【Sqoop】Export data into RDBMS using Sqoop 及其調優

【原文鏈接】https://hadoopjournal.wordpress.com/2017/08/15/export-data-using-sqoop/

 

We can export data from HDFS into an RDBMS table using the Sqoop export tool. The target table must already exist in the database to successfully execute the export task. The sqoop export tool uses the MapReduce model to export data into RDBMS table. Sqoop export, provide two mode to export data into relational database table.

  1. INSERT mode.
  2. UPDATE mode.

INSERT mode – This is default mode to export. If you don’t specify anything in Sqoop export command this will be picked up by default. This mode is useful when you only inject the records into the table.

UPDATE mode – In “update mode,” Sqoop generates UPDATE statements that replace existing records in the database. Legal values for mode include `updateonly` (default) and `allowinsert`.

We can use these export mode by providing the update-key <column(s)> command line argument. This action causes Sqoop to generate a SQL UPDATE statement to run on the RDBMS.

Assume that you want to update a three-column table with data stored in the HDFS file/user/user1/my-hdfs-file. The file contains this data: 10, 100, 200

The following abbreviated Sqoop export command generates the corresponding SQL UPDATE statement on your database system:

$ sqoop export (Generic Arguments)
 --table target_table
 --update-key column1
 --export-dir /hdfs/data/location/export.txt
 ...
Generates => UPDATE target_table SET
 column2=100,column3=200
 WHERE column1=10;

With the preceding export command, if the target_table on your RDBMS or data warehouse system has no record with the matching value in column1, nothing is changed in target_table. But when we have some updated record as well as some new record in Hadoop.

We can use –update-mode allowinsert mode to update the existing record if they exist or insert rows if they do not exist in the target table. The Sqoop command will be:

$ sqoop export --connect connection_string
--username xxxxx
--password xxxxx
--table target_table
--update-key id
--update-mode allowinsert
--export-dir /hdfs/data/location/export.txt
-m 1

Sqoop export performance tuning techniques:

Sqoop export performance can be gained by implementing techniques.

  1. By Increase parallelism.
  2. By Inserting Data in Batches

Increase parallelism – Since Sqoop export also uses MapReduce model to export data we can increase Mappers to gain the parallelism while exporting the data. By default, Sqoop will use four tasks in parallel for the export process. This may not be optimal; you will need to increase/decrease with your own particular setup. Because, additional tasks may offer better concurrency, but if the database is already bottlenecked on updating indices, invoking triggers, and so on, then additional load may decrease performance.

The –num-mappers or -m arguments control the number of map tasks, which is the degree of parallelism used.

Inserting Data in Batches – It specifies that we can group the related SQL statements into a batch when we export data. The JDBC interface exposes an API for doing batches in a prepared statement with multiple sets of values. With the –batch parameter, Sqoop can take advantage of this. This API is present in all JDBC drivers because it is required by the JDBC interface. Enable JDBC batching using the –batch parameter with export command.

$ sqoop export --connect connection_string
--username user_name
--password password
--table table_name
--export-dir /hdfs/data/location/export.txt
--batch
-m 1

The second option is to use the property sqoop.export.records.per.statement to specify the number of records that will be used in each insert statement:

$ sqoop export
-Dsqoop.export.records.per.statement=10
--connect connection_string --username user_name
--password password --table target_table
--export-dir /hdfs/data/location/export.txt

Finally, you can set how many rows will be inserted per transaction with the sqoop.export.statements.per.transaction property:

$ sqoop export
-Dsqoop.export.statements.per.transaction=10
--connectconnection_string
--username user_name
--password password
--table target_table
--export-dir /hdfs/data/location/export.txt

The default values can vary from connector to connector. Sqoop defaults to disabled batching and to 100 for both sqoop.export.records.per.statement and sqoop.export.statements.per.transaction properties.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章