Step by step guide for Sqoop incremental imports | MySQL to Hive | Incremental updates in hive

How to set up Sqoop incremental imports?

Here’s a step by step guide for Sqoop incremental imports and since it says step-by-step, it’s going to be only that ๐Ÿ˜‰ .

Hive, a data warehousing tool is a solution to your problems with data access while your data is distributed in flat files. Since these are “flat” files we are fetching data from originally, It is often challenging to operate over these files. Of course Hive is better suited for batch processes with fairly large data sets and immutable data but fortunately Hive has matured enough to support CRUD operations on flat files that we store in HDFS.

How does Hive ACTUALLY work?

From what we can see, Hive provides us an SQL like interface which is basically Hive query language, abbreviated as HQL. You can consider Hive as a platform that adds another layer of “schema and metadata” over the top of your data which is stored in flat files.

  • This layer of metadata added over your data gives your data the structure and the schema that we need to queriy the data using queries alike in relational databases i.e for CRUD operations.
  • This metadata is stored in hive metastore which could be stored in derby (by default) or you could specify external databases like postgresql, mysql or oracle to store this meta-information.
  • You set values for the relevant properties in hive-conf.xml which is basically a configuration file for hive.
  • So, Hive is basically a data warehousing package built on top of hadoop and it is used to provide a structure to unstructured data.
  • Internally Hive converts HiveQL to Map-reduce Java programs and is thus capable of querying petabytes of data.
  • To store thi meta-information, we have tables like:

  • DBS which stores the list of all hive databases created.
  • TBLS – has all table details
  • COLUMNS_V2 which has column level details.
What are Sqoop incremental imports/ incremental imports in Sqoop?

Once you have ingested the data for the first time in hive, you would most probably need to use Sqoop incremental imports. With the help of incremental imports, you can limit the data that you are ingesting in your warehouse or HDFS. Instead of having to copy and move the complete data set each time, why would not we want to import only the new data or ingest only the part of data that has not already been moved to the warehouse. Let’s see it this way that sqoop incremental imports can be used import data which is “newer” than the previously imported data.

To ingest data, I often use Apache sqoop. Sqoop allows and supports incremental imports in two ways.

  • Append mode
  • modifieddate mode
What is difference between append mode and modifieddate mode ?

Append mode and modifieddate mode are two different ways to keep track of the previously imported data.

Append mode:

Append mode in sqoop keeps track of the previously imported data through a numerical column which is unique and is incremented when new rows are added (as row_number does).

Arguments for append mode in sqoop command:

--incremental append --check-column entity_id --last-value 0

Now, every time you would run the sqoop job with the mentioned arguments, it would store the last-value of column entity_id so that next time it imports data/rows where entity_id value is greater than the last imported value. If you aren’t writing sqoop jobs and you use sqoop import command instead, it would simply render the last value of the column entity_id on console for you to keep track of it.However, if you are writing sqoop jobs, sqoop metastore stores the values and automatically imports only the newer data.


lastmodified mode

lastmodified mode in sqoop keeps track of a column that has a timestamp/datetime datatype and is updated every time the row is updated.

Lastmodified mode uses arguments as follows:

--incremental lastmodified --check-column update_date --last-value 0

Of course, you would tell sqoop which column to look at to check for the last import date/time stamp, you do it by passing the column name as a value to the argument --check-column.

Now, that you know that you can either use row_number or time-stamp to import only the newly added data. which one to use over the other?

  • Use append mode in sqoop to import data into HDFS/hive when the data in the table is NOT updated.
  • Use lastmodified mode in sqoop when rows in your source table could update.

If you are using append mode in sqoop to import data from a relational database and a row gets updated, the updated values would simply be ignored and will never reflect in your HDFS/hive data files. However, if you use lastmodified mode to import data (which of-course means that a field with time-stamp datatype would be updated when a row updates in your source tables, it would update the corresponding row with new values in HDFS files). However, please note that an update operation is fairly expensive to perform on flat files.

Q :- Now the question is, how does sqoop know which is the “corresponding row” in HDFS that needs to be updated for a given update in a row in source table?

Ans. :- merge-key

merge-key is an argument that is used to identify a row uniquely for the updated values from the source tables to be merged in HDFS files.

This is how typical sqoop job that uses lastmodified mode for Sqoop incremental imports looks like:

sqoop job --meta-connect jdbc:hsqldb:hsql://FQDN:16000/sqoop --create test_table -- import --driver com.mysql.jdbc.Driver --connectjdbc:mysql://IP/DB?zeroDateTimeBehavior=convertToNull --username USER_NAME --password 'PASSWORD' --table test_table --merge-key id --split-by id --target-dir LOCATION --incremental lastmodified --last-value 0 --check-column updated_at

Since I passed id column as a value to merge-key argument, id column has to be a unique identifier in your table (which means that it would be either a unique key or a primary key for the most part). Let’s say a row with value 6 for id column gets updated, it is going to look up a row with value 6 for id column and merge the updated values from source table into HDFS. Makes sense, right?

Note: However, I came across a weird situation where the updates were not working in Sqoop incremental imports. If you come across a situation when your Map-reduce job is running to completion successfully and yet the values in hive table/HDFS files do not get updated, you can refer here: Sqoop incremental import not working.

Now, that you understand what Sqoop incremental imports is, how Sqoop incremental imports works and how to fix it in case Sqoop incremental imports does not update the values in HDFS, Let me show you how to setup Sqoop incremental imports in your production environment:

1. Write a sqoop job, one like we wrote above for Sqoop incremental imports. You can ofcourse switch the mode for Sqoop incremental imports depending on what you need to do:

sqoop job --meta-connect jdbc:hsqldb:hsql://FQDN:16000/sqoop --create test_table -- import --driver com.mysql.jdbc.Driver --connectjdbc:mysql://IP/DB?zeroDateTimeBehavior=convertToNull --username USER_NAME --password 'PASSWORD' --table test_table --merge-key id --split-by id --target-dir LOCATION --incremental lastmodified --last-value 0 --check-column updated_at

OR

sqoop job --meta-connect jdbc:hsqldb:hsql://FQDN:16000/sqoop --create test_table -- import --driver com.mysql.jdbc.Driver --connectjdbc:mysql://IP/DB?zeroDateTimeBehavior=convertToNull --username USER_NAME --password 'PASSWORD' --table test_table --split-by id --target-dir LOCATION --incremental append --last-value 0 --check-column updated_at

Please note that we do not need –merge-key argument in case of append mode for sqoop incremental imports.

Note: If you take off --merge-key argument from sqoop job while using lastmodified mode and run it for the second time, you will get an error that says:

ERROR tool.ImportTool: Error during import: --merge-key or --append is required when using --incremental lastmodified and the output directory exists.

It means that if you do not specify an argument called merge-key, you either have to switch to append mode in Sqoop incremental imports or have to specify merge-key instead. Like I said above, how would it otherwise know which row to update in HDFS for a given updated row in your source table. Now, also note that you see this error only after you’ve successfully run your sqoop job to completion for the first time since it makes sense to have merge operation only when there’s some data in the target directory already which means that sqoop job has been run to completion at-least once. By the way, you specify target-directory for using argument target-dir location.

2. Create table in hive using hue editor.

a)

Create table in hive using hue

Create table in hive using hue

b)

Create table in hive using hue: Step 2

Create table in hive using hue: Step 2

Hue is a beautiful editor and is super easy to use. Ask me, Why use hue?

I would have hated it if I had to create those huge tables manually specifying the columns and their datatypes. What a time-killer it must have been (especially since you can not directly use SQL DML statements in hive directy, it takes some changes).

  1. Give your table name (*required)
  2. Give your table an optional description
  3. Specify HDFS path where your data file exists
  4. In the picture above(Image a) , if you see that checkbox at the bottom, it asks if you want to import the data. If you un-check it, it will create only the table schema and would not import data. How awesome!
  5. However, if you keep the checkbox checked, it is going to “move” your file from the location that you specified in your target-dir argument in your sqoop job while importing the data. Where is it going to move this data? It will move it to the location specified in hive-conf.xml in the argument warehouse-dir.
  6. Go next(Refer image-b) , the editor will show you the pre-determined datatypes for the columns (you may chose to change them if needed. I’ve found it pretty accurate so far).
  7. Go on to create the table
  8. Now, go to hive editor and try:

    SELECT * FROM DB.TableName;

    It should give you the table(schema) with no rows (since we have not imported any data yet into the hive table).

Conclusion: Keep the checkbox unchecked. Choose to only create the schema for now.

3. Map your hive empty table to the HDFS location where your data file exists (i.e the location you had mentioned for your –target-dir argument in the Sqoop incremental import job) .

ALTER TABLE DB.TABLE_NAME set location "/user/hue/DirName/FileName";

Now, execute
SELECT * FROM DB.TABLE_NAME;

We see the data. Yaaaayyyyyyyyyyyyy!

4. Now, as and when your sqoop job would run, the new data would automatically keep adding to the file in target-directory and hive table would automatically pick that up. I qualify it for something almost MAGICAL!!!! This was my AHA! moment.

Tips:

1. Do not forget to specify the table as external .

ALTER TABLE DB.Table_Name SET TBLPROPERTIES('EXTERNAL'='TRUE')

2. In case your see that there are rows in your hive table, but none of the rows has a value for any of the columns, Don’t worry. You just need to specify a field delimiter i.e Tell hive how your fields are separated in source file that exists in HDFS and that you had mapped to the hive table.

Check the delimiter in the file by opening in a simple text editor and execute this statement in your hive editor:

ALTER TABLE DB.TABLE_NAMEset serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('field.delim' = ',');

Change the value for field.delim accordingly.

Tip: You could pick the delimiter while creating the table in hue as well (Hue automatically picks one for you and is usually ‘,’ by default. However, if you missed it there or didn’t change it then, use the above query statement to change it later.

3. To check if your alter table statements that I’ve mentioned above worked fine, use

DESCRIBE FORMATTED DB.TABLE_NAME

If you were patient enough to read this through, Do let me know ๐Ÿ˜‰

P.S: Internet is a successfully thriving community from generous contributions of people from across the globe. You can help it thrive too. Please contribute. As a reader, you can be contributing with your valuable feedbacks.
Please drop by a comment or Share to let people know you were here. ๐Ÿ™‚ ๐Ÿ™‚

2 thoughts on “Step by step guide for Sqoop incremental imports | MySQL to Hive | Incremental updates in hive

Leave a Reply

Your email address will not be published. Required fields are marked *