How to run sqoop jobs from Oozie
Sqoop is a tool to import/export data from a relaional database to HDFS and vice-versa. It is super-easy to use and uses Map-reduce jobs behind the scenes to move data from source to destination. On the other hand, Oozie is a job-scheduler for Apache hadoop jobs (Just like you use cron for scheduling programs , Use Oozie to schedule your hadoop jobs).
Though it is fairly easy to use Sqoop by itself, it could be tricky at times when you need it to interface with other tools. To schedule your data import/export using Oozie, Sqoop metastore should be able to interface with the external tool. For other users to be able to access the sqoop jobs, run them , see the last import date and such, Sqoop meta-store should be able to expose it to the other user.
Sqoop by default uses an unshared instance of HSqlDB, which is file based database and stores the information at
Let’s see how to run sqoop jobs from Oozie
Since Hadoop works in a distributed environment, you could be running Oozie on one host and Sqoop at a different host machine. By default, it is going to be looked up into the same machine at the location where is would store the sqoop jobs if not explicitly specified.
When it fails to find the specified job at the default location if not explicitly specified , it throws an error:
“Caused by: java.sql.SQLException: File input/output error /var/lib/hadoop-yarn/.sqoop/metastore.db.properties java.io.FileNotFoundException: /var/lib/hadoop-yarn/.sqoop/metastore.db.properties.new (No such file or directory)”
How to solve File not found Exception in Oozie?
Start using Sqoop meta-store as a service.
From the Sqoop Docs,
Running sqoop-metastore launches a shared HSQLDB database instance on the current machine. Clients can connect to this metastore and create jobs which can be shared between users for execution
The location of the metastore’s files on disk is controlled by the sqoop.metastore.server.location property in conf/sqoop-site.xml. This should point to a directory on the local filesystem.
The metastore is available over TCP/IP. The port is controlled by the sqoop.metastore.server.port configuration parameter, and defaults to 16000.
Clients should connect to the metastore by specifying sqoop.metastore.client.autoconnect.url or –meta-connect with the value jdbc:hsqldb:hsql://:/sqoop. For example, jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop.
This metastore may be hosted on a machine within the Hadoop cluster, or elsewhere on the network.
Note: You could find the location of your sqoop-site.xml on a Linux based system using the command :
find / -name sqoop-site.xml
For a client to be able to interact with Sqoop, it needs to know the exact address i.e hostname (IP) and port number (16000) in this case for sqoop metastore.
Follow the following steps to make sure you run the sqoop metastore as a shared database so that it is available to oozie or any other client:
1. Edit sqoop-site.xml and un-comment the following lines of code
<property> <name>sqoop.metastore.client.autoconnect.url</name> <value>jdbc:hsqldb:hsql://FQDN:16000/sqoop</value> </property>
2. Start the sqoop metastore in Background
To disallow any hangups, run the Sqoop metastore start command as nohup on the server.This is what your command should look like:
You can read more about nohup here at wikipedia
3.Always specify meta-connect explicitly even if you have allowed auto-connect to the URL in sqoop-site.xml.
Even after specifying the URL in the -meta-connect argument and allowing auto-connect to it in the configuration file, I’ve come across instances where clients could not connect to the Sqoop meta-store. Therefore, specify the meta-connect argument with a value explicitly while creating the sqoop jobs as well as executing them.
1. Your Sqoop job should look like:
sqoop job --meta-connect jdbc:hsqldb:hsql://FQDN:16000/sqoop --create sample_job -- import --driver com.mysql.jdbc.Driver --connect jdbc:mysql://IP/db?zeroDateTimeBehavior=convertToNull --username USERNAME --password 'PASSWORD' --table tableName --merge-key id --split-by id --hive-import --hive-overwrite --hive-database hiveDBNAME
2.Sqoop execution command in Oozie should look like:
sqoop job --meta-connect jdbc:hsqldb:hsql://FQDN:16000/sqoop --exec sample_job
Feel free to redundantly specify the argument --meta-connect when using Sqoop with other clients.
4. Set record password to true in Sqoop-site.xml
Since you are going to be scheduling your sqoop-job and it isn’t going to be running in interactive mode, set the sqoop.metastore.client.record.password property to true in sqoop-site.xml.
Even though that should work, I came across a problem where Oozie could run the job for first time but failed everytime afterwards. Here’s how I fixed it: http://www.yourtechchick.com/sqoop/sqoop-job-fails-to-record-password-after-running-for-the-first-time-problem-fix/
This should allow you to run your jobs through oozie workflows now. I hope that helps you to run sqoop jobs from Oozie:)
Also, if you are going to be using Sqoop incremental imports, Here’s a step by step guide to help you through it: Step by Step guide for Sqoop incremental Imports
P.S: Internet is a successfully thriving community from generous contributions of people from across the globe. You can help it thrive too. Please contribute. As a reader, you can be contributing with your valuable feedback.
Please drop by a comment or Share to let people know you were here. 🙂 🙂