Apache Hive is the primary data-warehousing tool we use at our workplace for querying and data-analysis.On top of that, we use Apache Oozie to schedule our workflows of Sqoop and Pig jobs apart from Hive Queries.A recent requirement was end-to-end automation of a report that was usually updated on Hive every-day but sent by our BI team to rest of the teams. I was required to take this Human intervention off and to further automate emailing Hive Query results through Oozie
How to automate emailing Hive Query results through Oozie
Points to Note:
1. Oozie has an email action component in it. You can read more about it here: Oozie Email Action
2.Hue provides an easy to use interface to schedule Oozie jobs as well and we will be using that as I walk you through the steps to automate emailing Hive Query results.
3. You need to configure SMTP server in Oozie. To check if it’s configured or not, Just go to Hue quick start page. It will show you a warning already if SMTP is not configured for Oozie.
Here’s what it would look like if SMTP is not configured:
To set-up your server to send emails, You can use mail sending agents like POSTFIX and SENDMAIL. Both of these are implementations of SMTP.
I have set-up and configured POSTFIX on our master node in Hadoop. Follow this step by step tutorial to confugure postfix as send only smtp server .
2. Open oozie-site.xml to edit it. If you are using Cloudera, just go to Cloudera Manager, Click on Oozie -> Configuration -> Oozie Server -> oozie.email.smtp.host
1 ) Edit this field to add Public DNS of the master server that has postfix enabled. (In my case, Hue is running on the master node and so is Oozie) . We enabled SMTP on this node only.
2 ) oozie.email.smtp.port is set to 25 by default as it is a reserved port for SMTP. You don’t need to change this unless you changed the port for SMTP on your server.
3. Restart Oozie.
Now, we are good to start creating our workflow in Oozie:
In a typical Oozie workflow, I usually would have a Sqoop job, a Hive query and sometimes a shell action. This time, we got an additional Email action at the end of workflow.
But before coming to that
1. Create a sqoop job in case you’re pulling data from a relational database for the task.
In case you don’t know how to run Sqoop jobs from oozie or How to import data from relational database to HDFS and Hive, I got you covered. Here’s a step by step tutorial to do just that.
2. Create a Hive Action in Oozie and put in the query.
Here’s what my Query looked like:
INSERT OVERWRITE DIRECTORY '/user/usernamexyz/samplingNProduction'ROW FORMAT DELIMITED FIELDS TERMINATED BY ','SELECT name, COUNT(sku) as qty, date, SUM(COUNT(sku)) OVER () as totalFROM erp.fabricator_loan flleft joinerp.tailors tailorsON tailors.id = fl.loan_to WHEREdate = CURRENT_DATEGROUP BY name, date;
1. You have to use an INSERT OVERWRITE DIRECTORY query and put in the exact location.
2. Don’t forget to add ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ because we are generating a CSV here that users should be able to read. The default delimiter is “^A”.
3. Please note that we are not writing to LOCAL DIRECTORY here but an HDFS DIRECTORY through ‘INSERT OVERWRITE DIRECTORY’. Without the local keyword in it, it would write to an HDFS directory.
Problem ? How do we get the headers for the CSV file we intend to generate and email?
Solution: Create a Shell action in Oozie.
NOTE: Do not write any files to local system when executing a task in hadoop. Different jobs for a task may execute on different nodes and file written to local system by one task will not be available for the other on a different system.
3. Create a CSV file with just the column names and store it in HDFS. I called it : /user/oozie/fabricator_loan_report_headers.csv
Here’s what my shell script looks like:
hadoop fs -rm -r /user/oozie/fabricator_loan_report.csv;hadoop fs -cp /user/usernamexyz/samplingNProduction/000000_0 /user/oozie/fabricator_loan_report.csv;hadoop fs -cat /user/oozie/fabricator_loan_report_headers.csv /user/oozie/fabricator_loan_report.csv | hadoop fs -put -f - /user/oozie/fabricator_loan_report_new.csv;
-put command takes input from stdin here and writes it finally in /user/oozie/fabricator_loan_report_new.csv. What we are writing to this file is concatenation of fabricator_loan_report_headers.csv and fabricator_loan_report.csv . Yeah, that’s how we get our file the headers. Interesting, right? 😀
4. Now, comes the turn to email this output file we generated through Hive and shell action.
In email attachment action in Oozie, just add path of the file:
This is what it would look like in Oozie:
and That’s it.I hope that helps! 🙂
P.S: Internet is a successfully thriving community from generous contributions of people from across the globe. You can help it thrive too. Please contribute. As a reader, you can be contributing with your valuable feedback.
Please drop by a comment or Share to let people know you were here. 🙂 🙂