Analytics/Cluster/Oozie
Oozie
Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Oozie is a job scheduler with fancy features. Most relevantly, jobs may be scheduled based on the existence of data in HDFS. This allows jobs to be scheduled to be run not based only on a current timestamp, but for when the data needed to run a particular job is available.
Some terms
- action
- An action generally represents a single step in a job workflow. Examples include pig scripts, failure notifications, map-reduce jobs, etc.
- workflow
- Workflows are used to chain actions together. A workflow is synonymous with a job. Workflows describe how actions should run, and how actions should flow. Actions can be chained based on success and failure conditions.
- coordinator
- Coordinators are used to schedule recurring runs of workflows. They can abstractly describe input and output datasets based on on periodicity. Coordinators will submit workflow jobs based on the existence of data.
- bundle
A bundle is a logical grouping of coordinators that share commonalities. You can use bundles to start and stop whole groups of coordinators at once.
Oozie 101. An Example
Let's run through a simple example of how to set up runs of an oozie job running on the cluster. Have in mind that this is an example just to get you started and that more steps than the ones outlined below are needed to get a job running according to our production standards.
In our example we will run via oozie a job that runs a parametized hive query. Note that we assume you have access to stat1002.eqiad.wmnet from which we normally access the hadoop cluster.
CLI
The job we are going to set up will use just a workflow, this means that -in the absence of a coordinator - we will run it by hand using the oozie command line interface (CLI). There is a lot you can do through oozie's CLI, please take a look at docs [here https://oozie.apache.org/docs/3.1.3-incubating/DG_CommandLineTool.html]
There are three files needed:
- A file with the hive query.
- A workflow.xml file that oozie is going to use to see what job to run
- A workflow.properties file that sets concrete values for the properties defined in the workflow.
Both workflow.xml and the hive query need to be available inside hdfs, so we will be putting them into hdfs /tmp directory and the oozie job will run from there.
hive query
Please note placeholders for parameters and replace <user> with your username.
DROP VIEW IF EXISTS <user>_oozie_test;
CREATE VIEW <user>_oozie_test AS
SELECT
CASE WHEN user_agent LIKE('%iPhone%') THEN 'iOS'
ELSE 'Android' END AS platform,
parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID') AS uuid
FROM ${source_table}
WHERE year=${year}
AND month=${month}
AND day=${day}
AND hour=${hour};
-- Now get a count of totals that will be inserted in some file
INSERT OVERWRITE DIRECTORY "${destination_directory}"
SELECT platform, COUNT(DISTINCT(uuid))
FROM <user>_oozie_test
GROUP BY platform;
Workflow.xml and workflow.properties
<workflow-app name="cmd-param-demo" xmlns="uri:oozie:workflow:0.4">
<parameters>
<property>
<name>queue_name</name>
<value>default</value>
</property>
<!-- Required properties -->
<property><name>name_node</name></property>
<property><name>job_tracker</name></property>
<property>
<name>hive_site_xml</name>
<description>hive-site.xml file path in HDFS</description>
</property>
<!-- specifying parameter values in file to test running -->
<property>
<name>source_table</name>
<description>
Hive table to read data from.
</description>
</property>
<property>
<name>year</name>
<description>The partition's year</description>
</property>
<property>
<name>month</name>
<description>The partition's month</description>
</property>
<property>
<name>day</name>
<description>The partition's day</description>
</property>
<property>
<name>hour</name>
<description>The partition's hour</description>
</property>
</parameters>
<start to="hive-demo"/>
<action name="hive-demo">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${job_tracker}</job-tracker>
<name-node>${name_node}</name-node>
<job-xml>${hive_site_xml}</job-xml>
<configuration>
<property>
<name>mapreduce.job.queuename</name>
<value>${queue_name}</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive-${user}</value>
</property>
</configuration>
<script>generate_daily_uniques.hql</script>
<param>source_table=${source_table}</param>
<param>destination_directory=/tmp/test-mobile-apps/${wf:id()}</param>
<param>year=${year}</param>
<param>month=${month}</param>
<param>day=${day}</param>
<param>hour=${hour}</param>
</hive>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Worflow.properties:
@stat1002:~/workplace/refinery/oozie/mobile-apps/generate_daily_uniques$ more workflow.properties name_node = hdfs://analytics-hadoop job_tracker = resourcemanager.analytics.eqiad.wmnet:8032 queue_name = default oozie_directory = ${name_node}/wmf/refinery/current/oozie # for testing locally, this won't work: # hive_site_xml = ${oozie_directory}/util/hive/hive-site.xml hive_site_xml = ${refinery_directory}/oozie/util/hive/hive-site.xml # Workflow app to run. oozie.wf.application.path = hdfs://analytics-hadoop/tmp/tests-<some>/workflow.xml oozie.use.system.libpath = true oozie.action.external.stats.write = true # parameters source_table = wmf_raw.webrequest year = 2014 month = 11 day = 20 hour = 10 user = <your-user-in1002>
Validating worflow.xml
After creating the files you should make sure they are valid according to oozie's schema:
oozie validate workflow.xml
Moving files to hdfs
The easiest place where to put stuff is the /tmp directory. You should move there the workflow.xml and hive file
hdfs dfs -mkdir /tmp/tests-$USER hdfs dfs -put workflow.xml /tmp/tests-$USER/workflow.xml hdfs dfs -cat /tmp/tests-$USER/workflow.xml
Running oozie job
From your local directory -where you have workflow.properties- run:
oozie job -config workflow.properties -run
Running your job, say, once a day
In order to use oozie's crontab you will need a coordinator file
Good docs about coordinators can be found here: [1]
Running a real oozie example
The main difference with the 101 example is that in this case we will likely need to override the oozie directory. This testing needs to be done from 1002. Let's assume that your refinery code you want to test is deployed to ~/workplace/refinery/oozie
Configure hive-site.xml to be the latest on coordinator.properties
+hive_site_xml = ${refinery_directory}/oozie/util/hive/hive-site.xml
Put your oozie directory on hdfs
>nuria@stat1002:~/workplace/refinery$ ls oozie >hdfs dfs -rmr /tmp/oozie-nuria ; hdfs dfs -mkdir /tmp/oozie-nuria; hdfs dfs -put oozie/ /tmp/oozie-nuria
Now oozie code is on hdfs
Run oozie job overriding what pertains
Note that we are overriding the refinery-directory variable
oozie job -run -Duser=nuria -Darchive_directory=hdfs://analytics-hadoop/tmp/nuria -Doozie_directory=/tmp/oozie-nuria/oozie -config ./oozie/pageview/hourly/coordinator.properties -Dstart_time=2015-09-03T00:00Z -Dstop_time=2015-09-03T04:00Z -Drefinery_directory=hdfs://analytics-hadoop$(hdfs dfs -ls -d /wmf/refinery/2015* | tail -n 1 | awk '{print $NF}')
Troubleshooting
Checking logs:
oozie job -log <job_id>
Seeing the last couple jobs that run:
oozie jobs -localtime -len 2
Get more info on a job id that failed
oozie job -info <job-id>
This should display your hadoop job id (see below)
Job ID : 0005783-141210154539499-oozie-oozi-W ------------------------------------------------------------------------------------------------------------------------------------ Workflow Name : cmd-param-demo App Path : hdfs://analytics-hadoop/tmp/tests-mobile-apps/workflow.xml Status : KILLED Run : 0 ..... CoordAction ID: - Actions ------------------------------------------------------------------------------------------------------------------------------------ ID Status Ext ID Ext Status Err Code ------------------------------------------------------------------------------------------------------------------------------------ 0005783-141210154539499-oozie-oozi-W@:start: OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0005783-141210154539499-oozie-oozi-W@hive-demo ERROR job_1415917009743_45854 FAILED/KILLED40000 ------------------------------------------------------------------------------------------------------------------------------------ 0005783-141210154539499-oozie-oozi-W@kill OK - OK E0729 ------------------------------------------------------------------------------------------------------------------------------------
Logs for hadoop job id: 1415917009743_45854
Should be at:
/var/log/hadoop-yarn/apps/$USER/logs/application<id>/
Kill a job
Get job id from following command:
>oozie jobs -jobtype coord
>oozie job -kill <id>
Check how the cluster is doing
You can see how the queues are being utilized here:
http://localhost:8088/cluster/scheduler
You'll have to set up the ssh tunnel as specified in Analytics/Cluster/Access
Current submitted Refinery's Oozie jobs
How to see jobs scheduled to run
The bird's eye view over currently submitted Refinery Oozie jobs is depicted in the following diagram:
On stat1002
run
oozie jobs -jobtype bundle -filter status=RUNNING
to see the 100 most recent, RUNNING bundles.
On stat1002
run
oozie jobs -jobtype coordinator -filter status=RUNNING
to see the 100 most recent, RUNNING coordinators
On stat1002
run
oozie jobs -jobtype wf -filter status=RUNNING
to see the 100 most recent, RUNNING workflows.
To consider more than the 100 jobs, add a -len
option at the end. Like -len 2000
to get the 2000 most recent ones.
To not limit to RUNNING jobs, drop the -filter status=RUNNING
from the command.
The source for the refinery's oozie production jobs can be found at https://git.wikimedia.org/tree/analytics%2Frefinery/master/oozie .
In the cluster, Oozie's job definitions can be found at /wmf/refinery/...
.
Never deploy jobs from the /wmf/refinery/current/...
, always use one of refinery-variants that have a concrete time and commit in the directory name. Like /wmf/refinery/2015-01-09T12.39.20Z--2007cb8
.
How to deploy Oozie production jobs
On stat1002
, run
export REFINERY_VERSION=2015-01-05T17.59.18Z--7bb7f07 export PROPERTIES_FILE=oozie/webrequest/legacy_tsvs/bundle.properties export START_TIME=2015-01-05T11:00Z cd /mnt/hdfs/wmf/refinery/$REFINERY_VERSION sudo -u hdfs oozie job \ -oozie http://analytics1027.eqiad.wmnet:11000/oozie \ -run \ -config $PROPERTIES_FILE \ -D refinery_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION \ -D oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \ -D queue_name=essential \ -D start_time=$START_TIME
, where
REFINERY_VERSION
should be set to the concrete, 'deployed' version of refinery that you want to deploy from. Like2015-01-05T17.59.18Z--7bb7f07
. (Do not usecurrent
there, or you job is likely to break when some deploys refinery afresh).PROPERTIES_FILE
should be set to the properties file that you want to deploy; relative to the refinery root. Likeoozie/webrequest/legacy_tsvs/bundle.properties
.START_TIME
should denote the time the job should run the first time. Like2015-01-05T11:00Z
.
(To get your code to stat1002
in first place, follow Analytics/Cluster/Refinery#How_to_deploy)
Alternate That Maybe Should replace this section
- restbase coordinator
sudo -u hdfs oozie job --oozie $OOZIE_URL \ -Drefinery_directory=hdfs://analytics-hadoop$(hdfs dfs -ls -d /wmf/refinery/2015* | tail -n 1 | awk '{print $NF}') \ -Dqueue_name=production \ -Doozie_launcher_queue_name=production \ -Doozie_launcher_memory=256 \ -Dstart_time=2015-08-01T00:00Z \ -config /srv/deployment/analytics/refinery/oozie/restbase/coordinator.properties \ -run