Analytics/Cluster/Spark

How do I ...

See spark logs on my local machine when using spark submit

If you are running Spark on local, spark-submit should write logs to your console by default.

How to get logs written to a file?
- Spark uses log4j for logging, and the log4j config is usually at /etc/spark/log4j.properties
- This uses a ConsoleAppender by default, and if you wanted to write to files, an example log4j properties file would be:

# Set everything to be logged to the file
log4j.rootCategory=INFO, file
log4j.appender.file=org.apache.log4j.FileAppender
log4j.appender.file.File=/tmp/spark.log
log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

This should write logs to /tmp/spark.log

On the analytics cluster (stat1002):
- On the analytics cluster, running a spark job through spark submit writes logs to the console too, on both yarn and local modes
- To write to file, create a log4j.properties file, similar to the one above that uses the FileAppender
- One option is to use the --files argument on spark-submit and upload your custom log4j.properties file.
- The other is to use extraJavaOptions config
- For the above two, see http://spark.apache.org/docs/1.3.0/running-on-yarn.html (Under Debugging your Application for explanation)

While running a spark job through Oozie
- The log4j file path now needs to be a location accessible by all drivers/executors running in different machines
- Putting the file on a temp directory on Hadoop and using a hdfs:// url should do the trick
- Note that the logs will be written on the machine where the driver/executors are running - so you'd need access to go look at them

Poke madhuvishy on #wikimedia-analytics for help!

Spark and Ipython

The spark python API makes working with data in HDFS super easy. For exploratory tasks, I like using Ipython Notebooks. You can run Spark from an Ipython Notebook by doing the following:

On Stat1002:

Tell pyspark to start the ipython notebook server when called

export IPYTHON_OPTS="notebook --pylab inline --port 8123  --ip='*' --no-browser"

Start pyspark

pyspark --master yarn --deploy-mode client --num-executors 2 --executor-memory 2g --executor-cores 2

On Your Laptop:

Create a tunnel from your machine to the Ipython Notebook server

ssh -N bast1001.wikimedia.org -L 8123:stat1002.eqiad.wmnet:8123

Finally, navigate to http://localhost:8123, create a notebook and start coding!

Spark and Oozie

Oozie has a spark action, allowing you to launch Spark jobs as you'd do (almost ...) with spark-submit:

<spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${job_tracker}</job-tracker>
            <name-node>${name_node}</name-node>
            <configuration>
                <property>
                    <name>mapreduce.job.queuename</name>
                    <value>${queue_name}</value>
                </property>
                <property>
                    <name>oozie.launcher.mapred.job.queue.name</name>
                    <value>${oozie_launcher_queue_name}</value>
                </property>
                <property>
                    <name>oozie.launcher.mapreduce.map.memory.mb</name>
                    <value>${oozie_launcher_memory}</value>
                </property>
            </configuration>
            <master>yarn</master>
            <mode>cluster</mode>
            <name>${spark_job_name}</name>
            <jar>${spark_code_path_jar_or_py}</jar>
             <spark-opts>--conf spark.yarn.jar=${spark_assembly_jar} --executor-memory ${spark_executor_memory} --driver-memory ${spark_driver_memory} --num-executors ${spark_number_executors} --queue ${queue_name} --conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus --driver-class-path ${hive_lib_path} --driver-java-options "-Dspark.executor.extraClassPath=${hive_lib_path}" --files ${hive_site_xml}</spark-opts>
            <arg>--arg1_name</arg>
            <arg>arg1</arg>
            <arg>--arg2_name</arg>
            <arg>arg2</arg>
            ...
        </spark>

The tricky parts here are in the spark-opts element, with the need for spark to be given specific configuration settings not automatically loaded as they are with spark-submit:

Core spark jar is needed in configuration:

--conf spark.yarn.jar=${spark_assembly_jar}
# on analytics-hadoop:
#    spark_assembly_jar = hdfs://analytics-hadoop/user/spark/share/lib/spark-assembly.jar

When using python, you need to set the SPARK_HOME environment variable (to dummy for instance):

--conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus

If you want to use HiveContext in spark, you need to add the hive lib jars and hive-site.xml to spark (not done by default in our version):

--driver-class-path ${hive_lib_path} --driver-java-options "-Dspark.executor.extraClassPath=${hive_lib_path}" --files ${hive_site_xml}
# on analytics-hadoop: 
#   hive_lib_path = /usr/lib/hive/lib/*
#   hive_site_xml = hdfs://analytics-hadoop//util/hive/hive-site.xml