Analytics/Cluster/Spark
How do I ...
See spark logs on my local machine when using spark submit
- If you are running Spark on local, spark-submit should write logs to your console by default.
- How to get logs written to a file?
- Spark uses log4j for logging, and the log4j config is usually at /etc/spark/log4j.properties
- This uses a ConsoleAppender by default, and if you wanted to write to files, an example log4j properties file would be:
# Set everything to be logged to the file
log4j.rootCategory=INFO, file
log4j.appender.file=org.apache.log4j.FileAppender
log4j.appender.file.File=/tmp/spark.log
log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
This should write logs to /tmp/spark.log
- On the analytics cluster (stat1002):
- On the analytics cluster, running a spark job through spark submit writes logs to the console too, on both yarn and local modes
- To write to file, create a log4j.properties file, similar to the one above that uses the FileAppender
- One option is to use the --files argument on spark-submit and upload your custom log4j.properties file.
- The other is to use extraJavaOptions config
- For the above two, see http://spark.apache.org/docs/1.3.0/running-on-yarn.html (Under Debugging your Application for explanation)
- While running a spark job through Oozie
- The log4j file path now needs to be a location accessible by all drivers/executors running in different machines
- Putting the file on a temp directory on Hadoop and using a hdfs:// url should do the trick
- Note that the logs will be written on the machine where the driver/executors are running - so you'd need access to go look at them
- Poke madhuvishy on #wikimedia-analytics for help!
Spark and Ipython
The spark python API makes working with data in HDFS super easy. For exploratory tasks, I like using Ipython Notebooks. You can run Spark from an Ipython Notebook by doing the following:
On Stat1002:
Tell pyspark to start the ipython notebook server when called
export IPYTHON_OPTS="notebook --pylab inline --port 8123 --ip='*' --no-browser"
Start pyspark
pyspark --master yarn --deploy-mode client --num-executors 2 --executor-memory 2g --executor-cores 2
On Your Laptop:
Create a tunnel from your machine to the Ipython Notebook server
ssh -N bast1001.wikimedia.org -L 8123:stat1002.eqiad.wmnet:8123
Finally, navigate to http://localhost:8123, create a notebook and start coding!
Spark and Oozie
Oozie has a spark action, allowing you to launch Spark jobs as you'd do (almost ...) with spark-submit:
<spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${job_tracker}</job-tracker> <name-node>${name_node}</name-node> <configuration> <property> <name>mapreduce.job.queuename</name> <value>${queue_name}</value> </property> <property> <name>oozie.launcher.mapred.job.queue.name</name> <value>${oozie_launcher_queue_name}</value> </property> <property> <name>oozie.launcher.mapreduce.map.memory.mb</name> <value>${oozie_launcher_memory}</value> </property> </configuration> <master>yarn</master> <mode>cluster</mode> <name>${spark_job_name}</name> <jar>${spark_code_path_jar_or_py}</jar> <spark-opts>--conf spark.yarn.jar=${spark_assembly_jar} --executor-memory ${spark_executor_memory} --driver-memory ${spark_driver_memory} --num-executors ${spark_number_executors} --queue ${queue_name} --conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus --driver-class-path ${hive_lib_path} --driver-java-options "-Dspark.executor.extraClassPath=${hive_lib_path}" --files ${hive_site_xml}</spark-opts> <arg>--arg1_name</arg> <arg>arg1</arg> <arg>--arg2_name</arg> <arg>arg2</arg> ... </spark>
The tricky parts here are in the spark-opts
element, with the need for spark to be given specific configuration settings not automatically loaded as they are with spark-submit:
- Core spark jar is needed in configuration:
--conf spark.yarn.jar=${spark_assembly_jar} # on analytics-hadoop: # spark_assembly_jar = hdfs://analytics-hadoop/user/spark/share/lib/spark-assembly.jar
- When using python, you need to set the SPARK_HOME environment variable (to dummy for instance):
--conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus
- If you want to use HiveContext in spark, you need to add the hive lib jars and hive-site.xml to spark (not done by default in our version):
--driver-class-path ${hive_lib_path} --driver-java-options "-Dspark.executor.extraClassPath=${hive_lib_path}" --files ${hive_site_xml} # on analytics-hadoop: # hive_lib_path = /usr/lib/hive/lib/* # hive_site_xml = hdfs://analytics-hadoop//util/hive/hive-site.xml