Jupyter notebooks on EMR

Explanatory data analysis requires interactive code execution. In case of spark and emr it is very convenient to run the code from jupyter notebooks on a remote cluster. EMR allows installing jupyter on the spark master. In order to do that configure "Applications" field for the emr cluster to contain also jupyter hub. For example: "Applications": [ { "Name": "Ganglia", "Version": "3.7.2" }, { "Name": "Spark", "Version": "2.4.0" }, { "Name": "Zeppelin", "Version": "0....

February 4, 2019 · SergeM

Spark on a local machine

How to install spark locally Considering spark without hadoop built-in. Download hadoop unpack to /opt/hadoop/ Download spark without hadoop, unpack to /opt/spark Install java. Set JAVA_HOVE environment variable. For example: export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 create environment variables required for spark to run. One can put those in .bashrc export HADOOP_HOME=/opt/hadoop export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/* Now you can run pyspark for example: $ /opt/spark/bin/pyspark Python 2.7.12 (default, Nov 12 2018, 14:36:49) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information....

January 30, 2019 · SergeM

Spark in Docker with AWS credentials

Running spark in docker container Setting up spark is tricky. Therefore it is useful to try out things locally before deploying to the cluster. Docker is of a good help here. There is a great docker image to play with spark locally. gettyimages/docker-spark Examples Running SparkPi sample program (one of the examples from the docs of Spark): docker run --rm -it -p 4040:4040 gettyimages/spark bin/run-example SparkPi 10 Running a small example with Pyspark:...

July 29, 2018 · SergeM