Running spark in docker container
Setting up spark is tricky. Therefore it is useful to try out things locally before deploying to the cluster.
Docker is of a good help here. There is a great docker image to play with spark locally. gettyimages/docker-spark
SparkPi sample program (one of the examples from the docs of Spark):
docker run --rm -it -p 4040:4040 gettyimages/spark bin/run-example SparkPi 10
Running a small example with Pyspark:
echo -e "import pyspark\n\nprint(pyspark.SparkContext().parallelize(range(0, 10)).count())" > count.py docker run --rm -it -p 4040:4040 -v $(pwd)/count.py:/count.py gettyimages/spark bin/spark-submit /count.py
Here we create a file with a python program outside of the docker. During
docker run we map this file to the file inside the docker container with path
/count.py and the we execute
bin/spark-submit command that executes our code.
You can also run PySpark in interactive mode:
$ docker run --rm -it -p 4040:4040 gettyimages/spark bin/pyspark Python 3.5.3 (default, Jan 19 2017, 14:11:04) [GCC 6.3.0 20170118] on linux Type "help", "copyright", "credits" or "license" for more information. 2018-07-29 20:03:59 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Python version 3.5.3 (default, Jan 19 2017 14:11:04) SparkSession available as 'spark'. >>> sc <SparkContext master=local[*] appName=PySparkShell> >>>
Now you can enter commands and evaluate your code in interactive mode.
Running a cluster with
One can use docker-compose.yaml file from https://github.com/gettyimages/docker-spark.git to run a cluster locally.
docker-compose.yaml file looks like this:
Run it with command.
It uses configs for master and worker nodes from
Accessing S3 from local Spark
I want to do experiments locally on spark but my data is stored in the cloud - AWS S3. If I deploy spark on EMR credentials are automatically passed to spark from AWS. But locally it is not the case. In the simple case one can use environment variables to pass AWS credentials:
docker run --rm -it -e "AWS_ACCESS_KEY_ID=YOURKEY" -e "AWS_SECRET_ACCESS_KEY=YOURSECRET" -p 4040:4040 gettyimages/spark bin/spark-shell
Loading credentials from
If you want to use AWS S3 credentials from
~/.aws/credentials you have to do some configuration.
in the previous cluster example one have to specify credentials provider. Add the following line to
Let’s say we want to run some code like this:
Now if you configure the rest properly and run the cluster you can access your s3 data from local spark/docker container.
Without the configuration we would get the following error:
Caused by: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider InstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint
That basically means that spark has three credentials proveders: BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider InstanceProfileCredentialsProvider. But none of them worked.
- EnvironmentVariableCredentialsProvider - one that loads the credentials from environment variables
- InstanceProfileCredentialsProvider - one that works on AWS instances.
- BasicAWSCredentialsProvider - I don’t know what it is.
What we need is ProfileCredentialsProvider. It reads the credentials from