Jupyter notebooks on EMR

Explanatory data analysis requires interactive code execution. In case of spark and emr it is very convenient to run the code from jupyter notebooks on a remote cluster. EMR allows installing jupyter on the spark master. In order to do that configure "Applications" field for the emr cluster to contain also jupyter hub. For example: "Applications": [ { "Name": "Ganglia", "Version": "3.7.2" }, { "Name": "Spark", "Version": "2.4.0" }, { "Name": "Zeppelin", "Version": "0....

February 4, 2019 · SergeM

Spark on a local machine

How to install spark locally Considering spark without hadoop built-in. Download hadoop unpack to /opt/hadoop/ Download spark without hadoop, unpack to /opt/spark Install java. Set JAVA_HOVE environment variable. For example: export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 create environment variables required for spark to run. One can put those in .bashrc export HADOOP_HOME=/opt/hadoop export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/* Now you can run pyspark for example: $ /opt/spark/bin/pyspark Python 2.7.12 (default, Nov 12 2018, 14:36:49) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information....

January 30, 2019 · SergeM

GPIO controls for Rasbperry Pi

Libraries for GPIO Node JS Python Using RPi.GPIO How to Exit GPIO programs cleanly, avoid warnings and protect your Pi Setting up RPi.GPIO, numbering systems and inputs On using hardware PWM without sudo due to permissions for /dev/gpiomem: discussion General pigpio The library also provides a service. It can be useful if you don’t want to give root access to the client applications and want to control PWM for example....

September 23, 2018 · SergeM

Reducing disk usage in Ubuntu

Here are some recipies to make ubuntu installed on USB drive to work faster. [1] [2] reducing swapping Add these lines to /etc/sysctl.conf, and reboot. vm.swappiness = 0 vm.dirty_background_ratio = 20 vm.dirty_expire_centisecs = 0 vm.dirty_ratio = 80 vm.dirty_writeback_centisecs = 0 More caching while writting on disk Add noatime,commit=120,… to /etc/fstab entries for / and /home

August 16, 2018 · SergeM

Spark in Docker with AWS credentials

Running spark in docker container Setting up spark is tricky. Therefore it is useful to try out things locally before deploying to the cluster. Docker is of a good help here. There is a great docker image to play with spark locally. gettyimages/docker-spark Examples Running SparkPi sample program (one of the examples from the docs of Spark): docker run --rm -it -p 4040:4040 gettyimages/spark bin/run-example SparkPi 10 Running a small example with Pyspark:...

July 29, 2018 · SergeM

OKRs

We had a production system written by mathematicians, 50 different stakeholders with conflicting targets, five leadership changes during last year, a dozen of microservices, AWS costs of 10 thousands per week, hole galaxy of legacy databases, cron jobs, Celery, greenlets, … Also, unstable API as a dependency, 10 Gb of text dumps as output, user input without validation, false alarms in monitoring, and two dozen unprotected public endpoints. Not that we needed all that for the work, but once you get locked into a serious agile development, the tendency is to push it as far as you can....

July 4, 2018 · SergeM

Bokeh in jupyter notebooks for interactive plots

Bokeh is a library for interactive visualization. One can use it in Jupyter notebooks. Here is the example. Lets say we have a pandas dataframe with timestamps and some values: 1 2 3 4 5 6 7 8 9 10 import pandas as pd from io import StringIO df = pd.read_csv(StringIO("""timestamp,value 2018-01-01T10:00:00,20 2018-01-01T12:00:00,10 2018-01-01T14:00:00,30 2018-01-02T10:30:00,40 2018-01-02T13:00:00,50 2018-01-02T18:00:40,10 """), parse_dates=["timestamp"]) You can visualize it to a nice graph with zoom, selection, and mouse-over tooltips using the bokeh:...

June 20, 2018 · SergeM

Comparison of click-based config parsers for python

Problem There is click module that allows you to create comman dline interfaces for your python scripts. The advantages of click are nice syntax 1 2 3 4 5 6 7 8 9 10 11 12 13 import click @click.command() @click.option('--count', default=1, help='Number of greetings.') @click.option('--name', prompt='Your name', help='The person to greet.') def hello(count, name): """Simple program that greets NAME for a total of COUNT times.""" for x in range(count): click....

June 6, 2018 · SergeM

Vim cheat sheet

Some frequently used commands in Vim File explorer :Explore - opens the file explorer window. :E - the same Visual commands > - shift right < - shift left y - yank (copy) marked text d - delete marked text ~ - switch case Cut and Paste yy - yank (copy) a line 2yy - yank 2 lines yw - yank word y$ - yank to end of line p - put (paste) the clipboard after cursor P - put (paste) before cursor dd - delete (cut) a line dw - delete (cut) the current word x - delete (cut) current character Search/Replace /pattern - search for pattern ?...

May 31, 2018 · SergeM

Select lines matching regular expression in python

Given a text file we want to create another file containing only those lines that match a certain regular expression using python3 1 2 3 4 5 6 import re with open("./in.txt", "r") as input_file, open("out.txt", "w") as output_file: for line in input_file: if re.match("(.*)import(.*)", line): print(line, file=output_file)

February 26, 2018 · SergeM