CI and monitoring for Apache Spark applications

Illustration of a person writing "DATA" in the air with interconnected boxes labeled with acronyms, suggesting a diagram or process, with an "APPLICATION" arrow underneath. Illustration of a person writing "DATA" in the air with interconnected boxes labeled with acronyms, suggesting a diagram or process, with an "APPLICATION" arrow underneath.

The article is meant for those who know the basics of Apache Spark and want to monitor their projects with the help of it. Any work sooner or later becomes production. Spark applications do as well. After that – only editing and performance monitoring. It's especially useful for streaming applications, but it's good to monitor periodic tasks as well, or else you will sit and wait while Spark is out of memory or CPU time.

The first step: CI runs tests and pushes artefact into Nexus

When working with streaming Spark applications, the process is the same as for web development: we make edits, create a commit, push it to GitLab/GitHub/Bitbucket. Inside of the GitLab the task runs (example .gitlab-ci.yml)

  - mvn clean

    - mvn test deploy

We have already partially considered the process of CI in the article Continuous Integration, Jenkins and Middleman.

Pushing an artefact to Nexus. Pushing an artefact to Maven is called deploying. Nexus documentation describes how to work with it in detail.

The second step: executing our Spark application through spark-jobserver.

Now we only need to run the Spark application and start monitoring it on a test cluster. And that is where the project spark-jobserver will help us. It is an HTTP API for Spark managing.

Let's execute a query to the spark-jobserver in order to receive a list of jars for jobs.

curl localhost:8090/jars

Our jar is not supposed to be there, so let's upload it. bash curl --data-binary @path/target/to/your.jar localhost:8090/jars/test

Probably, there is a job that we need to replace, so we should stop it first:

curl -XDELETE localhost:8090/jobs/<jobId>

You can get the list of all jobs upon the following request:

curl localhost:8090/jobs

And now we can execute our job:

curl -XPOST 'localhost:8090/jobs?appName=test&'

An important thing: ?sync=false – there is no point in waiting for the completion when executing streaming jobs, and other batch jobs, too. If you don't understand why so, read about an HTTP protocol and how it works, paying special attention to timeouts.

All in all, jobserver offers a lot of possibilities, read documentation for more details: Pay special attention to authentication – or else some schoolboys will do home assignments on your cluster :)

As a result, the simplest config for gitlab-ci:

  - mvn clean

    - mvn test deploy
    - curl --data-binary @path/target/to/your.jar localhost:8090/jars/test
    - curl -XPOST 'localhost:8090/jobs?appName=test&'

Monitoring the state of applications

Zabbix is a well paid system of monitoring, that's why, let's work with it. If you have never dealt with Zabbix, our russian guys from Latvia have written it, and they have great docs in english.

Now let's configure Zabbix to gather information about a job. If there are no other jobs, we will receive the information via port 4040. You can inspect received metrics with the help of a godawful language Python (you can use any other language, but still, Python is fast and efficient):

curl "" | python -mjson.tool

Now let's write a script, which will take ip and port of a working Spark process, and also the parameter to be executed, as arguments:

PYSTR1='import json,sys;obj=json.load(sys.stdin);print obj["gauges"]["'
curl -s -X GET "http://$1:$2/metrics/json/" | python -c "$PYSTR1$3$PYSTR3"

Put the script into /var/lib/zabbixsrv/externalscripts/ and don't forget to change the name of the user, set up a SELinux context, give the permission to execute:

chown zabbixsrv:zabbixsrv /var/lib/zabbixsrv/externalscripts/
restorecon -R /var/lib/zabbixsrv/externalscripts/
chmod +x /var/lib/zabbixsrv/externalscripts/

Zabbix has a system for gathering metrics. It is perfectly described in the documentation, that's why let's create an item following the instructions:

We use the opportunity to call an external script, but that's just one of the options.

Enter in the key field:["","4040","local-1453247966071.driver.AggrTouchpointMySql.StreamingMetrics.streaming.lastCompletedBatch_processingDelay"]

And create a graph, which will use the item. Or not a graph – the most important thing is that now we have the data.

And now gathering metrics properly

Do you feel like the way we deal with the script is unserious? You are right. This is a clumsy way. Spark offers an opportunity to use jmx, and Zabbix can read it. And we don't need those hipsterish json, bash (hipsterish bash :) ). So let's start from the beginning.

We need to turn on saving metrics to jmx in Spark:

cat ./conf/

Creating a file /etc/spark/jmxremote.password with a password:

username password

Executing the Spark job with the option:

--driver-java-options " -Djava.rmi.server.hostname=localhost"

Make sure to add encrypting after debugging, you can never have too much of paranoia. And it's also good to make the text file itself readonly, and protect it from reading, too, would be good.

chown spark:spark /etc/spark/jmxremote.password
chmod 0400 /etc/spark/jmxremote.password
chattr +i /etc/spark/jmxremote.password

It is very easy to connect to jmx and look at what there is using jconsole – a lot of useful information, especially for optimization. Just choose which keys you want to see in Zabbix and create the item you need:

Also, Spark offers an opportunity to upload metrics to Gangila or Graphite, these functions are described in detail here: