Spark Documentation

Step-by-step documentation how run Spark SQL queries on trade raw data

About Apache Spark

Apache Spark is a fast and general engine for large-scale data processing. This allows to process more data than fits into RAM, the main limitation when using R. Commercial statistical tools such as SPSS and SAS are not constrained by RAM: observations are loaded, processed and removed sequentially. Recent versions of RStudio, a popular IDE for R, includes an interface to Spark. However, this is complicated to set up on Windows with limited user privileges.

Data Preparation

In order to use Spark for processing local data, we need to store information in compressed text file format

Remote Setup

The shortest way to execute SQL queries is offered by the databricks platform in combination with Amazon S3.

Amazon S3

User Login
URL: https://058644585154.signin.aws.amazon.com/console
User: tradeuser
Password: f!3P)oDl7!5%
AWS_ACCESS_KEY_ID: AKIAJLC5BRWMJD5VN2HA
AWS_SECRET_ACCESS_KEY: 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

Databricks

Databricks is an online platform that allows using Spark interactively with minimum investment. The community edition is free of charge that starts small Spark clusters on Amazon AWS with 6GB memory. This is sufficient to convert the data and run queries.

Register

We need to register on a per-user basis because an email address is required for user authetication

Convert data for remote use

In order to execute queries on the complete time series, information must be stored in a different format. Data has already been uploaded to Amazon S3. In case the input data changes, the conversion of raw data to parquet format must be repeated.

Test Query

The test query schedules multiple jobs executed by the cluster’s worker nodes. The execution can be tracked with different tools. The result is displayed in the web browser and can be copied manually to a text file for further processing.

Local Setup

JDK

In order to compile classes for the JVM, we need a JDK

Lightbend Activator

Spark is implemented in Scala. Running and modifying a Spark App locally requires re-compiling the class files for execution on the JVM. Lightbend Activator is a build tool for Scala that can be used in Windows.

Install Spark

Next we download and install a pre-compiled version of Spark including Hadoop

Convert data for local use

In order to execute queries on the complete time series, information must be stored in parquet file format

Test Query

An example application to execute a Spark SQL query is available on github

Table of Contents