Step-by-step documentation how run Spark SQL queries on trade raw data
Apache Spark is a fast and general engine for large-scale data processing. This allows to process more data than fits into RAM, the main limitation when using R. Commercial statistical tools such as SPSS and SAS are not constrained by RAM: observations are loaded, processed and removed sequentially. Recent versions of RStudio, a popular IDE for R, includes an interface to Spark. However, this is complicated to set up on Windows with limited user privileges.
In order to use Spark for processing local data, we need to store information in compressed text file format
rds
format using R dryworkflowrds
files and export to compressed csv
files in gz
formatThe shortest way to execute SQL queries is offered by the databricks platform in combination with Amazon S3.
tradeuser
f!3P)oDl7!5%
AKIAJLC5BRWMJD5VN2HA
rHcmTPgoz4Uz1B1v9PZJibRhe5zUz6DZQqEWyZ73
write-parquet-s3.scala
notebookDatabricks is an online platform that allows using Spark interactively with minimum investment. The community edition is free of charge that starts small Spark clusters on Amazon AWS with 6GB memory. This is sufficient to convert the data and run queries.
We need to register on a per-user basis because an email address is required for user authetication
In order to execute queries on the complete time series, information must be stored in a different format. Data has already been uploaded to Amazon S3. In case the input data changes, the conversion of raw data to parquet format must be repeated.
LoadDataDS
procedure from Spark to read gz
files and save to parquet
files on local diskThe test query schedules multiple jobs executed by the cluster’s worker nodes. The execution can be tracked with different tools. The result is displayed in the web browser and can be copied manually to a text file for further processing.
In order to compile classes for the JVM, we need a JDK
jdk-8u111-windows-x64.exe
.rsrc/1033/JAVA_CAB10/
111
, which is also a zipped file containing tools.zip
. Extract this file to get tools.zip
tools.zip
to your desired Java installation pathsrc.zip
is the file 110
at .rsrc/1033/JAVA_CAB9
src.zip
to the java installation path where tools.zip
has been unzippedJAVA_HOME
and JDK_HOME
to the installation pathSpark is implemented in Scala. Running and modifying a Spark App locally requires re-compiling the class files for execution on the JVM. Lightbend Activator is a build tool for Scala that can be used in Windows.
PATH
environment variableNext we download and install a pre-compiled version of Spark including Hadoop
SPARK_HOME
In order to execute queries on the complete time series, information must be stored in parquet file format
LoadDataDS
procedure from Spark to read gz
files and save to parquet
files on local diskAn example application to execute a Spark SQL query is available on github
activator
exit
commandactivator "run-main controller.query.QueryDataDS ct_tariffline_unlogged chapter spark_count_ct_tl_chapter.csv"