Step-by-step documentation how run Spark SQL queries on trade raw data
Apache Spark is a fast and general engine for large-scale data processing. This allows to process more data than fits into RAM, the main limitation when using R. Commercial statistical tools such as SPSS and SAS are not constrained by RAM: observations are loaded, processed and removed sequentially. Recent versions of RStudio, a popular IDE for R, includes an interface to Spark. However, this is complicated to set up on Windows with limited user privileges.
In order to use Spark for processing local data, we need to store information in compressed text file format
rds format using R dryworkflowrds files and export to compressed csv files in gz formatThe shortest way to execute SQL queries is offered by the databricks platform in combination with Amazon S3.
tradeuserf!3P)oDl7!5%AKIAJLC5BRWMJD5VN2HArHcmTPgoz4Uz1B1v9PZJibRhe5zUz6DZQqEWyZ73write-parquet-s3.scala notebookDatabricks is an online platform that allows using Spark interactively with minimum investment. The community edition is free of charge that starts small Spark clusters on Amazon AWS with 6GB memory. This is sufficient to convert the data and run queries.
We need to register on a per-user basis because an email address is required for user authetication
In order to execute queries on the complete time series, information must be stored in a different format. Data has already been uploaded to Amazon S3. In case the input data changes, the conversion of raw data to parquet format must be repeated.
LoadDataDS procedure from Spark to read gz files and save to parquet files on local diskThe test query schedules multiple jobs executed by the cluster’s worker nodes. The execution can be tracked with different tools. The result is displayed in the web browser and can be copied manually to a text file for further processing.
In order to compile classes for the JVM, we need a JDK
jdk-8u111-windows-x64.exe.rsrc/1033/JAVA_CAB10/111, which is also a zipped file containing tools.zip. Extract this file to get tools.ziptools.zip to your desired Java installation pathsrc.zip is the file 110 at .rsrc/1033/JAVA_CAB9src.zip to the java installation path where tools.zip has been unzippedJAVA_HOME and JDK_HOME to the installation pathSpark is implemented in Scala. Running and modifying a Spark App locally requires re-compiling the class files for execution on the JVM. Lightbend Activator is a build tool for Scala that can be used in Windows.
PATH environment variableNext we download and install a pre-compiled version of Spark including Hadoop
SPARK_HOMEIn order to execute queries on the complete time series, information must be stored in parquet file format
LoadDataDS procedure from Spark to read gz files and save to parquet files on local diskAn example application to execute a Spark SQL query is available on github
activatorexit commandactivator "run-main controller.query.QueryDataDS ct_tariffline_unlogged chapter spark_count_ct_tl_chapter.csv"