Data processing systems such as Apache Spark [1] rely on runtime code generation [2] to speedup query execution. In this context, code generation typically translates a SQL query to some executable Java code, which is capable of delivering high performance compared to query interpretation. While SQL code generation in Spark can significantly improve performance, the quality of the code generated by Spark is currently sub-optimal, and very often Spark cannot match the performance of an equivalent C/C++ query execution engine.
In this project we want to explore how modern compilation techniques commonly used in language VMs such as GraalVM [3] can be applied to SQL code generation to improve the efficiency and performance of systems such Apache Spark.
[2] https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf