Tech Whims

Saprk3.x Journey of Discovery | Spark 2.4 to 3.4 releases notes on spark core and SQL

张晓龙 / 2023-05-04



将我比较关心部分放到这里,如果需要更多的内容,可以到 spark release notes 查看更多!

Spark2 to Spark3.0

spark3.0, The vote passed on the 10th of June, 2020.

Spark SQL is the top active component in this release. 46% of the resolved tickets are for Spark SQL.

These enhancements benefit all the higher-level libraries, including structured streaming and MLlib, and higher level APIs, including SQL and DataFrames. Various related optimizations are added in this release.

In TPC-DS 30TB benchmark, Spark 3.0 is roughly two times faster than Spark 2.4.

The biggest new features in Spark 3.0

spark3.0 issues

To Spark3.1

This release adds Python type annotations and Python dependency management support as part of Project Zen.

Other major updates include improved ANSI SQL compliance support, history server support in structured streaming, the general availability (GA) of Kubernetes and node decommissioning in Kubernetes and Standalone.

Highlights in Spark3.1

To Spark3.2 — now, we use spark3.2.1 in our company

Spark supports the Pandas API layer on Spark.

Other major updates include RocksDB StateStore support, session window support, push-based shuffle support, ANSI SQL INTERVAL types, enabling Adaptive Query Execution (AQE) by default, and ANSI SQL mode GA.

Highlights in Spark3.2

To Spark3.3

This release improve join query performance via Bloom filters, increases the Pandas API coverage with the support of popular Pandas features.

Simplifies the migration from traditional data warehouses by improving ANSI compliance and supporting dozens of new built-in functions, boosts development productivity with better error handling, autocompletion, performance, and profiling.

Highlights in Spark3.3

To Spark3.4(Apr 13, 2023)

This release introduces Python client for Spark Connect, augments Structured Streaming with async progress tracking and Python arbitrary stateful processing, increases Pandas API coverage and provides NumPy input support, simplifies the migration from traditional data warehouses by improving ANSI compliance and implementing dozens of new built-in functions, and boosts development productivity and debuggability with memory profiling.

Highlights in Spark3.4

Reference

  1. Spark Release 3.0.0
  2. Introducing Apache Spark 3.0
  3. Spark Release 3.1.1