Apache Spark

March 29, 2023 | permanent

tags: Apache Foundation, Tool

Apache Spark™ is built on an advanced distributed SQL engine for large-scale data

The most widely-used engine for scalable computing

Thousands of companies, including 80% of the Fortune 500, use Apache Spark™. Over 2,000 contributors to the open source project from industry and academia.

URL

Key features #

Batch/streaming data #

Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.

SQL analytics #

Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.

Data science at scale #

Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling

Machine learning #

Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.

Ecosystem #

Apache Spark™ integrates with your favorite frameworks, helping to scale them to thousands of machines.

Spark SQL engine: under the hood #