Last mod: 2024.12.14

Apache Spark - Installation

Apache Spark is a powerful, open-source framework for big data processing that excels in speed, scalability, and versatility. Its in-memory processing capabilities make it up to 100 times faster than traditional Hadoop systems for certain tasks. Spark supports a wide range of workloads, including batch processing, real-time streaming, machine learning, and graph computation, all within a unified framework. Its ease of integration with popular data sources and robust APIs in multiple languages (Scala, Java, Python, and R) make it a top choice for data engineers and scientists tackling large-scale data challenges.

Software

  • Ubuntu 24.04 LTS
  • OpenJDK 17
  • Spark-3.5.3-bin-hadoop3

Naturally, you can run on another OS and on other versions. I specify for ease of configuration.

Installation

Installing OpenJDK 17:

sudo apt install openjdk-17-jdk-headless

Before download and unpack Spark, it is necessary to ensure that the /opt directory has the correct permissions.
Download and unpack:

cd /opt/
wget https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz
tar -xvzf spark-3.5.3-bin-hadoop3.tgz

These commands need to be repeated on each node to be part of the cluster.

On the selected single host, we launch the master instance:

cd /opt/spark-3.5.3-bin-hadoop3/sbin
./start-master.sh -h MASTER_IP_OR_DNS

Spark opens two ports:

  • 7077 - to communicate with the nodes
  • 8080 - control panel accessible via HTTP

We can view the status of the cluster by typing in the browser:
http://SPARK_MASTER_IP:8080/

Apache Spark status

Next, install on all nodes: OpenJDK 17, unpack Spark and run:

cd /opt/spark-3.5.3-bin-hadoop3/sbin
.//start-slave.sh spark://MASTER_IP_OR_DNS:7077

Replace the MASTER_IP_OR_DNS value with the master IP address or DNS name.

After starting all nodes, we can refresh the master panel and check the status of the connected nodes:

Apache Spark status with nodes

Links

https://openjdk.org/
https://spark.apache.org/downloads.html
https://www.apache.org/dyn/closer.lua/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz


To be continued...