Deployment Overview of Apache Spark on Server¶
Prerequisites and Basic Requirements¶
The server must meet the following requirements to successfully host the Apache Spark application:
-
Operating System: Debian (Bookworm) or Ubuntu (Bullseye).
-
Privileges: Root access is required for package installation, environment variable configuration, and Docker management.
-
Java Runtime: The system must have the OpenJDK installed (version 11 for Ubuntu Bullseye, version 17 for Debian Bookworm).
-
Network: The server must have outbound internet access to download the Apache Spark archive and SSL certificates.
-
Ports: The following ports must be available for internal communication:
-
4040: Spark UI (internal) -
8080: Spark Master -
8081: Spark Worker -
18080: Spark History Server -
443: HTTPS (external access via Nginx)
FQDN of the Final Panel¶
The application is accessible via the following Fully Qualified Domain Name (FQDN) format:
-
Format:
spark<Server ID>.hostkey.in:443 -
Example: If the Server ID is
123, the address isspark123.hostkey.in.
Access is provided over HTTPS on port 443.
File and Directory Structure¶
The application and its supporting components are organized in the following directories:
-
Apache Spark Installation:
/root/spark-3.5.3-bin-hadoop3(extracted from the archive). -
Nginx Configuration:
-
Compose file:
/root/nginx/compose.yml -
User configuration:
/data/nginx/user_conf.d/spark<Server ID>.hostkey.in.conf -
SSL Certificates:
/etc/letsencrypt/live/spark<Server ID>.hostkey.in/ -
Environment Variables:
/etc/environment
Application Installation Process¶
The Apache Spark application is installed manually on the host system using the following steps:
-
System Update: APT packages are updated and upgraded.
-
Java Installation: The
default-jdkpackage is installed via APT. -
Environment Configuration:
-
JAVA_HOMEis set to/usr/lib/jvm/java-11-openjdk-amd64on Ubuntu or/usr/lib/jvm/java-17-openjdk-amd64on Debian. -
SPARK_LOCAL_IPis set to127.0.0.1. -
These variables are added to
/etc/environment.
-
-
Spark Download: The Apache Spark archive is downloaded from the official Apache archive:
-
URL:
https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz -
Destination:
/root/spark-3.5.2-bin-hadoop3.tgz
-
-
Extraction: The archive is extracted to the root directory.
-
Cleanup: The original archive file is removed.
-
Reboot: The system is rebooted to apply the new environment variables.
The installed version of Apache Spark is 3.5.3.
Docker Containers and Their Deployment¶
The reverse proxy and SSL termination are handled by Docker containers managed via Docker Compose.
-
Docker Installation: Docker is installed on the host system.
-
Compose File Location:
/root/nginx/compose.yml -
Service Name:
nginx -
Image:
jonasal/nginx-certbot:latest -
Network Mode:
host -
Volumes:
-
nginx_secrets(external) mounted at/etc/letsencrypt -
/data/nginx/user_conf.dmounted at/etc/nginx/user_conf.d -
Environment:
-
CERTBOT_EMAIL:[email protected] -
Environment file:
/data/nginx/nginx-certbot.env
The container is started using the command:
executed from the/root/nginx directory. Proxy Servers¶
Nginx acts as the reverse proxy, handling SSL termination and routing traffic to the internal Spark services.
-
Domain:
spark<Server ID>.hostkey.in -
Protocol: HTTPS (Port 443)
-
SSL Provider: Let's Encrypt (managed by Certbot within the Docker container)
-
Certificate Paths:
-
Fullchain:
/etc/letsencrypt/live/spark<Server ID>.hostkey.in/fullchain.pem -
Private Key:
/etc/letsencrypt/live/spark<Server ID>.hostkey.in/privkey.pem -
Chain:
/etc/letsencrypt/live/spark<Server ID>.hostkey.in/chain.pem
Routing Configuration¶
The Nginx configuration routes traffic to specific internal ports based on the URL path:
| URL Path | Internal Service | Internal Port |
|---|---|---|
/ | Spark UI | 4040 |
/master | Spark Master | 8080 |
/worker | Spark Worker | 8081 |
/history | Spark History Server | 18080 |
The proxy configuration includes headers for X-Forwarded-Host, X-Forwarded-Server, X-Real-IP, and X-Forwarded-For. WebSocket support is enabled for the main path.
Available Ports for Connection¶
The following ports are configured for the application:
-
External Access:
-
443: HTTPS (Nginx Reverse Proxy) -
Internal Access (accessible only from the host or via the proxy):
-
4040: Spark UI -
8080: Spark Master -
8081: Spark Worker -
18080: Spark History Server
Starting, Stopping, and Updating¶
The Docker-based proxy service is managed using Docker Compose commands from the /root/nginx directory.
-
Start/Restart:
-
Stop:
-
Update: To update the Nginx container image, pull the latest version and restart:
The Apache Spark application itself runs as a native process on the host. To restart Spark, the user must navigate to the installation directory and execute the appropriate Spark shell or script commands, or restart the host system if environment variables were modified.