Skip to content

Deployment Overview of Apache Spark on Server

Prerequisites and Basic Requirements

The server must meet the following requirements to successfully host the Apache Spark application:

  • Operating System: Debian (Bookworm) or Ubuntu (Bullseye).

  • Privileges: Root access is required for package installation, environment variable configuration, and Docker management.

  • Java Runtime: The system must have the OpenJDK installed (version 11 for Ubuntu Bullseye, version 17 for Debian Bookworm).

  • Network: The server must have outbound internet access to download the Apache Spark archive and SSL certificates.

  • Ports: The following ports must be available for internal communication:

  • 4040: Spark UI (internal)

  • 8080: Spark Master

  • 8081: Spark Worker

  • 18080: Spark History Server

  • 443: HTTPS (external access via Nginx)

FQDN of the Final Panel

The application is accessible via the following Fully Qualified Domain Name (FQDN) format:

  • Format: spark<Server ID>.hostkey.in:443

  • Example: If the Server ID is 123, the address is spark123.hostkey.in.

Access is provided over HTTPS on port 443.

File and Directory Structure

The application and its supporting components are organized in the following directories:

  • Apache Spark Installation: /root/spark-3.5.3-bin-hadoop3 (extracted from the archive).

  • Nginx Configuration:

  • Compose file: /root/nginx/compose.yml

  • User configuration: /data/nginx/user_conf.d/spark<Server ID>.hostkey.in.conf

  • SSL Certificates: /etc/letsencrypt/live/spark<Server ID>.hostkey.in/

  • Environment Variables: /etc/environment

Application Installation Process

The Apache Spark application is installed manually on the host system using the following steps:

  1. System Update: APT packages are updated and upgraded.

  2. Java Installation: The default-jdk package is installed via APT.

  3. Environment Configuration:

    • JAVA_HOME is set to /usr/lib/jvm/java-11-openjdk-amd64 on Ubuntu or /usr/lib/jvm/java-17-openjdk-amd64 on Debian.

    • SPARK_LOCAL_IP is set to 127.0.0.1.

    • These variables are added to /etc/environment.

  4. Spark Download: The Apache Spark archive is downloaded from the official Apache archive:

    • URL: https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

    • Destination: /root/spark-3.5.2-bin-hadoop3.tgz

  5. Extraction: The archive is extracted to the root directory.

  6. Cleanup: The original archive file is removed.

  7. Reboot: The system is rebooted to apply the new environment variables.

The installed version of Apache Spark is 3.5.3.

Docker Containers and Their Deployment

The reverse proxy and SSL termination are handled by Docker containers managed via Docker Compose.

  • Docker Installation: Docker is installed on the host system.

  • Compose File Location: /root/nginx/compose.yml

  • Service Name: nginx

  • Image: jonasal/nginx-certbot:latest

  • Network Mode: host

  • Volumes:

  • nginx_secrets (external) mounted at /etc/letsencrypt

  • /data/nginx/user_conf.d mounted at /etc/nginx/user_conf.d

  • Environment:

  • CERTBOT_EMAIL: [email protected]

  • Environment file: /data/nginx/nginx-certbot.env

The container is started using the command:

docker compose up -d
executed from the /root/nginx directory.

Proxy Servers

Nginx acts as the reverse proxy, handling SSL termination and routing traffic to the internal Spark services.

  • Domain: spark<Server ID>.hostkey.in

  • Protocol: HTTPS (Port 443)

  • SSL Provider: Let's Encrypt (managed by Certbot within the Docker container)

  • Certificate Paths:

  • Fullchain: /etc/letsencrypt/live/spark<Server ID>.hostkey.in/fullchain.pem

  • Private Key: /etc/letsencrypt/live/spark<Server ID>.hostkey.in/privkey.pem

  • Chain: /etc/letsencrypt/live/spark<Server ID>.hostkey.in/chain.pem

Routing Configuration

The Nginx configuration routes traffic to specific internal ports based on the URL path:

URL Path Internal Service Internal Port
/ Spark UI 4040
/master Spark Master 8080
/worker Spark Worker 8081
/history Spark History Server 18080

The proxy configuration includes headers for X-Forwarded-Host, X-Forwarded-Server, X-Real-IP, and X-Forwarded-For. WebSocket support is enabled for the main path.

Available Ports for Connection

The following ports are configured for the application:

  • External Access:

  • 443: HTTPS (Nginx Reverse Proxy)

  • Internal Access (accessible only from the host or via the proxy):

  • 4040: Spark UI

  • 8080: Spark Master

  • 8081: Spark Worker

  • 18080: Spark History Server

Starting, Stopping, and Updating

The Docker-based proxy service is managed using Docker Compose commands from the /root/nginx directory.

  • Start/Restart:

    cd /root/nginx
    docker compose up -d
    

  • Stop:

    cd /root/nginx
    docker compose down
    

  • Update: To update the Nginx container image, pull the latest version and restart:

    cd /root/nginx
    docker compose pull
    docker compose up -d
    

The Apache Spark application itself runs as a native process on the host. To restart Spark, the user must navigate to the installation directory and execute the appropriate Spark shell or script commands, or restart the host system if environment variables were modified.

question_mark
Is there anything I can help you with?
question_mark
AI Assistant ×