Deployment Overview of Apache Spark on Server¶
Prerequisites and Basic Requirements¶
The server must meet the following operating system and software requirements to successfully host Apache Spark: - Operating System: Debian (Bookworm) or Ubuntu (Bullseye). - Privileges: Root access is required for package installation, environment variable configuration, and service management. - Java Runtime: The system requires the default JDK package (default-jdk). - Network: The server must have internet access to download the Apache Spark archive and SSL certificates. - Ports: The deployment utilizes specific ports for the Spark Master, Worker, and History Server, which are proxied through Nginx.
File and Directory Structure¶
The application and its supporting components are organized in the following directories: - /root/spark-3.5.3-bin-hadoop3/: The extracted Apache Spark installation directory. - /root/nginx/: The directory containing the Docker Compose configuration for the Nginx proxy. - /data/nginx/user_conf.d/: The directory storing custom Nginx server block configurations. - /data/nginx/nginx-certbot.env: The environment file for Nginx and Certbot settings. - /etc/letsencrypt/: The mount point for SSL certificates managed by Certbot. - /etc/environment: The system file where global environment variables for Java and Spark are defined.
Application Installation Process¶
Apache Spark is installed manually via a binary archive rather than a package manager. The installation process involves the following steps: 1. Update and upgrade all APT packages on the system. 2. Install the default-jdk package to provide the necessary Java runtime. 3. Configure the JAVA_HOME environment variable in /etc/environment. The path depends on the OS version: - For Ubuntu Bullseye: /usr/lib/jvm/java-11-openjdk-amd64 - For Debian Bookworm: /usr/lib/jvm/java-17-openjdk-amd64 4. Set the SPARK_LOCAL_IP environment variable to 127.0.0.1 in /etc/environment. 5. Download the Apache Spark archive spark-3.5.3-bin-hadoop3.tgz from the Apache archive repository. 6. Extract the archive to the /root directory. 7. Remove the original .tgz archive file to free up space. 8. Reboot the server to apply the changes made to /etc/environment.
Docker Containers and Their Deployment¶
The deployment utilizes Docker to run an Nginx reverse proxy with integrated Certbot for SSL management. The container is managed via Docker Compose: - Image: jonasal/nginx-certbot:latest - Restart Policy: unless-stopped - Network Mode: host - Volumes: - nginx_secrets (external) mounted to /etc/letsencrypt for certificate storage. - /data/nginx/user_conf.d mounted to /etc/nginx/user_conf.d for custom configurations. - Environment: - CERTBOT_EMAIL is set to [email protected]. - Additional settings are loaded from /data/nginx/nginx-certbot.env.
The container is started using the docker compose up -d command executed from the /root/nginx directory.
Proxy Servers¶
Nginx acts as the reverse proxy and SSL termination point for the Apache Spark services. The configuration is defined in a server block located at /data/nginx/user_conf.d/ with a filename pattern of {prefix}{server_id}.{zone}.conf.
Key proxy settings include: - SSL Configuration: - Certificates are loaded from /etc/letsencrypt/live/{prefix}{server_id}.{zone}/. - Files used: fullchain.pem, privkey.pem, and chain.pem. - Diffie-Hellman parameters are loaded from /etc/letsencrypt/dhparams/dhparam.pem. - Proxy Locations: - Main Application: Proxies traffic to the internal Spark port with WebSocket support enabled (proxy_http_version 1.1, Upgrade headers). - Master Interface: Proxies to the Spark Master port. - Worker Interface: Proxies to the Spark Worker port. - History Server: Proxies to the Spark History Server port. - Headers: The proxy forwards X-Forwarded-Host, X-Forwarded-Server, X-Real-IP, X-Forwarded-For, and X-Scheme to the backend services.
Permission Settings¶
The following permission settings are applied to the configuration directories and files: - The /root/nginx directory is owned by root:root with mode 0644. - The compose.yml file in /root/nginx is owned by root:root with mode 0644. - The Nginx configuration files in /data/nginx/user_conf.d/ are owned by root:root with mode 0644.
Starting, Stopping, and Updating¶
The Nginx proxy service is managed via Docker Compose commands executed in the /root/nginx directory: - Start/Restart: Run docker compose up -d to start or restart the Nginx and Certbot containers. - Stop: Run docker compose down to stop the containers. - Update: To update the Nginx image, pull the latest version using docker pull jonasal/nginx-certbot:latest and then restart the service with docker compose up -d.
Apache Spark services are started manually using the scripts located within the /root/spark-3.5.3-bin-hadoop3/bin/ directory, such as ./start-master.sh and ./start-worker.sh, after the environment variables are loaded.