Llama-3.3-70B¶
In this article
Information
Llama-3.3-70B - a high-performance language model with 70 billion parameters for local deployment through Ollama. The model requires powerful computing resources with at least 53 GB of video memory (NVIDIA A100/H100 or multiple consumer GPUs). Deployment on Ubuntu 22.04 supports distributed computing and integration with Open Web UI, providing full data control and performance optimization.
Deployment Features¶
-
High-performance architecture: the model has 70 billion parameters and is optimized for processing complex tasks with unmatched accuracy through modern distributed computing technologies;
-
Integration with Open Web UI: a modern web interface available on port 8080, ensuring full control over data, computational resources, and processing processes;
-
Distributed computing: advanced support for multi-card configurations with automatic load balancing between multiple GPUs;
-
Scalability: the ability to horizontally scale by adding additional GPUs to increase performance;
-
Performance: using LLAMA_FLASH_ATTENTION technology to optimize computations and accelerate request processing;
-
Fault tolerance: an automatic recovery system ensures continuous operation.
-
Usage examples:
-
Customer support: automating responses to user queries;
-
Education: creating educational materials, assisting in problem-solving;
-
Marketing: generating advertising texts, analyzing reviews;
-
Software development: creating and documenting code.
-
Deployment Features¶
ID | Compatible OS | VM | BM | VGPU | GPU | Min CPU (Cores) | Min RAM (Gb) | Min HDD/SDD (Gb) | Active |
---|---|---|---|---|---|---|---|---|---|
253 | Ubuntu 22.04 | - | - | + | + | 4 | 64 | - | Yes |
-
Installation time is 15-30 minutes including the OS;
-
The Ollama server loads and runs the LLM in memory;
-
Open WebUI is deployed as a web application connected to the Ollama server;
-
Users interact with the LLM through the Open WebUI web interface, sending requests and receiving responses;
-
Configuration for distributed computing for multi-card systems;
-
Monitoring system state, including GPU temperature and performance;
-
Optimization of parallel work of multiple graphic accelerators;
-
All computations and data processing occur locally on the server. Administrators can configure the LLM for specific tasks through OpenWebUI tools.
System Requirements and Technical Specifications¶
-
Graphics Accelerator with CUDA support (one of the options, may be better):
- 1x NVIDIA H100
- 2x NVIDIA A100 (48 GB of video memory each)
- 2x NVIDIA RTX 5090 (32 GB of video memory each)
- 2x NVIDIA A6000 (48 GB of video memory each)
- 3x NVIDIA RTX 4090 (24 GB of video memory each)
- 3x NVIDIA A5000 (24 GB of video memory each)
-
Disk space: SSD of sufficient size for the system and model;
-
Software: NVIDIA drivers and CUDA;
-
Video memory usage: 53 GB with a 2K token context;
-
System monitoring: comprehensive checking of driver status, containers, and GPU temperature.
Getting Started After Deploying Llama-3.3-70B¶
After payment, an email will be sent to the registered email address notifying you that the server is ready for work. It will include the VPS IP address as well as login and password information for connecting to the server and a link for accessing OpenWebUI. Our company's clients manage equipment through the server management panel and API — Invapi.
-
Authentication data for accessing the server's operating system (e.g., via SSH) will be sent to you in the received email.
-
Link for access to Ollama control panel with Open WebUI web interface: under the tag webpanel in the Info >> Tags tab of Invapi's management console. The exact link in the form
https://llama<Server_ID_from_Invapi>.hostkey.in
is sent in an email upon server handover.
After clicking the link from the tag webpanel, a Get started with Open WebUI login window will open, where you need to create an admin account name, email, and password for your chatbot, then press the Create Admin Account button:
Attention
After registering the first user, the system automatically assigns them an administrator role. To ensure security and control over the registration process, all subsequent registration requests must be approved in OpenWebUI from the administrator account.
Note
Detailed information about working with Ollama's control panel with Open WebUI can be found in the article AI Chatbot on Your Own Server.
Note
For optimal performance, it is recommended to use GPUs with more than the minimum required 16 GB of video memory. This ensures a buffer for processing larger contexts and parallel requests. Detailed information about Ollama's main settings and Open WebUI can be found in Ollama developers' documentation and in Open WebUI developers' documentation.
Ordering a Server with Llama-3.3-70B Using API¶
To install this software using the API, follow these instructions.
Some of the content on this page was created or translated using AI.