Ollama Installation¶

In this article

Introduction to Ollama

Updating Ollama on Linux

Installing Language Models LLM

Environment Variables

Introduction to Ollama¶

Ollama is a framework for running and managing large language models (LLMs) on local computing resources. It enables the loading and deployment of selected LLMs and provides access to them through an API.

Attention

If you plan to use GPU acceleration for working with LLMs, please install NVIDIA drivers and CUDA at the beginning.

System Requirements:

Requirement	Specification
Operating System	Linux: Ubuntu 22.04 or later
RAM	16 GB for running models up to 7B
Disk Space	12 GB for installing Ollama and basic models. Additional space is required for storing model data depending on the used models
Processor	Recommended to use a modern CPU with at least 4 cores. For running models up to 13B, a CPU with at least 8 cores is recommended
Graphics Processing Unit (optional)	A GPU is not required for running Ollama, but can improve performance, especially when working with large models. If you have a GPU, you can use it to accelerate training of custom models.

Note

The system requirements may vary depending on the specific LLMs and tasks you plan to perform.

Installing Ollama on Linux¶

Download and install Ollama:

curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/bin/ollama
chmod +x /usr/bin/ollama

Create a group:

sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama

Create the Ollama service:

tee /usr/lib/systemd/system/ollama.service > /dev/null <<EOF
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0" 
Environment="OLLAMA_ORIGINS=*"

[Install]
WantedBy=default.target
EOF

For Nvidia GPUs, add Environment="OLLAMA_FLASH_ATTENTION=1" to improve token generation speed.

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama

Ollama will be accessible at http://127.0.0.1:11434 or http://<you_server_IP>:11434.

Updating Ollama on Linux¶

To update Ollama, you will need to re-download and install its binary package:

sudo curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/bin/ollama
sudo chmod +x /usr/bin/ollama

For ease of future updates, you can create a script ollama_update.sh (run as root or with sudo):

#!/bin/bash
systemctl stop ollama
sudo curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/bin/ollama
sudo chmod +x /usr/bin/ollama
systemctl start ollama

Installing Language Models LLM¶

You can find the list of available language models on this page.

To install a model, click on its name and then select the size and type of the model on the next page. Copy the installation command from the right-hand window and run it in your terminal/command line:

ollama run llama3

Note

Recommended models are marked with the latest tag.

Attention

To ensure acceptable performance, the size of the model should be at least two times smaller than the amount of RAM available on the server and ⅔ of the available video memory on the GPU. For example, a model of size 8GB requires 16GB of RAM and 12GB of video memory on the GPU.

After downloading the model, restart the service:

service ollama restart

For more information about Ollama, you can read the developer documentation.

Environment Variables¶

Set this variables in Ollama service as Environment="VARIABLE=VALUE".

Variable	Description
OLLAMA_DEBUG	Show additional debug information (e.g. `OLLAMA_DEBUG=1`)
OLLAMA_HOST	IP Address for the ollama server (default `127.0.0.1:11434`)
OLLAMA_KEEP_ALIVE	The duration that models stay loaded in memory (default `5m`)
OLLAMA_MAX_LOADED_MODELS	Maximum number of loaded models (default `1`)
OLLAMA_MAX_QUEUE	The queue length, defines number of requests that might be sitting there and waiting for being picked up (`512` by default)
OLLAMA_MODELS	The path to the models directory
OLLAMA_NUM_PARALLEL	Maximum number of parallel requests (default `1`)
OLLAMA_NOPRUNE	Do not prune model blobs on startup
OLLAMA_ORIGINS	A comma separated list of allowed origins
OLLAMA_TMPDIR	Location for temporary files
OLLAMA_FLASH_ATTENTION	Set to `1` improves token generation speed on Apple Silicon Macs and NVIDIA graphics cards
OLLAMA_LLM_LIBRARY	Set LLM library to bypass autodetection (Dynamic LLM libraries [`rocm_v6` `cpu` `cpu_avx` `cpu_avx2` `cuda_v11` `rocm_v5`])
OLLAMA_MAX_VRAM	Maximum VRAM (`OLLAMA_MAX_VRAM=<bytes>`)
OLLAMA_NOHISTORY	Set to `1` disable history in Ollama run