Ollama Installation¶

In this article

Introduction to Ollama

Installing Ollama on Linux

Updating Ollama on Linux

Installing Language Models LLM

Environment Variables

Introduction to Ollama¶

Ollama is a framework for running and managing large language models (LLMs) on local computing resources. It enables the loading and deployment of selected LLMs and provides access to them through an API.

Attention

If you plan to use GPU acceleration for working with LLMs, please install NVIDIA drivers and CUDA at the beginning.

System Requirements:

Requirement	Specification
Operating System	Linux: Ubuntu 22.04 or later
RAM	16 GB for running models up to 7B
Disk Space	12 GB for installing Ollama and basic models. Additional space is required for storing model data depending on the used models
Processor	Recommended to use a modern CPU with at least 4 cores. For running models up to 13B, a CPU with at least 8 cores is recommended
Graphics Processing Unit (optional)	A GPU is not required for running Ollama, but can improve performance, especially when working with large models. If you have a GPU, you can use it to accelerate training of custom models.

Note

The system requirements may vary depending on the specific LLMs and tasks you plan to perform.

Installing Ollama on Linux¶

Download and install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

For Nvidia GPUs, add Environment="OLLAMA_FLASH_ATTENTION=1" to improve token generation speed.

Ollama will be accessible at http://127.0.0.1:11434 or http://<you_server_IP>:11434.

Updating Ollama on Linux¶

To update Ollama, you will need to re-download and install its binary package:

curl -fsSL https://ollama.com/install.sh | sh

Note

If you don't have access to the Ollama, you may need to add the following lines to the service file /etc/systemd/system/ollama.service in the [Service] section:

Environment="OLLAMA_HOST=0.0.0.0" 
Environment="OLLAMA_ORIGINS=*"

and restart the service with the following commands:

systemctl daemon-reload
service ollama restart

Installing Language Models LLM¶

You can find the list of available language models on this page.

To install a model, click on its name and then select the size and type of the model on the next page. Copy the installation command from the right-hand window and run it in your terminal/command line:

ollama run llama3

Note

Recommended models are marked with the latest tag.

Attention

To ensure acceptable performance, the size of the model should be at least two times smaller than the amount of RAM available on the server and ⅔ of the available video memory on the GPU. For example, a model of size 8GB requires 16GB of RAM and 12GB of video memory on the GPU.

After downloading the model, restart the service:

service ollama restart

For more information about Ollama, you can read the developer documentation.

Environment Variables¶

Set this variables in Ollama service as Environment="VARIABLE=VALUE".

Variable	Description	Possible values / format	Default value
`OLLAMA_DEBUG`	Level of logging detail: INFO (default), DEBUG, or TRACE	`0`, `1`, `false`, `true`, or integer ≥2 (TRACE level)	`0` (INFO level)
`OLLAMA_HOST`	Address and port where the Ollama server runs	`[http://\\|https://]<host>[:<port>]` (e.g., `127.0.0.1:11434`, `https://ollama.local`)	`127.0.0.1:11434`
`OLLAMA_KEEP_ALIVE`	Time for which the model remains loaded in memory after the last request	Duration string (`5m`, `1h`, `30s`) or integer (seconds); negative → indefinitely	`5m`
`OLLAMA_LOAD_TIMEOUT`	Maximum wait time for loading a model before timeout (to detect hangs)	Duration string or integer (seconds); ≤0 → indefinitely	`5m`
`OLLAMA_MAX_LOADED_MODELS`	Maximum number of models that can be simultaneously loaded into memory	Non-negative integer (`uint`)	`0` (automatic management)
`OLLAMA_MAX_QUEUE`	Maximum length of the request queue awaiting processing	Non-negative integer (`uint`)	`512`
`OLLAMA_MODELS`	Path to the directory where models are stored	Absolute or relative path	`$HOME/.ollama/models`
`OLLAMA_NOHISTORY`	Disables saving command history in interactive CLI mode	`0`, `1`, `false`, `true`	`false`
`OLLAMA_NOPRUNE`	Prevents deletion (pruning) of unused model BLOB files upon startup	`0`, `1`, `false`, `true`	`false`
`OLLAMA_NUM_PARALLEL`	Maximum number of parallel requests to a single model	Non-negative integer (`uint`)	`1`
`OLLAMA_ORIGINS`	List of allowed CORS-origins for web requests (comma-separated)	Comma-separated list of origins (e.g., `https://myapp.com,http://localhost:3000`)	— (built-in values added)
`OLLAMA_FLASH_ATTENTION`	Enables experimental flash attention optimization (acceleration on Apple Silicon and NVIDIA GPU)	`0`, `1`, `false`, `true`	`false`
`OLLAMA_KV_CACHE_TYPE`	Type of quantization for the key-value cache (K/V cache)	`f16`, `q8_0`, `q4_0`	— (`f16` if empty string)
`OLLAMA_LLM_LIBRARY`	Force use of specified LLM library instead of auto-detection	`cpu`, `cpu_avx`, `cpu_avx2`, `cuda_v11`, `rocm_v5`, `rocm_v6`	— (auto-detection)
`OLLAMA_SCHED_SPREAD`	Spread model loading load evenly across all available GPUs instead of using just one	`0`, `1`, `false`, `true`	`false`
`OLLAMA_MULTIUSER_CACHE`	Optimizes prompt caching in multi-user scenarios (reduces duplication)	`0`, `1`, `false`, `true`	`false`
`OLLAMA_CONTEXT_LENGTH`	Default maximum context length (in tokens), if the model does not specify otherwise	Positive integer (`uint`)	`4096`
`OLLAMA_NEW_ENGINE`	Use new experimental engine instead of llama.cpp	`0`, `1`, `false`, `true`	`false`
`OLLAMA_AUTH`	Enables basic authentication between client and Ollama server	`0`, `1`, `false`, `true`	`false`
`OLLAMA_INTEL_GPU`	Enables experimental support for Intel GPU	`0`, `1`, `false`, `true`	`false`
`OLLAMA_GPU_OVERHEAD`	Amount of VRAM (in bytes) reserved per GPU (for system needs)	Non-negative integer (`uint64`, in bytes)	`0`
`OLLAMA_NEW_ESTIMATES`	Enables new memory size estimation system required to load a model	`0`, `1`, `false`, `true`	`0` (disabled)

Some of the content on this page was created or translated using AI.