Instant dedicated servers ⭐ AMD EPYC / Ryzen ⭐ 1–10 Gbps ⭐ 12% OFF
EN
Currency:
EUR – €
Choose a currency
  • Euro EUR – €
  • United States dollar USD – $
VAT:
OT 0%
Choose your country (VAT)
  • OT All others 0%

04.12.2025

Improving LLM Benchmark for GPU Servers Equipped with NVIDIA Cards: A Focus on the Ollama Framework

server one
HOSTKEY

Author: Alexander Kazantsev, Head of Documentation and Content Development

When assembling GPU servers or dealing with issues reported by our clients, we need to test them. Since deploying LLM (Large Language Model) models using Ollama is one of the potential use cases for these servers, and we offer pre-built panels based on such models, we also need to verify the servers’ functionality and performance under load through Ollama. We need to have reliable test results for comparison purposes.

GPU Servers – Hourly Billing
These are high-performance servers featuring the latest NVIDIA graphics cards, available for payment on a hourly or monthly basis with discounts.

After searching for open-source solutions, we couldn’t find a suitable benchmark tool, so we decided to develop our own test. The testing algorithm was designed as follows:

  1. Install the Ollama software on the system.
  2. Determine the available video memory capacity and select the appropriate model based on that capacity. We found that the DeepSeek model with 14/32/70 billion parameters and a maximum context size of 128K performed well. We will be testing GPUs with video memory of 16 GB or more.
  3. Generate the same prompt multiple times (ideally, generate code to obtain the desired amount of output) on this model, gradually increasing the context size from 2000 tokens to the model’s maximum capacity in increments of 2000 tokens.
  4. Record the results, including the generation time and the number of tokens generated per second, as well as other relevant metrics from Ollama and the GPU performance data (from nvidia-smi).
  5. Analyze the relationship between the model parameters and context size, and determine baseline values for subsequent comparisons.

First Approach: Using Bash Script

We decided to create the script in Bash to avoid using more complex languages like Python. The only additional packages required are jq for processing JSON responses from Ollama and curl, which is included by default in Ubuntu 22.04 and 24.04.

To determine the video memory capacity, we’ll use the ollama serve command. However, there’s a slight issue: Ollama is designed as a service, so trying to run the command directly may cause conflicts with already running instances of Ollama. To resolve this, we need to:

  • Stop any currently running Ollama instances.
  • Start the ollama serve command.
  • Extract the available memory information from the output and store it in an array.
  • Use a regular expression to extract the relevant values.

The final code snippet shows how to do this:

grep -o 'available="[^"]*"' | grep -o '[0-9.]*'

Then, we process the output using a regular expression to extract the memory size and calculate the total available video memory:

printf '%s GiB\n' "${available_memory[@]}"
total_available=$(echo "$log_output" | awk -F'available="' '/available="/ { gsub(/".*/, "", $2); sum += $2 } END { print sum+0'}

Although we could also use the nvidia-smi command to get the available GPU memory, ollama serve provides direct access to this information, making it more straightforward.

After stopping and starting the Ollama service, we can proceed with further testing using its API.

Next, based on the total amount of video memory available, you need to select the appropriate model according to the following criteria:

  • Deepseek-r1:14b = 15 GiB (the values are provided in gigabytes, not gigabits)
  • Deepseek-r1:32b = 23 GiB
  • Deepseek-r1:70b = 48 GiB

You need to “warm up” the model by running Ollama with it. This is necessary to ensure that the benchmark results reflect performance without any delay caused by loading the model into memory.

curl -s http://localhost:11434/api/generate \
      -H "Content-Type: application/json" \
      -d '{"model":"'"$MODEL"'","prompt":"Hello","stream":false}' \
      >/dev/null

After that, you can proceed to run a loop that increases the context parameter (ctx) from 2000 to 128000, and then ask the neural network to generate a “Tetris game” for us using the prompt “Generate Tetris game on HTML and JS”.

curl -s --max-time 300 \
       -H "Content-Type: application/json" \
       -d '{
          "model": "'"$MODEL"'",
          "prompt": "'"$prompt"'",
          "stream": false,
          "options": {
                  "num_ctx": '"$num_ctx"'
           }
           }' \
        http://localhost:11434/api/generate > "$response_file"
local total_tokens=$(jq -r '.eval_count // 0' "$response_file")
local model_response=$(jq -r '.response // ""' "$response_file")
local eval_sec=$(echo "$eval_duration_ns / 1000000000" | bc -l)
local load_sec=$(echo "$load_duration_ns / 1000000000" | bc -l)
local tokens_per_sec=0
if (( $(echo "$eval_sec > 0.001" | bc -l) )); then
tokens_per_sec=$(echo "scale=2; $total_tokens / $eval_sec" | bc)
fi

Next, we neatly display all the results on the screen, including the parameters output by nvidia-smi:

nvidia-smi --query-gpu=index,name,memory.used,memory.total,temperature.gpu,power.draw,power.limit

We stop the testing when the CPU starts to become overloaded.

That’s it. The first version of the test only did this; it worked, but our requirements usually increase during the actual testing process. Therefore, our colleagues asked us to improve the test, and thus the second version was created.

Second Approach

There were quite a few changes made to the second version:

  1. In the first version, only one model that fit the available video memory capacity was selected; smaller models were simply ignored. However, this prevented us from comparing different server configurations with GPUs since the models used were varied. As a result, two modes were created: “max” and “test.” The “max” mode tested only the model that completely fitted the video memory, while the “test” mode tested all models in ascending order of size.
  2. The initial context parameter was increased to the default value for Ollama 4K.
  3. You can now provide your own prompt instead of using the default one for Tetris generation.
  4. You can now specify a particular model from the list of available ones.
  5. You can run tests with a custom context value without having to go through all options.
  6. You can choose a different group of models for testing; open-source models from OpenAI and Qwen have also been added.
  7. In the “test” mode, a final report is generated in HTML format.

All these options are now available by providing the necessary parameters when starting the test.

Options:
  -t MODE     Execution mode:
                max   — run the largest model from the group that fits in memory,
                        then test with increasing context (default)
                test  — test all models in the group from smallest to largest
  -p PROMPT   Prompt for generation (default: "Generate Tetris game on HTML and JS")
  -m MODEL    Run ONLY the specified model (ignores -t and -g)
  -c CTX      Use a fixed context size
  -g GROUP    Model group: deepseekr1 (default), gpt-oss, qwen3
  -h          Show this help and exit

The model table now looks like this:

Model

Minimum Required Video Memory (GiB)

deepseek-r1:14b

15

deepseek-r1:32b

23

deepseek-r1:70b

48

gpt-oss:20b

16

gpt-oss:120b

70

qwen3:14b

15

qwen3:32b

23

It can be easily modified or expanded in the future, as it’s defined by the following code snippet:

declare -A MODEL_VRAM
MODEL_VRAM=(
    ["deepseek-r1:14b"]="15"
    ["deepseek-r1:32b"]="23"
    ["deepseek-r1:70b"]="48"
    ["gpt-oss:20b"]="16"
    ["gpt-oss:120b"]="70"
    ["qwen3:14b"]="15"
    ["qwen3:32b"]="23"
)

declare -A MODEL_GROUPS
MODEL_GROUPS["deepseekr1"]="deepseek-r1:14b deepseek-r1:32b deepseek-r1:70b"
MODEL_GROUPS["gpt-oss"]="gpt-oss:20b gpt-oss:120b"
MODEL_GROUPS["qwen3"]="qwen3:14b qwen3:32b"

Additionally, we made some minor improvements to the formatting and fixed some small bugs, such as issues with incorrect Ollama initialization when services were not specified or errors when determining the offloading mode.

Below is an example of the test results displayed on screen for the second version:

=== [deepseek-r1:70b] Test with num_ctx=30000 ===
Ollama PS:
NAME               ID              SIZE     PROCESSOR    CONTEXT    UNTIL
deepseek-r1:70b    d37b54d01a76    62 GB    100% GPU     30000      4 minutes from now
nvidia-smi:
GPU  Name                 Mem used     Mem total    Temp         Power draw   Power lim
0     NVIDIA GeForce RTX 5090  31447 MiB    32607 MiB    63°C        340.80 W     575.00 W
1     NVIDIA GeForce RTX 5090  31179 MiB    32607 MiB    58°C        295.09 W     575.00 W
Test results:
Loading time: 17.687 seconds
Generation time: 83.218 seconds
Tokens used: 2752
Model speed: 33.06 tokens per second
GPU utilization: 100%
✅ The deepseek-r1:70b model has been fully tested.
🏁 Results are saved in /root/gpu_test/

The full code for the script can be found and used in your own repository on GitHub at:

https://github.com/hkadm/ollama_gpu_test/tree/main/nvidia_test

Results

The test was successful; we ran it on multiple of our GPU servers. In the current implementation, we even discovered some interesting patterns:

  1. Ollama is unable to distribute the workload across multiple GPUs when the ctx (context) size increases for smaller models. For us, small models simply crash with an error once they reach the memory capacity of a single GPU.
  2. The processing speed of the model does not depend on the ctx size under equal conditions; on average, it remains constant.
  3. Ollama does not parallelize the processing of neural network models but merely redistributes them across all available video memory. Therefore, you shouldn’t expect two NVIDIA RTX 5090 GPUs to work twice as fast as a single GPU. The speeds will be the same.
  4. Various environmental factors have almost no impact on the final result. We were only able to achieve a maximum of +1 token per second.

The current table showing the results of the tests with the GPUs we used (a4000, a5000, 4090, 5090, 2x4090, and 2x5090) in DeepSeek R1, with model sizes ranging from 7 to 70B, can also be found on GitHub at the following link: https://github.com/hkadm/ollama_gpu_test/blob/main/test_result/Test_total.md

Just to clarify, since Ollama is still in the active development phase, you might achieve better results yourself. We’d be happy if you tried our test and shared your findings with us!

GPU Servers – Hourly Billing
These are high-performance servers featuring the latest NVIDIA graphics cards, available for payment on a hourly or monthly basis with discounts.

Other articles

04.12.2025

What is the Cloud and How It Outperforms Traditional Hosting: A Comprehensive Overview of Cloud Computing

The cloud is a flexible and cost-effective solution that adapts to demand, enabling businesses to grow without unnecessary expenses. Through a simple yet realistic example, we demonstrate how cloud services work internally and why they often outperform traditional hosting in certain situations.

01.12.2025

Debian 13 “Trixie” and Proxmox VE 9.0: Implementation and Testing in Production

The new version of Debian 13 and the release of Proxmox VE 9.0 came out almost simultaneously, generating significant interest from customers. In this article, we detail how the HOSTKEY team adapted their processes, automated deployments, and prepared their infrastructure for these new releases.

27.10.2025

Checklist: 5 Signs It's Time for Your Business to Upgrade from VPS to a Dedicated Server

Do you still rely on cloud services despite paying for them? If your budget is at least €50 per year, a dedicated server could be more cost-effective. Please review the checklist and the comparative tests between cloud and bare-metal solutions.

29.09.2025

What to Do If Your Laptop Breaks Down? How Kasm Turns Even an Old Tablet into a Workstation

When technical issues disrupt work, Kasm Workspaces becomes a lifesaver, turning outdated devices into powerful workstations through a browser. The article discusses how the platform addresses issues with broken laptops and equipment shortages, compares different versions (Community, Starter, Enterprise, Cloud), examines resource requirements, and reviews test results on VPS.

24.09.2025

Replacing Google Meet and Microsoft Teams: Jitsi Meet and Other Alternatives for Business

If you’re in the market for a replacement for Google Meet—just like we were—we’ve got options for you: Zoom, NextCloud, or self-hosted solutions. After thorough testing, we decided on Jitsi Meet on a VPS and have put it to use in real-world scenarios. We’d love to share our insights and any potential pitfalls you should be aware of.

Upload