Dedicated Servers
  • Instant
  • Custom
  • Single CPU servers
  • Dual CPU servers
  • Servers with 4th Gen CPUs
  • Servers with AMD Ryzen and Intel Core i9
  • Storage Servers
  • Servers with 10Gbps ports
  • Hosting virtualization nodes
  • GPU
  • Sale
  • VPS
  • Dedicated GPU server
  • VM with GPU
  • Tesla A100 80GB & H100 Servers
  • Sale
  • VMware and RedHat's oVirt Сlusters
  • Proxmox VE
  • Colocation
  • Colocation in the Netherlands
  • Remote smart hands
  • Services
  • Intelligent DDoS protection
  • Network equipment
  • IPv4 and IPv6 address
  • Managed servers
  • SLA packages for technical support
  • Monitoring
  • Software
  • VLAN
  • Announcing your IP or AS (BYOIP)
  • USB flash/key/flash drive
  • Traffic
  • Hardware delivery for EU data centers
  • About
  • Careers at HOSTKEY
  • Server Control Panel & API
  • Data Centers
  • Network
  • Speed test
  • Hot deals
  • Sales contact
  • Reseller program
  • Affiliate Program
  • Grants for winners
  • Grants for scientific projects and startups
  • News
  • Our blog
  • Payment terms and methods
  • Legal
  • Abuse
  • Looking Glass
  • The KYC Verification
  • Hot Deals


    Multi-threaded video encoding on a professional GPU

    server one

    Professional GPUs in servers are positioned as devices for high-performance computing, artificial intelligence systems and rendering farms for 3D graphics. Should they be used for encoding, or is this just overkill? Let's take a closer look.

    If you’re working with multi-threaded video, the power of modern CPUs and solutions like Intel Quick Sync are enough. Moreover, some experts believe that using professional GPUs for decoding and encoding is a waste of resources. The number of incoming streams is specifically limited to two or three on consumer video cards, although we have already seen that a little voodoo with the driver allows you to get around this limitation. In the previous article we tested consumer video cards, and now we will deal with a more serious contender - the NVIDIA RTX A4000.

    Getting ready for testing

    What if the lscpu output gives you something along the lines of an AMD Ryzen 9 5950X 16-Core Processor, but you have an NVIDIA RTX A4000 with 16GB of RAM plugged into your computer and you want to transcode and record multiple network cameras? Their feeds usually comes in via http, rtp or rtsp, and our task is to catch these streams, transcode them into the required format and write each one to a separate file.

    For the test, we at HOSTKEY created a small test bench with the above CPU/GPU configuration without any special optimizations and 32 GB of RAM. On it, through ffmpeg, we will receive multicast broadcasts in http and rtsp formats (we used this video file, bbb_sunflower_1080p_30fps_normal.mp4, from the Blender demo repository), decode it in a different number of ffmpeg streams and record each of them in a separate file. As the name suggests, we are streaming in 1080p (30 frames per second). Encoding will only be applied to the video, as the audio streams will go unchanged.

    It doesn’t really matter whether we take one incoming stream and simulate its multithreading or process several streams in parallel. Networking and current processes on the test bench take up less than 1% of CPU resources, so we can assume that encoding will provide the main load on the processor and disk subsystem.

    All further description is going to be about the http broadcast, as the results for the rtsp stream turned out to be comparable. So as not to produce a lot of terminal consoles on the server, simple bash scripts were created for the test, into which the required number of ffmpeg instances are transferred at startup, transcoding the video stream into h264.

    Encoding on a bare CPU:

    for (( i=0; i<$1; i++ )) do
    ffmpeg -i http://XXX.XXX.XXX.XXX:5454/ -an -vcodec h264 -y Output-File-$i.mp4 &

    For the GPU, we will use the capabilities of the video card through NVENC (we discussed how to build ffmpeg with its support in the first article of the series):

    for (( i=0; i<$1; i++ )) do
    ffmpeg -i http://XXX.XXX.XXX.XXX:5454/ -an -vcodec h264_nvenc -y Output-File-$i.mp4 &

    The scripts run in a multicast loop and snare our rabbit. You should first check through the same vlc or ffplay that the stream is actually being broadcast over. We will evaluate the result by CPU / GPU load, memory utilization and the quality of the recorded video, where the main parameters for our purposes will be the following: fps (it must be stable and not fall below 30 frames per second) and speed (shows whether we have time to process video on the go). For real-time, the speed parameter must be greater than 1.00x.

    Subsidence of these two parameters leads to dropped frames, artifacts, encoding problems and other image damage that you would not want to see on CCTV footage.

    Checking encoding on a bare CPU

    Running one copy of ffmpeg gives us this initial picture:

    CPU usage averages 18-20% per core, and the ffmpeg output shows the following:

    frame=196 fps=87 q=-1.0 Lsize=2685kB time=00:00:06.43 bitrate=3419.0kbits/s speed=2.84x

    There is a lot in reserve, so we can try three streams at once:

    frame=310 fps=54 q=29.0 size=4608kB time=00:00:07.63 bitrate=4945.3kbits/s speed=1.33x

    Four threads take up almost all the resources of the CPU and eat up 13 GB of RAM.

    That being said, the ffmpeg output shows that it can handle a little more:

    frame=332 fps=49 q=29.0 size=3072kB time=00:00:08.36 bitrate=3007.9kbits/s speed=1.23x

    We increase the number of threads to five. The processor runs at its limit, and in some places the frame rate and bitrate drops by 5–10%:

    frame=491 fps=37 q=29.0 size=4864kB time=00:00:13.66 bitrate=2915.6kbits/s speed=1.03x

    Running six threads shows that the limit has indeed been reached. We are getting more and more behind real time and we are starting to drop frames:

    frame=140 fps=23 q=29.0 size=1024kB time=00:00:01.96 bitrate=2954.4kbits/s speed=0.446x

    Testing the might of the GPU

    Rent off-the-shelf GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA RTX 4000 / 5500 / A6000 cards. These solutions are ideal for remote access to high-load applications from any place on Earth.

    We start with one ffmpeg stream with encoding through h264_nvenc. We make sure through the output of the nvidia-smi that we actually have the video card involved:

    As the output is quite cumbersome, we will monitor the GPU parameters with the following command:

    nvidia-smi dmon -s pucm

    Let's decipher the abbreviations:

    • • pwr — power consumed by the video card in watts;
    • • gtemp — video core temperature (in degrees Celsius);
    • • sm — SM, mem — memory, enc — encoder, dec — decoder (use of their resources is given as a percentage);
    • • mclk — current memory frequency (in MHz), pclk — current processor frequency (in MHz);
    • • fb — frame buffer usage (in MB).

    gpu pwr gtemp mtemp sm mem enc dec mclk pclk fb bar1
    Idx W C C % % % % MHz MHz MB MB
    0 35 48 - 1 0 6 0 6500 1560 213 5
    gpu Idx 0
    pwr W 35
    gtemp C 48
    mtemp C -
    sm % 1
    mem % 0
    enc % 6
    dec % 0
    mclk MHz 6500
    pclk MHz 1560
    fb MB 213
    bar1 MB 5

    In this output, we will be interested in the values of GPU encoder load and video memory use.

    The ffmpeg output gives the following results:

    frame=192 fps=96 q=23.0 Lsize=1575kB time=00:00:06.36 bitrate=2027.1kbits/s speed=3.17x

    We launch five streams at once. As can be seen from the htop output, in the case of GPU encoding, the CPU load is minimal, and most of the work falls on the video card. The disk subsystem is also loaded much less.

    gpu pwr gtemp mtemp sm mem enc dec mclk pclk fb bar1
    Idx W C C % % % % MHz MHz MB MB
    0 36 48 - 8 2 40 0 6500 1560 1035 14
    gpu Idx 0
    pwr W 36
    gtemp C 48
    mtemp C -
    sm % 8
    mem % 2
    enc % 40
    dec % 0
    mclk MHz 6500
    pclk MHz 1560
    fb MB 1035
    bar1 MB 14

    The load on the encoding blocks has increased to 40%, we have taken up almost a gigabyte of memory, but the video card is actually not heavily loaded. The ffmpeg output confirms this, showing that we have the resources to increase the number of threads by at least 2x:

    frame=239 fps=67 q=36.0 Lsize=2063kB time=00:00:07.93 bitrate=2130.3kbits/s speed=2.22x

    We put on ten streams. CPU usage is at the level of 15-20%.

    Graphics Card figures:

    gpu pwr gtemp mtemp sm mem enc dec mclk pclk fb bar1
    Idx W C C % % % % MHz MHz MB MB
    0 55 48 - 14 4 61 0 6500 1920 2064 24
    gpu Idx 0
    pwr W 55
    gtemp C 48
    mtemp C -
    sm % 14
    mem % 4
    enc % 61
    dec % 0
    mclk MHz 6500
    pclk MHz 1920
    fb MB 2064
    bar1 MB 24

    Power consumption has increased, the video card was forced to overclock the video core frequency, but the encoding power and video memory allow for increasing the load. We check the output of ffmpeg to make sure:

    frame=1401 fps=36 q=29.0 Lsize=12085kB time=00:00:46.66 bitrate=2121.5kbits/s speed=1.2x

    We try to add four more streams and get the load on the encoding blocks to 100%.

    gpu pwr gtemp mtemp sm mem enc dec mclk pclk fb bar1
    Idx W C C % % % % MHz MHz MB MB
    0 68 59 - 18 7 100 0 6500 1920 2886 33
    gpu Idx 0
    pwr W 68
    gtemp C 59
    mtemp C -
    sm % 18
    mem % 7
    enc % 100
    dec % 0
    mclk MHz 6500
    pclk MHz 1920
    fb MB 2886
    bar1 MB 33

    The ffmpeg output confirms that we have reached the limit. CPU usage still does not exceed 20%.

    frame=668 fps=31 q=26.0 Lsize=5968kB time=00:00:22.23 bitrate=2199.0kbits/s speed=1.04x

    The benchmark 15 threads show that the GPU is starting to fail as the encoders are overloaded, and there is an increase in temperature and power consumption.

    gpu pwr gtemp mtemp sm mem enc dec mclk pclk fb bar1
    Idx W C C % % % % MHz MHz MB MB
    0 70 63 - 18 7 100 0 6500 1920 3092 35
    gpu Idx 0
    pwr W 70
    gtemp C 63
    mtemp C -
    sm % 18
    mem % 7
    enc % 100
    dec % 0
    mclk MHz 6500
    pclk MHz 1920
    fb MB 3092
    bar1 MB 35

    Ffmpeg also confirms that the graphics card is getting stressed. The processing frequency and frame skipping are no longer encouraging:

    frame=310 fps=28 q=29.0 size=2560kB time=00:00:10.23 bitrate=2049.4kbits/s speed=0.939x

    CPU vs. GPU

    Let's summarize: the use of a GPU in such a configuration can be deemed justified because the maximum number of streams processed by the video card is 3 times higher than the capabilities of far from the weakest processors (especially without the support of hardware encoding technologies). On the other hand, we use only a small part of the capabilities of the video adapter. Since the rest of its blocks and video memory are not heavily loaded, the resources of an expensive device are actually being used inefficiently.

    Sophisticated readers may point out that we didn't test the work in 2K/4K modes, and we didn't use the capabilities of modern codecs (such as h265 and VP8/9), and also we installed a video adapter based on the previous generation’s architecture in the test bench. The same A5000 should show a better result, so we will check its work in the next article, and then we will dissect Intel Quick Sync.

    Rent off-the-shelf GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA RTX 4000 / 5500 / A6000 cards. These solutions are ideal for remote access to high-load applications from any place on Earth.

    Other articles


    How to choose the right server with suitable CPU/GPU for your AI?

    Let's talk about the most important components that influence the choice of server for artificial intelligence.


    VPS, Website Hosting or Website Builder? Where to host a website for business?

    We have compared website hosting options, including VPS, shared hosting, and website builders.


    How AI is fighting the monopoly in sports advertising with GPUs and servers

    AI and AR technologies allow sports advertising to be customized to different audiences in real time using cloud-based GPU solutions.


    From xWiki to static-HTML. How we “transferred” documentation

    Choosing a platform for creating a portal with internal and external documentation. Migration of documents from cWiki to Material for MkDocs


    Test Build: Supermicro X13SAE-F Intel Core i9-14900KF 6.0 GHz

    Test results of a computer assembly based on the Supermicro X13SAE-F motherboard and the new Intel Core i9-14900KF processor overclockable up to 6.0 GHz.

    HOSTKEY Dedicated servers and cloud solutions Pre-configured and custom dedicated servers. AMD, Intel, GPU cards, Free DDoS protection amd 1Gbps unmetered port 30
    4.3 67 67