20.06.2022

Multithreaded encoding: Pay twice as much or go for built-in?

server one

Our test of the NVIDIA A4000 practically confirmed that it is able to encode up to 16 independent FullHD video streams in H264 format. Will we be able to multiply the performance with a professional video card, which costs twice as much? Let's check it out.

HOSTKEY
Rent GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA RTX 5500 / 5000 / A4000 cards. VPS with dedicated GPU cards are also available . The GPU card is dedicated to the VM and cannot be used by other clients. GPU performance in virtual machines matches GPU performance in dedicated servers.

In our second article about encoding (with the A4000 test) we missed the fact that a video stream can be of higher resolution, so it's worth testing 4K file encoding. To complete the picture, we will also compare encoding on solutions from NVIDIA with Intel's built-in GPU. Some professionals believe that it is enough to use the same FFmpeg with QuickSync enabled and an external video card will no longer be needed. We will check this assertion as well.

We won't describe in detail the testing process for NVIDIA video cards and why we need FFmpeg, as this is covered in the previous articles (parts one and two). We'd rather focus on the new results and useful tips and tricks.

A4000 vs A5000

We will use the same test rig from the existing HOSTKEY servers, but install an NVIDIA A5000 graphics card with more encoding blocks, 24 GB of video memory and higher power consumption.

NVIDIA A5000

First, let's check its performance based on the number of threads, which turned out to be the limit for the A4000 according to the results of the previous test:

14 threads

gpu pwr gtemp mtemp sm mem enc dec mclk pclk fb bar1
Idx W C C % % % % MHz MHz MB MB
0 97 47 - 92 3 100 0 7600 1920 3502 33
gpu Idx 0
pwr W 97
gtemp C 47
mtemp C -
sm % 92
mem % 3
enc % 100
dec % 0
mclk MHz 7600
pclk MHz 1920
fb MB 3502
bar1 MB 33

frame=1015 fps=31 q=28.0 Lsize= 9056kB time=00:00:33.80 bitrate=2194.8kbits/s speed=1.02x

Amazing! We got figures comparable to those of the A4000. Despite higher chip frequency, more video memory and higher power consumption, the A5000 managed to encode only 14 threads and gave up on the fifteenth. This fiasco proves once again that professional video adapters are designed for other purposes.

Switching to 4K

Now let's try broadcasting the stream with 3840x2160 resolution (aka 4K). Thankfully we have such a video file about a rabbit. CPU-only encoding failed even on one thread, when the amount of data multiplied:

frame= 2902 fps=27 q=29.0 size=104448kB time=00:01:33.56 bitrate=9144.7kbits/s dup=436 drop=0 speed=0.878x

What are the capabilities of the GPU (remember, the results of the A4000 and A5000 are comparable)? It's 3 threads.

gpu pwr gtemp mtemp sm mem enc dec mclk pclk fb bar1
Idx W C C % % % % MHz MHz MB MB
0 96 46 - 100 3 96 0 7600 1920 1112 9
gpu Idx 0
pwr W 96
gtemp C 46
mtemp C -
sm % 100
mem % 3
enc % 96
dec % 0
mclk MHz 7600
pclk MHz 1920
fb MB 1112
bar1 MB 9

As you can see, in terms of power consumption and encoding blocks, the video chip is obviously not working in high-comfort mode, although only about 1 GB of video memory is being used.

FFmpeg output confirms that the video card is doing fine:

frame= 1465 fps=33 q=35.0 Lsize=12584kB time=00:00:48.80 bitrate=2112.4kbits/s dup=159 drop=0 speed=1.09x

However, the adapter can't handle 4 streams. Although the hardware load remains at about the same values, there is a drop in frames:

frame= 614 fps= 26 q=35.0 Lsize=4978kB time=00:00:20.43 bitrate=1995.6kbits/s speed=0.858x

Using FFmpeg with QuickSync support

According to the developer, QuickSync is supposed to "use the special multimedia processing capabilities of Intel® graphics technology to accelerate decoding and encoding, allowing the processor to perform other tasks in parallel and improving system performance."

For the tests we needed a suitable Intel processor (we found a machine with a Core i9-9900K CPU @ 3.60GHz) and FFmpeg with Quick Sync support. There were no problems with the former (we only needed a 6th-generation or older chip and a GPU, which is easy to check), but setting up FFmpeg for Ubuntu 20.04 felt like practicing the Kama Sutra. To save you precious time, we will describe how we solved the problem.

Since the packages in the repositories are broken, the first thing to do is to build and install gmmlib and libva libraries, as well as the latest Intel media driver and Media SDK versions in the system. To do that, create a GIT folder in your home directory, go to it and run the following commands in sequence (if any dependencies are missing, install them from the repository; we recommend doing sudo apt install autoconf automake build-essential cmake pkg-config):

git clone https://github.com/intel/gmmlib.git && cd gmmlib 
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_LIBDIR=/usr/lib/x86_64-linux-gnu ..
make -j8
sudo make install 

git clone https://github.com/intel/libva.git && cd libva
./autogen.sh --prefix=/usr --libdir=/usr/lib/x86_64-linux-gnu 
make -j8
sudo make install 

git clone https://github.com/intel/media-driver.git && cd media-driver
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_LIBDIR=/usr/lib/x86_64-linux-gnu ..
make -j8
sudo make install 

git clone https://github.com/Intel-Media-SDK/MediaSDK.git && cd MediaSDK
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_LIBDIR=/usr/lib/x86_64-linux-gnu ..
make -j8
sudo make install

Then you need to run FFmpeg with a few magic commands:

git clone https://github.com/ffmpeg/ffmpeg
cd ffmpeg
./configure --enable-libmfx --enable-vaapi --enable-opencl --enable-libvorbis --enable-libvpx --enable-libdrm --enable-gpl --cpu=native --enable-libfdk-aac --enable-libx264 --enable-libx265 --extra-libs=-lpthread --enable-nonfree
make -j8
sudo make install

It's worth making sure you have Quick Sync support:

ffmpeg -decoders|grep qsv

The output of the command should look something like this:

V....D av1_qsv              AV1 video (Intel Quick Sync Video acceleration) (codec av1) 
V....D h264_qsv             H264 video (Intel Quick Sync Video acceleration) (codec h264) 
V....D hevc_qsv             HEVC video (Intel Quick Sync Video acceleration) (codec hevc) 
V....D mjpeg_qsv            MJPEG video (Intel Quick Sync Video acceleration) (codec mjpeg) 
V....D mpeg2_qsv            MPEG2VIDEO video (Intel Quick Sync Video acceleration) (codec mpeg2video) 
V....D vc1_qsv              VC1 video (Intel Quick Sync Video acceleration) (codec vc1) 
V....D vp8_qsv              VP8 video (Intel Quick Sync Video acceleration) (codec vp8) 
V....D vp9_qsv              VP9 video (Intel Quick Sync Video acceleration) (codec vp9) 

Good! Everything is ready for testing.

Testing encoding with Quick Sync

First, let's see how the processor can handle FullHD video encoding without Quick Sync: it can manage 4 threads maximum, with all cores under 100% load.

frame= 1461 fps= 33 q=29.0 size=24064kB time=00:00:46.33 bitrate=4254.7kbits/s speed=1.05x

The fifth thread is no longer being handled by the processor, so we can safely proceed with the Quick Sync test. In the script from the previous article, you will need to change the encoder to h264_qsv, and it will look like this (you can read more about using QuickSync with FFmpeg here):

#!/bin/bash
for (( i=0; i<$1; i++ )) do
	ffmpeg -i http://78.0.75.110:5454/ -an -vcodec h264_qsv -y Output-File-$i.mp4 &
done

First we do a test on 6 threads (+2 to the test on a clean CPU):

frame=291 fps=55 q=29.0 size=1280kB time=00:00:10.13 bitrate=1034.8kbits/s dup=2 drop=0 speed=1.93x

The difference is obvious: the CPU load is less than 50%, and the available reserve of computing resources allows you to predict 11 - 12 total threads.

Let's try 11 threads:

frame=157 fps=30 q=38.0 Lsize=628kB time=00:00:05.69 bitrate=903.0kbits/s dup=2 drop=0 speed=1.09x

The processor load increases only slightly, but the GPU is already reaching its limits. The twelfth thread drops the bitrate and processing speed to 24 - 28 frames.

Now let's check the threads in 4K. In contrast to AMD, our Intel processor easily handles one thread at this resolution and without hardware acceleration:

frame=655 fps=31 q=-1.0 Lsize=30637kB time=00:00:21.73 bitrate=11547.9kbits/s speed=1.03x

Unfortunately, it couldn’t do more than that. With Quick Sync on, the test computer was able to pull three 4K threads:

frame= 509 fps=31 q=33.0 Lsize=8010kB time=00:00:17.42 bitrate=3764.7kbits/s dup=2 drop=0 speed=1.07x

It failed only on the fourth, but our Nvidia A5000 video card survived as well.

Unfortunately, the solution has disadvantages as well. When using the BMC module (for example, when controlling a machine via IPMI), you will not have access to all the hardware acceleration capabilities, even if a GPU is detected in the system. You'll have to choose between the convenience of remote management or getting all the benefits of using Quick Sync.

Bottom line

You can draw your own conclusions. We would only mention that for video encoding, the difference in video card capabilities is not always determined by the price, and for some tasks it is worth paying attention to specialized technologies in CPUs. We also used H264 for the tests, but HEVC (H265) or VP1 codecs should in theory give better results, especially at 4K resolutions. If you do similar tests with the former yourself (VP1 is still available on hardware and on a large scale only for decoding), share your results in the comments.

Rent GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA RTX 5500 / 5000 / A4000 cards. VPS with dedicated GPU cards are also available . The GPU card is dedicated to the VM and cannot be used by other clients. GPU performance in virtual machines matches GPU performance in dedicated servers.

Other articles

11.11.2022

How to ignore tmpfs, udf, iso9660 when dealing with filesystem metrics

How to avoid issues using Foreman when installing Windows OS?

31.10.2022

Monitoring Linux Services with Prometheus

Monitoring services and implementing an alert system on Linux servers with Prometheus.

31.10.2022

Using Prometheus + Alertmanager + Node Exporter to monitor a company's geo-distributed infrastructure.

How to implement an effective monitoring system to track equipment operation and quickly detect server failures.

17.10.2022

Migrating Virtual Servers to oVirt

Develop and apply a convenient scheme for transferring servers from the old oVirt Engine 4.2 infrastructure to the new oVirt Engine 4.5 infrastructure.

11.10.2022

Wazuh from the point of view of a Windows Administrator

How to choose a solution that will ensure a stable file configuration on servers and reliable alerts in event these files change.

HOSTKEY Dedicated servers and cloud solutions Pre-configured and custom dedicated servers. AMD, Intel, GPU cards, Free DDoS protection amd 1Gbps unmetered port 30
4.3 67 67
Upload