20.06.2022

Multithreaded encoding: Pay twice as much or go for built-in?

Our test of the NVIDIA A4000 practically confirmed that it is able to encode up to 16 independent FullHD video streams in H264 format. Will we be able to multiply the performance with a professional video card, which costs twice as much? Let's check it out.

Index:

A4000 vs. A5000
Switching to 4K
Using FFmpeg with QuickSync support
Testing encoding with Quick Sync
Bottom line

Rent GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA RTX 5500 / 5000 / A4000 cards. VPS with dedicated GPU cards are also available . The GPU card is dedicated to the VM and cannot be used by other clients. GPU performance in virtual machines matches GPU performance in dedicated servers.

In our second article about encoding (with the A4000 test) we missed the fact that a video stream can be of higher resolution, so it's worth testing 4K file encoding. To complete the picture, we will also compare encoding on solutions from NVIDIA with Intel's built-in GPU. Some professionals believe that it is enough to use the same FFmpeg with QuickSync enabled and an external video card will no longer be needed. We will check this assertion as well.

We won't describe in detail the testing process for NVIDIA video cards and why we need FFmpeg, as this is covered in the previous articles (parts one and two). We'd rather focus on the new results and useful tips and tricks.

A4000 vs A5000

We will use the same test rig from the existing HOSTKEY servers, but install an NVIDIA A5000 graphics card with more encoding blocks, 24 GB of video memory and higher power consumption.

First, let's check its performance based on the number of threads, which turned out to be the limit for the A4000 according to the results of the previous test:

14 threads

gpu	pwr	gtemp	mtemp	sm	mem	enc	dec	mclk	pclk	fb	bar1
Idx	W	C	C	%	%	%	%	MHz	MHz	MB	MB
0	97	47	-	92	3	100	0	7600	1920	3502	33

gpu	Idx	0
pwr	W	97
gtemp	C	47
mtemp	C	-
sm	%	92
mem	%	3
enc	%	100
dec	%	0
mclk	MHz	7600
pclk	MHz	1920
fb	MB	3502
bar1	MB	33

frame=1015 fps=31 q=28.0 Lsize= 9056kB time=00:00:33.80 bitrate=2194.8kbits/s speed=1.02x

Amazing! We got figures comparable to those of the A4000. Despite higher chip frequency, more video memory and higher power consumption, the A5000 managed to encode only 14 threads and gave up on the fifteenth. This fiasco proves once again that professional video adapters are designed for other purposes.

Switching to 4K

Now let's try broadcasting the stream with 3840x2160 resolution (aka 4K). Thankfully we have such a video file about a rabbit. CPU-only encoding failed even on one thread, when the amount of data multiplied:

frame= 2902 fps=27 q=29.0 size=104448kB time=00:01:33.56 bitrate=9144.7kbits/s dup=436 drop=0 speed=0.878x

What are the capabilities of the GPU (remember, the results of the A4000 and A5000 are comparable)? It's 3 threads.

gpu	pwr	gtemp	mtemp	sm	mem	enc	dec	mclk	pclk	fb	bar1
Idx	W	C	C	%	%	%	%	MHz	MHz	MB	MB
0	96	46	-	100	3	96	0	7600	1920	1112	9

gpu	Idx	0
pwr	W	96
gtemp	C	46
mtemp	C	-
sm	%	100
mem	%	3
enc	%	96
dec	%	0
mclk	MHz	7600
pclk	MHz	1920
fb	MB	1112
bar1	MB	9

As you can see, in terms of power consumption and encoding blocks, the video chip is obviously not working in high-comfort mode, although only about 1 GB of video memory is being used.

FFmpeg output confirms that the video card is doing fine:

frame= 1465 fps=33 q=35.0 Lsize=12584kB time=00:00:48.80 bitrate=2112.4kbits/s dup=159 drop=0 speed=1.09x

However, the adapter can't handle 4 streams. Although the hardware load remains at about the same values, there is a drop in frames:

frame= 614 fps= 26 q=35.0 Lsize=4978kB time=00:00:20.43 bitrate=1995.6kbits/s speed=0.858x

Using FFmpeg with QuickSync support

According to the developer, QuickSync is supposed to "use the special multimedia processing capabilities of Intel® graphics technology to accelerate decoding and encoding, allowing the processor to perform other tasks in parallel and improving system performance."

For the tests we needed a suitable Intel processor (we found a machine with a Core i9-9900K CPU @ 3.60GHz) and FFmpeg with Quick Sync support. There were no problems with the former (we only needed a 6th-generation or older chip and a GPU, which is easy to check), but setting up FFmpeg for Ubuntu 20.04 felt like practicing the Kama Sutra. To save you precious time, we will describe how we solved the problem.

Since the packages in the repositories are broken, the first thing to do is to build and install gmmlib and libva libraries, as well as the latest Intel media driver and Media SDK versions in the system. To do that, create a GIT folder in your home directory, go to it and run the following commands in sequence (if any dependencies are missing, install them from the repository; we recommend doing sudo apt install autoconf automake build-essential cmake pkg-config):

git clone https://github.com/intel/gmmlib.git && cd gmmlib 
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_LIBDIR=/usr/lib/x86_64-linux-gnu ..
make -j8
sudo make install 

git clone https://github.com/intel/libva.git && cd libva
./autogen.sh --prefix=/usr --libdir=/usr/lib/x86_64-linux-gnu 
make -j8
sudo make install 

git clone https://github.com/intel/media-driver.git && cd media-driver
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_LIBDIR=/usr/lib/x86_64-linux-gnu ..
make -j8
sudo make install 

git clone https://github.com/Intel-Media-SDK/MediaSDK.git && cd MediaSDK
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_LIBDIR=/usr/lib/x86_64-linux-gnu ..
make -j8
sudo make install

Then you need to run FFmpeg with a few magic commands:

git clone https://github.com/ffmpeg/ffmpeg
cd ffmpeg
./configure --enable-libmfx --enable-vaapi --enable-opencl --enable-libvorbis --enable-libvpx --enable-libdrm --enable-gpl --cpu=native --enable-libfdk-aac --enable-libx264 --enable-libx265 --extra-libs=-lpthread --enable-nonfree
make -j8
sudo make install

It's worth making sure you have Quick Sync support:

ffmpeg -decoders|grep qsv

The output of the command should look something like this:

V....D av1_qsv              AV1 video (Intel Quick Sync Video acceleration) (codec av1) 
V....D h264_qsv             H264 video (Intel Quick Sync Video acceleration) (codec h264) 
V....D hevc_qsv             HEVC video (Intel Quick Sync Video acceleration) (codec hevc) 
V....D mjpeg_qsv            MJPEG video (Intel Quick Sync Video acceleration) (codec mjpeg) 
V....D mpeg2_qsv            MPEG2VIDEO video (Intel Quick Sync Video acceleration) (codec mpeg2video) 
V....D vc1_qsv              VC1 video (Intel Quick Sync Video acceleration) (codec vc1) 
V....D vp8_qsv              VP8 video (Intel Quick Sync Video acceleration) (codec vp8) 
V....D vp9_qsv              VP9 video (Intel Quick Sync Video acceleration) (codec vp9)

Good! Everything is ready for testing.

Testing encoding with Quick Sync

First, let's see how the processor can handle FullHD video encoding without Quick Sync: it can manage 4 threads maximum, with all cores under 100% load.

frame= 1461 fps= 33 q=29.0 size=24064kB time=00:00:46.33 bitrate=4254.7kbits/s speed=1.05x

The fifth thread is no longer being handled by the processor, so we can safely proceed with the Quick Sync test. In the script from the previous article, you will need to change the encoder to h264_qsv, and it will look like this (you can read more about using QuickSync with FFmpeg here):

#!/bin/bash
for (( i=0; i<$1; i++ )) do
	ffmpeg -i http://78.0.75.110:5454/ -an -vcodec h264_qsv -y Output-File-$i.mp4 &
done

First we do a test on 6 threads (+2 to the test on a clean CPU):

frame=291 fps=55 q=29.0 size=1280kB time=00:00:10.13 bitrate=1034.8kbits/s dup=2 drop=0 speed=1.93x

The difference is obvious: the CPU load is less than 50%, and the available reserve of computing resources allows you to predict 11 - 12 total threads.

Let's try 11 threads:

frame=157 fps=30 q=38.0 Lsize=628kB time=00:00:05.69 bitrate=903.0kbits/s dup=2 drop=0 speed=1.09x

The processor load increases only slightly, but the GPU is already reaching its limits. The twelfth thread drops the bitrate and processing speed to 24 - 28 frames.

Now let's check the threads in 4K. In contrast to AMD, our Intel processor easily handles one thread at this resolution and without hardware acceleration:

frame=655 fps=31 q=-1.0 Lsize=30637kB time=00:00:21.73 bitrate=11547.9kbits/s speed=1.03x

Unfortunately, it couldn’t do more than that. With Quick Sync on, the test computer was able to pull three 4K threads:

frame= 509 fps=31 q=33.0 Lsize=8010kB time=00:00:17.42 bitrate=3764.7kbits/s dup=2 drop=0 speed=1.07x

It failed only on the fourth, but our Nvidia A5000 video card survived as well.

Unfortunately, the solution has disadvantages as well. When using the BMC module (for example, when controlling a machine via IPMI), you will not have access to all the hardware acceleration capabilities, even if a GPU is detected in the system. You'll have to choose between the convenience of remote management or getting all the benefits of using Quick Sync.

Bottom line

You can draw your own conclusions. We would only mention that for video encoding, the difference in video card capabilities is not always determined by the price, and for some tasks it is worth paying attention to specialized technologies in CPUs. We also used H264 for the tests, but HEVC (H265) or VP1 codecs should in theory give better results, especially at 4K resolutions. If you do similar tests with the former yourself (VP1 is still available on hardware and on a large scale only for decoding), share your results in the comments.