03.06.2022

NVIDIA A5500: real power or just a facelift

One of the new products featured at the GTC 2022 conference was the RTX A5500 graphics card, which expands the range of NVIDIA professional graphics accelerators. It is based on Ampere architecture with second-generation RT cores and third-generation tensor cores. It features 24 GB of GDDR6 memory with ECC error correction and 768 GB/s peak bandwidth.

The 8nm RTX A5500 graphics chip includes 10,240 CUDA cores, 80 RT cores and 320 tensor cores. NVIDIA notes that the performance in single precision operations (FP32) is 34.1 Tflops, and in half precision operations (FP16) - 272.8 Tflops.

All this, as they say, is on paper. Let's see what the video card is really capable of, made possible by the fact that Hostkey has recently been able to build a machine equipped with the card.

Index:

Encoding
CUDA/RT/Tensor Cores
Concluions

Rent GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA RTX 5500 / 5000 / A4000 cards. VPS with dedicated GPU cards are also available . The GPU card is dedicated to the VM and cannot be used by other clients. GPU performance in virtual machines matches GPU performance in dedicated servers.

Encoding

Comparing the RTX A5000 and RTX A4000, we were convinced that neither an increase in CPU frequency nor the amount of video memory had much effect on the video cards' encoding block performance. Readers also rightly noticed that we used an automatic quantization setting (hence the quality of the resulting video) instead of the ready-made h264 codec preset, and also we skipped the encoding necessary for 60 fps streaming.

Let's repeat the same tests on RTX A5500 and first of all we will run 1080p stream encoding at 30 fps. If we compare the A5000 results, then it (just like the A4000) managed only 14 streams.

The A5500 performed better, and at 14 threads it clearly had a safety margin (NVIDIA promises up to 16 threads). At the same time, the video card consumed 5 W less power and had a lower video core temperature (+35° C vs. +47° C for A5000), though it used 500 MB more video memory.

Output from nvidia-smi dmon -s pucm

gpu	pwr	gtemp	mtemp	sm	mem	enc	dec	mclk	pclk	fb	bar1
Idx	W	C	C	%	%	%	%	MHz	MHz	MB	MB
0	92	35	-	13	3	100	0	7600	1890	4141	32

gpu	Idx	0
pwr	W	92
gtemp	C	35
mtemp	C	-
sm	%	13
mem	%	3
enc	%	100
dec	%	0
mclk	MHz	7600
pclk	MHz	1890
fb	MB	4141
bar1	MB	32

Ffmpeg output gives us the following:

frame = 1051 fps = 32 q = 33.0 size = 9472 kB time = 00:00:34.93 bitrate = 2221.2 kbits/s speed = 1.07x

The adapter obviously cannot handle 16 video streams:

gpu	pwr	gtemp	mtemp	sm	mem	enc	dec	mclk	pclk	fb	bar1
Idx	W	C	C	%	%	%	%	MHz	MHz	MB	MB
0	96	44	-	13	4	100	0	7600	1905	4732	32

gpu	Idx	0
pwr	W	96
gtemp	C	44
mtemp	C	-
sm	%	13
mem	%	4
enc	%	100
dec	%	0
mclk	MHz	7600
pclk	MHz	1905
fb	MB	4732
bar1	MB	32

frame = 901 fps =28 q= 26.0 size = 7680 kB time = 00:00:29.93 bitrate = 2101.8 kbits/s speed = 0.917x

It started experiencing frame loss, and the picture was filled with artifacts (defects), as the codec could not keep up and automatically degraded the quality (parameter q jumping from 26 to 50).

Let's try to record video in high quality. We set the parameters corresponding to the high profile for the h264 codec: it is considered basic for digital broadcasting and video on optical media, especially for high definition television (it is also used for Blu-ray video discs and DVB HDTV broadcasting).

Once again running 14 threads, the load on the video card increases, but the card can handle it:

gpu	pwr	gtemp	mtemp	sm	mem	enc	dec	mclk	pclk	fb	bar1
Idx	W	C	C	%	%	%	%	MHz	MHz	MB	MB
0	95	43	-	13	4	100	0	7600	1890	4141	32

gpu	Idx	0
pwr	W	95
gtemp	C	43
mtemp	C	-
sm	%	13
mem	%	4
enc	%	100
dec	%	0
mclk	MHz	7600
pclk	MHz	1890
fb	MB	4141
bar1	MB	32

Ffmpeg output:

frame = 968 fps = 32 q = 23.0 size = 7680 kB time = 00:00:32.16 bitrate = 1955.9 kbits/s speed = 1.07x

Let’s try 4K at 30 fps. The card can handle three streams in high profile without any problems:

frame = 257 fps = 37 q = 33.0 size = 2304 kB time = 00:00:08.46 bitrate = 2229.3 kbits/s speed = 1.2x

On four streams, it drops slightly (as you remember, the A5000 with four streams and the automatic quality setting was able to give only 25-26 frames with artifacts):

frame = 985 fps = 30 q = 37.0 size = 7424 kB time = 00:00:32.73 bitrate = 1858.0 kbits/s speed = 0.995x

The hardware was as follows:

gpu	pwr	gtemp	mtemp	sm	mem	enc	dec	mclk	pclk	fb	bar1
Idx	W	C	C	%	%	%	%	MHz	MHz	MB	MB
0	89	32	-	9	4	100	0	7600	1920	1659	11

gpu	Idx	0
pwr	W	89
gtemp	C	32
mtemp	C	-
sm	%	9
mem	%	4
enc	%	100
dec	%	0
mclk	MHz	7600
pclk	MHz	1920
fb	MB	1659
bar1	MB	11

In fact, the video card runs at a higher frequency than when encoding video in FullHD, but its main cores are not stressed by the load (the chip remains cold, as well as the video memory).

Streaming 4K at 60 frames per second dropped to two streams as expected, but we didn’t use a cartoon, but rather a gameplay recording from Doom Eternal, which caused some problems for the hardware decoder. A5500 handled it, but it was pushed to the limit, and the worst thing was that encoding in AV1 is not available in hardware, and when broadcasting via VLC with Ubuntu 20.04 we failed to get 60 fps as the stream had dropped down to 30 fps. We created a workaround from ffmpeg and the broadcast server:

frame = 240 fps = 61 q = 32.0 size = 2304 kB time = 00:00:09.48 bitrate = 3991.0 kbits/s speed = 1.03x

Conclusion: the encoders in the RTX A5500 have been improved, and under equal conditions it outperforms the A5000, rendering a subjectively better picture and working at lower frequencies.

CUDA/RT/Tensor Cores

And what about the rest of the units? We compared the RTX A5500 with the A5000 in several other tests (you can read more about the exact methods in a previous article):

A test of mining capabilities (with PhoenixMiner).
A test of machine learning capabilities. Here, we trained a neural network on each of the cards to determine whether a cat or a dog is depicted in the photograph, using 100 epochs for this purpose.
V-Ray 5 Benchmark test for rendering both on CPU + GPU (CUDA test) and purely on a GPU (RTX test).
LuxMark test in three different scenes, checking the speed in OpenCL on the GPU.
Test Blender in different scenes in OptiX mode using the full capabilities of the RTX.

Summary Table

NVIDIA GPU	Mining speed, MH	ML test 100 epoch	V-Ray 5 Benchmark (vpaths/vrays)	LuxMark	Blender
RTX A5000	86.66	9 min. 9s	V-Ray GPU CUDA — 1381 vpaths V-Ray GPU RTX — 2288 vrays	Lux ball — 74 795 Hotel — 15 794 Mic — 45 640	Monster — 2312 Junkshop — 1331 Classroom — 1148
RTX A5500	87.319	8 min. 59s	V-Ray GPU CUDA — 1594 vpaths V-Ray GPU RTX — 2613 vrays	Lux ball — 78 554 Hotel — 16 219 Mic — 48 832	Monster — 2468 Junkshop — 1388 Classroom — 1223

NVIDIA GPU	RTX A5000	RTX A5500
Mining speed, MH	86.66	87.319
ML test 100 epoch	9 min. 9s	8 min. 59s
V-Ray 5 Benchmark (vpaths/vrays)	V-Ray GPU CUDA — 1381 vpaths V-Ray GPU RTX — 2288 vrays	V-Ray GPU CUDA — 1594 vpaths V-Ray GPU RTX — 2613 vrays
LuxMark	Lux ball — 74 795 Hotel — 15 794 Mic — 45 640	Lux ball — 78 554 Hotel — 16 219 Mic — 48 832
Blender	Monster — 2312 Junkshop — 1331 Classroom — 1148	Monster — 2468 Junkshop — 1388 Classroom — 1223

RTX A5500 shows better performance in rendering, but here everything depends on optimization: in V-Ray 5 we have a 13-14% gap, in LuxMark - 5-7%, with similar figures of 5-7% in Blender. Taking into account a margin of error of 1-2% percent depending on the run, the final performance gain is not very impressive.

The A5500 is at least 15% faster in machine learning, but for miners it will be an unpleasant surprise to find almost the same hash rate on both cards. Note, however, that this solution is positioned by the manufacturer for professionals in graphics and neural networks.

Concluions

Alas, a miracle did not happen, and the actual performance increase is 5-10% depending on the task, and in the cases of mining and encoding, there was nary an increase at all. On the plus side, we saw lower power consumption, better cooling due to the lower heat dissipation needs of the video chip, as well as the larger amount of video memory, which should have a positive effect on intensive tasks. Whether or not it is worth the money is up to the buyer, and you can order a dedicated server with an NVIDIA RTX A5500 from us if you wish to check it out yourself.

NVIDIA A5500: real power or just a facelift

Encoding

CUDA/RT/Tensor Cores

Concluions

Other articles

Other topics