NVIDIA Video Encoder and CUDA
Updated: May 7
Recently someone was looking for a help implementing video encoding with NVIDIA CUDA. I asked them why not to use NVIDIA Video codec SDK which features high performance HW accelerated h264/5 encoder (NVENC) and decoder. They complained that NVENC was too slow, "not even real time". This is actually not so uncommon case where people or companies failing to adopt a cutting edge tech decide to go in an alternative direction without understanding the consequences.
So, if anybody is wondering if it makes sense to write video encoder with CUDA, then the short answer is - No, it doesn't. Here is a lengthy answer why. See, NVENC runs on a special dedicated encoder hardware chip located somewhere on the card.All this chip knows to do is to encode frames, provided in any of supported YUV formats, into raw h264/h265 bit stream. It doesn't use CUDA cores at all, besides a case when an input frame is in RGBA format (possible with the latest versions of NVENC). That's when CUDA kernel is used to convert to YUV space and that step is performed by the driver before sending the data down to the encoder hardware. NVENC API is mostly asynchronous, it can be run in its own thread,without interfering with other GPU related stuff going on in your app. If your frames are generated on GPU,it is possible to fetch those directly into encoder, by means of low overhead resource sharing between rendering and NVENC APIs, without round trip to a system memory,which effectively allows keeping GPU bound during the whole encoding process.
CUDA implemented video encoder will be a way slower than NVENC. Here are 3 main reasons why:
Some performance critical parts of any video codec algorithms cannot be run efficiently on GPU because their execution is serial by nature and there is not much to do about that, unless you plan to invent a new codec.(Good luck with that one)See h264 specs on entropy coding and macro-blocks to understand in depth what I mean.
CUDA uses the same SMs (streaming multi-processors) available for other GPU based APIs.That's called a "unified architecture", introduced in Fermi generation GPUs and it basically means that same resources (cores) can be used to do any sort of computation job.In the older architectures each stage of the logical pipeline had its dedicated hardware.So if you run a rendering session with OpenGL or whatever API of choice at the same time you are running encoding on CUDA, the chances are high you will be sharing GPU resources between those APIs,which means - less available SMs for any of them. Of course, the more SMs your card has the more computation tasks can be run in parallel. But even the strongest GPU has its limits.
Even If you manage to write such a codec, one of the extra major performance bottlenecks will be what's called CUDA Graphics interoperability API. That's the mechanism of sharing textures and buffers from rendering API like OpenGL, DirectX or Vulkan with CUDA. This thing is slow, and I mean - very slow. That's the main reason why you won't find many high performance rendering applications that use CUDA for GPGPU. Right, they use compute shaders which live in the same context and don't suffer from shared resource mapping overhead. Moreover, you won't be able to accomplish complete encoding cycle in one kernel invocation. That's impossible algorithmically, and technically. The last one means - your card may run out of registers needed to execute it all in one go. That was true for Kepler cards. I am not sure how it may be with the latest architectures, but I bet, you will stuff the GPU to the neck with the workloads for encoder, so it won't be able to run other tasks efficiently...
The above information is based on my personal experience. Another fact which is also based on my experience is this : if you use NVENC correctly, you can encode 1080p not only in real time (60fps), but much faster than that and in more than one concurrent encoding session. See NVENC performance statistics on NVIDIA website. They don't lie. The API is hard, poorly documented, but if you take time to learn it and experiment, you will get encoding rates impossible with any other approach.
I will try to find a free time to record a video tutorial on the subject.