top of page
  • Michael Ivanov

Beyond Raster: How Generative AI is shaping the Future of Video Ads

I have decided to write this article after the announcement of Sora video generator by OpenAI. The first time I realized that hyper-realistic AI-driven video content generation is just a matter of time was when Midjourney and the Disco Diffusion models had been released. It was an easy prediction: if AI can generate realistic images, it should definitely be capable of creating complete video content. Two things ought to be improved over time: Content Quality and Prompt Responsiveness (that’s, how accurately the prompt instructions are translated into graphic elements).


As someone who has spent the bulk of his professional career — almost 15 years now — developing graphics and video rendering systems for different kinds of video companies and for my own ventures, I feel like I have a pretty good understanding of where this whole industry is heading from a technological standpoint, and I would like to share my thoughts on this matter.


Since the dawn of computer graphics and up until recently, the most common way to create CGI was by combining filmed content with raster rendering.


Raster rendering.

Raster Rendering, in a nutshell, has probably been the most popular and widely used rendering technique since the computer scanned display was invented. Simply speaking, raster rendering is about generating graphic elements on a 2D image surface from geometric shapes defined by sets of lines and curves, which later evolved into more advanced data types such as 2D and 3D geometric shapes consisting of triangles and rasterized (also called ‘3d rendering’) with the help of GPU. The result of the process is a 2D grid of dots, also known as ‘pixels,’ painted with colors which we call a photo, image, or a video — a sequence of lossless or compressed images by a video codec such as H.264.


GANs

I checked with GPT regarding the techniques used by DALL-E model to generate visuals to make sure the raster is not one of those.


OpenAI’s GPT answer:

Generative AI models like DALL-E and video generators like Sora typically do not use traditional raster rendering techniques. Instead, they rely on neural networks and generative algorithms to create images and videos based on input data or parameters. These models generate content in a more abstract and data-driven manner, rather than rendering scenes pixel by pixel like rasterization.


DALL-E and other generative adversarial networks (GANs) based models, generate images iteratively rather than assembling pre-existing visual elements. In these models, the generation process typically starts with a random noise vector or an initial input and gradually refines it to produce the final image. This iterative process involves adding and adjusting details, textures, colors, and other visual elements until the desired output is achieved.


OpenAI DALL-E research page says:

Unlike a 3D rendering engine, whose inputs must be specified unambiguously and in complete detail, DALL·E is often able to “fill in the blanks” when the caption implies that the image must contain a certain detail that is not explicitly stated.


Raster vs GANs

To gain a better understanding of the differences between GPU-based graphics rasterization processes and GAN generators, let’s examine the following diagrams: The GPU rasterization process is executed through the standard graphics rendering pipeline, which is facilitated by APIs such as Vulkan, OpenGL, and Direct3D:


GPU accelerated raster rendering pipeline diagram
GPU accelerated raster rendering pipeline

Note that the above scheme is extremely oversimplified. “The Application” represents a complete rendering engine that contains many important and extremely complex subsystems, which take years for a team of professionals to develop. Additionally, the “Application Stage” involves the input of graphics, also known as assets, such as images, fonts, and geometric data. This requires the use of third-party graphics editing software and the possession of a related skill set. Moreover, there is a need for sophisticated human manual work to put all those assets together into artistic compositions, creating animations, coloring, and applying special effects, among other tasks.


And this is how generative AI image rasterization works:


GAN driven image generating process diagram
GAN driven image generating process

Generative AI image generation, as you can see, is fundamentally different. The user input is text. The generator runs an iterative process to create the image, constantly comparing the intermediate results (Fake Images) with the ‘Real Images’ on which the model was trained. This technique does not require users to provide costly graphics assets or develop complex rendering setups. Yes, the results are still not perfect. For example, I have found that the existing commercial models still struggle to generate an image with text that is correct both calligraphically and grammatically.


DALL-E generated image
Generating an image with text using DALL-E 2:First attempt


DALL-E 2 generated image containing a text with spelling error.Second attempt still results in a wrong text:


An image generated with text with DALL-E 2
Generating an image with text with DALL-E 2:Second attempt

Yet, I believe it is just a matter of time till problems like correct text generation will be solved completely.


Video Advertising industry

I believe that the video advertising industry is poised for a tremendous technological revolution, thanks to the latest advancements in generative AI technology. By ‘video advertising industry,’ we refer to advertising and marketing agencies, creative studios, and potentially hundreds or even thousands of SaaS companies offering self-service graphic content creation tools, such as Canva, and popular video editors like Veed, Vimeo Create etc. Don’t misunderstand me; AI won’t replace human effort in these companies entirely. Camera-shot content won’t become obsolete anytime soon. Moreover, big clients seeking tailor-made, exclusive content (bearing in mind that images generated by AI aren’t guaranteed to be unique — at least, not yet) are unlikely to forego live shooting for synthetic imagery entirely. Content creators catering to such clients will continue to perform some tasks manually. However, they will also increasingly utilize AI-driven tools to expedite workflow and reduce production costs. Nonetheless, some of the industry’s currently standard, yet very expensive and complex technologies will be completely supplanted by AI. Let’s discuss this further.



a homeless man holding a carton with the writing “Will work for food”,DALL-E
Given better context — “a homeless man holding a carton with the writing “Will work for food”,DALL-E was able to generate the text correctly.

Hopefully, I will never have to work for food, but my point is that my key expertise is exactly what generative AI is going to render useless in relation to this industry. For more than a decade I was making a living mostly from designing and developing GPU-accelerated video rendering systems for advertising companies.


From a technological standpoint, up until recently, virtually all online video creation companies (including those focused on online ads & marketing videos, personalized video, and online video editors) have relied on raster rendering to generate the synthetic part of video graphics. Web companies have utilized HTML5 technology (Canvas2D, Canvas3D) and even running Adobe After Effects on the servers, while others have opted to develop their own, primarily server-side, rendering solutions, employing specialists like me.

In the second case, a team of developers experienced in real-time graphics would write a full-fledged rendering engine, typically leveraging the GPU to boost speed. Writing and maintaining the rendering engine in-house is a very (very) costly affair and keeping experts on the team is a never-ending pain. I believe the emergence of generative AI is going to change that forever.


Now, let’s look at the problem and the new solution from the user perspective. “A user” here refers to advertising and marketing agencies, small businesses, and even private users who seek solutions to create beautiful graphics for social networks. These individuals don’t care if your product generates the content with AfterEffects, OpenGL, or Generative AI. They care about two things: content quality and costs. Well, some of them also care about content delivery speed, but I am confident that companies such as NVIDIA and similar will solve it in the long run.


Still images: With today’s state of generative AI models such as DALL-E, Midjourney, etc., it is already possible to generate high-quality imagery based on text prompts. Accuracy is not always perfect, but it keeps improving very fast, and prompt design has become a new profession. That means, at least for stills, the problem has been solved. Everyone who can describe the image in a text prompt concisely can generate a desired image fast and cheaply. Content copyright and exclusiveness are still issues for some AI platforms, but that is being actively addressed by B2B products. Commercial-grade online video editors already provide in-house or third-party tools to generate images with the help of AI. Content creators who in the past had to make graphics from scratch using professional software such as Photoshop can now merely generate those graphics and mix AI images into more complex compositions, saving a huge amount of time. DALL-E and others continue to invest in web-based editing tools which allow extraction and replacement of objects inside the generated image. And that’s just the beginning.


Videos: The modern advertising industry heavily relies on video content. Most of the ads you see nowadays are videos, with 71% of B2B marketers using video content. In short, video is king.


Creating videos from AI prompts has not been a straightforward task thus far. For example, DALL-E 2 and DALL-E 3 APIs allow generating a sequence of images, but not a sequence that tracks and preserves generated elements from the previous frames. That means you can’t automatically generate an image sequence of a dog running across a lawn without prompting every next image, and the output is still not guaranteed to be the same. And as I mentioned above, those models are also bad at generating text in images, which means the end user not only has to use video editing software to stitch together generated stills but also create marketing text overlays using other tools.


Companies such as Runway have recognized the problem and started developing models that allow the transformation of still images into animated videos, showing us the future potential of video content creation. While Runway’s Gen-2 still lacks image quality suitable for video ads, we must not forget that it is just the beginning. Hyper-realistic, AI-generated talking avatars are another domain that is evolving at a remarkable pace. It solves a huge problem: advertising agencies and their clients allocate a considerable part of the campaign’s budget to hiring actors, scriptwriters, film crews, directors, decorations, venues, etc. We are talking about several weeks of work and tens of thousands of dollars for shooting a modest several minutes-long footage. The synthetic avatars are not exactly looking like real humans yet; there are still many technical issues to solve, such as full 3D transformation of the avatar, facial expressions, and accurate lip-sync. However, these are rather optimizations that will be solved in the next several years.


Sora

Then OpenAI released Sora. Everyone who was following this domain knew it was just a matter of time.

Sora model capabilities are outlined on OpenAI website as follows:


The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions. Sora can also create multiple shots within a single generated video that accurately persist characters and visual style.


Also this:


Sora is capable of generating entire videos all at once or extending generated videos to make them longer. By giving the model foresight of many frames at a time, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily.


Yes, from the showcase videos generated with Sora, we still can see all sorts of visual bugs like wrong perspective scale (see ‘Snowy Tokyo City’), bugs on animals’ limbs (cat in the bed), abruptly disappearing objects; most of the videos still lack cinematic quality. But overall, OpenAI has nailed it. From now on, it is mostly about improvements. And again, this is just the beginning.


Once Sora goes public, everyone will be able to generate high-quality, compelling videos that can serve as source material for advertising campaigns. Creative agencies or video editing software will still have to be involved to produce a full-fledged marketing campaign video, but the need to use graphics rendering technology to generate synthetic videos won’t be required anymore.

Here is my summary of the content creation evolution for video advertising industry:

summary of the content creation evolution for video advertising industry

Where do we go from here?

Does all that mean the end for startups operating in the advertising content creation sector? Of course not! I personally see this as the beginning of a new era and new opportunities. As I have already explained, computer-generated content is just one piece of a quite complex process of making a video ad. Sora and similar technologies have revolutionized that part of the creation flow by allowing the generation of hyper-realistic visuals from textual descriptions. But video ads are not made only of a series of stunning camera shots. The process known as video compositing involves stitching together and blending different video sources, still images, text overlays, background music, voice overs, and special effects application. And while we can agree that all of these can also be managed by AI prompts, I doubt it can be completely automated without human brains (and hands) involved to produce a commercial looking 100% according to customer expectations.


Video editors are not going to disappear so fast, but rather the technology, tools, and workflows will undergo a dramatic upgrade with the help of AI. For example, online marketing video creation tools such as Canva will rely more and more on AI to generate sophisticated graphics, while at the same time adding robust AI-powered tools to assemble the graphic assets into ads more easily for users. AI will be used not just to generate fancy images and complete video footage, but it will also help to automate the creation and customization of video templates, currently a daunting task done manually by motion designers. Brand color and style are other areas that will receive a boost from AI. The laying out of graphics elements on the canvas in a way that doesn’t ruin the look and feel of the frame usually requires designer’s skills. AI will automate that process too.


Eventually, video editing and personalization software engineers will be able to focus on developing intelligent and user-friendly tools for a fast and efficient video creation workflow that can be operated in a ‘self-serve’ manner by people who are not technical. The competitive advantage will belong to those companies that can fine-tune AI models to provide all that functionality accurately and efficiently, along with tools that can be used painlessly by users to achieve their creative goals. These next-gen products are going to reduce ad content creation costs and make high-fidelity video advertising content affordable for businesses of any size. Similar to how everyone can make a PowerPoint presentation today, I foresee the same outcome for online marketing video ads.



















9 views0 comments
bottom of page