Hello, in this post, I will be explaining how digital avatars such as the one in the video above are created! As digital avatars become more popular for sales and consumer facing roles, it will be useful to understand how they work. There are two crucial steps to bringing an avatar to life, the first being a strong text to speech model so that the avatar may sound realistic and not robotic. The second is a strong model to artificially generate my facial expressions, so it appears the avatar is making the sounds you are hearing. An AI model is a program that has been trained on a set of data to recognize certain patterns or make certain decisions without further human intervention. Keep in mind, the audio you are hearing is artificially generated, and not actually created by me. Let us start by focusing on the first step: text to speech.
Put simply, text to speech converts a text based script and reproduces it in spoken words sounding audio. This video itself started as a script, and the words were converted to audio before being played over a realistic looking video.
The text to speech systems use a multi step process to turn text into spoken words. At a high level, these steps involve text analysis, linguistic processing, and speech synthesis. Let us break each of these down.
First up is text analysis. The text to speech system starts by reading the input text and breaking it down into manageable chunks. This involves identifying words, sentences, and punctuation. The text is parsed to understand its structure and meaning. For instance, the sentence “Let us go to the park!” is split into words and analyzed for context. This helps the system decide how to pronounce each word correctly and how to handle punctuation marks like exclamation points.
Next, we have Linguistic Processing. This stage involves several key tasks. First, the system performs phonetic transcription, converting text into phonetic symbols, which represent the sounds of speech. This step is crucial because it translates written words into the pronunciation that matches spoken language. For example, the word ‘apple’ is broken down into its phonetic components, something like app and el. The system also applies prosody, which is the rhythm, stress, and intonation of speech. This ensures the generated voice sounds natural and expressive. Prosody adds natural variations in pitch and rhythm, making the speech sound more human and less robotic.
Now, onto the final stage of text to speech: Speech Synthesis. This is where the magic really happens. There are two main methods for synthesizing speech: Concatenative Synthesis and Parametric Synthesis. Concatenative Synthesis involves piecing together pre-recorded human speech segments. Imagine a giant puzzle where each piece is a small chunk of recorded speech. The system selects and combines these segments to form complete sentences. While this method can produce very natural-sounding speech, it’s limited by the available recorded material. On the other hand, Parametric Synthesis generates speech using mathematical models. It doesn’t rely on pre-recorded segments but rather synthesizes speech sounds based on parameters like pitch, duration, and volume. This approach offers more flexibility and requires less storage space, though it can sound less natural compared to concatenative synthesis. My voice was generated using Parametric Synthesis, which is why you may hear me occasionally mispronounce words.
Today’s cutting-edge text to speech systems use Deep Learning and Neural Networks to improve speech quality. These systems are trained on vast amounts of speech data, learning to produce voices that are incredibly lifelike and expressive. By analyzing the nuances of human speech, these advanced systems can generate voices that are not only clear but also convey emotions and subtleties. Face and lip syncing avatar’s lip movements are both natural and in sync with the spoken output. These innovations in facial movement are essential to creating a convincing talking head video that brings text to speech to life. They also allow me to support a variety of languages and accents, and is why you are hearing me use a british accent. I could give this same lecture in over 100 different languages by simply regenerating the speech output with the desired language chosen.
Once the speech audio has been created, a digital avatar is created, whose facial features will match the phonetic pronunciations made in the audio. Since videos are simply a collection of photos at a given frame rate such as 60 frames per second, individual images are generated to represent each moment in time that corresponds to the audio.
The appearance of digital avatars are becoming increasingly lifelike, and one of the technologies behind this realism is called Neural Radiance Fields. But what exactly is a nerf? Let’s break it down. Neural Radiance Fields are a type of machine learning model used to generate highly realistic 3D scenes from 2D images. Unlike traditional 3D modeling techniques, nerfs use neural networks to create and render scenes by predicting how light interacts with surfaces.
So, how does this apply to creating facial scenes for digital avatars? Imagine we want to create a lifelike 3D model of a person’s face. We start by capturing multiple high-resolution images of the person’s face from various angles. These images serve as the input data for our nerf model. It learns not just the shape of the face but also how light reflects off different features like the eyes, skin, and hair. Once the nerf model has learned these details, it can render new views of the face with expressions matching any text. This level of detail is crucial for creating avatars that not only look realistic but can also convey a range of expressions and emotions.
In addition to nerf, deep learning techniques can train a model for avatar animation using video data of people speaking. The model learns to map different phonemes, which are the distinct units of sound, to specific facial movements and expressions. This way, when the avatar speaks, its lips and expressions match the audio output.
These techniques are not perfect, which is why you can sometimes notice a difference in my facial expressions and the sounds you hear.
I hope that between text to speech and model outputs, you now have a better understanding of how digital avatars work!