It’s astonishing that, after more than two thousand years of slumber, the Terracotta Warriors of the Qin Tomb seem to have “come to life”, resembling scholars with the talent to perform the traditional Qin opera.
In a series of eye-catching video demonstrations, what we see is not just the Terracotta Warriors that can sing, but the expressions and gestures that reach an impressive level of realism. On closer inspection, one can even observe the movement of the Adam’s apple and the resonance of the chest cavity.
Not just the Terracotta Warriors, but also historical celebrities like Mona Lisa, Audrey Hepburn, Confucius, and Lu Xun, only require a static photo paired with an audio clip to turn into digitized images that can talk, sing or even perform Rap.
Recalling the fashionista from the Tokyo streets showcased in the Sora video model, she can also sing now, a scene that seems to cross the boundaries between reality and virtuality.
All these wonders are brought about by the artificial intelligence model EMO—Emote Portrait Alive developed by Tongyi Laboratory. The EMO model requires nothing more than a portrait photo and an audio clip, yet it can create lifelike portrait speaking videos.
On April 26, the EMO model announced its official integration into the Tongyi APP, and is now available to all users for free.
Currently, users can take advantage of this feature to freely choose among various templates such as singing, popular memes, and emoji packs. Uploading a photo is all it takes to experience a similar singing sensation.
EMO is not only technologically different from Sora but also revolutionizes traditional face-swapping and digital stand-in technology.
In fact, the generative AI field has long held high hopes for EMO. As early as the end of February this year, the Tongyi Laboratory published its technical paper on the academic platform arXiv and showcased the project to the public on the open-source community GitHub.
After the publication of the paper, it quickly captured the widespread attention of foreign media and was lauded as “one of the most eye-catching AI video models after Sora”.
To date, on GitHub, EMO continues to surge in popularity, obtaining more than 6700 stars.
In just two short months, products based on EMO technology have already been launched on the Tongyi APP and offered to the public for free experiences, earning a lot of praise for its efficient style.
In the technical paper “EMO: Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions,” EMO is defined as a framework for generating portrait videos triggered by audio.
In terms of technical principles, the research team built a large dataset during the training phase, containing over 250 hours of video and more than 15 million images. During the generation phase, the EMO model extracts features for “frame coding” from reference images and video frames, then processes the sound embeddings with a pretrained audio encoder, and combines the facial region mask with multi-frame noise to achieve precise generation of facial images.
In the current technological trend, there’s a technology that focuses more on bringing photos to life — EMO, which, through its unique mechanism, makes portrait videos vibrant. A part of the EMO model operation includes “referential attention” and “audio attention,” dedicated to maintaining character identity consistency and adjusting their dynamic expression. This is markedly different from large video generation models like Sora, which generate videos through text description.
Bolie Feng, head of the XR Laboratory at Tongyi Labs and one of the co-authors of the paper, pointed out that EMO technology is starkly different from existing “face swapping” technology. “Face swapping” usually refers to copying and fitting one person’s facial features onto another person’s face, focusing on the replacement and synthesis of facial features. In contrast, digital double technology uses computer graphics and motion capture to create virtual characters that can simulate human behavior and expressions. However, digital doubles typically require existing character videos as learning samples. Bolie Feng excitedly described the potential of EMO, foreseeing its vast potential for widespread application in video scenes.
In the research paper, the terms “Weak Conditions” and “Expressive” particularly stand out. Bolie Feng prefers to use the term “weak control” when explaining EMO’s core advantages. It implies that the connection between sound and facial expressions, head movements, and body language is not coincidental. Different sounds and emotions naturally trigger corresponding expressions and movements, illustrating the inherent expressiveness of sound. The goal of EMO is to enable the model to naturally capture the emotions in sound and translate them into facial expressions and movements.
Conversely, “strong control” produces video frames synchronized with audio by explicitly simulating facial keypoints or expression movements. The EMO model, through training, can achieve outstanding expressiveness without the need for “strong control.” Bolie Feng pointed out that the model not only identifies the correspondence between voice and mouth shape during speech but has also grasped the connection between voice pitch characteristics and expressions after training with numerous speaker videos, allowing the character’s expressions and movements to naturally and sensitively reflect the emotions conveyed by the audio.
With the help of EMO, we can even imagine a new way of communication, such as conversing with “images” of ourselves at different ages. These advancements and discoveries provide a new and innovative method to enhance the vivacity and expression of character portrait videos.
In modern training processes, the revolutionary aspect of the EMO model is its Audio-Attention module, which accurately matches audio features with pixels in the image. With this technology, the model notably enhances the embodiment of emotional characteristics from audio in key facial areas like the mouth and eyes, ensuring that the emotions in voice are clearly reflected in facial expressions.
Based on demonstration videos shown by Tongyi Labs, we can see that EMO has much to offer in achieving highly realistic communication effects. Whether it’s conversations in Mandarin, Cantonese, English, Japanese, or Korean, or even singing, EMO can support them all. Additionally, it displays diversity in presentation styles, integrating photos, traditional painting, comics, 3D rendering, AI digital humans, and sculptures, among other forms of art.
Bo Lie Feng highlighted the driving role of EMO in the Talking Head technology. Compared to strong control technologies, the weak control technology advocated by EMO has lowered the usage barrier, allowing a wider range of users to easily enjoy the conveniences brought by the technology. Even with just a static image and an audio clip, EMO can generate videos with expressive voice and visuals.
However, in today’s rapidly developing AI technology, the authenticity of video content is increasingly challenged. To address the concerns and problems arising from this, Bo Lie Feng revealed that Tongyi Laboratory will embed easily recognizable watermarks in the generated videos. Even if the naked eye cannot directly detect them, it will be possible to distinguish the authenticity of the video content through technical means. He emphasized that AI video generation and AI video verification are complementary technologies that are developing together.
In order to prevent the misuse of EMO technology, Tongyi Laboratory has taken a series of cautious measures. For example, only official audio templates that have been reviewed can be used in the Tongyi APP; users cannot upload audio themselves and must upload photos that meet the standards and abide by the platform agreement. In addition, user-generated content must go through algorithmic and manual review to ensure content safety and compliance.
Currently, the API of EMO is not yet open to the public, and the team is discussing related security strategies based on safety as a priority. In the future, Tongyi Laboratory will push for further development of large model technology on the premise of ensuring platform safety and encourage contributions of wisdom and suggestions from all sectors of society.
Since ChatGPT sparked the trend of generative artificial intelligence in 2023, China’s large model technology has rapidly developed and attracted broader attention in 2024. With the emergence of video generation models like Sora and EMO, artificial intelligence technology is changing our world in various forms.
Faced with an ever-changing world, are you ready to embrace the challenges?
In this era of great changes, we must all equip ourselves to adapt to the continuous transformations. Whether it is in personal life or professional career, adaptability has become an indispensable skill. Facing the evolving environments of technology, economy, society, and even education, only by maintaining learning and continuous improvement can we stand at the forefront of the times.