DEV Community

Cover image for Voice cloning from music, removal of dynamic text from videos, and new features in the second part of the Wunjo AI update
Wladislav Radchenko
Wladislav Radchenko

Posted on

Voice cloning from music, removal of dynamic text from videos, and new features in the second part of the Wunjo AI update

Hello, reader! Finally, after numerous sleepless nights, I have completed work on the second part of the open-source project update for Wunjo AI and brought my vision of the application to life. In this update, the primary focus has been on sound: voice cloning has been enhanced, enabling the extraction of vocals or melodies from songs and an improvement in speech quality that you can utilize for video editing. But that's not all; new features have also been introduced for video manipulation, such as text removal, video quality enhancement, and the creation of deepfakes. Let's explore everything step by step: starting with sound and moving on to videos and deepfakes. At the end of this article, you'll find a video explaining how to work with videos within the application and the functioning of neural networks for creating deepfakes and more.

If you're interested, you can read previous articles about creating deepfakes in Wunjo AI and the functionalities related to deepfakes and video to video change by prompt and generative AI.

Let's begin with sound. One of the primary objectives in the second part of the update was working on sound. Initially, Wunjo AI utilized an adapted version of Real Time Voice Cloning. However, the approach was completely overhauled, resulting in an improved version of voice cloning. Now, I employ an encoder trained on audio material through Real Time Voice Cloning, in conjunction with HuBERT Soft. This method enables a more precise replication of speech speed and timbre during the sound synthesis phase before interacting with the vocoder. Additionally, based on the original audio cleansed of noise, the gender of the voice (male or female) is identified, and then vocoder settings are adjusted accordingly.

However, this article focuses on simpler aspects without delving into technical details. Let's take a look at the voice cloning process in Wunjo AI.

Now Wunjo AI can clone voice on English, Russian and Chinese, but you can write in discussion if you wanna see your language for voice cloning in Wunjo AI.

Russian Voice

Excerpt from the song Enjoykin - Cutlets with mashed potatoes

The full new version of Wunjo AI is now capable not only of extracting vocals from songs but also of voice cloning. Furthermore, a convenient panel for manually separating vocals from melodies or background noise in audio or video has been introduced, providing more flexibility to suit your needs.

Panel for manual voice extraction

In the previous version of Wunjo AI, we were unable to extract vocals from a song. Therefore, in the new version, we extract vocals from songs. The sound separation method is based on the Open-Unmix technology, ensuring precise extraction of vocals or accompaniment from the song.

Extracted Vocals

How cloning worked before improvement

How voice cloning works in the new version

Certainly, the quality has indeed improved, and the voice itself was cloned from the original excerpt without the need for manual vocal extraction.

English Voice

Perhaps the improvement in voice quality is related to the model? No, the model has not been altered, and this can be demonstrated by cloning an English voice using the basic Real Time Voice Cloning models.

Excerpt from the song "Tessa Violet - Crush"

Extracted vocals for cloning in previous version

How cloning worked before improvement

How voice cloning works in the new version

Cloning a voice into free text

However, it's no secret that the Real Time Voice Cloning approach reduces the audio fragment's frequency, and to achieve the best quality of a cloned voice, it's necessary to lower the input audio's frequency. Any reduction in audio frequency leads to a loss in sound quality. To enhance the audio and restore the original frequency, Speech Enhancement technology is applied. Speech enhancement operates on both audio and video, aimed at improving sound quality and restoring the original frequency.

Voice enhancement panel

How voice cloning works in the new version + speech improvement

Cloning voice to free text + speech improvement

To enhance the speech cloning process, significant work was undertaken alongside sleepless nights. However, we are now transitioning to the next phase—working with videos.

Text Removal from Videos

Have you ever encountered the need or simply wanted to remove text from a video, whether it's text that appears fullscreen, subtitles, obscured text on product packaging or brand logos, or even signs on the street in your video or image? It occurred to me that this would be a useful feature for Wunjo AI users, enabling them to remove text from videos with just a couple of clicks. This would simplify the task for those working on text removal from video materials.

Text removal panel

Deleting text with unusually similar people

It doesn't work perfectly, but it can be useful in most cases.

Video Style Change Panel

In the previous update, I added the ability to alter videos using text.

An example of the method of changing video using text in Wunjo AI for 8 GB VRAM

In this update, I added a panel for working with the second part of the video modification module using text in videos. As I mentioned in the previous article, such a module requires a large amount of video memory, and I only have 8 GB. However, the advantage of this approach lies in generating the next frame for the video not only based on the current frame but also on the data from the previous frame, allowing for better control over changes.

The second part of this approach is less resource-intensive compared to the first. For instance, given my video memory capacity, I can work with a resolution of 1280x1280, which is already pleasing. What's the gist of it? You upload the video, select key frames where there are significant scene changes in the video, separately modify these frames in AUTOMATIC1111, add them to the panel, and start the processing. The video style will change thanks to EbSynth, which has been slightly modified compared to the original repository. Without the first part, creating such images would be solely your responsibility.

Style change panel

Original video excerpt

Original video excerpt

Result

Result

Even with limited video memory, we can achieve higher-quality results. Moreover, in the new version of Wunjo AI, video enhancement has been added.

Video Enhancement

You can enhance faces, improve video quality, or enhance the quality of drawn videos, as the approach to drawn videos is more aggressive.

Now, you can enhance not only face quality but also improve the quality of video clips or enhance visual aspects of drawn video materials, as the approach to drawn videos is more aggressive.

Video enhancement panel

The quality of the resulting fragment after compressing the video for embedding into a GIF is unlikely to be noticeable. Therefore, let's consider another segment specifically created for these purposes.

Improve video quality

What else?

Previously, some Windows users who didn't have Visual Studio might have encountered issues launching Wunjo AI due to the library requirements of dlib, necessary for face processing. Now, this library has been entirely replaced without adding any new dependencies.

And what about deepfakes?

Work with deepfakes has been optimized for less powerful PCs with limited RAM. If you want to learn more about deepfakes, how neural networks operate within the process of creating them, or about other video-related features in the Wunjo AI project, you can visit my YouTube channel. There, you'll find videos on how to work with deepfakes, install the application, train a neural network on your voice within the app, and also view my creative sketches. In any case, promise to use this technology for the benefit of humanity!

Documentation for Wunjo AI, the open-source code on GitHub, and the official website for downloading installers or portable versions with GPU support for Windows are available for you. Just choose the icon corresponding to your operating system as indicated by the arrow. Remember, for using the GPU-supported version, you'll need to install CUDA 11.8.

Additionally, I'll add that you can use any language interface within the application. Instructions on how to add your language can be found in the documentation on this page.

That's all! I hope you found it interesting and helpful. Bye for now!

Top comments (0)