Related posts:
The goal is to create an open-source app or library which allows musicians to expedite the process of creating visuals for their music:
Lip Sync
In parallel with my study of shader functions, I have been exploring ways to incorporate an animation of my face (or any character I wish to create) that will lip-sync to my song in an HTML/Canvas animation.
This was originally inspired by the output from the forced aligner I used (gentle), which included the time the word was spoken, as well as the duration of each phoneme of the word (phonemes are fundamental units of a word's sound).
For example, gentle's result for the word "let" (the duration of the phoneme is shown in seconds):
{
"alignedWord": "let",
"phones": [
{
"duration": 0.09,
"phone": "l_B"
},
{
"duration": 0.09,
"phone": "eh_I"
},
{
"duration": 0.04,
"phone": "t_E"
}
]
}
My first plan was to map mouth shape coordinates to each phoneme when rendering the canvas at each frame of the animation. As a first attempt, I have used the following image I found on the web which shows the mouth shape corresponding to different letters:
Source: https://fmspracticumspring2017.blogs.bucknell.edu/2017/04/18/odds-ends-lip-syncing/
I've tried to articulate my intention with comments throughout the code, but essentially, the master image (with all of the mouth shapes) is translated to display the desired phonemes for each word as it is displayed.
I feel confident that this case study can be extended to a full song, with custom mouth shape coordinates (which will probably start out as drawings using vectr). This will be likely the next step I take to produce a full song's animation.
But before I proceed with that route, I wanted to try out something I came across a few days ago: RunwayML, which is software that provides a GUI to run different open-source ML models. RunwayML is explicitly marketed as software for creators. There's a free download and it's unbelievably easy to use so if you are interested in using machine learning for creative endeavors, I highly recommend it.
Using RunwayML
Instead of using the image of mouth shapes, or drawing my own, I was happy to utilize the power of facial recognition to do that work for me.
I started by recording a short video of myself with my phone:
I then created a new workspace in RunwayML and added to it the Face Landmarks model, which is described by its author as follows:
A ResNet-32 network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset2, the VGG dataset1, and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.
The model takes a video file as input and outputs the coordinates (in x,y pixels) for different recognized face features. The output format I chose was .JSON
and the resulting data structure is:
[
{
time: 0.01,
landmarks: [
{
bottom_lip: [[x0,y0], [x1,y1], ...],
chin: [[x0,y0], [x1,y1], ...],
left_eye: [[x0,y0], [x1,y1], ...],
...
}
]
}
]
Each time
value (based on the frame rate of the export, which in this case is 10 fps) has a corresponding set of landmarks (facial features). The facial features have assigned to it an array of [x, y] pixel coordinate arrays.
Here's RunwayML interface during the export. The top panel shows the uploaded video, the bottom panel shows the export/preview of the model's output, and the side panel has model parameters:
I copied over the .JSON
output to a pen, and built out an 10 fps animation using the face landmark coordinates:
Woo!! I think that's pretty awesome, given how smooth the whole process went. Note, I did not adjust or study any of the model parameters so I will explore that next.
A small note if you are new to RunwayML: make sure you download, install and open Docker Desktop if you are running the model locally. RunwayML does give you credits to use a remote GPU to run the model, and I'll be using that this week to run a full video with a higher export frame-rate.
Top comments (1)
To create a desktop application that maps mouth shape coordinates to phonemes using HTML, CSS, and JavaScript, you can leverage Electron, a framework that allows you to build cross-platform desktop applications with web technologies. Here’s a step-by-step guide to create such an application:
Install Node.js: Ensure you have Node.js installed on your system.
Create a Project Directory: Create a directory for your project and navigate to it.
bash
Copy code
mkdir mouth-phoneme-mapper
cd mouth-phoneme-mapper
Initialize the Project: Initialize a Node.js project.
bash
Copy code
npm init -y
Install Electron: Install Electron as a development dependency.
bash
Copy code
npm install electron --save-dev
main.js: This file will create the main window of your application.
javascript
Copy code
// main.js
const { app, BrowserWindow } = require('electron');
const path = require('path');
function createWindow() {
const mainWindow = new BrowserWindow({
width: 800,
height: 600,
webPreferences: {
preload: path.join(__dirname, 'preload.js'),
nodeIntegration: true,
contextIsolation: false,
},
});
}
app.whenReady().then(() => {
createWindow();
});
app.on('window-all-closed', () => {
if (process.platform !== 'darwin') {
app.quit();
}
});
index.html: The main HTML file for your application.
html
Copy code
<!DOCTYPE html>
Mouth-Phoneme Mapper
Mouth-Phoneme Mapper
style.css: The CSS file for styling your application.
css
Copy code
body {
font-family: Arial, sans-serif;
text-align: center;
}
photo-canvas {
}
script.js: The JavaScript file for the main functionality.
javascript
Copy code
document.getElementById('upload-photo').addEventListener('change', loadPhoto);
async function loadPhoto(event) {
const file = event.target.files[0];
if (file) {
const img = new Image();
img.src = URL.createObjectURL(file);
img.onload = () => {
const canvas = document.getElementById('photo-canvas');
const ctx = canvas.getContext('2d');
canvas.width = img.width;
canvas.height = img.height;
ctx.drawImage(img, 0, 0);
}
// Sample phoneme data
const phonemeData = [
{"phoneme": "l_B", "duration": 0.09, "mouth_shape": [[30, 50], [32, 52]]},
{"phoneme": "eh_I", "duration": 0.09, "mouth_shape": [[35, 55], [37, 57]]},
{"phoneme": "t_E", "duration": 0.04, "mouth_shape": [[40, 60], [42, 62]]}
];
// Display phoneme data
const phonemeDiv = document.getElementById('phoneme-data');
phonemeData.forEach(p => {
const pElement = document.createElement('p');
pElement.textContent =
Phoneme: ${p.phoneme}, Duration: ${p.duration}, Mouth Shape: ${JSON.stringify(p.mouth_shape)}
;phonemeDiv.appendChild(pElement);
});
javascript
Copy code
// preload.js
window.addEventListener('DOMContentLoaded', () => {
const replaceText = (selector, text) => {
const element = document.getElementById(selector);
if (element) element.innerText = text;
};
});
json
Copy code
{
"name": "mouth-phoneme-mapper",
"version": "1.0.0",
"main": "main.js",
"scripts": {
"start": "electron ."
},
"devDependencies": {
"electron": "^latest"
}
}
bash
Copy code
npm start
Integrate Facial Landmark Detection: Use a library like face-api.js or create a backend with Python to process the image and return the landmarks.
Map Landmarks to Phonemes: Add logic to map the detected facial landmarks to the phonemes based on the provided data.
Improve the User Interface: Enhance the UI for a better user experience, possibly adding more interactive features.
By following these steps, you can create a basic Electron application that allows users to upload their photo and see the mapping of mouth shapes to phonemes using HTML, CSS, and JavaScript.