YouTube Murals: Painting Topics in YouTube Videos with Natural Language Processing

#a11y #showdev #youtube

Click here for a demo ✨

During my college modern physics class, my group and I did research on the Breakthrough Starshot project. I looked for videos that talked about the project, and thankfully they got the point across in 5 minutes. This video was next on autoplay, and being an Elon Musk super fan not even a long days worth of classes made me shelve this on my infinite display of forgotten Chrome tabs.

I eventually watched it all in one sitting, but I just wanted to watch parts I needed. It would be a miracle to have an hour to sit and watch this during group projects, and homework, and an hour commute home.

Cool video, but long right?

No way I had the energy to retain everything that was said. What if I could skip over other parts of the video?

The second time I tried to watch this I saw that Youtube included time encoded transcripts that you can view the entire transcript to the right of the video. Because they’re indexed with time stamps I can search for specific words or phrases and jump to the parts I need.

Time encoded transcripts (not machine generated)

Time encoded search results for the desired topic

Accessibility Implications

What started off as an idea for a creative coding sketch turned into a new art project presenting enormous implications for YouTube and human cognition. I had somewhat of an easy way out if I could hear what was said at a given time. This isn’t the case for many others. YouTube demonstrated their commitment to accessibility with closed captions, yet this is just the beginning. I believe there more approaches that can contribute to this.

“Everyone should be able to access and enjoy the web. We’re committed to making that a reality.” — Google Accessibility.

Making Topics Visible

A bag o’ words approach sounded straight forward and effective--group words associated with a certain topic at a given point in time. The size of each group depends on a specified duration, or sentence length. Thankfully YouTube’s API makes it possible to retrieve captions, and not just noisy auto-generated ones. These come in the form of Timed Text Markup Language. I’ll demonstrate this using a transcript from Elon Musk’s talk.

TTML v1 snippet from Elon Musk Talk

When sifted through an xml parser, I got the following below which made the processing much easier:

{'start': '75.936', 'dur': '1.302'} 
ELON MUSK: Thank you. Thank you very
{'start': '77.238', 'dur': '1.887'} 
much for having me. I look forward to
{'start': '79.125', 'dur': '3.547'} 
talking about the SpaceX Mars
{'start': '82.672', 'dur': '2.25'} 
architecture. And what I really want to
{'start': '84.922', 'dur': '3.019'} 
achieve here is to make Mars seem
{'start': '87.941', 'dur': '2.949'} 
possible, make it seem as though it&#39;s
{'start': '90.89', 'dur': '2.32'} 
something that we can do in our
{'start': '93.21', 'dur': '3.11'} 
lifetimes and that you can go. And is
{'start': '96.32', 'dur': '1.76'} 
there really a way that anyone can go if
{'start': '98.08', 'dur': '1.55'} 
they wanted to?

In Natural Language Processing it’s important to filter out stop words (common words in a specific language) to reduce noise when clustering words by topic. There’s no complete list of stop words, but I used NLTK’s stop words corpus (encoded with unicode). In the English language, these would consist of prepositions, articles, proper nouns. I removed these words from the text clusters above:

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', 
u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', 
u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', 
u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what',
u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', 
u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', 
u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', 
u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', 
u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', 
u'between', u'into', u'through', u'during', u'before', u'after', 
u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', 
u'off', u'over', u'under', u'again', u'further', u'then', u'once', 
u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', 
u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', 
u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', 
u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now', 
u'd', u'll', u'm', u'o', u're', u've', u'y', u'ain', u'aren', u'couldn', 
u'didn', u'doesn', u'hadn', u'hasn', u'haven', u'isn', u'ma', u'mightn', 
u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', 
u'wouldn', u"'s", u"n't", u"'m", u"'d"]

Latent Dirichlet Allocation

LDA is a common model in Natural Language Processing where it discovers topics in a sentence or document. It’ll assume that you have a word count of your document. It scopes out a piece of text and finds a bunch of key words that it’ll use to learn what the document is about.

Think of harried high school students who’re preparing for a quiz in English class by reading (or skimming) their assignment for the next class, and making sure they understand what’s going on because their teacher’s on to their Spark Notes use. They’ll highlight words, sentences, paragraphs, or 99 percent of the text in a reading assignment. They’ll pick a few topics that they covered in class, and when they read the next assignment they’ll try to figure out why it fits with the topic. They’ll go through the text over and over to make sure the parts they took notes on fit under that topic.

Just like in LDA, a student take a mental snapshot of his or her reading notes, and see that for each chapter c and topic category t, he or she can see the portions of highlighted sections that go under each category for a full representation of c. If they happen to be reading something like a Harry Potter book and they finished reading chapter c, they might see that the topic distribution could be 10% friendship, 50% magic, 10% bravery, and 30% family.

After looking over the document repeatedly this model returned the probabilities of words appearing in Musk’s transcript assuming it’s looking for 5 topics:

LdaModel(num_terms=1016, num_topics=5, decay=0.5, chunksize=100)[(0, 
u'0.041*"engine" + 0.026*"really" + 0.025*"make" + 0.022*"tank" + 
0.016*"rocket" + 0.014*"vehicle" + 0.013*"also" + 0.013*"merlin" + 
0.013*"capable" + 0.012*"because"'), (1, u'0.031*"mars" + 0.028*"use" + 
0.025*"mission" + 0.024*"carbon" + 0.022*"fiber" + 0.022*"liquid" + 
0.018*"thing" + 0.017*"falcon" + 0.016*"day" + 0.015*"very"'), (2, 
u'0.045*"system" + 0.035*"would" + 0.032*"propel" + 0.028*"go" + 
0.026*"time" + 0.024*"solar" + 0.022*"mars" + 0.018*"orbit" + 
0.018*"cost" + 0.016*"greater"'), (3, u'0.034*"first" + 0.027*"station" + 
0.026*"applause" + 0.024*"dragon" + 0.020*"space" + 0.016*"think" + 
0.016*"ton" + 0.015*"mars" + 0.014*"launch" + 0.013*"go"'), (4, 
u'0.028*"booster" + 0.028*"spaceship" + 0.025*"get" + 0.023*"land" + 
0.022*"like" + 0.021*"really" + 0.021*"maybe" + 0.019*"go" + 
0.018*"anywhere" + 0.018*"actual"')]

Now that we have the estimated topic mixtures from the Musk talk after 50 iterations, here is the mural below:

Youtube Mural for “Making Humans a Multiplanetary Species”

Mural Samples 🎨

The above mural was painted assuming there were 5 possible topics in the video, and that each words are grouped by 60 second intervals. These can be configured to show more or less topics, have different time intervals, etc. Below are more murals with different settings based on time intervals, sentence intervals, number of topics, and LDA iterations.

10 topics grouped by 60 second intervals after 10 LDA iterations

10 topics grouped by 5 second intervals after 10 LDA iterations

5 topics grouped by 60 second intervals after 10 LDA iterations

10 topics grouped by 60 words per sentence after 10 LDA iterations

Each row in the murals represent a different topic, with the color varying based on the distributions of each word. The length of the mural is always constant with each mark mapped according to duration (words will share topics, especially common words that didn’t appear in the stop-words corpus). Brighter areas in a topic row show areas where more words are associated with a particular topic.

5 topics grouped by 60 second intervals after 20 LDA iterations

5 topics grouped by 60 second intervals after 50 LDA iterations

5 topics grouped by 60 second intervals after 200 LDA iterations

Murals painted after more iterations typically have brighter spots. That’s because as I mentioned earlier with high school students example, they tend to go back and double check their text to make sure they’re ready for the quiz the next day. The same thing applies to LDA — the model needs to double check to make sure words are grouped correctly to achieve maximum accuracy.

A YouTube mural used to navigate Musk’s talk.

Future goals for this project include the following:

Update XML parsing to use latest version
Implement this for TED Talks

Conclusions

Color is a powerful stimulus for the brain — that’s what it notices and remembers first. Try watching Musk’s talk now and see if you notice any differences in how you pay attention and remember it. If you did notice improvements, imagine what this would do for class lectures, videos for kids, or for older users with age-related accessibility needs like Alzheimers.

There are so many people for whom interacting in the physical world is really tough, yet interacting with an accessible web is easy, and it will get easier thanks to technological advances.