DEV Community

Cover image for Social Learning Journal - Parsing Audiobooks
Justin L Beall
Justin L Beall

Posted on

Social Learning Journal - Parsing Audiobooks

Since January of 2017, I have been documenting all of my professional learning activities on Twitter. As an established software engineer, I want to share the process it takes to get here. Here is my Twitter account, dev3l_.

I use structured messages for journal-specific events. For example, when I listen to a podcast I start the Tweet with "Listened to:". I have similar semantic tags for reading books, listening to books, and attending conferences/courses. In addition, I hashtag the Tweets with classification identifiers, such as #agile. Typically, I try to write a few notes about the event. Finally, the event is tagged with a duration using the carrot symbol and an amount, such as "^45m" to signify 45 minutes in length.

At one point, Gary Vaynerchuk had a video called Document, Don't Create. This has been my attempt to do such.

New Project Setup

I am starting off based upon my previous post, Message bot to find a PS5 on sale. I intend this to be a Python Flask API using MongoDB that will support a ReactJS front end. For today, I want to get CI setup with Travis, coverage using Coveralls, code quality on Code Climate, and the whole thing hosted on Heroku. If anyone has questions on how to set these technologies up, leave a comment below and I will create some more detailed steps.


For today, my goal is to get a Twitter data archive parsed to create a list of Audiobooks I have listened to for the year. On Twitter, you can go to your Settings and privacy and request an archive download of all your data. This is what I will be using as a starting point, seed data. Inside this archive, is a tweet.js file that contains every Tweet from the account in JSON format.

The JSON structure of a Tweet from an archive and the Twitter API is identical. By creating a Tweet parser as the first step using a static export, this will be able to be leveraged in a future scheduled job that will dynamically handle Tweets in near real-time.

In the future, I imagine pulling from data sources outside of Twitter. Any platform with an available API could be used, like GitHub, LinkedIn, and YouTube.

Structured Messages

When I first started doing this, I did not know the importance of structured messages. A few years ago, I went to create an initial prototype and realized it was pretty hard to pull meaningful data out - it took a lot of hand manipulation. Given an appropriate classifier with machine learning, it would have been possible, but for now, it is much easier to just add a little bit of metadata to each message.

Alt Text

Instead of working with the file tweet.js as a JavaScript file, I removed the JS window.YTD.tweet.part0 = from the file and save it as tweet.json. This can easily be imported into Python as a JS document and we can start working on it relatively easily.

import json
with open(DATA_SEED_TWITTER_PATH) as data_seed:
    data = json.load(data_seed)
Enter fullscreen mode Exit fullscreen mode

With three lines of code, I now have access to start manipulating my 4556, at the time of this article, journaled events.

This is a lot of data when all I want is just to see the audiobooks for this year. Next, let's filter out the data set to show only items from this year based upon the created_at attribute.

def filter_by_this_year(tweet: dict) -> bool:
    created_at = parse(tweet['tweet']['created_at'])
    return first_of_year <= created_at <= end_of_year

tweets_from_this_year = list(filter(filter_by_this_year, data))
Enter fullscreen mode Exit fullscreen mode

Now our data is a little bit more manageable at 449 events. Applying another filter, I'll look for the text of "Started listening to:" to whittle it down to the list of the books I am interested in.

def filter_by_audiobook_start(tweet: dict) -> bool:
    text = tweet['tweet']['full_text']
    return "Started listening to:" in text
audio_books_from_this_year = list(filter(filter_by_audiobook_start, tweets_from_this_year))
Enter fullscreen mode Exit fullscreen mode

At this point, I have found 11 books. This makes sense to me as I have an Audible subscription that allows for one book a month. I'm not perfect with my annotations and sometimes log the end of a book without marking the start of the book. So I changed the filter to include "Finished listening to:" and found one other book.

"Started listening to:" in text or "Finished listening to:" in text
Enter fullscreen mode Exit fullscreen mode

Audiobook List

Now that we have identified the list of books. It's time for a bit of string manipulation to get to the title. After the type identifier tag, I put the name of the book followed by a newline. Using a reduce function, we can easily put these titles into a set and create a unique list of books listened to.

def reduce_book_titles(result: set, tweet: dict) -> set:
    text = tweet['tweet']['full_text']
    title = text.split(":")[1].split("\n")[0]
    return result
audio_book_titles = list(reduce(reduce_book_titles, audio_books_from_this_year, set()))
Enter fullscreen mode Exit fullscreen mode

Looping through and printing the titles I can see that I have the following audiobooks under my belt for the year. The full source to this simple yet powerful script can be found here:

2020 Audiobooks:

  • The Unicorn Project - A Novel About Developers, Digital Disruption, and Thriving in the Age of Data
  • Talking to Strangers
  • Understanding Software - Simplicity, Coding, and How to Suck Less as a Programmer
  • Escaping the Build Trap - How Effective Product Management Creates Real Value
  • Good to Great - Why Some Companies Make the Leap...And Others Don't
  • Agile Conversations - Transform Your Conversations, Transform Your Culture
  • Creativity, Inc. - Overcoming the Unseen Forces That Stand in the Way of True Inspiration
  • Doing Agile Right - Transformation Without Chaos
  • The Pragmatic Programmer
  • The 7 Habits of Highly Effective People - Powerful Lessons in Personal Change
  • Sense & Respond - How Successful Organizations Listen to Customers and Create New Products Continuously
  • The Infinite Game
  • The Art of War

Top comments (0)