How Things Work
This marks the start of what I hope will be an ongoing series. Throughout this series, we will explore the inner workings of various technologies. While we won't delve too deeply into technical details, the goal is to provide an accessible and inquisitive understanding of how things function
Shazam
Have you ever found yourself humming along to a catchy tune without being able to put a name to the song? That's where music recognition apps come to the rescue, and one of the most renowned ones is Shazam. Shazam can identify music based on a short sample captured through the device's microphone. It typically takes about 5 seconds to locate a match within its extensive music database, even in noisy environments.
Originally created by London-based Shazam Entertainment, Shazam gained further prominence when it was acquired by Apple Inc. in 2018 for a reported sum of about $400 million.
How It Works
Shazam is reported to have around 11 million tracks stored in its database, and yet, it takes only a few seconds to find a match. How does this happen?
Shazam creates an audio fingerprint for all audio files in its database and also creates one for the sample audio to be compared. What is an audio fingerprint? It is a compact representation of an audio by extracting relevant features and patterns of the audio content that makes it distinct.
The fingerprint of the sample audio is compared against the fingerprints of all the audio files in the database to find matches. It is possible for two audio files to have similar fingerprints but be different, so each match is evaluated for correctness.
There are some guiding principles for creating these fingerprints.
- Temporary Locality: The fingerprint should only consider the music that's happening around a specific moment in time. It shouldn't be influenced by far-off parts of the music. Imagine you're listening to a song, and you want to identify a particular saxophone solo in that song. Temporal locality in this context means that you're pinpointing the exact moment in the song when the saxophone solo starts and ends. You're not concerned with events happening far before or after the solo; you're specifically interested in that solo's duration and the sounds within that time frame.
- Translation-Invariance: If the same audio appears in different parts of a file, the fingerprint should still be the same. It doesn't matter where in the file that music is.
- Robustness: Even if the audio file is a bit messed up or low quality, the fingerprint should still work. It should be able to handle copies of the music that aren't perfect.
- Entropy: The fingerprint should be complex enough to minimize the chance of falsely thinking two pieces of music are the same when they're not.
A spectrogram, which is a visual representation of audio, is created, able to represent time, frequency, and amplitude all on one graph. To make the audio recognition algorithm work well with noise and distortions, certain points on the spectrogram with higher amplitude (loudness) than nearby points are selected. The spectrogram is then reduced to a set of coordinates called a constellation map, removing the amplitude information but retaining the essential features.
credits: http://insidebigdata.com
To match a piece of audio, the constellation map of that audio is overlaid on the map of a song in the database. At the points where the dots align, a match has been found. This method allows for the rapid identification of a small set of points in a vast database, even in the presence of noise or when some audio features are missing.
However, direct matching from constellation maps can be slow. Therefore, researchers have developed a faster method using combinatorial hashing. Fingerprint hashes are generated from the constellation map by pairing up time-frequency points in a specific way. For a given point (anchor point) in the audio signal, other points (target points) ahead in time are selected, typically within a limited time window. These selected pairs create a combinatorial association of features.
The frequency values of these pairs are mapped to 10-bit integers, while the time differences are encoded as 12-bit integers. This encoding results in a single 32-bit integer hash for each pair. These hashes effectively summarize the audio content within the paired segments.
Once these hashes are generated, they can be used to search and identify audio content in a database. The hashes, along with associated metadata such as the track ID and the time of occurrence, are stored in a database for each known audio track. Each audio track is represented by a set of these hash values in the database.
To find a match for an audio sample, the same process of generating hashes is followed for the sample. The generated hashes are compared against the hashes stored in the database. This comparison is done efficiently because the hash values are 32-bit integers, which are computationally lightweight to compare. A match is identified when there's a significant number of matching query hashes that align well in terms of time with the hashes in a specific audio track in the database. This alignment indicates a likely match between the query and that particular audio track. To confirm the match, the associated time offsets and track IDs of the matching hashes are examined. A high concentration of matching hashes at similar time offsets and track IDs confirms the correctness of the match.
Shazam has evolved since it last revealed its audio search algorithm, but we've provided a basic understanding of how it works, encompassing concepts like audio fingerprints and combinatorial hashing.
References:
https://en.wikipedia.org/wiki/Shazam_(application)
https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
Top comments (0)