Since it started in 1999, Shazam has been used to identify songs over fifty billion times, and that’s not even counting the IDs from Soundhound, MusicID, and other sound-recognition apps.
From a user’s perspective, it’s simple: Start the app, press a button, and let your phone listen to the song. After a few seconds, even with background noise and distortion, the app will tell you what the song is. It works so quickly and so well that it almost seems like magic – but, as with most magical things these days, it’s mostly run by algorithms.
What’s the idea behind these apps?
Shazam, Soundhound, and other music-identification services all work basically the same way: they have a big database of song information, an algorithm that can quickly extract information from your song sample, and an app to let you interface with those things. Technically, you don’t even need a smartphone.
Shazam was originally usable on old-fashioned flip phones by just recording a song and texting it to the service. Soundhound has actually gone a few steps further by also enabling you to sing or hum into their app which they match against a user-submitted database of other singing/humming recordings.
How do they work?
In simple terms, the process looks like this:
- The app’s database has a massive collection of song “fingerprints,” or small pieces of data about the song’s unique sound patterns.
- When a user hits the “Record” button, the app listens to the music and creates a fingerprint based on the few seconds of audio it hears.
- This fingerprint is checked against the database of existing fingerprints. If your ten-second fingerprint is a match to part of a song, you get your (hopefully correct) song result. If it’s not, you’ll get back an error.
If you’re just looking for a surface-level explanation, that’s all you need to know. The really interesting part is how you actually get that fingerprint.
It all starts with a spectrogram, like the one in the graph above, taken from a paper written by one of Shazam’s founders, Avery Wang. This is essentially a graph with time on the x-axis (horizontal), frequency on the y-axis (vertical), and amplitude represented by different levels of color intensity. Any sequence of sounds can thus be converted into a spectrogram, and any point on the spectrogram can be assigned a set of coordinates. Just like that, notes can be numbers.
If all you needed to do was match a few sounds to each other, you could stop here. If you want to look through a database full of millions of songs, though, a full-detail spectrogram has way too many data points to look through at any sort of speed.
The big breakthrough in music recognition was the realization that you can identify sounds with only a few pieces of data: the peaks, or the most intense parts. Not only does getting rid of most of a song’s lower-energy parts decrease the size of the spectrogram, but it makes the apps less susceptible to identifying dull, consistent background noise as part of the target sounds. Imagine a city skyline – the most identifiable parts are the tops of buildings, not the middle floors, and that’s what you can see from farthest away.
So every second of every song is stripped down to just a few of the most intense data points; everything on the city skyline is removed except the very top. But that’s still not quite efficient enough to be immediately searchable, so the next step is to “hash” this sequence of peaks. Hashing simply takes a set of inputs, runs them through an algorithm, and assigns them an integer output. In this case the hash is generated by taking two of the high-intensity peaks, measuring the time between them, and adding their two frequencies together.
The result is a string of numbers, easily storable and searchable. When a computer reads this hash, it will recognize them as representing frequency and time-distance. Once all the peaks in the song have been identified and hashed, the transformation is complete: the song now has a unique 32-bit number that serves as its ID in the database. More importantly, every second of the song is represented by the numbers.
When your phone hears music, it goes through this exact process: it filters out everything but the highest points, hashes them, and creates a fingerprint for the few seconds it has recorded. Once this is complete, your phone just needs to see where the corresponding strings of numbers appear in the database, allowing it to match the detected frequencies and timing to the correct song and returning it to you in seconds.
Music and more
This technology has been most widely used for music recognition, but sound recognition apps can also work with movies, commercials, TV shows, bird songs, and more. Shazam and Soundhound are the most well known, but you can also now ask Google what song is playing and get an accurate response.
And if you’re wondering, “Do these companies keep track of which songs get asked about?” the answer is “yes.” Music identification statistics have actually been able to predict the success of songs and artists with a fairly high level of accuracy, and big record labels like Warner have contracted with apps like Shazam to help find up-and-coming artists. So, if you want to support an artist, you may as well do your part and look up their song! You may just help them take off.