Machine Learning and Audio Data

I was excited to see WSJ featuring Shazam. The 6-minute video explains how Shazam works, albeit in a simple manner.

How Shazam Makes Unique Audio Fingerprints to Identify Songs
More than 23,000 songs are identified each minute by Shazam and the app has been used over 70 billion times. But while using it is simple, a complex computation

Shazam is an early example of using machine learning to address a narrow but popular use case --- near real-time identification of the name of a song that you hear in a store or cafe. While the concept is simple, the machine learning approach is not but akin to the layman description of the "fingerprint matching" concept.

Since Shazam's founding, the area of transforming audio and sound data into sound waves using Fourier Analysis has advanced. We have much better ability to do this Fourier transformation, parallelize the processing and matching based on similarity, and do this near real-time by chucking the data in more granular ways.

The approach that Shazam uses is now applied in many different use cases. For example, similar approach is used in voice biometrics and identification, recommendation systems (not finding the match of the highest similarities but near-neighbors or in the same cluster), sound/audio/music generation, and so on.

I am excited to see more advance in applying machine learning to audio data, especially leveraging and marrying other machine learning and technology innovations such as edge computing and federated learning.

Here is a good post about applying machine learning to audio data.

Subscribe to Joyce J. Shen

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.