Audio Processing

How Our Technology Works

Spoken language differs considerably from other sources of information. There are significant variations in audio recordings, as expressed by metrics like signal-to-noise ratio, bandwidth, audio encoding, language, dialect, speaking style, gender and pitch, to name just a few, which each pose a challenge. But for the last few years, with redundancy of speech, processing of spoken language with a focus on information retrieval has already shown better results than human performance. Creating an error-free transcript is still a challenge, although in recent years deep learning has helped reduce errors tremendously.


We look at different levels of processing spoken language with acoustic, lexical, syntactic and semantic models, as well as the communication process. Our work is inherently multidisciplinary, requiring competence in signal processing, acoustics, phonetics and phonology, linguistics and computer science.


Having an in-house production environment allows us to immediately validate research results on live data. This is important, as too many scientists work on artificially created, year-old test beds to prove their results, and do not have the expertise and resources available to validate their results on live data.

Processing audio data contains four main parts: (1) audio segmentation, (2) speech recognition, (3) speaker recognition and (4) speaker clustering.


Audio Segmentation

When listening to a stream of audio coming in, e.g. from a TV or radio station, the first step in analyzing the data is to segment it according to acoustic characteristics. Is it a person speaking, or a song? Is it a single person speaking for a long time (e.g. reading a novel), or is it a dialog between two or more speakers? Audio segmentation gives us the answers.

Speech Recognition

For all audio segments where humans speak we can use speech recognition to obtain a rough transcript of what has been said. Our high-accuracy-transcription (HAT) speech engine uses neural networks trained using deep learning algorithms and a cluster of GPUs. These deep learning methods generate best-of-breed results on continuous speech as typically found in broadcast-monitoring applications.

Speaker Recognition & Speaker Clustering

Audio segments can be clustered according to the audio fingerprints of the people speaking. This allows us to analyze an interview and identify which person is speaking which segment. If you have a voice fingerprint of a person, you can use speaker recognition to label audio segments with their names.

Find more information about the other technologies used by eMM:

> Video Processing

> Real Time

> Semantic Analysis