DEV Community

Cover image for Processing Collections of Archival Audio Recordings
Jayson DeLancey for Dolby.io

Posted on • Originally published at dolby.io

Processing Collections of Archival Audio Recordings

Like many, I enjoy listening to a good podcast. One of those that I have enjoyed over the years is Lexicon Valley, a podcast about language that explores the way we speak, read, and write. In a recent episode titled "This Am a Minstrel Stereotype, Right?" the host investigated the use of “Am” in black american english.

One of these recordings is done in 1940 and the ex-slave is named Irene… talking about how things used to be.

The interview with Irene Williams from Rome, Mississippi was captured without modern advantages of professional grade microphones or a studio environment. It does speak to a common media processing task of looking at audio that may have been created in non-optimal circumstances which is trying to be repurposed in creating new media content. It also identifies another problem of how to do such tasks at scale when dealing with large collections of media.

Let’s look at some tricks for automating that type of workflow with Python and the Dolby.io Media Processing APIs.

Voices Remembering Slavery Collection

The audio recording referenced in the podcast and others like it are part of collections maintained by the Library of Congress.

We can use the Media Processing APIs to get a little insight into the collection. Each record in the collection is identified with a unique identifier. For example, the 3 part interview corresponds to these records:

There were another 60+ recordings and each has corresponding audio files that can be downloaded. Generally, the Dolby.io Media Processing APIs work best with uncompressed raw audio files so I started to take a look at some of the .wav files.

The URL follows a pattern that looks like this :
https://tile.loc.gov/storage-services/master/afc/afc1940003/afc1940003_afs04016/afc1940003_afs04016a.wav

Build a Processing List

When working with a batch of media, it can be helpful to make a list of the input files but also maintain state for processing jobs and corresponding output files. We can do this with Python and a data file.

For example, we might start with a list of record ids:

afc1984011_afs25745a
afc1941016_afs05500a
afc1984011_afs25745b
afc1941002_afs04778a
afc1984011_afs25750b
afc1941002_afs04777a
afc1984011_afs25659a
...
Enter fullscreen mode Exit fullscreen mode

We can iterate over the list in order to make a request to the Media Analyze API to get some basic data about the media.

def start_batch_processing():
    jobs = {}
    with open ('batch.txt', 'r') as batch:
        for line in batch.readlines():
            name = line.strip()
            if name.startswith('#'):
                continue
            (group_id, item_id) = name.split('_')
            dir_name = name.rstrip('ab')
            url = 'https://tile.loc.gov/storage-services/master/afc/{}/{}/{}.wav'.format(group_id, dir_name, name)

            jobs[name] = {
                'url': url,
                'job_id': None,
                'status': "Pending",
                'response': None,
            }

    for name in jobs.keys():
        print("Start Analyzing: {}".format(jobs[name]['url']))
        job_id = post_media_analyze(jobs[name]['url'])
        print(job_id)
        jobs[name]['job_id'] = job_id

    return jobs

Enter fullscreen mode Exit fullscreen mode

Start Analyzing

For the functions post_media_analyze() and get_media_analyze() you can find sample code in the dolbyio/media-api-samples repository on github using the requests library. There are also some samples in other languages too.

The media file is given as an input parameter which can be any world-readable URL such as the one we constructed here from this collection.

def get_url():
    return 'https://api.dolby.com/media/analyze'

def get_headers():
    return {
        "x-api-key": os.environ['DOLBYIO_API_KEY'],
        "Content-Type": "application/json",
        "Accept": "application/json",
    }

def post_media_analyze(input_url):
    url = get_url()
    headers = get_headers()
    body = {"input": input_url}

    response = requests.post(url, json=body, headers=headers)
    response.raise_for_status()
    return response.json()["job_id"]

def get_media_analyze(job_id):
    url = get_url()
    headers = get_headers()
    params = {"job_id": job_id}

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()
    data = response.json()
    return data
Enter fullscreen mode Exit fullscreen mode

Track Progress

There are many ways to maintain state but generally it is a good idea to keep track of your job_id and status in any jobs being initiated. That way if your media pipeline workflow is interrupted or you want to resume or retry an existing job you'll have a record of what jobs were running and later where you put the output files. For simplicity, this can even just be json data is that is written and read from a file on disk.

def write_jobs(jobs, job_file):
    with open(job_file, 'w') as output:
        output.write(json.dumps(jobs, sort_keys=True, indent=4))

def read_jobs(job_file):
    with open(job_file, 'r') as json_jobs:
        jobs = json.load(json_jobs)
        return jobs
Enter fullscreen mode Exit fullscreen mode

We can use this file in between iterations and loop to verify that all of the jobs reach a complete state (i.e. Success or Failed).

def check_job_status(jobs):
    # Check status until all jobs are complete
    active_jobs = True
    while active_jobs:
        active_jobs = False
        for name in jobs.keys():
            # Pending and Running status indicate the job is still processing
            status = jobs[name]['status']
            if status in {"Pending", "Running"}:
                # Check again to see if there has been a change since
                # the last loop.
                response = get_media_analyze(jobs[name]['job_id'])
                jobs[name]['status'] = response['status']
                if response['status'] in {"Pending", "Running"}:
                    active_jobs = True
                else:
                    # The job is complete, also store the response for later
                    logging.debug(response['status'])
                    jobs[name]['response'] = response

        # Wait a bit and then retry
        time.sleep(5)

    return jobs
Enter fullscreen mode Exit fullscreen mode

Analyzing the Collection

After processing completes for all of the jobs, I have a collection of data about these files that can provide some interesting insights.

To start learning a bit more about the collection we can gather the data into a pandas data frame. If you are not familiar with it, pandas is a Python Data Analysis library that can be very handy for exploring data. It can also be useful to use an interactive tool like the ipython REPL or a jupyter notebook during this explore phase.

import pandas

def get_dataframe(jobs):
    df = pandas.DataFrame()
    for item in jobs:
        result = jobs[item]['response']['result']
        row = pandas.json_normalize(result, sep='.')
        row['label'] = item
        df.append(row)

    return df

df = get_dataframe(jobs)
print(df.columns)
Enter fullscreen mode Exit fullscreen mode

What we've done here is flatten keys from the JSON response returned from the Media Processing Analyze API so that it is available in a tabular format we can query and do some data analysis.

Index(['media_info.container.kind', 'media_info.container.duration',
       'media_info.container.bitrate', 'media_info.container.size',
       'media_info.audio.codec', 'media_info.audio.bit_depth',
       'media_info.audio.channels', 'media_info.audio.sample_rate',
       'media_info.audio.duration', 'media_info.audio.bitrate',
       'audio.clipping.num_sections', 'audio.loudness.measured',
       'audio.loudness.range', 'audio.loudness.gating_mode',
       'audio.loudness.sample_peak', 'audio.loudness.true_peak',
       'audio.bandwidth', 'audio.noise.snr_average',
       'audio.noise.level_average', 'audio.music.percentage',
       'audio.music.num_sections', 'audio.other.percentage',
       'audio.other.num_sections', 'audio.speech.percentage',
       'audio.speech.num_sections', 'audio.silence.percentage',
       'audio.silence.num_sections'],
      dtype='object')
Enter fullscreen mode Exit fullscreen mode

Total Duration

As an example, we might be curious how much analyzing a collection like this would cost. Using the pricing page we can come up with an estimate based on the total duration of the media being analyzed.

# Sum the file duration (in seconds) and then calculate cost
total_cost = df['media_info.audio.duration'].sum() / 60 * .05
Enter fullscreen mode Exit fullscreen mode

Turns out the collection has over 450 minutes of content which would take a long time if we had to listen and re-listen to these files manually in a digital audio workstation.

Detecting Audio Anomalies

We might be interested in detecting anomalies. Across an entire collection, we may expect some uniformity when it comes to encoding settings so any outliers become interesting.

We could look at the bitrate:

> df['media_info.container.bitrate'].value_counts()
1411200    50
705600      8
1411199     4
Enter fullscreen mode Exit fullscreen mode

Or we might check the number of channels:

> df['media_info.audio.channels'].value_counts()
2    54
1     8
Enter fullscreen mode Exit fullscreen mode

We can repeat that pattern for codec, sample_rate, etc.

Detecting Issues

When looking at loudness or noise, the values are less discrete so we may be looking for media that exceeds a threshold that matters for our use case.

For example, let's look at noise.

> df['audio.noise.level_average'].describe()
count    62.000000
mean    -65.693226
std      13.105599
min     -89.320000
25%     -75.380000
50%     -62.640000
75%     -53.910000
max     -47.740000
Enter fullscreen mode Exit fullscreen mode

To look at only the files with the most noise present, we can group these continuous values into discrete buckets or bins. For example:

> df['noise.bins'] = pandas.cut(df['audio.noise.level_average'], [-90, -80, -70, -60, -50, -40])
> df['noise.bins'].value_counts()
(-90, -80]    12
(-80, -70]    14
(-70, -60]     7
(-60, -50]    24
(-50, -40]     5
Enter fullscreen mode Exit fullscreen mode

Given that this is archival footage it is not unexpected to find a lot of noise in the recordings, but it may help us prioritize which media is beyond recovery or where to put focus on manual workflows.

Reducing Noise in the Collection

After looking at those insights about the collection overall, we may want to pick a few files to improve their quality. We can do this with the Media Enhance API.

Let's use the same batching mechanism to focus on a subset of files. We need to modify it a little bit to also account for where to send the enhanced results.

def get_url():
    return 'https://api.dolby.com/media/enhance'

def post_media_enhance(input_url, output_url):
    url = get_url()
    body = {"input": input_url, "output": output_url}
        body = {
        "input": input_url, 
        "output": output_url,
        "audio": {
            "speech": {
                "isolation": {
                    "amount": 100
                }
            }
        }
    }

    response = requests.post(url, json=body, headers=headers)
    response.raise_for_status()
    return response.json()["job_id"]
Enter fullscreen mode Exit fullscreen mode

There are a few differences to make note of.

First, we must provide a parameter to the API for where to put the output after processing is complete. This could be a cloud storage location that I already have available to use for my applications, but the requirement is the URL is something that the Dolby.io API can make a PUT request. The Media Input and Output tutorial goes into more detail about some of those options.

For this project, I used the Dolby.io Media Output API. By specifying an output url such as dlb://out/afc1940003_afs04011a.enhanced.wav the file is available for me to download and can be correlated with the input file with the same item id.

Second, the Media Enhance API will do its best to improve content through an intelligent noise management approach balancing noise reduction and speech isolation. With badly damaged recordings like these, turning speech isolation up to 100 gives better results. Once I determine the right tuning on a few sample files, the same can be applied to the entire collection which was created using similar production approaches.

Once a job is kicked off, I'm monitoring the job with a collection of details in a local file on disk like this:

{
    "afc1984011_afs25745a": {
        "job_id": "43dc001d-44ce-4694-b8d2-b72974f4ba81",
        "output_path": "afc1984011_afs25745a.enhanced.wav",
        "output_url": "dlb://out/afc1984011_afs25745a.enhanced.wav",
        "response": null,
        "status": "Pending",
        "url": "https://tile.loc.gov/storage-services/master/afc/afc1984011/afc1984011_afs25745/afc1984011_afs25745a.wav"
}
Enter fullscreen mode Exit fullscreen mode

Downloading Output

If you used your own storage, you may store it in an S3 bucket or other service ready for your review. To pick up the processed result from the Dolby.io Media Output API, I will need to download the file from the /media/output endpoint.

def get_media_output(output_url, local_output_path):
    url = 'https://api.dolby.com/media/output'
    headers = get_headers()
    args = {
        "url": output_url
    }

    with requests.get(url, params=args, headers=headers, stream=True) as response:
        response.raise_for_status()
        response.raw.decode_content = True
        print("Downloading {0}".format(output_url))
        with open(local_output_path, "wb") as output_file:
            shutil.copyfileobj(response.raw, output_file)

...

for name in jobs.keys():
    ...
    if jobs[name]['status'] == "Success":
        get_media_output(jobs[name]['output_url'], jobs[name]['output_path'])
Enter fullscreen mode Exit fullscreen mode

Conclusion

Following this podcast episode, one of the audio engineering communities picked up on the story and processed some files from this collection manually. The community spirit for crowd-sourcing this effort led a few more to jump in and try various tools to clean up the audio.

When faced with a large collection of media such as with a task like this, being able to take advantage of the horizontal scaling afforded from an API becomes very powerful. Instead of listening to 450 minutes of audio and editing it manually, the elasticity of the cloud allowed me to process the entire batch with very good results while letting the algorithms do the hard work.

Hope some of these Python samples help you get started in automating your own media workflows for improving and getting insights into collections of audio.

References

Lomax, John A, et al. Interview with Irene Williams, Rome, Mississippi, October. Rome, Mississippi, 1940. Pdf. Retrieved from the Library of Congress, .

Top comments (0)