🚀 Feature Request: Loading audio data from BytesIO or memory #800

antimora · 2020-07-19T20:26:04Z

🚀 Feature

The load API does not support loading audio bytes from the memory. It would a great addition to be able to load file like object, e.g. BytesIO. This is would be similar to SoundFile's read function (https://github.com/bastibe/SoundFile/blob/master/soundfile.py#L170)

Motivation

This addition will support a use case for reading audio as blobs directly from DB instead writing the files locally first.

Pitch

Without this feature, torchaudio.load is not useful for users who load files from DB and would love to use torchaudio for all audio operations.

Alternatives

SoundFile supports loading from bytes but currently does not support MP3 files. CommonVoice's audio files are saved in MP3, which requires to convert to FLAC or WAV before training.

waveform, samplerate = sf.read(file=io.BytesIO(audio_bytes), dtype='float32')

The text was updated successfully, but these errors were encountered:

mthrok · 2020-07-20T21:34:39Z

Hi @antimora (and @faroit ) cc @vincentqb

Thanks for bring this up. I gave some thoughts to this and did quick research. Let me start with stating where we stand.

torchaudio does not implement encode/decode, instead it uses external libraries for this.

We use libsox or PySoundFile for this.

We cannot pass file object to libsox.

I do not see a way to pass file object (FILE*) to libsox or extract FILE* from file-like object on Python 3. I looked into a way to pass data as stream, but this says that it's not portable on macOS or Windows.

PySoundFile (or libsndfile) does not support MP3.

There is nothing much we can do about it.

Following these, I looked at some Python solution for MP3, and it seems that miniaudio supports in-memory MP3 decoding.

I did not use this library, but if this library provides MP3 decoding, then one should be able to convert it to (via NumPy array) torch Tensor type.

As of adding such functionality to torchaudio, maybe we can add it as a new backend, though I cannot promise. We will need to discuss interface/design if we do so.

However at this moment, we, torchaudio team does not have much bandwidth to actively work on this.

If someone can confirm that miniaudio approach mentioned above works, then we can discuss the possibility to add it to torchaudio.

What do you think?

antimora · 2020-07-21T06:27:13Z

@mthrok , thank you very much for you detailed quick response. This is very helpful.

I agree with you regarding the challenges and limitations of currently used back-ends.

After you have mentioned miniaudio library, I have checked out and I can confirm it perfectly satisfies my use case. Not only I can load MP3 data from memory but I can also down-sample (from 44100 to 16000) on the fly. Also the library seems native and does not spawn a separate process like pydub.AudioSegment. Another bonus is there are no OS dependency, like ffmpeg. miniaudio uses C lib (https://miniaud.io/). I definitely recommend looking into this as a new backend.

For those who wishes to see a working code, here it is:

from pathlib import Path

import matplotlib.pyplot as plt
import torch
from miniaudio import SampleFormat, decode


# get mp3 bytes
audio_bytes = Path('common_voice_en_20603299.mp3').read_bytes()

# decode mp3 bytes, and at the same time downsample and have the output in signed 32 bit integer
decoded_audio = decode(audio_bytes, nchannels=1, sample_rate=16000, output_format=SampleFormat.SIGNED32)

# create tensor out of the audio samples
decoded_audio = torch.FloatTensor(decoded_audio.samples)

# normalize 32 integer bit audio by dividing by 2147483648 (or short hand 1 << 31)
decoded_audio /= (1 << 31)

print('Max:', decoded_audio.max())
print('Min:', decoded_audio.min())
print('Shape:', decoded_audio.shape)
print('Dtype:', decoded_audio.dtype)

# plot to visually verify
plt.plot(decoded_audio.numpy(), linewidth=1)
plt.savefig('miniaudio-16000-normalized.png')

MP3 file: common_voice_en_20603299.zip

Plot output:

vincentqb · 2020-07-21T14:51:33Z

How about minimp3 mentioned here? Would that do what you want? (unless minimp3 and miniaudio are in fact related?)

mthrok · 2020-07-21T17:40:06Z

@antimora

I am glad that it worked well for your needs, and thank you for the code snippet. This is useful for the whole torchaudio user base.

miniaudio seems to have a lot of ideal features. I see that it can be used as argumentation (applying codec to degrade the data) as well, so I will definitely look into it.

antimora · 2020-07-21T18:19:51Z

Thanks, @vincentqb . I think minimp3 could be another good candidate for torchaudio's backend. Unfortunately, a python project with minimp3 binding seems to unmaintained and unreleased: https://github.com/pyminimp3/pyminimp3. So I cannot immediately use it.

It looks like miniaudio uses dr_libs for its decoders:
https://github.com/mackron/dr_libs. And its mp3 decoder is apparently based off minimp3.

Probably dr_libs decoders is more suitable for torchaudio since it does not have extra functionalities, such as playing playing audio on a device.

One note that I wanted to share with others. miniaudio uses Linear Resampling (https://miniaud.io/docs/manual/index.html#ResamplingAlgorithms) as its default resampling algorithm. Does anyone know how much this might affect speech training vs if sinc_interpolation is uses, such as in torchaudio's resample: https://pytorch.org/audio/transforms.html#resample?

sw005320 · 2020-07-21T19:29:11Z

Does anyone know how much this might affect speech training vs if sinc_interpolation is uses, such as in torchaudio's resample: https://pytorch.org/audio/transforms.html#resample?

I compared several resampling methods in some speech recognition tasks a long time ago, but we did not find a large impact. With feature extraction, normalization, and re-training, such difference can be ignored. (it might be a problem if we cannot re-train a model).

antimora · 2020-07-23T05:56:26Z

EDIT (by @mthrok): The memory leak issue mentioned here is fixed at irmen/pyminiaudio@07773a6

UPDATE:

I have discovered decode function in pyminiaudio has serious memory leak, unfortunately. Memory usage blows up during training.

I have filed a bug with pyminiaudio about this: irmen/pyminiaudio#19

I will do more debugging to see if there is a workaround. But currently I am back where I started.

mthrok · 2020-07-23T20:52:13Z

UPDATE:

I have discovered decode function in pyminiaudio has serious memory leak, unfortunately. Memory usage blows up during training.

I have filed a bug with pyminiaudio about this: irmen/pyminiaudio#19

I will do more debugging to see if there is a workaround. But currently I am back where I started.

@antimora

Thanks for sharing your finding. That's very unfortunate.

antimora · 2020-07-23T21:32:44Z

@mthrok and others.

I found a workaround for the memory leak that I described a comment of the bug I have reported: irmen/pyminiaudio#19 (comment). The solution still uses miniaudio's functionality but calls different function. The memory leak appears in pyminiaudio's implementation of decode* functions, which do not release memory.

For those wishing to use pyminiaudio's in memory MP3 decoder, here is a working code which I will be using in my Common Voice training. Note: I have reimplemented mp3_read_f32 function because of the irmen/pyminiaudio#18 bug and it currently does not report sample_rate back to the caller.

import numpy as np
import torchaudio
import array
from pathlib import Path

import matplotlib.pyplot as plt
import torch
from miniaudio import DecodeError, ffi, lib
import resampy
import soundfile as sf

# get mp3 bytes
audio_bytes = Path('common_voice_en_20603299.mp3').read_bytes()


def mp3_read_f32(data: bytes) -> array:
    '''Reads and decodes the whole mp3 audio data. Resulting sample format is 32 bits float.'''
    config = ffi.new('drmp3_config *')
    num_frames = ffi.new('drmp3_uint64 *')
    memory = lib.drmp3_open_memory_and_read_pcm_frames_f32(data, len(data), config, num_frames, ffi.NULL)
    if not memory:
        raise DecodeError('cannot load/decode data')
    try:
        samples = array.array('f')
        buffer = ffi.buffer(memory, num_frames[0] * config.channels * 4)
        samples.frombytes(buffer)
        return samples, config.sampleRate, config.channels
    finally:
        lib.drmp3_free(memory, ffi.NULL)
        ffi.release(num_frames)


decoded_audio, sample_rate, channels = mp3_read_f32(audio_bytes)

assert channels == 1

# TODO handle channels > 1 cases

decoded_audio = np.asarray(decoded_audio)

# Resample to 16000
decoded_audio = resampy.resample(decoded_audio, sample_rate, 16000, axis=0, filter='kaiser_best')

decoded_audio = torch.FloatTensor(decoded_audio)

# Or resample with torchaudio's sinc_interpolation
# resampler = torchaudio.transforms.Resample(sample_rate, 16000)
# decoded_audio = resampler(decoded_audio)

# Scale down to [-1:1] Resampling somehow scales up.
decoded_audio /= decoded_audio.abs().max()

print('Max:', decoded_audio.max())
print('Min:', decoded_audio.min())
print('Shape:', decoded_audio.shape)
print('Dtype:', decoded_audio.dtype)

# plot to visually verify
plt.plot(decoded_audio, linewidth=1)
plt.savefig('mp3_read_f32-16000-torchaudio-normalized-kaiser_best.png')

# test audio quality
sf.write(open('mp3_read_f32.wav', 'wb'), decoded_audio.numpy(), 16000)

mthrok · 2020-07-24T01:36:21Z

Hi @antimora

That's wonderful! I am glad that it's working again for you and thanks for sharing the resolution 😀

vadimkantorov · 2020-07-28T22:04:51Z

Just for information, I'm working on a pure-C ffmpeg audio reading bindings for PyTorch (currently C-side allocates gives back a DLPack tensor): https://github.com/vadimkantorov/readaudio

I'm planning to also support filter graphs (for augmentation) and decoding from byte array.

Of course it's not mature or tested, but if using ffmpeg directly works okay from perf point of view, this simple approach may be okay.

antimora · 2020-07-28T22:45:53Z

@vadimkantorov Thanks for the information and your effort.

I started using miniaudio API to decode which is very fast compared to other python converted. However, when I used resampy or torchaudio's resampling, the conversation ended up being extremely slow (see my previous comment. Luckily miniaudio has resample function which is fast. Uses linear algorithm (https://miniaud.io/docs/manual/index.html#ResamplingAlgorithms) which is the same as torchaudio's linear resampler (

audio/torchaudio/compliance/kaldi.py

Line 890 in ddb8577

def resample_waveform(waveform: Tensor,

) but written in C.

I'll post my working function for in-memory mp3 decoder and resampler (48000 -> 16000). I'll be using to convert CommonVoice MP3 files. Currently it converts 100K mp3 files (4 second long) in 9 mins.

mthrok · 2020-07-28T22:55:13Z

Just for information, I'm working on a pure-C ffmpeg audio reading bindings for PyTorch (currently C-side allocates gives back a DLPack tensor): https://github.com/vadimkantorov/readaudio

I'm planning to also support filter graphs (for augmentation) and decoding from byte array.

Of course it's not mature or tested, but if using ffmpeg directly works okay from perf point of view, this simple approach may be okay.

@vadimkantorov FYI torchvision has FFmpeg binding which can decode video (and audio, I think). One proposal we got is to adopt it to torchaudio once they clean up the installation process.

@antimora If you want a fast resampling, you can also do torchaudio.sox_effects.apply_effects_tensor(effects=["rate", "16000"]) too. (you need master build though)

rbracco · 2020-11-23T14:51:33Z

@antimora Thank you for your work. If you have time can you please share the final code using miniaudio for resampling and decoding? Thank you!

mthrok · 2021-01-07T21:43:07Z

file-like object support has been added in #1158

keunwoochoi · 2021-04-20T06:21:39Z

Thanks @antimora for the fantastic example!
One thing, that code doesn't work when there are more than one channel. And it will silently fail! That's because, the output of miniaudio.decode is always a 1D array and we have to interleave it manually. This can be done like this:

import numpy as np

stereo_audio = np.reshape(np.array(decoded_audio.samples), (-1, 2).T  #  shape: (N,) → (2, N//2)
# fyi: np.reshape(decoded audio, (2, -1)) is wrong!

* Example for combining DDP + RPC Summary: The example includes a simple model consisting of a sparse part and a dense part. The sparse part is an nn.EmbeddingBag stored on a parameter server and the dense part is an nn.Linear module residing on the trainers. The dense part on the trainers are replicated via DistributedDataParallel. A master creates the nn.EmbeddingBag and drives the training loop on the trainers. The training loop performs an embedding lookup via the Distributed RPC Framework and then executes the local dense component. Test Plan: Reviewers: Subscribers: Tasks: Tags: * Address review comments. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Address more comments. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Co-authored-by: pritam <pritam.damania@fb.com>

antimora changed the title ~~Loading audio data from BytesIO or memory~~ 🚀 Feature Request: Loading audio data from BytesIO or memory Jul 19, 2020

This was referenced Jul 23, 2020

Add sample_rate conversion option to sox_io_backend.load #816

Closed

🚀 Feature Request: Audio mix and degrader #823

Closed

mthrok mentioned this issue Nov 3, 2020

RFC: Fast Audio Loading (from file) #1000

Closed

mthrok mentioned this issue Dec 21, 2020

Support file-like object in info #1108

Merged

mthrok mentioned this issue Jan 7, 2021

Support file-like object in load function #1158

Merged

mthrok closed this as completed Jan 7, 2021

teowenshen mentioned this issue Jan 8, 2022

Support for Online Decoding k2-fsa/icefall#173

Closed

PaulLockett mentioned this issue Feb 5, 2024

TorchAudio constantly fails due to no backend livekit/agents#151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Feature Request: Loading audio data from BytesIO or memory #800

🚀 Feature Request: Loading audio data from BytesIO or memory #800

antimora commented Jul 19, 2020

mthrok commented Jul 20, 2020

antimora commented Jul 21, 2020 •

edited

Loading

vincentqb commented Jul 21, 2020

mthrok commented Jul 21, 2020

antimora commented Jul 21, 2020

sw005320 commented Jul 21, 2020

antimora commented Jul 23, 2020 •

edited by mthrok

Loading

mthrok commented Jul 23, 2020

antimora commented Jul 23, 2020

mthrok commented Jul 24, 2020

vadimkantorov commented Jul 28, 2020 •

edited

Loading

antimora commented Jul 28, 2020

mthrok commented Jul 28, 2020 •

edited

Loading

rbracco commented Nov 23, 2020

mthrok commented Jan 7, 2021

keunwoochoi commented Apr 20, 2021

🚀 Feature Request: Loading audio data from BytesIO or memory #800

🚀 Feature Request: Loading audio data from BytesIO or memory #800

Comments

antimora commented Jul 19, 2020

🚀 Feature

Motivation

Pitch

Alternatives

mthrok commented Jul 20, 2020

antimora commented Jul 21, 2020 • edited Loading

vincentqb commented Jul 21, 2020

mthrok commented Jul 21, 2020

antimora commented Jul 21, 2020

sw005320 commented Jul 21, 2020

antimora commented Jul 23, 2020 • edited by mthrok Loading

mthrok commented Jul 23, 2020

antimora commented Jul 23, 2020

mthrok commented Jul 24, 2020

vadimkantorov commented Jul 28, 2020 • edited Loading

antimora commented Jul 28, 2020

mthrok commented Jul 28, 2020 • edited Loading

rbracco commented Nov 23, 2020

mthrok commented Jan 7, 2021

keunwoochoi commented Apr 20, 2021

antimora commented Jul 21, 2020 •

edited

Loading

antimora commented Jul 23, 2020 •

edited by mthrok

Loading

vadimkantorov commented Jul 28, 2020 •

edited

Loading

mthrok commented Jul 28, 2020 •

edited

Loading