Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low WER training pipeline in torchaudio with wav2letter #913

Closed
vincentqb opened this issue Sep 18, 2020 · 12 comments
Closed

Low WER training pipeline in torchaudio with wav2letter #913

vincentqb opened this issue Sep 18, 2020 · 12 comments

Comments

@vincentqb
Copy link
Contributor

torchaudio is targeting speech recognition as full audio application (internal). Along this line, we implemented wav2letter pipeline to obtain a low character error rate (CER). We want to expand on this and showcase a new pipeline which also has a low word error rate (WER). To achieve this, we consider the following additions to torchaudio from higher to lower priority.

Token Decoder: Add a lexicon-constrained beam search algorithm, based on fairseq (search class, sequence generator) since it is torchscriptable.

Acoustic Model: Add a transformer-based acoustic model, e.g. speech-transformer, comparison.

Language Model: Add KenLM to use a 4-gram language model based on LibriSpeech Language Model, as done in paper.

Training Loss: Add the RNN Transducer loss to replace the CTC loss in the pipeline.

Transformations: SpecAugment is already available in wav2letter pipeline.

See also internal

cc @astaff @dongreenberg @cpuhrsch

@scarecrow1123
Copy link

Hey @vincentqb these planned additions look great and useful! Can you clarify on the below points please?

  1. Are these open for community contributions since there are a bunch of internal links you've linked to the description above?

  2. For things like transducer loss and token decoder, are you planning on linking other libraries (like fairseq, warp-transducer, etc.) as dependencies to the example application? Or would they be self contained as it is done for wav2letter implementation in Example pipeline with wav2letter #632 ?

@vincentqb
Copy link
Contributor Author

Hey, thanks for commenting :)

  1. For now, we are looking for high level feedback on the plan. This has not yet been taskyfied for the community :)

  2. We are aiming at keeping the implementations self-contained in torchaudio as much as possible.

Thoughts?

@Edresson
Copy link

@vincentqb Wouldn't it be interesting to create a torchaudio_ASR repository?

Support various things, an independent preprocess of features that saves the features in .pt so we can extract any feature from torchaudio and make an easy support to features from fairseq (like wav2vec).

I feel that ASR lacks a repository with torchaudio and is modular enough to accept new features. And that it is simple for people to add new models (which are different from those implemented in torchaudio (currently we only have wav2letter)). I believe that the community can like this idea and contribute to the repository.

@vincentqb
Copy link
Contributor Author

@vincentqb Wouldn't it be interesting to create a torchaudio_ASR repository?

My first step is to understand what would be missing in torchaudio to serve the ASR community best :) Can you provide some examples?

Support various things, an independent preprocess of features that saves the features in .pt so we can extract any feature from torchaudio and make an easy support to features from fairseq (like wav2vec).

torch.save can save any nn.Module and tensors. Is that what you meant?

I feel that ASR lacks a repository with torchaudio and is modular enough to accept new features. And that it is simple for people to add new models (which are different from those implemented in torchaudio (currently we only have wav2letter)). I believe that the community can like this idea and contribute to the repository.

Are there models you would like to contribute to torchaudio? :)

@Edresson
Copy link

@vincentqb I would support the Jasper model. Right now I'm out of time so maybe soon I can send a PR :).

If that list of things has been added to torchaudio it will be very good for ASR :). It will be even simpler to build a great pipeline using only torchaudio :).

My initial suggestion is to keep external notebooks. Why not make a torchaudio_ASR repository? So some things would not need to be implemented in the torchaudio itself, but in these new repositories. There in this repository we can extract features before and save with torch.save and just write a generic dataloader that reads, this is interesting to facilitate support for new features like wav2vec. So to support new extraction methods independent of torchaudio, just write a new preprocessing class.

@vincentqb
Copy link
Contributor Author

@vincentqb I would support the Jasper model. Right now I'm out of time so maybe soon I can send a PR :).

Great! Feel free to ping me when you do :)

If that list of things has been added to torchaudio it will be very good for ASR :). It will be even simpler to build a great pipeline using only torchaudio :).

My initial suggestion is to keep external notebooks.

We currently offer training examples such as wav2letter using torchaudio. The example I link shows one way of doing preprocessing. Does that help?

Why not make a torchaudio_ASR repository? So some things would not need to be implemented in the torchaudio itself, but in these new repositories. There in this repository we can extract features before and save with torch.save and just write a generic dataloader that reads, this is interesting to facilitate support for new features like wav2vec. So to support new extraction methods independent of torchaudio, just write a new preprocessing class.

Our goal with torchaudio is to provide flexible building blocks for audio-related fields, such as ASR. As such, we want to make sure we capture what would be useful to the community, and to ASR. Can you provide an example of your suggested workflow?

@Edresson
Copy link

We currently offer training examples such as wav2letter using torchaudio. The example I link shows one way of doing preprocessing. Does that help?

I had not seen this very good example :)

Our goal with torchaudio is to provide flexible building blocks for audio-related fields, such as ASR. As such, we want to make sure we capture what would be useful to the community, and to ASR. Can you provide an example of your suggested workflow?

Now that I've seen the example. Do you think about supporting Wav2vec?

The easiest way to do this support that I see is to change the example, separating the feature extraction (MFFC / waveform) from the model training. Basically a preprocess.py that extracts the characteristics and saves them with torch.save so the main only reads the files saved by torch.save.

Do you know a simpler way to integrate Wav2vec with torchaudio?

@vincentqb
Copy link
Contributor Author

Adding an example workflow with wav2vec would be a great addition! I see you have already mentioned jasper in comment so let's move the discussion there :)

Now that I've seen the example. Do you think about supporting Wav2vec?

The easiest way to do this support that I see is to change the example, separating the feature extraction (MFFC / waveform) from the model training. Basically a preprocess.py that extracts the characteristics and saves them with torch.save so the main only reads the files saved by torch.save.

Do you know a simpler way to integrate Wav2vec with torchaudio?

We currently don't have a pipeline with wav2vec included, but this would be a great addition. torchaudio is made to be modular and uses standard pytorch operations, so using the pre-trained tensors from fairseq can be done using standard pytorch operations. Is that what you meant?

@Edresson
Copy link

Adding an example workflow with wav2vec would be a great addition! I see you have already mentioned jasper in comment so let's move the discussion there :)

Now that I've seen the example. Do you think about supporting Wav2vec?
The easiest way to do this support that I see is to change the example, separating the feature extraction (MFFC / waveform) from the model training. Basically a preprocess.py that extracts the characteristics and saves them with torch.save so the main only reads the files saved by torch.save.
Do you know a simpler way to integrate Wav2vec with torchaudio?

We currently don't have a pipeline with wav2vec included, but this would be a great addition. torchaudio is made to be modular and uses standard pytorch operations, so using the pre-trained tensors from fairseq can be done using standard pytorch operations. Is that what you meant?

Do you have any idea how this support would do?
Do you intend to add fairseq as a dependency on the torchaudio or example in question?

Or do you want to try to make it independent of the fairseq structure and just create a class with Wav2vec architecture and load it at the checkpoint?

@vincentqb
Copy link
Contributor Author

vincentqb commented Oct 14, 2020

@Edresson -- those are great questions, and thanks for sharing your thoughts :)

Do you intend to add fairseq as a dependency on the torchaudio or example in question?
Or do you want to try to make it independent of the fairseq structure and just create a class with Wav2vec architecture and load it at the checkpoint?

We do not want torchaudio to depend on fairseq, no. For the example implementation, we also aim to avoid such dependencies as much as possible.

I would aim instead for fairseq to use torchaudio building blocks in some places.

Do you have any idea how this support would do?

What I meant above about the checkpoint was really just that torchaudio uses standard pytorch, so a user can interact with through standard pytorch means. For instance, someone could pre-process with torchaudio on a torchaudio dataset and then import a model from somewhere else, and then follow a torchaudio example for training loop.

Is this what you meant?

@Edresson
Copy link

@Edresson -- those are great questions, and thanks for sharing your thoughts :)

Do you intend to add fairseq as a dependency on the torchaudio or example in question?
Or do you want to try to make it independent of the fairseq structure and just create a class with Wav2vec architecture and load it at the checkpoint?

We do not want torchaudio to depend on fairseq, no. For the example implementation, we also aim to avoid such dependencies as much as possible.

I would aim instead for fairseq to use torchaudio building blocks in some places.

Do you have any idea how this support would do?

What I meant above about the checkpoint was really just that torchaudio uses standard pytorch, so a user can interact with through standard pytorch means. For instance, someone could pre-process with torchaudio on a torchaudio dataset and then import a model from somewhere else, and then follow a torchaudio example for training loop.

Is this what you meant?

I believe so :). We could, for example, preprocess the dataset with torchaudio after extracting the features of these audios with wav2vec and save with torch.save. After going back to torchaudio and using the torchaudio ASR models. I think of saving with torch.save because extracting features at all times with wav2vec can be too slow, so the extraction is done only once. That makes sense?

@vincentqb
Copy link
Contributor Author

I believe so :).

great!

We could, for example, preprocess the dataset with torchaudio after extracting the features of these audios with wav2vec and save with torch.save. After going back to torchaudio and using the torchaudio ASR models. I think of saving with torch.save because extracting features at all times with wav2vec can be too slow, so the extraction is done only once. That makes sense?

yup, does to me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants