In February I took part in the Huggingface robust speech recognition challenge that included all languages of the Common Voice dataset (which was training and test set of choice but anyone could use any speech data) and even some more. I was lucky to have noticed that it takes place, just visited the Common Voice Matrix channel and there learned about it two days before it began. Irene that I missed a similar Coqui event that took place in November. Then again, I had other (health) worries then.
The event was very educative and very well-prepared. The model architecture was Wav2vec2-XLS-R (its versions having 300M, 1B, and 3B parameters), recently implemented in Huggingface. Particularly the fact that there was effectively unlimited free GPU from OVH (that I utilized more than 300 euro worth overall) was outstanding (this doesn't mean you couldn't use your local machine, but basically who asked for GPU time, got it).
Training could have been done using Jupyter notebooks or just Python code, both of which are supported by OVH AI Training. I have chosen a Jupyter notebook, edited from an earlier example, because my typing speed is limited and frankly also because I am a bit lazy.
It went smoothly, for me largely because I already have some speech recognition model training experience, although I had no idea about Huggingface and the Wav2vec2-XLS-R model. When I learned the trick (in the Discord chat) that gradient accumulation can be turned off for larger batch sizes, training speeded enormously and I regretted that I didn't choose larger model than the 300M one. I made several Czech models as well as several non-Czech ones, including Upper Sorbian, but only the best Czech one was among the winning ones.
Yes, each model best in its language was winning and there were prizes. Finally it was decided it will be merchandise from Huggingface as well as generous 200 Euro OVH further AI Training voucher (which I unfortunately basically wasted). And a T-shirt was for everyone with a decent model.
All in all a great event that helped me discover Huggingface as well as the fact that new speech models are getting substantially better every year.