I have released my May 2022 Czech Coqui speech to text model has been released. The acoustic model (neural network weights) has seen ever reduction on the order of up to 10%, but the main improvement is the language model (scorer). By heavy kenlm pruning and increasing n-gram order to 5 it slashes average word error rate further 50%.
Depending on the kind of audio, word error rates vary from sub 10 for very easy (people pronounce carefully in order to be recognized) to 35 for very noisy low-quality telephone recordings, making the model already quite applicable in many situations while keeping its vocabulary (list of recognizable words) approximately above 500k. The release can be downloaded from Github. There is a
gradio based inference example on Huggingface.