It is capable of transcribing speech in English and several other languages,[3] and is also capable of translating several non-English languages into English. OpenAI claims that the combination of different training data used in its development has led to improved recognition of accents, background noise and jargon compared to previous approaches.[4]
Transformers, introduced in 2017 by Google, displaced many prior state-of-the art approaches to many problems in machine learning, and started becoming the core neural architecture in fields such as language modeling and computer vision;[10] weakly-supervised approaches to training acoustic models were recognized in the early 2020s as promising for speech recognition approaches using deep neural networks.[11]
According to a NYT report, in 2021 OpenAI believed they exhausted sources of higher-quality data to train their large language models and decided to complement scraped web text with transcriptions of YouTube videos and podcasts, and developed Whisper to solve this task.[12]
Whisper has been trained using semi-supervised learning on 680,000 hours of multilingual and multitask data, of which about one-fifth (117,000 hours) were non-English audio data. Whisper does not outperform models which specialize in the LibriSpeech dataset, although when tested across many datasets, it is more robust and makes 50% fewer errors than other models.[13]
Whisper has a differing error rate with respect to transcribing different languages, with a higher word error rate in languages not well-represented in the training data.[14]
The model has been used as the base for a unified model for speech recognition and more general sound recognition.[15]
The Whisper architecture is based on an encoder-decoder transformer. Input audio is split into 30-second chunks converted into a Mel-frequency cepstrum, which is passed to an encoder. A decoder is trained to predict later text captions. Special tokens are used to perform several tasks, such as phrase-level timestamps.[13]
^Yu, Dong; Deng, Li (2014). Automatic speech recognition: a deep learning approach. Signals and communication technology (2015th ed.). London Heidelberg: Springer. p. 9. ISBN978-1-4471-5778-6.
^Kamath, Uday; Graham, Kenneth L.; Emara, Wael (2022). Transformers for machine learning: a deep dive. Chapman & Hall/CRC machine learning & pattern recognition (First ed.). Boca Raton London New York: CRC Press, Taylor & Francis Group. pp. xix. ISBN978-0-367-76734-1.
^Paaß, Gerhard; Giesselbach, Sven (2023-02-16). "Foundation Models for Speech, Images, Videos, and Control". Foundation Models for Natural Language Processing. Artificial Intelligence: Foundations, Theory, and Algorithms. pp. 313–382. arXiv:2302.08575. doi:10.1007/978-3-031-23190-2_7. ISBN978-3-031-23189-6. S2CID257019816.