Microsoft Azure offers speech recognition and synthesis capabilities through Azure AI Speech service, which supports many capabilities, including:
- Speech to text
- Text to speech
- Speech translation
Speech to text
You can use Azure AI Speech to text API to perform real-time or batch transcription of audio into a text format. The audio source for transcription can be a real-time audio stream from a microphone or an audio file.
Azure AI’s Speech to text API is based on Microsoft’s Universal Language Model. The data for the model is Microsoft-owned and deployed to Azure. The model is optimized for two scenarios, conversational and dictation. You can also create and train your own custom models including acoustics, language, and pronunciation if the prebuilt models from Microsoft don’t provide what you need.
Real-time transcription: Real-time speech to text allows you to transcribe audio streams to text. You can use real-time transcription for presentations, demos, or any other scenario where a person is speaking.
In order for real-time transcription to work, your application needs to be listening for incoming audio from a microphone, or other audio input source such as an audio file. Your application code streams the audio to the service, which returns the transcribed text.
Batch transcription: Not all speech to text scenarios are real time. You might have audio recordings stored on a file share, a remote server, or even on Azure storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results.
Batch transcription should be run in an asynchronous manner because the batch jobs are scheduled on a best-effort basis. Normally a job starts executing within minutes of the request but there’s no estimate for when a job changes into the running state.
Leave a Reply