DeepSpeech is a speech to text (STT) or automatic speech recognition (ASR) engine developed by Mozilla. It allows to recognize a speech and convert spoken words into text. DeepSpeech is an open-source and deep learning based ASR engine that uses TensorFlow for implementation.
This tutorial provides example how to use DeepSpeech to convert speech to text from WAV audio file.
pip package manager install
deepspeech from the command line.
pip install deepspeech
DeepSpeech offers pre-trained models for American English. Download model (
deepspeech-X.Y.Z-models.pbmm) from releases page of the
X.Y.Z stand for version. Model performs best when recordings are made in low-noise environments.
In addition to improve accuracy we can use an external scorer that uses vocabulary. A scorer (
deepspeech-X.Y.Z-models.scorer) can be downloaded from releases page.
DeepSpeech also offers few sample audio files in WAV format. Download archive (
audio-X.Y.Z.tar.gz) and extract files.
from deepspeech import Model import wave import numpy as np modelPath = 'deepspeech-0.8.2-models.pbmm' scorerPath = 'deepspeech-0.8.2-models.scorer' audioPath = 'audio/2830-3980-0043.wav' ds = Model(modelPath) ds.enableExternalScorer(scorerPath) fin = wave.open(audioPath, 'rb') frames = fin.readframes(fin.getnframes()) audio = np.frombuffer(frames, np.int16) text = ds.stt(audio) print(text)
We can use own WAV audio files. We need to record a voice using appropriate parameters that matches what the model was trained on.
- Sample rate: 16 kHz
- Channel: 1
- Bit rate: 256 kb/s
A voice can be recorded by using SoX (Sound eXchange) command line tool.
- On Ubuntu or Debian, run the following command to install SoX:
sudo apt install sox
- On Windows, download SoX from SourceForge.
After installing SoX we can record a voice by using a command.
- On Ubuntu or Debian:
rec -r 16k -c 1 test.wav
- On Windows:
sox -t waveaudio -r 16k -c 1 -d test.wav