Speech to Text using DeepSpeech

Speech to Text using DeepSpeech

DeepSpeech is a speech to text (STT) or automatic speech recognition (ASR) engine developed by Mozilla. It allows recognizing a speech and convert spoken words into text. DeepSpeech is an open-source and deep learning based ASR engine that uses TensorFlow for implementation.

This tutorial provides example how to use DeepSpeech to convert speech to text from WAV audio file.

Using pip package manager, install deepspeech from the command line.

pip install deepspeech

DeepSpeech offers pre-trained models for American English. Download model (deepspeech-X.Y.Z-models.pbmm) from releases page of the mozilla/DeepSpeech repository. X.Y.Z stand for version. The model performs best when recordings are made in low-noise environments.

In addition to improve accuracy, we can use an external scorer that uses vocabulary. A scorer (deepspeech-X.Y.Z-models.scorer) can be downloaded from the releases page.

DeepSpeech also offers a few sample audio files in WAV format. Download archive (audio-X.Y.Z.tar.gz) and extract files.

We create a DeepSpeech model and enable an external scorer. The wave module is used to read WAV audio file. We convert speech to text by using stt method.

from deepspeech import Model
import wave
import numpy as np

modelPath = 'deepspeech-0.8.2-models.pbmm'
scorerPath = 'deepspeech-0.8.2-models.scorer'
audioPath = 'audio/2830-3980-0043.wav'

ds = Model(modelPath)
ds.enableExternalScorer(scorerPath)

fin = wave.open(audioPath, 'rb')
frames = fin.readframes(fin.getnframes())
audio = np.frombuffer(frames, np.int16)
text = ds.stt(audio)

print(text)

We can use own WAV audio files. We need to record a voice using appropriate parameters that matches what the model was trained on.

  • Sample rate: 16 kHz
  • Channel: 1
  • Bit rate: 256 kb/s

A voice can be recorded by using SoX (Sound eXchange) command line tool.

  • On Ubuntu or Debian, run the following command to install SoX:
sudo apt install sox

After installing SoX we can record a voice by using a command.

  • On Ubuntu or Debian:
rec -r 16k -c 1 test.wav
  • On Windows:
sox -t waveaudio -r 16k -c 1 -d test.wav

Leave a Comment

Cancel reply

Your email address will not be published.