Speech Recognition of Digits using Tensorflow


Here we're going to build our own deep neural network that learns to recognize spoken numbers. We're going to:

  1. Download a labelled data set of people saying numbers,
  2. Then build a neural network
  3. Train on that data
  4. Finally test it out see if we can recognize other spoken numbers

This example demonstrates a Tensorflow implementation of Speech Recognition. We build an LSTM recurrent neural network using the TFLearn high level Tensorflow-based library to train on a labeled dataset of spoken digits. Then we test it on spoken digits.

Dependencies

  1. tensorflow (https://www.tensorflow.org/versions/r0.12/get_started/os_setup.html)
  2. tflearn (http://tflearn.org/)
  3. future

Use pip to install any missing dependencies

Let's dive into our code:

In [ ]:
#! python27
# demo.py
from __future__ import division, print_function, absolute_import
import tflearn
import speech_data
import tensorflow as tf

First we import TFlearn. TFLearn is a high-level library built on top of TensorFlow that is easier to read and great for quick prototyping.

Our other import is a helper class called speech data which will help fetch data from the web and format it for us. You can get it from here.

Now that we have our libraries let's define are hyper parameters or tuning knobs. We have three of them:

  1. learning rate : learning rate is what we applied to this weight updating process. The greater the learning rate, the faster our network trains; the lower the learning rate the more accurate our network predicts. So it represents a trade-off between time and accuracy.
  2. training_iters : defines how many steps we want to train - 300,000.
In [ ]:
learning_rate = 0.0001
training_iters = 300000  # steps
batch_size = 64

We have our hyperparameters. Now we can fetch our data. This is where we'll use our helper class speech_data.py, specifically its mfcc_batch_generator function. This function will download a set of wave files. Each wave file is a recording of a different spoken digit like and each is labeled with a written digits. It will return the list of labeled speech files as a batch. Then we can split our batch into training and testing data with pythons built-in next function. We will use the same data for testing for simplicity so it'll be able to recognize the speaker we trained it on but not other speakers.

In [ ]:
width = 20  # mfcc features
height = 80  # (max) length of utterance
classes = 10  # digits

batch = word_batch = speech_data.mfcc_batch_generator(batch_size)
X, Y = next(batch)
trainX, trainY = X, Y
testX, testY = X, Y #overfit for now

Now that we have our training and testing data it's time to make our neural network. Since spoken words are a sequence of sound waves, we want to use a recurrent neural network since they're capable of processing sequences. So we'll initialize our net by calling tflearn's input_data function. This initial input layer will be the gateway that data is fed into the network and the parameters will help define the shape of our input_Data or as TensoFlow calls it - our input tensor. A 'tensor' is a fancy word for a multi-dimensional array of data. Our two parameters will be the width and height. The width is the number of features that are extracted from our utterances in our speech_data helper class and the height is the max length of each utterances.

In [ ]:
# Network building
net = tflearn.input_data([None, width, height])

For our next layer we use TFlearn's LSTM or Long Short Term Memory function. In a recurrent net the output data's contents is influenced not only by the input we just put in but by the entire history of inputs to our recurring loop. LSTMs are the type of recurrent net that can remember everything that is fed and because of that they outperform regular recurrent nets consistently.

We'll use our previous layer as our first parameter since we are feeding tensors from one layer to the next. Then the number of neurons. There's not really a rule for knowing how many neurons using a layer; too few will lead to bad predictions and too many will overfit to our training data, meaning it will not generalize well. Let's pick 128. And then our dropout value which says how much dropout do we want. Dropout helps prevent overfitting by randomly turning off some neurons during training, so the data is forced to find new paths between layers allowing for more generalized model.

In [ ]:
net = tflearn.lstm(net, 128, dropout=0.8)

Our next layer will be fully connected meaning every neuron in the previous layer will be connected to its neurons and our number of classes are ten as we are only recognizing 10 digits. We'll set the activation function to softmax which will convert numerical data into probabilities.

In [ ]:
net = tflearn.fully_connected(net, classes, activation='softmax')

Lastly we'll create our output layer as a regression which will output a single predicted number for our utterance. We're using the popular Adam optimizer to minimize our categorical cross-entropy loss function over time so we get a more accurate prediction.

In [ ]:
net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate, loss='categorical_crossentropy')

Now we can initialize our network using TFlearn's DNN (Deep Neural Net) function and tensor_board verbose to 0 which means we want a detailed visualization. We'll initialize our training loop then fit our model to the training and testing data for ten epochs with our specified batch size. Then will predict a spoken digits value from our training data. We also make sure to save our model for later use and print our result.

In [ ]:
# Training

### add this "fix" for tensorflow version errors
col = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
for x in col:
    tf.add_to_collection(tf.GraphKeys.VARIABLES, x ) 


model = tflearn.DNN(net, tensorboard_verbose=0)
while 1: #training_iters
  model.fit(trainX, trainY, n_epoch=10, validation_set=(testX, testY), show_metric=True,
          batch_size=batch_size)
  _y=model.predict(X)
model.save("tflearn.lstm.model")
print (_y)
print (y)

TFlearn has a nice log of important training variables built-in just from running the fit function; so we don't have to specify what things to print. After its done training you will predict the digits and if we wanted to we could just record yourself saying a number and place it in the directory then predict that.

Summary:

So to break it down LSTM neural networks are used in state-of-the-art speech recognition. We can use TFlearn to quickly build and train a deep neural network to recognize speech. And good hyper parameters like the learning rate are those that are balanced between trade-offs like time and accuracy.

Entire demo.py script:

In [ ]:
from __future__ import division, print_function, absolute_import
import tflearn
import speech_data
import tensorflow as tf

learning_rate = 0.0001
training_iters = 300000  # steps
batch_size = 64

width = 20  # mfcc features
height = 80  # (max) length of utterance
classes = 10  # digits

batch = word_batch = speech_data.mfcc_batch_generator(batch_size)
X, Y = next(batch)
trainX, trainY = X, Y
testX, testY = X, Y #overfit for now

# Network building
net = tflearn.input_data([None, width, height])
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, classes, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate, loss='categorical_crossentropy')
# Training

### add this "fix" for tensorflow version errors
col = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
for x in col:
    tf.add_to_collection(tf.GraphKeys.VARIABLES, x ) 


model = tflearn.DNN(net, tensorboard_verbose=0)
while 1: #training_iters
  model.fit(trainX, trainY, n_epoch=10, validation_set=(testX, testY), show_metric=True,
          batch_size=batch_size)
  _y=model.predict(X)
model.save("tflearn.lstm.model")
print (_y)
print (y)
  • Run this code after installing all the dependencies and downloading the speech_data.py helper class.
  • This downloads the dataset into the 'data' folder and initiates the training process.
  • It will take a couple hours to train fully.
  • You can then test the trained model on your own .wav samples of spoken digits.

You can download the entire code from here.