Here we're going to build our own deep neural network that learns to recognize spoken numbers. We're going to:
This example demonstrates a Tensorflow implementation of Speech Recognition. We build an LSTM recurrent neural network using the TFLearn high level Tensorflow-based library to train on a labeled dataset of spoken digits. Then we test it on spoken digits.
Use pip to install any missing dependencies
Let's dive into our code:
#! python27
# demo.py
from __future__ import division, print_function, absolute_import
import tflearn
import speech_data
import tensorflow as tf
Now that we have our libraries let's define are hyper parameters or tuning knobs. We have three of them:
learning_rate = 0.0001
training_iters = 300000 # steps
batch_size = 64
We have our hyperparameters. Now we can fetch our data. This is where we'll use our helper class speech_data.py
, specifically its mfcc_batch_generator
function. This function will download a set of wave files. Each wave file is a recording of a different spoken digit like and each is labeled with a written digits. It will return the list of labeled speech files
as a batch. Then we can split our batch into training and testing data with pythons built-in next
function. We will use
the same data for testing for simplicity so it'll be able to recognize the speaker we trained it on but not other
speakers.
width = 20 # mfcc features
height = 80 # (max) length of utterance
classes = 10 # digits
batch = word_batch = speech_data.mfcc_batch_generator(batch_size)
X, Y = next(batch)
trainX, trainY = X, Y
testX, testY = X, Y #overfit for now
Now that we have our training and testing data it's time to make our neural network. Since spoken words are a sequence of sound waves, we want to use a recurrent neural network since they're capable of processing sequences. So we'll initialize our net by calling tflearn's input_data
function. This initial input layer will be the gateway that data is fed into the network and the
parameters will help define the shape of our input_Data or as TensoFlow calls it - our input tensor. A 'tensor' is a fancy
word for a multi-dimensional array of data. Our two parameters will be the width and height. The width is the number of features that are extracted from our utterances in our speech_data helper class and the height is the max length of each utterances.
# Network building
net = tflearn.input_data([None, width, height])
For our next layer we use TFlearn's LSTM or Long Short Term Memory function. In a recurrent net the output data's contents is influenced not only by the input we just put in but by the entire history of inputs to our recurring loop. LSTMs are the type of recurrent net that can remember everything that is fed and because of that they outperform regular recurrent nets consistently.
We'll use our previous layer as our first parameter since we are feeding tensors from one layer to the next. Then the number of neurons. There's not really a rule for knowing how many neurons using a layer; too few will lead to bad predictions and too many will overfit to our training data, meaning it will not generalize well. Let's pick 128. And then our dropout value which says how much dropout do we want. Dropout helps prevent overfitting by randomly turning off some neurons during training, so the data is forced to find new paths between layers allowing for more generalized model.
net = tflearn.lstm(net, 128, dropout=0.8)
Our next layer will be fully connected meaning every neuron in the previous layer will be connected to its neurons
and our number of classes are ten as we are only recognizing 10 digits. We'll set the activation function to softmax
which
will convert numerical data into probabilities.
net = tflearn.fully_connected(net, classes, activation='softmax')
Lastly we'll create our output layer as a regression which will output a single predicted number for our utterance. We're using the popular Adam optimizer to minimize our categorical cross-entropy loss function over time so we get a more accurate prediction.
net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate, loss='categorical_crossentropy')
Now we can initialize our network using TFlearn's DNN (Deep Neural Net) function and tensor_board verbose to 0 which means we want a detailed visualization. We'll initialize our training loop then fit our model to the training and testing data for ten epochs with our specified batch size. Then will predict a spoken digits value from our training data. We also make sure to save our model for later use and print our result.
# Training
### add this "fix" for tensorflow version errors
col = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
for x in col:
tf.add_to_collection(tf.GraphKeys.VARIABLES, x )
model = tflearn.DNN(net, tensorboard_verbose=0)
while 1: #training_iters
model.fit(trainX, trainY, n_epoch=10, validation_set=(testX, testY), show_metric=True,
batch_size=batch_size)
_y=model.predict(X)
model.save("tflearn.lstm.model")
print (_y)
print (y)
TFlearn has a nice log of important training variables built-in just from running the fit function; so we don't have to specify what things to print. After its done training you will predict the digits and if we wanted to we could just record yourself saying a number and place it in the directory then predict that.
So to break it down LSTM neural networks are used in state-of-the-art speech recognition. We can use TFlearn to quickly build and train a deep neural network to recognize speech. And good hyper parameters like the learning rate are those that are balanced between trade-offs like time and accuracy.
demo.py
script:¶from __future__ import division, print_function, absolute_import
import tflearn
import speech_data
import tensorflow as tf
learning_rate = 0.0001
training_iters = 300000 # steps
batch_size = 64
width = 20 # mfcc features
height = 80 # (max) length of utterance
classes = 10 # digits
batch = word_batch = speech_data.mfcc_batch_generator(batch_size)
X, Y = next(batch)
trainX, trainY = X, Y
testX, testY = X, Y #overfit for now
# Network building
net = tflearn.input_data([None, width, height])
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, classes, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate, loss='categorical_crossentropy')
# Training
### add this "fix" for tensorflow version errors
col = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
for x in col:
tf.add_to_collection(tf.GraphKeys.VARIABLES, x )
model = tflearn.DNN(net, tensorboard_verbose=0)
while 1: #training_iters
model.fit(trainX, trainY, n_epoch=10, validation_set=(testX, testY), show_metric=True,
batch_size=batch_size)
_y=model.predict(X)
model.save("tflearn.lstm.model")
print (_y)
print (y)