Abstract
This paper demonstrates how to train and infer the speech recognition problem using deep neural networks on Intel® architecture. A scratch training approach was used on the Speech Commands dataset that TensorFlow* recently released. Inference was done using test audio clips to detect the label. The experiments were run on an Intel® Xeon® Gold processor system.
Introduction
The audio classification tasks are divided into three sub domains: music classification, speech recognition (particularly for the acoustic model), and acoustic scene classification. With the rapid development of mobile devices, speech-related technologies are becoming increasingly popular. For example, Google offers the ability to search by voice on Android* phones. In this study, we approach the speech recognition problem building a basic speech recognition network that recognizes thirty different words using a TensorFlow-based implementation.
To help with this experiment, TensorFlow recently released the Speech Commands datasets. It includes 65,000 one-second-long utterances of 30 short words by thousands of different people.
Continued research in the deep learning space has resulted in the evolution of many frameworks to solve the complex problem of speech recognition. These frameworks have been optimized, specific to the hardware, where they are run for better accuracy, reduced loss, and increased speed. In these lines, Intel has optimized the TensorFlow library for better performance on its Intel® Xeon® processors. This paper discusses the training and inferencing speech recognition problem that is built using sample convolutional neural network (CNN) architecture with the TensorFlow framework on an cluster powered by Intel® processors. We have adopted an approach by training the model from scratch.
Document Content
This section describes the end-to-end steps, from choosing the environment to running the tests on the trained speech recognition model.
Choosing the environment
Hardware
Experiments were performed on Intel Xeon Gold processor-powered systems. Table 1 list the hardware details.
Table 1. Intel Xeon Gold processor configuration.
Architecture | x86_64 |
CPU op-mode(s) | 32 bit, 64 bit |
Byte order | Little endian |
CPU(s) | 24 |
Core(s) per socket | Six |
Socket(s) | Two |
CPU family | Six |
Model | 85 |
Model name | Intel Xeon Gold 6128 CPU at 3.40 GHz |
RAM | 92 GB |
Software
Intel® Optimization for TensorFlow* framework, along with Intel® Distribution for Python*, were used as the software configuration. Tables 2 list the details of the software.
Table 2. Software configuration – Intel Xeon Gold processor
TensorFlow | 1.4.0 (optimized by Intel) |
Python* | 3.6 |
TensorBoard* | 0.1.5 |
The software configurations listed in Table 2 are available on the hardware environments chosen, and no source build for TensorFlow was necessary.
Dataset
The Speech Commands dataset (TAR file) is comprised of 65,000 WAVE audio files (.wav) of people saying 30 different words. This data was collected by Google and released under a CC BY license, and this archive is more than 1 GB. Each audio file is a 1-second audio clip as either silence, an unknown word, yes, no, up, down, left, right, on, off, stop, or go. Twelve different sounds of the entire dataset consisting of 30 sounds were used for this experiment.
Total training data set: 23701
Training: 80 percent -- 18961
Validation: 10 percent -- 2370
Testing: 10 percent -- 2370
We used a hash-function-based split to prevent repeating the files from one set to another.
We maintained a list of all words such as up, go, off, on, stop, and so on.Train and test split was done based on each word to ensure all classes were covered so that there was no class imbalance.
CNN-TRAD-POOL3 architecture
The architecture used is based on the Convolutional Neural Networks for Small-footprint Keyword Spotting paper. TensorFlow provides different approaches to building neural network models. We chose CNN-TRAD-POOL3, because it is comparatively simple, quick to train, and easy to understand. The CNN-TRAD-POOL3 network is made of two convolution layers, max-pooling layers, one linear low-rank layer, one DNN layer, and one softmax layer. Figure 1 shows the CNN-TRAD-POOL3 architecture.
Figure 1. CNN-TRAD-POOL3 model.
Execution steps
This section describes the steps we used in the end-to-end process for training, validation, and testing the speech recognition model on Intel® architecture.
- Setup for training
- Model training
- Inference
Setup for training
- Install the optimization for TensorFlow optimized by Intel.
- Clone the TensorFlow repository from https://github.com/tensorflow/tensorflow.
Model training
After cloning the TensorFlow repository, the next step is to train the model. We adopted a scratch training technique in order to retrain all layers from scratch.
The following command downloads the speech commands dataset and trains the algorithm toward detecting audio samples:
python tensorflow/examples/speech_commands/train.py
Experimental runs with inference
On the Intel Xeon Gold Processor – Intel® AI DevCloud Cluster
To execute on the Intel AI DevCloud cluster, use the following command to submit the training job:
qsub speech.sh -l walltime=24:00:00
On this cluster, there is a restriction on walltime of six hours to execute a job. The maximum value of walltime that can be set is 24 hours. As shown in the qsub command, the walltime is set to 24 hours.
The job script speech.sh has the following code:
#!/bin/sh #PBS -l walltime=24:00:00 which python cd ~/tensorflow/ export PATH=/glob/intel-python/python3/bin/:$PATH numactl --interleave=all python ~/tensorflow/tensorflow/examples/speech_commands/train.py
The following shows the details of the steps and accuracies:
TensorBoard* Graphs
TensorBoard is an effective tool to use for visualizing the training progress. By default, the script saves events to /tmp/retrain_logs, and loads the scripts by running the following command:
tensorboard --logdir /tmp/retrain_logs
Figure 2 shows the TensorBoard graphs for the Intel Xeon Gold processor.
Figure 2. TensorBoard graphs - Intel Xeon Gold processor.
The script used to export the trained model file for inference is as follows:
echo python ~/tensorflow/tensorflow/examples/speech_commands/freeze.py --start_checkpoint=~/kaggle-speech/speech_commands_train/conv.ckpt-68000 --output_file=~/kaggle-speech/my_frozen_graph_68000.pb | qsub
After the frozen model has been created, using the following code, test it with the label_wav.py script:
echo python ~/tensorflow/tensorflow/examples/speech_commands/label_wav.py --graph=~/kaggle-speech/my_frozen_graph_68000.pb --labels=~/kaggle-speech/speech_commands_train/conv_labels.txt --wav=~/kaggle-speech/speech_dataset/left/a5d485dc_nohash_0.wav | qsub left (score = 0.96563) right (score = 0.02616) _unknown_ (score = 0.00717)
left is the top score because it is the correct label.
Intel® Xeon® Gold processor metrics
Table 3. Intel Xeon Gold processor metrics.
Properties | Intel Xeon Gold Processor |
---|---|
Total amount of time | 83,400 (seconds) |
Total number of steps | 68,000 |
Batch size | 100 |
Total Wav files | 6,800,000 |
Wav files per second (Total Wav files / Total amount of time) | 81.53 |
Training accuracy | 93 percent |
Validation accuracy | 92 percent |
Testing accuracy | 92.5 percent |
Conclusion
In this paper, we showed how we trained and tested speech recognition from scratch using a sample CNN model and the TensorFlow audio recognition dataset on the Intel Xeon Gold processor-based environments. The experiment can be extended by applying different optimization algorithms, changing learning rates, and varying input sizes, further improving accuracy.
About the Authors
Rajeswari Ponnuru and Ravi Keron Nidamarty are members of the Intel team, working on evangelizing artificial intelligence in the academic environment.
References
Related Resources
TensorFlow* Optimizations on Modern Intel® Architecture
Build and Install TensorFlow* on Intel® Architecture