Speech Recognition Using Deep Learning on Intel® Architecture

Abstract

This paper demonstrates how to train and infer the speech recognition problem using deep neural networks on Intel® architecture. A scratch training approach was used on the Speech Commands dataset that TensorFlow* recently released. Inference was done using test audio clips to detect the label. The experiments were run on an Intel® Xeon® Gold processor system.

Introduction

The audio classification tasks are divided into three sub domains: music classification, speech recognition (particularly for the acoustic model), and acoustic scene classification. With the rapid development of mobile devices, speech-related technologies are becoming increasingly popular. For example, Google offers the ability to search by voice on Android* phones. In this study, we approach the speech recognition problem building a basic speech recognition network that recognizes thirty different words using a TensorFlow-based implementation.

To help with this experiment, TensorFlow recently released the Speech Commands datasets. It includes 65,000 one-second-long utterances of 30 short words by thousands of different people.

Continued research in the deep learning space has resulted in the evolution of many frameworks to solve the complex problem of speech recognition. These frameworks have been optimized, specific to the hardware, where they are run for better accuracy, reduced loss, and increased speed. In these lines, Intel has optimized the TensorFlow library for better performance on its Intel® Xeon® processors. This paper discusses the training and inferencing speech recognition problem that is built using sample convolutional neural network (CNN) architecture with the TensorFlow framework on an cluster powered by Intel® processors. We have adopted an approach by training the model from scratch.

Document Content

This section describes the end-to-end steps, from choosing the environment to running the tests on the trained speech recognition model.

Choosing the environment

Hardware
Experiments were performed on Intel Xeon Gold processor-powered systems. Table 1 list the hardware details.

Table 1. Intel Xeon Gold processor configuration.

Architecture	x86_64
CPU op-mode(s)	32 bit, 64 bit
Byte order	Little endian
CPU(s)	24
Core(s) per socket	Six
Socket(s)	Two
CPU family	Six
Model	85
Model name	Intel Xeon Gold 6128 CPU at 3.40 GHz
RAM	92 GB

Software
Intel® Optimization for TensorFlow* framework, along with Intel® Distribution for Python*, were used as the software configuration. Tables 2 list the details of the software.

Table 2. Software configuration – Intel Xeon Gold processor

TensorFlow	1.4.0 (optimized by Intel)
Python*	3.6
TensorBoard*	0.1.5

The software configurations listed in Table 2 are available on the hardware environments chosen, and no source build for TensorFlow was necessary.

Dataset

The Speech Commands dataset (TAR file) is comprised of 65,000 WAVE audio files (.wav) of people saying 30 different words. This data was collected by Google and released under a CC BY license, and this archive is more than 1 GB. Each audio file is a 1-second audio clip as either silence, an unknown word, yes, no, up, down, left, right, on, off, stop, or go. Twelve different sounds of the entire dataset consisting of 30 sounds were used for this experiment.

Total training data set: 23701

Training: 80 percent -- 18961
Validation: 10 percent -- 2370
Testing: 10 percent -- 2370

We used a hash-function-based split to prevent repeating the files from one set to another.

We maintained a list of all words such as up, go, off, on, stop, and so on.Train and test split was done based on each word to ensure all classes were covered so that there was no class imbalance.

CNN-TRAD-POOL3 architecture

The architecture used is based on the Convolutional Neural Networks for Small-footprint Keyword Spotting paper. TensorFlow provides different approaches to building neural network models. We chose CNN-TRAD-POOL3, because it is comparatively simple, quick to train, and easy to understand. The CNN-TRAD-POOL3 network is made of two convolution layers, max-pooling layers, one linear low-rank layer, one DNN layer, and one softmax layer. Figure 1 shows the CNN-TRAD-POOL3 architecture.

Figure 1. CNN-TRAD-POOL3 model.

Execution steps

This section describes the steps we used in the end-to-end process for training, validation, and testing the speech recognition model on Intel® architecture.

Setup for training
Model training
Inference

Setup for training

Install the optimization for TensorFlow optimized by Intel.
Clone the TensorFlow repository from https://github.com/tensorflow/tensorflow.

Model training

After cloning the TensorFlow repository, the next step is to train the model. We adopted a scratch training technique in order to retrain all layers from scratch.

The following command downloads the speech commands dataset and trains the algorithm toward detecting audio samples:

python tensorflow/examples/speech_commands/train.py

Experimental runs with inference

On the Intel Xeon Gold Processor – Intel® AI DevCloud Cluster

To execute on the Intel AI DevCloud cluster, use the following command to submit the training job:

qsub speech.sh -l walltime=24:00:00

On this cluster, there is a restriction on walltime of six hours to execute a job. The maximum value of walltime that can be set is 24 hours. As shown in the qsub command, the walltime is set to 24 hours.

The job script speech.sh has the following code:

#!/bin/sh
#PBS -l walltime=24:00:00
which python
cd ~/tensorflow/
export PATH=/glob/intel-python/python3/bin/:$PATH
numactl --interleave=all python ~/tensorflow/tensorflow/examples/speech_commands/train.py

The following shows the details of the steps and accuracies:

TensorBoard* Graphs

TensorBoard is an effective tool to use for visualizing the training progress. By default, the script saves events to /tmp/retrain_logs, and loads the scripts by running the following command:

tensorboard --logdir /tmp/retrain_logs

Figure 2 shows the TensorBoard graphs for the Intel Xeon Gold processor.

Figure 2. TensorBoard graphs - Intel Xeon Gold processor.

The script used to export the trained model file for inference is as follows:

echo python ~/tensorflow/tensorflow/examples/speech_commands/freeze.py --start_checkpoint=~/kaggle-speech/speech_commands_train/conv.ckpt-68000 --output_file=~/kaggle-speech/my_frozen_graph_68000.pb | qsub

After the frozen model has been created, using the following code, test it with the label_wav.py script:

echo python ~/tensorflow/tensorflow/examples/speech_commands/label_wav.py  --graph=~/kaggle-speech/my_frozen_graph_68000.pb --labels=~/kaggle-speech/speech_commands_train/conv_labels.txt --wav=~/kaggle-speech/speech_dataset/left/a5d485dc_nohash_0.wav
| qsub

left (score = 0.96563)
right (score = 0.02616)
_unknown_ (score = 0.00717)

left is the top score because it is the correct label.

Intel® Xeon® Gold processor metrics

Table 3. Intel Xeon Gold processor metrics.

Properties	Intel Xeon Gold Processor
Total amount of time	83,400 (seconds)
Total number of steps	68,000
Batch size	100
Total Wav files	6,800,000
Wav files per second (Total Wav files / Total amount of time)	81.53
Training accuracy	93 percent
Validation accuracy	92 percent
Testing accuracy	92.5 percent

Conclusion

In this paper, we showed how we trained and tested speech recognition from scratch using a sample CNN model and the TensorFlow audio recognition dataset on the Intel Xeon Gold processor-based environments. The experiment can be extended by applying different optimization algorithms, changing learning rates, and varying input sizes, further improving accuracy.

About the Authors

Rajeswari Ponnuru and Ravi Keron Nidamarty are members of the Intel team, working on evangelizing artificial intelligence in the academic environment.

References

Related Resources

TensorFlow* Optimizations on Modern Intel® Architecture
Build and Install TensorFlow* on Intel® Architecture

Speech Recognition Using Deep Learning on Intel® Architecture

Abstract

Introduction

Document Content

Choosing the environment

Dataset

CNN-TRAD-POOL3 architecture

Execution steps

Setup for training

Model training

Experimental runs with inference

Intel® Xeon® Gold processor metrics

Conclusion

About the Authors

References

Related Resources

Trending Articles

Police confirm man stabbed to death in Selsdon was Andrew David Else of Croydon

Practice Sheet of Right form of verbs for HSC Students

Thread: Ticket to Ride Legacy: Legends of the West:: General:: [SPOILERS]...

Raj Panchayat 3rd / Third Grade Teacher Revised Result 2012 Level 1-2...

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

libdevinfo を使ってネットワークインターフェイスデバイスの一覧を取得する

DD Kashir channel packaging bids invited by 29 june

Current scandal has roots in NPF saga

HResult: 0x80240033 Context: uecGeneral Msg: The license terms of one or more...

Re: How to fix error on printer HP Color LaserJet Pro MFP 3303 with event...

Brunei reaffirms healthcare commitment

Muloraki Au

Born To Be Wild: Chicago Outfit Hit Squad Littered The Streets With Bodies...

Gudur Mandal Sarpanch Wardmumbers Mobile Numbers List Warangal District in...

Mp3 Download: Mdu - Nammer

Ilahi mera jee aaye/ Shaame Malang si Lyrics Translation

Re: My Sisters Plan For Me To Smell Her Feet (Fiction): Part 1,2,3 and 4!!!

Procedure for conduct of supplementary DPC

Srinagar Kitty’s brother dies at 67 due to Covid-19

spreading clines