Download PDF[PDF 603KB]
Download Zip File[ZIP 798KB]
Abstract
The Flashcard code sample demonstrates some of the speech recognition features in the Intel® RealSense™ SDK for Windows*. The SDK includes speech modules for integrating dictation and verbal command control in your applications. These two modes of operation provide the following:
- Dictation - The SDK module returns the user’s dictated sentence.
- Command and Control - The application defines a list of words as the command list and the SDK module recognizes speech based solely on the command list.
The flashcard app uses the Command and Control mode to accept verbal input from the user. It does not demonstrate any Dictation features. The app displays simple multiplication problems and matches the user’s spoken response to the correct answer.
Introduction
This code sample demonstrates the basics of using the Command and Control speech recognition capabilities of the SDK. The app displays randomly generated multiplication problems and waits for verbal input from the user.
Figure 1: The Flashcard sample recognizes spoken numbers as input
If the user says the correct answer as shown in Figure 1, the app responds by displaying the user’s answer in green and indicating “Correct!” on the screen. After a short delay, the app displays another randomly created multiplication problem and awaits a response from the user.
Figure 2: Incorrect answers are displayed in red
If the user says the incorrect answer as shown in Figure 2, the app responds by displaying the user’s answer in red and shows the correct answer on the screen.
Purpose
The purpose of this code sample is to distill the complexities of the SDK down to the basics of using the speech recognition module and present this information in a simple use case scenario.
Development Environment
The sample app can be built using Microsoft Visual Studio* Express 2013 for Windows Desktop or the professional versions of Visual Studio 2013.
Configuring the Speech Recognition Module
A method named ConfigureRealSense() is called on startup to prepare the app for accepting speech commands from the user. This method performs the following actions:
- Instantiates session and audio source objects
- Selects the audio device
- Sets the audio recording volume
- Creates a speech recognition instance
- Initializes the speech recognition module
- Builds and sets the active grammar
- Displays device information
The sample app selects the first audio device (index 0) from the audio source device list; however, the SDK provides a mechanism to scan and enumerate audio devices on the computer to allow a user to select the desired input device. This technique is shown in the SDK documentation.
In the sample app the recording volume is set to a fixed value, but in a full-featured app it is recommended to provide a control for setting this parameter and give visual feedback indicating if the user's volume is adjusted adequately.
Handling Speech Recognition Events
An OnRecognition() event handler is implemented to capture data from the speech recognition module when active recognition results are available. The RecognitionData structure passed to the handler describes details of the recognition event (e.g., confidence, sentence, etc.)
The sample app uses a fixed threshold for evaluating the confidence level returned by the speech recognition module; however, the SDK documentation suggests that you “use thresholding to increase or decrease certain aspect of voice recognition. For example, your application may expose a graphical user interface control to let the user adjust what is the acceptable recognition rate. The application can use 50% as the baseline.”
Setting the Active Grammar
When using the Command and Control mode, the speech recognition module uses a list of commands (referred to as the “grammar”) and ignores any words or phrases not contained in the list. The commands can be loaded using either the BuildGrammarFromStringList() method to define the list programmatically or the BuildGrammarFromFile() method to read the grammar from a Java* Speech Grammar Format (JSGF) file. We use the latter method so that we can take advantage of a shorthand for our grammar, and not have to enter all possible answer numbers as distinct strings.
The Flashcard app uses the SDK’s BuildGrammarFromFile() method to open the grammar.jsgf file and build its grammar using the file’s contents. (For more information on the JSGF file format refer to http://www.w3.org/TR/jsgf/). The contents of grammar.jsgf are shown in the following table.
#JSGF V1.0; grammar Digits; public <Digits> = ( <digit> ) + ;<digit> = ( zero | one | two | three | four | five | six | seven | eight | nine | ten | eleven | twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen | twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety );
The notation used in this code sample is similar to examples shown in the SDK’s documentation (RSSDK_DIR\doc\PDF\sdkmanuals.pdf), and you are encouraged to review this for a more thorough explanation of the different formatting options that are available. The <digit> rule identifies the grammar that will be used by the speech recognition module. The “+” sign signifies that whatever comes before it should occur one or more times. This format permits not only single words like “four” to be recognized, but also accepts phrases like “forty four”.
Check It Out
Download the app and learn more about how speech recognition works in the Intel RealSense SDK for Windows.
About Intel® RealSense™ Technology
To get started and learn more about the Intel RealSense SDK for Windows, go to https://software.intel.com/en-us/intel-realsense-sdk
About the Author
Bryan Brown is a software applications engineer in the Developer Relations Division at Intel.