Intel® RealSense™ SDK Voice Command Sample Application

Introduction

Thinking about exploring speech recognition in your code? Do you want more detailed information on the inner workings of the Intel® RealSense™ SDK and voice commands? In this article, we’ll show you a sample application that uses the speech recognition feature of the Intel RealSense SDK, using C# and Visual Studio* 2015, the Intel RealSense SDK R4 or above, and an Intel® RealSense™ camera F200.

Project Structure

In this sample application, I separated out the Intel RealSense SDK functionality from the GUI layer code to make it easier for a developer to focus on the SDK’s speech functionality. I’ve done this by creating a C# wrapper class (RSSpeechEngine) around the Intel RealSense SDK Speech module. Additionally, this sample app is using the “command” mode from the Intel RealSense speech engine.

The Windows* application uses a standard Windows Form class for the GUI controls and interaction with the RSSpeechEngine class. The form class makes use of delegates as well as multithreaded technology to ensure a responsive application.

I am not trying to make a bullet-proof application. I have added some degree of exception handling, but it’s up to you to ensure that proper engineering practices are in place to ensure a stable, user friendly application.

Requirements

Hardware requirements:

4th generation Intel® Core™ processors based on the Intel microarchitecture code name Haswell
8 GB free hard disk space
Intel RealSense camera F200 (required to connect to a USB 3 port)

Software requirements:

Microsoft Windows* 8.1/Win10 OS 64-bit
Microsoft Visual Studio 2010–2015 with the latest service pack
Microsoft .NET* 4.0 (or higher) Framework for C# development
Unity* 5.x or higher for Unity game development

WordEventArg.CS

WordEventArgs derives from the C# EventArgs class. It’s a small wrapper that has one private data member added to it. The private string _detectedWord holds the word that was detected by the speech engine.

This class is used as an event argument when the RSSpeechEngine class dispatches an event back to the Form class indicating the word that was detected.

RSSpeechEngine.CS

RSSpeechEngine is a wrapper class, an engine so to speak, around the speech modules command mode. I wrote the class with the following goals in mind:

Cleanly and clearly isolate as much of the Intel RealSense SDK functionality away from the client application.
Isolate each of the steps needed to get the command mode up and running in easy-to-understand function blocks.
Try to provide comments in the code to help the reader understand what the code is doing.

Below, I describe functions that comprise the RSSpeechEngine class.

public event EventHandler<WordEventArg> OnWordDetected;

The OnWordDetected Event triggers a message back to the client application letting it know that a given word was detected. The client creates an event handler to handle the WordEventArg object.

public RSSpeechEngine( )

RSSpeechEngine is the constructor for the class, and it takes no parameters. The constructor creates the global session object, which is needed in many different areas of the class for initialization.

Next I create the speech recognition module itself. If that succeeds, it creates the speech implementation object, following by the grammar module, the audio source, and the speech event hander. If none of those functions fail, _initialized is set to true, and the client application has the green light to try to use the class.

You might wonder whether the private variable _initialized variable is worth having, given the fact that each function returns a Boolean value, and if it’s false, I manually throw an error. In this example, the only real benefit of this variable is in the StartSpeechRecognition() function, where _initialized acts as a gate, allowing the recognition to start or not.

public void Dispose( )

This is a cleanup function that the client application calls to ensure memory is properly cleaned up.

public bool Initialized

This property exposes the private _initialized variable for the client application to use if they so choose. This example uses it as a gate in the StartSpeechRecognition() function.

private bool CreateSpeechRecognitionModule( )

This sets up the module itself by calling the session object’s CreateImpl function specifying the name of the module to be created. This function has one “out” parameter, which is a PXCMSpeechRecognition object.

private bool CreateSpeechModuleGrammar( )

This function ensures the speech module has a grammar to work with. A grammar for command mode is a set of words that speech recognition will recognize. In this example I’ve used the words “Fire,” “Bomb,” and “Lazer” as a grammar.

It should be noted that it’s possible to have more than one grammar loaded up into the speech recognition engine at one time. As an example, you could do the following which would load up three different grammars into the speech recognition module all waiting to be used.

_speechRecognition.BuildGrammarFromStringList( 1, _commandWordsList1, null ); _speechRecognition.BuildGrammarFromStringList( 2, _commandWordsList2, null ); _speechRecognition.BuildGrammarFromStringList( 3, _commandWordsList3, null );

Then when you want to use them, you would use the following

_speechRecognition.SetGrammar( 1 );
_speechRecognition.SetGrammar( 2 );
_speechRecognition.SetGrammar( 3 );

While you can build multiple grammars for speech recognition, you can only have one grammar loaded at a time.

When might you want to use multiple grammars? Maybe you have a game where each level has a different grammar. You can load one list for one level, and set its grammar. Then on a different level, you can use a different word list that you’ve already loaded. To use it, you simply set that level’s grammar.

As the sample code shows, I created one array that contains three words. I use the BuildGrammarFromStringList function to load that grammar, then use the SetGrammar function to ensure it’s active.

private bool CreateAudioSource( )

The CreateAudioSource function finds and connects to the Intel RealSense camera’s internal microphone. Even though I am specifically targeting this microphone, you can use any microphone attached to your computer. As an example, I’ve even used my Plantronics headset, and it works fine.

The first thing the function does is initialize the _audioSource PXCMAudioSource object by calling the session’s CreateAudioSource() function, and then check to ensure it was successfully created by checking it against null.

If I have a valid audio source, the next step is to create the device information for the audio source, which is talked about in the next function description CreateDeviceInfo(). For now, let’s assume that the valid audio device information was created. I set the volume of the microphone, which controls the volume at which the microphone records the audio signal. Then I set the audio sources device information that was created in the CreateDeviceInfo() function.

private bool CreateDeviceInfo( )

This function queries all the audio devices on a computer. First it instructs the audio device created in the previous function to scan for all devices on the computer by making the call to ScanDevices(). After the scan, the next step—an important one—is to iterate over all the devices found. This step is important because ALL audio devices connected to your computer will be detected by the Intel RealSense SDK. For example, I have a Roland OctaCapture* connected to my computer at home via USB. When I run this function on my computer, I get eight different audio devices listed for just this one Roland unit.

There are several ways to do this, but I’ve chosen what seems to be the standard in all the examples I’ve seen in the SDK: I hop into a for loop and query the system and populate the _audioDeviceInfo object with the “next as specified by i” audio device detected on the system. If the current audio device’s name matches the name of the audio device I want, I set the created variable to true and break out of the loop.

DEVICE_NAME was created and initialized at the top of the source code file as

string DEVICE_NAME = "Microphone Array (2- Creative VF0800)";

How did I know this name? I had to run the for loop, set a break point, and look at all the different devices as I iterated through them. It was obvious on my computer which one was the Intel RealSense camera in comparison to the other devices.

Once we have a match, we can stop looking. The global PXCMAudioSource.DeviceInfo object _audioDeviceInfo will now contain the proper device information to be used back in the calling function CreateAudioSource().

NOTE: I have seen situations where the device name is different. On one computer the Intel RealSense camera’s name will be "Microphone Array (2- Creative VF0800)" and on my other computer with the same Intel RealSense camera but a different physical device, the name will be "Microphone Array (4- Creative VF0800)". I’m not sure why this is but it’s something to keep in mind.

private bool CreateSpeechHandler( )

The CreateSpeechHandler function tells the speech recognition engine what to do after a word has been detected. I create a new PXCMSpeechRecognition.Handler object, ensuring it’s not null, and if not, I assign the OnSpeechRecognition function to the onRecognition delegate.

private void OnSpeechRecognition( PXCMSpeechRecognition.RecognitionData data )

OnSpeechRecognition is the event handler for the _handler.onRecognition delegate when a word has been detected. It accepts a single RecognitionData parameter. This parameter contains things like a list of scores, which is used when we need to maintain a certain confidence level in the word that was detected, as can be seen in the if(..) statement. I want to be sure that the confidence level of the currently detected work is at least 50 as defined here.

int CONFIDENCE_LEVEL = 50;

If the confidence level is at least 50, I create an instance of OnWordDetected event passing in a new instance of the WordEventArg object. WordEventArg takes a single parameter, which is the word that was detected.

At this point, the OnWordDetected sends a message to all subscribers, in this case, the Form informs it that the speech module detected one of the words in the list.

public void StartSpeechRecognition( )

StartSpeechRecogntion tells the speech recognition functionality to start listening to speech. First it checks to see if everything was initialized properly. If not, it returns out of the function.

Next, I tell it to start recording, passing in the audio source I want to listen to/for, and then what handler object has the function to call when a word is detected. And for good measure, just check to see if any error occurred.

public void StopSpeechRecognition( )

The function calls the module’s .StopRec() function to stop the processing and kills the internal thread. This process can take a few milliseconds. As such, if you immediately try to call the Dispose function on this class, there is a strong chance the Dispose() code will cause an exception. So if you try to set _speechRecognition to null and call its dispose method before StopRec() has completed, it will cause the application to crash. This is why I added the Thread.Sleep() function. I want the execution to halt just long enough to give StopRec() time to complete before moving on.

MainForm.CS

MainForm is the Windows GUI client to RSSpeechEngine. As mentioned previously, I designed this entire application so that the Form class only handles GUI operations and kicks off the RSSpeechEngine engine.

The RSSpeechEngineSample application itself is not a multithreaded application per se. However, because the PXCMSpeechRecognition module has a function that runs its own internal thread and we need data from that thread, we have to use some multithreaded constructs in the main form. This can be seen in the function that updates the label and the list box.

To start with, I create a global RSSpeechEngine _RSSpeechEngine object that will get initialized in the constructor. After that I declare two delegates. These delegates do the following

SetStatusLabelDelegate. Sets the applications status label from stopped to running.
AddWordToListDelegate. Adds a detected word to the list on the form.

public MainForm( )

This is the forms constructor. It initializes the _RSSpeechEngine object and assigns the OnWordDetected event to the AddWordToListEventHandler function.

private void btnStart_Click( object sender, EventArgs e )

This is the start button event click handler, which updates the label to “Running” and starts the RSSpeechEngine’s StartVoiceRecognition functionality.

private void btnStop_Click( object sender, EventArgs e )

This is the stop button event click handler, which tells the RSSPeechEngine to stop processing and sets the label to “Not Running.”

private void AddWordToListEventHandler( object source, WordEventArg e )

This is the definition for the AddWordToEventListHandler that gets called when _RSSpeechEngine has detected a word. It calls the AddWordToList function, which knows how to deal with multithreaded functionality.

private void AddWordToList( string s )

The AddWordToList takes one parameter, the word that was detected by the RSSpeechEngine engine. Due to the nature of multithreaded applications in Windows forms, this function looks a little strange.

When dealing with multithreaded applications in Windows, where form elements/controls need to be updated, it’s required to check a control’s InvokeRequired property. If true, this property indicates that the use of a delegate is going to be required. The delegate turns around and calls the exact same function. A new instance of the AddWordToListDelegate is created specifying the name of the function it is to call, which is a call back to the same function.

Once, the delegate has been initialized, I tell the form object to invoke the delegate with the original “s” parameter that came in.

private void SetLabel( string s )

This function works exactly like AddWordToList in the multithreaded area.

private void FormIsClosing( object sender, FormClosingEventArgs e )

This is the event that gets triggered when the application is closing. I check to ensure that _RSSpeechEngine is not null, and if _RSSpeechEngine has been successfully initialized, I call StopSpeechRecognition() forcing the processing to stop and then call the engine’s dispose function to clean up after itself.

Conclusion

I hope this article and sample code has helped you gain a better understanding of how to use Intel RealSense SDK speech recognition. These same principles will apply if you are using Unity. The intent was to show how to use Intel RealSense SDK speech recognition in an easy-to-understand, simple application covering everything to be successful in implementing a new solution.

If you think I have left out any explanation or haven’t been clear in a particular area, shoot me an email at rick.blacker@intel.com or make a comment below.

About Author

Rick Blacker is a seasoned software engineer who spent many of his years authoring solutions for database driven applications. Rick has recently moved to the Intel RealSense technology team and helps users understand the technology.

Contents