User Feedback for Natural User Interfaces with Intel® RealSense™ Technology Part 2

By Justin Link on July 8, 2015

In my previous article, User Feedback for Natural User Interfaces with Intel® RealSense™ Technology Part 1, I discussed user expectations when interacting naturally with computers and the importance of and challenges involved with creating user feedback for natural user interfaces (NUIs). I also discussed user feedback when using hands, based on our experience creating the games Space Between* and The Risen*, which are based on Intel® RealSense™ technology. In this article, I discuss the other input modalities that Intel RealSense technology offers: head and voice.

An IDF 2015 attendee playing The Risen.

Interacting with your hands in an application (once you’ve gotten past the technical limitations of the recognition hardware and software) tends to be a natural and intuitive experience. We use our hands every day for nearly everything. They are our manipulators of the physical world, and so it makes sense that they would be used to manipulate virtual worlds as well. However, using your head and face to manipulate or control things is a much more abstract task. In contrast using voice to control applications is intuitive but presents a different kind of challenge that mostly revolves around the state of the technology and our own expectations. In both these cases, the challenges require specific design and attention to user feedback when developing an application for them.

A Quick Review of What User Feedback Is (and Why It’s Important)

In case you haven’t read Part 1 of this article, I want to quickly review what user feedback in an application is. To put it simply, user feedback in an application is any kind of notification the application gives the user to tell them that the software has recognized their input or has changed state. For example, when using a button in an application there will usually be a visual change when the mouse hovers over it, clicks down on it, or the user clicks it. These visual changes are important because they tell the user that this is an element that can be interacted with and what kind of input affects that element. This user feedback even extends outside the software itself into the hardware we use to interface with it. The mouse, when clicked, has an audible and tangible response to a user pushing a button for the same reasons a button in software does. All of this stems from the natural way we interact with people and our environments.

User feedback is important in NUIs for three reasons: (1) most NUIs are touchless, and so any kind of haptic feedback is impossible, (2) there are no standards for user feedback in NUIs as there are in other software input modalities, and (3) when using NUIs there is an expectation that they behave, well, naturally. Losing haptic feedback in any input modality means that the software has to account for this either visually or aurally. If you don’t design this in, your software will be much less intuitive and will create more frustration for a new user.

The lack of any kind of universal standard for this feedback means that not only will you have to figure out what works best, but your users will be unfamiliar with the kinds of feedback they’re getting and will have to learn it, increasing the learning curve for the application. Working around user expectations is the final challenge since the medium stems directly from (and is trying to be) human language. Unfortunately the current state of technology is not mature enough to replicate a natural conversation, and so you will find users acting with the software as they would with another person only to find that it doesn’t quite work the same in software.

Head Tracking in the Intel® RealSense™ SDK

Image 1.Head tracking in the Intel RealSense SDK showing the orientation of a user’s head.

The advantages of knowing where a user’s head is from inside an application aren’t immediately obvious. Unlike hands, you probably won’t want to develop intricate control mechanisms based on position tracking, or you’ll risk giving your users self-induced whiplash. However, light use of head position tracking can be a unique layer in your application and can even be used to immerse users further into your virtual worlds. The Intel RealSense SDK has a couple of limitations with head tracking that you’ll need to consider, especially when designing user feedback into your application.

Limitations of Head Tracking

The Tracking Volume

Image 2.The tracking volume for Intel RealSense SDK is finite and can be restrictive to the application.

As I discussed in Part 1, understanding that tracking in the Intel RealSense SDK happens within the camera’s field of view is critical to knowing how to use the device. This is the number one problem that users have when interacting with any Intel RealSense application, and it applies to all modalities of the Intel RealSense application except voice. This limitation is less pronounced when using the head-tracking module since users will generally be seated, but it can still be an issue especially if your application has the user leaning left and right.

Tracking is Based on Face Detection

Most of the time, users will be facing the camera and so the software’s need to detect a face for head tracking won’t be a huge issue. Face detection is needed the most during initial detection and in subsequent detections when tracking is lost. The software can especially have trouble picking up a user’s face when the camera is placed above or below the user’s head (causing a sharp perspective from the camera’s view). The solution, similar to hands, is to show the camera the thing it’s looking for—the face in this case. Needing to have a detected face has other implications for head tracking too, like not being able to track the back of the head (if the user turns around).

Head as a Cursor

Image 3.The Twilight Zone stage from Space Between used a glowing cursor to track the user’s head.

In Space Between, we used a head cursor to represent the player’s head position in two dimensions on the screen. While our usage didn’t have users selecting things with their head like you normally would with a cursor, we ended up basing the control of a whale off our “hands as a cursor” implementation.

Next I will talk about some of our challenges when designing this kind of head-tracking interaction, go over our implementation, and discuss from a usability perspective what worked and what didn’t.

Our Challenges

People often lean out of tracking bounds.
Again, this is the most common issue with new people using Intel RealSense applications, but different users had different levels of engagement in our application, which for some meant leaning much further than we anticipated. Leaving the tracking bounds wouldn’t be a problem if it didn’t require the Intel RealSense application to re-detect the user when that happened. We found that this could throw off our entire experience and so we needed to design around it.

Moving up/down wasn’t as intuitive for people as we expected.
Horizontal movement when leaning maps pretty easily to a cursor when using your head for input, but vertical movement isn’t quite the same. Literally raising and lowering your head didn’t make sense to use for vertical position since it would mean that users would have to stand up or crouch down to move their head up or down relative to the camera. Instead we chose to use the distance from the camera (leaning in and out) to control vertical cursor position, but we found that this wasn’t as intuitive for some as we had expected.

The Cursor in Space Between

For Space Between, we used the head cursor to control a whale in a stage called the Twilight Zone. In this stage, the player could lean left or right to swim left or right, and lean in and out to dive and ascend. The whale was fixed to a rail and leaning would make the whale swim within a certain distance of that rail, allowing players to pick up points along the way.

What Worked
From a user-feedback perspective, showing a cursor that mapped to the head’s position helped us understand how the head was being tracked. We also front-loaded each game with instructions and graphics to show what input modalities were being used, which helped prep players for exploring the input. Once people understood exactly what the cursor was showing (head position), the cursor also helped players intuitively learn where the camera’s field of view was, since a cursor on the edge of the screen meant the player’s head was on the edge of the camera’s field of view.

Image 4.A snip from our instructions in Space Between showing the input we’re using from the Intel RealSense SDK.

What Didn’t Work
While we did have an animation of the whale turning as you leaned left or right, it was pretty subtle and there were times when people didn’t know they were moving in the direction they were trying to. We needed a stronger visual indication that the leaning was directly related to moving the whale left, right, up, or down. There was also initial confusion sometimes over what the cursor was representing. To alleviate the confusion, I think we could have done a better job showing or explaining that the cursor was representing head position.

Takeaways

It’s important to prepare the user for the kind of input they will be doing.
To help eliminate loss of tracking, show in some way where the edge of the camera’s field of view is relative to the input being used.
The control is much more intuitive when the visual feedback is tied to what a user’s input is controlling.

Voice Recognition in Intel RealSense SDK

I saved the most challenging aspect for last, and I’ll start with a disclaimer: most of what we learned was about limitations and what didn’t work. For those not familiar, voice recognition in the Intel RealSense SDK comes in two flavors: command and dictation. In command mode you set the specific commands to listen for in the Intel RealSense SDK, while in dictation you’re given a string of recognized speech as it comes. While we tried some things that definitely improved the user experience when using the voice module, it has still been by far the most frustrating for users to use and for us to implement. The challenge here is to leverage user feedback to mitigate the technical limitations in voice recognition.

Limitations of Voice Recognition

The module’s accuracy does not meet user expectations

Most people have had experience with some kind of voice recognition software, such as Apple Siri*, Google Now*, or Microsoft Cortana*. In each of these solutions the software is cloud-based, leveraging tons and tons of data, complex algorithms, and so on. These capabilities aren’t available in a local solution such as what Intel RealSense SDK uses. Because user expectations are set based on the more functional cloud-based solutions, you’ll need to manage and mitigate the limitation through design and by providing instructions and user feedback.

There is sometimes a significant delay between spoken commands and recognized speech

Depending on the application, speech will sometimes have a significant delay between when the user speaks a command versus when Intel RealSense SDK processes that command and returns it as text.

Voice pitch, timbre, and volume play a role in voice-recognition accuracy

From our experience, adult male voices are recognized the best, while higher pitch and quiet voices are not recognized as well.

Accents play a role in voice-recognition accuracy

For English, there are two versions you can set in Intel RealSense SDK: American and British. This obviously does not cover the nuances of accents within those dialects, and so people with accents will have a harder time getting speech recognized.

Microphone quality plays a large role in voice-recognition accuracy

The microphone built into the Intel RealSense camera (F200) works well as an all-around webcam microphone, but for voice recognition we’ve found that headset mics work better.

Environment noise plays a larger role in voice-recognition accuracy

This is the biggest challenge with any voice recognition-enabled applications. Environments vary greatly in terms of ambient noise, and voice recognition works best in quiet environments where speech detected is clear and discernable. A headset mic helps mitigate this problem, but don’t expect voice recognition to work well outside of a relatively quiet home office.

Voice Recognition as an Input Controller

Image 5.Here I am giving a powerful command to my skeletons in The Risen.

Using your voice to command an application is one of the biggest ways you can knock down the wall between humans and computers. When it works, it’s magical. When it doesn’t, it’s frustrating. In our game The Risen we used voice to let players give real commands to their skeleton minions. Next I will talk about some of our challenges and how we approached them from a user feedback perspective.

Our Challenges

Voice commands often go unrecognized.
This is enough to warrant strongly considering its inclusion at all, but designing user feedback to mitigate this within the technical limitations of Intel RealSense SDK is a challenge.

Users often don’t know why their command wasn’t recognized.
Was it because I wasn’t loud enough, because I didn’t speak clearly, because the module didn’t initialize, or does it just not like me? These are some of the range of questions users have when trying to use voice recognition.

It was easy to forget what the commands were when playing for the first time.
We did our best to represent the voice commands with icons, but when you only see the words once they can be easy to forget.

Voice Recognition in The Risen

In The Risen, you could issue four simple commands to direct your skeletal minions: forward, back, attack, and defend. Each of these would put your skeletons into a specific attack state, allowing for high level control of their actions. Skeleton states were represented by colored icons in the GUI and effects on the skeletons themselves.

We also had GUI to give users feedback on when speech detected had begun and had ended, as well as a slider to control microphone input volume. For the feedback on detecting commands, we started playing an animation of a mouth moving on our GUI player skeleton when we received the alert LABEL_SPEECH_BEGIN and stopped playing it when we received LABEL_SPEECH_END. The microphone slider was there to increase the quality of speech recognized, but also changed color to indicate whether the detected speech was too loud or too quiet.

Image 6.The microphone slider in The Risen.

What worked
In terms of knowing what state the skeletons were in, the visual effects on the skeletons were the most informative. Before we had that, people would spam commands not knowing that they were already in that state, or wonder why specific skeletons were in a different state than the global skeleton state (an intended game mechanic). The visual effects also helped us better debug our skeleton AI.

The microphone volume slider actually ended up helping, so much so that I recommend that any game using voice recognition implement this feature. Not only did it give a way to dynamically adjust microphone input volume and improve the success rate of recognized commands, but it also gave a way to tell users why input might not be working. This is huge for mitigating user frustration because it implicitly told users that the microphone was working, commands were being recognized, and gave them a hint on how to improve their input.

What didn’t work
The animated player skeleton that was supposed to indicate when a user was talking didn’t quite work to tell people when commands were being recognized. I think this was because there were quite a few things to look for in the interface and so the animation detail was often overlooked. That said, we only created a short demo level for this game, and so users didn’t have much time to become familiar with the UI.

I also think that the icons we used to represent the skeleton state were mostly overlooked. This probably would have been fine for a non-voice controlled game, but with the goal of informing the user what command was just detected (and when), this was a problem. To show that a voice command is recognized, I think we need to flash the word on the screen for a second or so, to get the user’s attention and let them know that the system recognized their command. This approach would also help the user to remember the specific commands they need to use.

Takeaways

Tell the user when speech is being detected before speech has been processed to avoid frustration and command repetition while speech is being processed.
Make what the speech controls obvious and the changes between those states apparent.
Give users a microphone input volume slider that also tells when the speech being detected is too loud or too quiet.
Consider showing the command on the screen to help users remember what the system’s commands are.
Make it obvious to the user when a command has been recognized.

Man’s New Best Friend

Computers are finding their way into every aspect of our lives. As their technology continues to improve, we find new ways to utilize them and entrust them with even greater responsibilities. Fast approaching is the day when even we ourselves have bits of computers integrated into us. As such, our relationship with computers is becoming more and more human.

Intel RealSense technology and other NUIs are the first step in this direction. These technologies give us the capability to truly shift our perspective on how we see and interact with our world. But the relationship is still young, and as designers and developers we are responsible for guiding it in the right direction. One day our computers will be like our best friends, able to anticipate our intentions before we even start expressing them; but for now, they’re more like our pets and still need a little help telling us when they need to go outside.

About the Author

Justin Link is an Interactive Media Developer for Chronosapien Interactive in Orlando, Florida. His game Space Between placed 2nd in the Intel® Perceptual Computing Challenge. The game utilized the Intel Perceptual Computing gesture camera and focused on using gestures and voice to control underwater sea creatures in three mini games. In the top 10% of Intel Innovators, he has trained more than 1000 developers on perceptual computing, including the new 2014 Intel RealSense technology.

For More Information

User Feedback for Natural User Interfaces with Intel RealSense Technology Part 1 - by Justin Link
Using Intel® RealSense™ to Create "The Space Between"– Intel Software TV video
Space Between* Plumbs Ocean Depths with Intel® RealSense™ Technology– a case study with Ryan Clark of Chronosapien
Get the Intel RealSense SDK
Get an Intel® RealSense™ camera

A Quick Review of What User Feedback Is (and Why It’s Important)

Head Tracking in the Intel® RealSense™ SDK

Limitations of Head Tracking

The Tracking Volume

Tracking is Based on Face Detection

Head as a Cursor

Our Challenges

The Cursor in Space Between

Takeaways

Voice Recognition in Intel RealSense SDK

Limitations of Voice Recognition

The module’s accuracy does not meet user expectations

There is sometimes a significant delay between spoken commands and recognized speech

Voice pitch, timbre, and volume play a role in voice-recognition accuracy

Accents play a role in voice-recognition accuracy

Microphone quality plays a large role in voice-recognition accuracy

Environment noise plays a larger role in voice-recognition accuracy

Voice Recognition as an Input Controller

Our Challenges

Voice Recognition in The Risen

Takeaways

Man’s New Best Friend

About the Author

For More Information

Trending Articles