By Justin Link
Software enables us to reach beyond the limitations of human ability. It has given the power for the blind to read, taken us to the moon and back, and connected us globally in ways that seemed like science fiction only decades ago. However, in spite of the power software has in our lives, the way we interact with it is, in most cases, still in rudimentary stages.
With the advent of natural user interfaces, or NUIs, like Intel® RealSense™ technology, we can interact with software in new ways that are more second nature. NUIs can make the work we do more efficient, more user friendly, and more powerful. But with these new ways to interact with software, designers must now create a new language to support the way that we interface with it.
This article will, in two parts, examine our experience at Chronosapien Interactive in developing NUIs for games and specifically focus on how important user feedback is in this environment. In this, many of the old rules and tricks we’ve used before no longer apply, and ironically we must look backward at the ways we’ve been interfacing with each other for thousands of years.
User Expectations
Software in its current state is rigid and unforgiving. It assumes nothing and expects full disclosure of commands to deliver the intended action. We have been trained well and adapted to software’s requirements. With NUIs though, that expectation changes. Everything we’ve learned about computers and how they understand the world disappears when we say “hello” to them. When we’re told to wave at our screens and it doesn’t respond immediately, we’re confused because from our perspective we did exactly what we were asked. Part of this disconnect is because of a lack of understanding of the technology, but mostly it is a result of asking users to communicate naturally and them, justifiably, personifying the computer as a result. Though users act as if they’re communicating with a person, they won’t be receiving many of the same cues used in natural communication such as facial expression, eye shape, eye contact, body language, and many others. You need to make up for that by creating obvious responses to a user’s interactions that say things like “we got your message”, “I didn’t understand that because…”, and “got it, I’m working on an answer.” A certain level of training is also needed to set the expectation of users. Think of it like meeting a new friend from another country.
Have you ever been in a conversation with someone who took an extended pause to put their thoughts together mid idea? Or how about waved at someone who awkwardly responded with a half-raised wave back? Maybe you’ve been in a loud room and heard only bits of your friend shouting, “it’s about time to leave.” In situations like these, using context clues and past experiences, you were able to interpret the person’s idea with partial information. However, each of these poses a problem in current NUI technology.
Though some information was missing from these example interactions, most of us could reconstruct the intent through other related information. When someone pauses mid idea to collect their thoughts, you don’t completely forget about what was said before, or respond to their half-finished sentence without letting them finish. That’s because you know, through disconnected information such as voice intonation, facial expression, and eye contact that they had something else to say. If someone gives you an awkward, half-raised wave, you don’t get completely confused because their hand signal didn’t conform to the universal standard for waving. Instead you interpret it as what they were most likely trying to say in the given context and probably make some assumptions about their personality so that you can better handle information from them in the future. When hearing only part of a statement in a loud, crowded room you don’t need a full and complete sentence to understand it’s time to leave. The two important things to glean from these examples are context and related information. One of the themes you will find running through my examples of giving user feedback for NUIs is that it’s better to give too much information instead of not enough.
Challenges in Creating User Feedback in NUIs
The analogy of trying to talk to someone in a loud and crowded room is actually a good one for working with NUIs—except that in that room you’ve got the short term memory of a toddler and the contextual awareness of a fruit fly. Below are a few of the main challenges in creating user feedback using data in an Intel RealSense application:
- You often won’t know when a user has started to interact with the application
- You won’t be able to distinguish between a user interacting with the application, and the user doing something completely unrelated
- You won’t be able to easily distinguish between a user who is interacting with the application and another person who happens to be within the camera view.
- Data for interactions will be noisy and sometimes wrong
- Data is not constrained to real-world limitations
- Data takes time to process making for awkward pauses between command and response
These challenges, focusing on hand interactions, are addressed in sections below based on different implementations of Intel RealSense technology. In general, these are things you will want to keep in mind when designing both user feedback and the interactions themselves. Though the work I’ve done has resulted in solutions to some of these, they still stand as a major hurdle to using computers naturally. When developing with and for NUIs, prepare for lots of testing and iteration. Some of the problems you will encounter are a result of the hardware, some a result of the SDK, and others are just a problem with NUIs in general.
Hand Tracking in the Intel® RealSense™ SDK
The ability for software to interpret hand movements opens new avenues for designers. Aside from offering an intuitive platform on which to build interactions, using hands provides a level of immersion into an application that cannot be had any other way. With the Intel RealSense SDK, developers have access to a multitude of tracked nodes in a hand, its current “openness” state and value, poses, size, position, and gestures. These abilities do not come without limitations, however, and as with the other Intel RealSense application modalities, developers will have to mask or otherwise account for these. Below, I review these limitations as well as discuss some of the different implementations using hands we have tried.
Limitations of Hand Interactions
The Tracking Volume
Image 1.The tracking volume for Intel® RealSense™ application hand modality is finite and can be restrictive to the application
One of the biggest problems with hand interactions in the SDK is the hardware’s limited tracking range. Because of the wide range of motion humans have with their arms and hands, quite often they will leave this volume. Leaving the tracking volume is the most common issue new users have when using their hands in Intel RealSense applications.
Occlusion
Image 2.Hand occlusion, from Robust Arm and Hand Tracking by Unsupervised Context Learning
Following the limited tracking volume, the SDK’s next biggest limitation and other image-based tracking systems is occlusion. In simple terms, when something is blocked by something else, it is occluded. This is a problem especially when tracking hands because many natural poses or gestures contain times when hands will occlude themselves from the camera’s perspective. Conversely, when using a screen as a viewing medium hands will often occlude the screen from the user’s perspective.
Hand Size Relative to Screen Size
When interacting naturally with hands in an application, it is intuitive to design the interface as if the user is reaching into the viewing medium, most often a screen. However when hands become the interaction method in this way, not much space is left on the screen to do anything else. This creates problems both for the GUI and the application itself.
Arm Fatigue
While using your hands to manipulate the digital world is a liberating experience, it is easily overdone. One of the biggest problems we’ve seen with our applications and many others that use hands to interact is that users become fatigued within 60-90 seconds of use. This can be mitigated when using a desk to rest elbows on, but it doesn’t completely solve the problem.
No Haptic Feedback
Of all that is lost transitioning from traditional computer interfaces, haptic feedback has to be the most valuable. Simple feedback such as the snap when a button is clicked no longer exists when swiping in the air. What this means is that the application must account for this both visually and aurally.
Hands as a Cursor
Image 3.Our implementation of hands as a cursor in Space Between. Our cursor is the glowing ball near the sharks.
In our Space Between game, we found that using hands like a cursor was a simple way to control the application. It provided an intuitive bridge between controlling the application traditionally with a mouse and using hands in a new way. Below I discuss some of the problems we encountered with this approach, our implementation, and what worked and what didn’t from a usability perspective.
Our Challenges
At a glance, here are the problems we encountered when using hands like a cursor.
Users didn’t know what they were controlling
In Space Between, users have direct control over a glowing ball that follows their hand positions relative to the screen in real time. The way we used this in our games was by having the player character follow the cursor. The result was a somewhat indirect control over the creature. Many times when people first play the game, it takes a moment for it to click that they are in fact controlling the cursor, and not the creature itself.
Users didn’t know what was controlling the cursor
Because our game reused the cursor method in different contexts and in different ways, users sometimes became confused with what was supposed to be controlling the cursor.
Users’ hands often left the tracking volume
As stated earlier, this is the most common problem when using hands to interact with an Intel RealSense application. Even though the cursor would visibly reach the edge of the screen, users did not associate that with reaching the edge of the tracking volume.
The Cursor in Space Between
In Space Between, we used the 2D cursor method in three ways:
A Gust of Wind
Image 4.The gust of wind cursor from Space Between
What worked
Of the three, the gust of wind was the most abstract to control. What it had going for it was that its amorphous shape masked much of the positional data noise that occurs in an Intel RealSense application. Also, it used an audio loop that had its volume based on the cursor’s velocity. This was nice because it let users know when they were and weren’t being tracked while moving their hands (aside from the cloud visually moving).
What didn’t work
While the amorphous shape was good for masking noise in position data, it was more difficult to accurately discern where it was on the screen. This was a problem when doing things like hovering to select games.
Glowing Orb
Image 5.A look at another cursor in Space Between
What worked
Letting the cursor emit light onto the environment, yet drawing it on top meant that users could tell where in the environment their character would go to, without having issues like the cursor getting lost inside of walls. Because of its relatively small size, it also showed the accuracy of the SDK’s hand module. Initially, we used the orb by itself to represent the cursor. A problem we ran into was that it was easy to lose when doing fast motions with the hand. To account for this, we created a particle trail behind the cursor that lasted a second or so. This had the side effect of making it fun to just move the cursor around since you could draw shapes in the air. Lastly, to help connect the cursor to the player’s character we created a trail that connected it to the cursor. This helped especially when the player character was trapped by the environment and wasn’t moving any longer.
What didn’t work
The main problem with the glowing orb in our games was that users sometimes didn’t realize they were controlling the cursor and not the character itself. Another issue with the glowing orb was that aside from controlling character position, it also tried to represent another function in the game: hand openness. To show this, we increased the light’s intensity that was attached to it as well as made it brighter. Looking forward, we will probably add more to the cursor to visually show that hand openness is changing it, as well as possibly show a graphic of a hand near it briefly to let users know what exactly they have control over.
Hand Cursor
Image 6.Our hand cursor in Space Between, used for menu type interactions
What worked
The hand cursor was by far the easiest and most intuitive to use of the three cursor styles. Because the cursor was the shape of a hand (and represented the correct hand side), users knew immediately they were controlling it. Taking it further, we created animated sprites that blended between hand poses to reflect the current state of the hand. This was great because it told the user immediately that the system had recognized the change and was responding to it. Even if the action wasn’t being used in the current context, the player easily learned what the application was able to interpret, and how well.
What didn’t work
Though the hand cursor was great at usability, it stuck out like a sore thumb in the game’s environment. This meant that unless we wanted to break immersion, we could only use it in application control contexts such as pause or options menus.
Takeaways
Our three approaches showed that there is no one answer to how to implement a hand cursor—it’s application and context specific. However, there are a couple rules I think can be applied across the board when giving user feedback for hand cursors:
- Reflect state changes in the hand both visually and aurally whenever possible. This helps for the player to both understand what is being tracked and also teaches what the capabilities are intuitively.
- Make it clear to the user when they are leaving the tracking volume. This is something we currently lack in Space Between, but it solves a lot of headaches on the user experience side of things like not understanding why tracking is lost and experiencing a tracking delay when the hand returns.
Hands Using Gestures
Image 7.The first stage of the raise gesture in The Risen
Gestures are a powerful way to communicate ideas and perform actions. Their robustness allows for very specific control and a feel that can be completely unique to the environment in which they’re used. Using gestures helped define our Intel RealSense technology games, Space Between and The Risen, and connect players to the actions they are performing. As mentioned before, I will first discuss the problems we encountered when using gestures, how we implemented them, and what we thought worked and didn’t with our approaches.
Our Challenges
Gestures are more complex than simple position tracking of features. Here are some of the main issues we faced when designing for gesture input.
There is no way to tell when a gesture has begun
This is somewhat dependent on the gesture being used, but in general the out-of-the-box gestures in the Intel RealSense application gesture modality give no indication as to when they start until after the gesture has been performed. This may seem trivial, but for complex gestures having to wait until you’ve completed the gesture only to find that it didn’t work makes for tedious repetition.
Many people perform the gesture correctly, but not accurately enough to be recognized by the application
Like I mentioned earlier, software for gesture recognition is quite rigid in its detection. Things like swipes have a specific distance they need to travel, hand poses need to be performed certain ways, certain distances from the camera must be maintained, etc. All of this together makes for often frustrating use of hand gestures.
Certain hand angles are not optimized for tracking in Intel RealSense technology
One of the biggest problems with the hand tracking algorithms is the inability to track certain hand angles. Currently the system is great at detecting hands with palms pointed directly at the camera, but not so great when hands are perpendicular to it. This has implications for many uses of hands, but in gestures specifically it is a problem when trying to create and perform gestures with complex motion. For example, in our game The Risen we created a gesture to raise skeletons in which users first show their palms to the camera, then bring their hands low and point their palms toward the ceiling, then lift them to complete the raise. During the part of the gesture where the hands become flat, the application often loses track of the hands, breaking the gesture mid performance.
The Raise Gesture in The Risen
Image 8.The second stage of the raise gesture in The Risen
In The Risen, we have a custom raise gesture was essential in giving the player the feeling that they were a part of the world. Here are some things we learned from doing it.
What worked
We really wanted to make sure that people knew exactly what motion they were expected to perform since the gesture would be used so much throughout the game. We also wanted to avoid complicated descriptions trying to describe minute details of hand positions over time. Our solution was to put animated hands in the scene in a learning section of the game to show exactly how the gesture should be performed. The hands were the same size as the user’s hands in the scene and so players were able to easily see what was expected.
When designing the gesture, we knew that the player would most likely not have their hands positioned in the scene. We also were aware of the SDK’s hand tracking limitations. To account for this, we made the first step in the gesture a pose that the gesture recognition module could easily recognize. This also had a second benefit of being a place where we give the user feedback of “hey, it looks like you’re about to perform the raise gesture.” Being able to visually and aurally signal that the system knows the gesture is in its first stages both prevents unnecessary repetition and also teaches the player intuitively what the system is looking for.
Following the theme of breaking the gesture up into sections for usability, we also triggered effects and audio when the second stage of the gesture was reached. Because our gesture was relatively complex (and unique), this helped to signal the player that they were performing correctly and now entering the final stage of the gesture.
While we did break the gesture up into parts for technical and usability reasons, it should be noted that it can be performed completely fluidly; the stages in the gesture were there to give users cues that they were performing the gesture correctly, or to give a place to look when they were not.
What didn’t work
Our biggest problem in using the gesture came from the gesture module’s tracking limitations. When getting to the part of the gesture where the hands become flat relative to the camera, tracking will often drop out, cancelling the gesture in the process. This isn’t something we have a whole lot of control over at this point, but looking forward I think that training users about this limitation would help.
Takeaways
Here are a couple of the key things to remember when designing feedback for hand gesture input:
- Proper set up and explanation are key to understanding how to perform gestures. We used an animation of 3D hands to show our gesture, and I think this works best since it shows the user what to do.
- Giving feedback at different stages in complex gestures helps to avoid potential frustration. Once users are more comfortable with the technology, telling them specifically when the system is working (or not) helps to avoid having to run through the gesture again and again, not really knowing where it is failing.
Virtual Hands
Image 9.Using virtual hands to interact with the environment in The Risen
Being able to reach into a virtual world and interact with it as we would our own is a truly liberating experience. The level of immersion gained from doing so cannot be achieved any other way. In our game The Risen, we let players reach into the environment to open doors or activate traps. Below I’ve listed some of the problems we have encountered when using hands to interact in this method, how we implemented virtual hands into The Risen, and how that worked out.
Our Problems Encountered
Though controlling virtual hands is pretty awesome, implementing them can be a little tricky with the SDK’s out-of-the-box features. Here are some problems you can expect to design around.
Data is noisy
When displaying a rigged hand and controlling it via data from the SDK, it is quite jittery. Though the SDK has some smoothing algorithms, these don’t completely remove unwanted noise.
Data is not constrained to real-world limitations
Along with noisy data of hand nodes, nodes will sometimes become oriented in ways that are just not physically possible. They also have a tendency to jump across the screen at lightning speeds for a few frames at a time when visibility of certain parts of the hand is low.
Minute interactions are very difficult to perform and detect
We wanted players to be able to interact with objects in the world that were relatively small compared to the size of the hand. However the combination of noisy data, a vague sense of depth, and no haptic feedback make this almost impossible.
Virtual Hands in The Risen
Players have the ability to reach into the world with ghostly skeleton hands in The Risen. They are used to help the player’s skeletons in different ways such as opening doors or triggering traps for enemies. We learned quite a bit from implementing virtual hands.
What worked
Image 10.GUI showing a detected face and right hand in The Risen
The first thing that is worth mentioning is the GUI we created for The Risen. In it, a skull in the upper left represents the player and their currently tracked features. When hands are detected, they animate in the GUI to show the player that the system recognizes them. Though simple, having a way for the player to understand what is and isn’t working really helps. For example, if the system is detecting the user’s head but not hands, that would probably mean they are outside of the tracking volume.
To indicate what objects in the world could be used with hands, we displayed an icon that hovered over them and showed how they could be interacted with when seeing them for the first time. We wanted players to know the different kinds of things that could be used, but we also wanted a sense of discovery when finding something interactive in the environment. Showing an icon in early parts of the game was a nice balance between these two things.
I will mention our initial approach to interacting with environment objects in the “what didn’t work” section below, but what we ended up with that worked pretty well was having a simple grab gesture that used the entire hand. This addressed two of the problems above, a vague sense of depth and no haptic feedback, and didn’t really compromise the game. However, it did mean that we had to be more selective of the kinds of objects that could be interacted with in this way since two or more in the same area of the hand would trigger multiple objects at the same time.
To indicate to users when their hands are in the “interacting” state (a closed hand), we changed the color of them. In this way using hands was almost like using buttons: there was an inactive state and an active state that made it clear what the application was expecting. From this, all users had to figure out was where they needed to trigger the interaction.
What didn’t work
When we first envisioned using hands to interact with the environment, we pictured things like pulling chains, pushing books, and so on as if they were right in front of you. The problem with this as it turned out was that it was very difficult to accurately perform these minute interactions. Grabbing a chain with your fingers when you can’t really perceive depth or receive any haptic feedback made for lots of tedious failed attempts at interacting. While I think this issue may be mitigated with more accurate tracking, the real solution is a stereoscopic display and haptic feedback in the fingers.
Takeaways
A quick recap of the main lessons learned using virtual hands:
- Simple gesture interactions work best. Maybe when the technology has matured or if you’re using a different viewing medium, you can try small gestures, but for now stick to the basics.
- Give visual and aural feedback when hands are in the “interacting” state. This tells the user when the system is looking for objects in range, simplifying the interaction.
To be continued…
In Part 1, I have discussed our experiences and findings in giving user feedback for Intel RealSense applications specifically as they relate to the hand input modality. In the next article, I will explore the other modalities Intel RealSense technology has to offer: Head tracking, emotion detection, and voice recognition. Keep your eyes open for Part 2, coming in a few weeks.
About the Author
Justin Link is an Interactive Media Developer for Chronosapien Interactive in Orlando, Florida. His game Space Between placed 2nd in the Intel® Perceptual Computing Challenge. The game utilized the Intel Perceptual Computing gesture camera and focused on using gestures and voice to control underwater sea creatures in three mini games. In the top 10% of Intel Innovators, he has trained more than 1000 developers on perceptual computing, including the new 2014 Intel RealSense technology.