Introduction
User experience (UX) guidelines exist for the implementation of Intel® RealSense™ technology in applications. However, these guidelines are hard to visualize for four main reasons: (a) You have to interpret end-user interaction in a non-tactile environment during the application design phase where you don’t yet have a prototype for end-user testing, (b) the application could be used on different form factors like laptops and All-In-Ones where the Field-of-View (FOV) and user placement for interaction are different, (c) you have to work with the different fidelities and FOVs of a color and depth camera, and (d) different Intel® RealSense™ SDK modalities have different UX requirements. Having a real-time feedback mechanism to gauge this impact is therefore critical. In this article, we cover an application that is developed for the use of Intel® RealSense™ application developers to help visualize the UX requirements and implement these guidelines in code. The source code for the application is available for download through this article.
The Application
The application works for the user-facing cameras only. Both the F200 and SR300 cameras are covered in the scope of the application. Provision is made to seamlessly switch between the two cameras within the application. If using the F200 camera, the application works on Windows* 8 or Windows® 10. However, if using the SR300 camera, the application requires Windows 10.
There are two windows within the application. One window provides the real-time camera feed where the user can interact. This section also provides visual indicators, which are analogous to visual feedback you will provide in your application. In each of the scenarios below, we call out the visual feedback that has been implemented. The other window provides the code snippets that are required to implement a specific UX scenario. In the sections below, I will walk you through the scenarios covered. WPF is the framework used for development.
Version of the Intel RealSense SDK: Build 8.0.24.6528
Version of Intel® RealSense™ Depth Camera Manager (DCM) (F200): Version 1.4.27.52404
Version of the Intel RealSense Depth Camera Manager (SR300): Version 3.1.25.2599
The application is expected to work on more recent versions of the SDK but has been validated on the version above.
Scenarios
General scenarios
Depth and RGB resolution
The RGB and the depth cameras support different resolutions and have different aspect ratios. Different modalities also have different resolution requirements for each of these cameras.
Below is a snapshot of the stream resolution and frame rate as indicated in the SDK documentation on working with multiple modalities:
The UX problem:
How do I know which areas of the screen real estate should be used for 3D interactions and which ones for UI placement? How can I indicate to the end user visually or through auditory feedback when they have moved out of the interaction zone?
Implementation:
The application uses the SDK API (mentioned below) to obtain the color and depth resolution data for each modality and weaves the depth map over the color map to show superimposing areas. Within the camera feed window, look for the yellow boundary that indicates the space that overlaps the color and depth map. This is your visual feedback. From a UX perspective, you can now visually identify areas of the screen that have to be used for FOV 3D interactions as opposed to UI element placements. Experiment by selecting the different modalities in the first column and choosing from available color and depth resolutions to understand the implications of RGB to depth mapping for your desired usage. The snapshots below show some of the examples of how this overlap changes with the change in inputs.
Example using both depth and color:
Experiment with how the mapping changes as the user switches between different color and depth resolutions. Also choose other modalities that use both depth and RGB to see how the supported color and depth resolution lists change.
Example using only depth:
An example where this is handy is when you are using the hand skeletal tracking. You do not need the color camera for this use case; however, you can switch between the available depth resolutions to see how the screen mapping changes.
Example using only color:
If your application is restricted to using only facial detection, 2D capability will suffice as all you need is the bounding box for the faces. However, if you need the 78 landmarks, you will need to switch to using the 3D example.
The sample application available for download from this article walks through the code required to implement this in your application. As a highlight, the two APIs you will need to create the depth and color resolution iterative lists for each modality are PXCMDevice.QueryCaptureProfile (int i) and the PXCMVideoModule.QueryCaptureProfile(). However, for the visual representation of how the two maps overlap, you will have to use the Projection interface. We know that each pixel has a color and a depth value associated with it. In order to apply the overlap of the depth map on the color map, in this example, we just choose only one value of depth. In order to implement this, the application uses the blob module. The workaround uses the closest blob to the camera (say your hand) and maps the center of this blob (observable as a cyan dot on the screen). The depth value of this pixel is then used as a single depth value to map the depth map to the color map.
Optimal Lighting
The Intel RealSense SDK does not provide any direct API to identify the lighting situation in the environment where the camera is operating. Bad lighting can result in a lot of noise within the color data.
The UX problem:
From within an application, it would be nice to provide visual feedback to the user asking them to move to the ideal lighting environment. Within the application, watch how the camera feed provides the current luminance value displayed on the screen.
Implementation:
The application uses the RGB values and applies the log average luminance to identify the lighting conditions. More information on the use of log average luminance could be found here.
The formula used to identify the log average luminance value for each pixel is:
L = 0.27R + 0.67G + 0.06B;
The values range from 0 for pitch black to 1 for very bright light. We do not define a threshold in this sample because this is something the developers would have to experiment with. Some factors that could affect luminance values is backlight, black clothing (resulting in many pixels giving a close to 0 rating, thus bringing down the average value), outdoor versus indoor lighting conditions, and so on.
Since we have to perform this calculation per pixel in each frame of data, this is a compute-intensive operation. The application shows how to implement this computation using the GPU for optimal performance.
Raw Streams
The Intel RealSense SDK provides APIs for capturing the color and depth streams. However, in some cases, it may be necessary to capture the raw streams to perform low-level computation. The Intel RealSense SDK provides C++ API with .NET wrappers. This means that the memory containing the images live in unmanaged memory. This is non-optimal when displaying images in WPF.
One way to work through this is using the PXCMImage.ToBitmap() API to create an unmanaged HBITMAP wrapped around the image data and use System.Windows.Interop.Imaging.CreateBitmapSourceFromHBitmap() to copy the data into the managed heap and then wrap a WPF BitmapSource object around it.
The UX problem:
The problem with the above-mentioned approach is that the YUY2-> RGB conversion is done on the CPU following which we have to handle an unmanaged to managed memory copy for the image data. This slows down the process a lot and could result in lost data and jittery displays.
The Implementation:
The application shows an alternate implementation using the Direct3D* Image Source introduced in Service Pack 1 of the Microsoft .NET framework version 3, which allows arbitrary DirectX* 9 surfaces to be included in WPF. We implement an unmanaged DirectX library to do the color conversion to display on the GPU. This approach also allows for the GPU accelerated image processing via Pixel Shaders for any custom manipulation needed (example: processing depth image data). The snapshot below shows the raw color, IR and depth streams, and the depth images as shown by the custom shader.
Facial Recognition
One of the most commonly used modalities within the Intel RealSense SDK is the face module. This module allows recognizing up to four people in the FOV while also providing 78 landmark points for each face. Using these data points, it is possible to integrate a facial recognition implementation within applications. Windows Hello* in the Windows 10 OS uses these landmarks to identify templates that can be used to identify people at login. More information on how Windows Hello works can be found here. In this application, we focus on some of the UX issues around this module and how to provide visual feedback to correct end-user interaction for better UX.
The UX problem:
The most prominent UX challenge comes from the fact that your end users may not understand where the FOV of the camera is. They may be completely outside this frustrum or be too far away from the computer, thus being out of range. The Intel RealSense SDK provides many alerts to capture these scenarios. However, implementing these to provide visual feedback to the end user when out of the FOV is critical. In the application, when the end user is in the FOV and in the allowed range, a green bounding box is provided indicating you are within the interaction zone. Experiment with moving your head toward the edges of your computer or by moving farther away—you will notice a red bounding box appear as soon as the camera loses face data.
The implementation:
The Intel RealSense SDK provides the following alerts for effectively handling user errors: ALERT_FACE_OUT_OF_FOV, ALERT_FACE_OCCLUDED, ALERT_FACE_LOST. For more information on alerts, refer to the PXCMFaceModule. The application uses a simple ViewModel architecture to capture the errors and act on them in the XAML code.
Immersive Collaboration
Imagine a photo booth setup where you are trying to obtain a background segmented image of yourself. As mentioned in the Depth and RGB scenario above, the range for each of the Intel RealSense modalities is different. So how do we indicate to the end user what the optimal range for the 3D camera is, so they can position themselves accordingly within the FOV?
The UX problem:
As with the facial detection scenario, providing a visual indicator to the end user when they move in and out of range is important. In this application, note that the slider is set to the optimal range for the camera FOV for 3D segmentation (indicated in green). To identify the lowest minimal range, move the left slider toward the end with the picture of the camera. Note how the pixels turn white. On the other hand, if I want to identify the maximum optimal range, I move the right slider toward the right. Beyond the optimal point, the pixel are pigmented red. The range in between the two sliders now provides the optimal range for segmentation.
Take a look at the last image for a second. You notice another UX issue when using BGS. As I move closer to the background, in this case, the chair, the 3D segmentation module creates one blob from the foreground as well as the background object. You will also notice this in cases where you have a black background and are wearing a black shirt. Identifying depth with uniform pixels is hard. We do not address that scenario in this application, but we want to mention this UX challenge as something to be aware of.
The implementation:
The 3D segmentation module provides alerts to handle UX scenarios. Some of the important alerts we implement here are: ALERT_USER_IN_RANGE, ALERT_USER_TOO_CLOSE, and ALERT_USER_TOO_FAR. The application implements these alerts to provide the pigmentation as well as textual feedback to indicate when the user is too close or too far.
3D Scanning
The 3D scanning module for front-facing cameras provide for scanning the face and small objects. In this application, we will use the face scanning example to demonstrate some of the UX challenges and how to provide an implementation in code to add visual and auditory feedback.
The UX problem:
One of the key challenges in getting a good scan involves detecting the scan area. Usually this gets locked after a few seconds after the scan begins. Here is a snapshot of the region the camera needs to detect for a good scan:
If the user cannot determine the correct scan area, the scan module fails. An example scenario of how things could go wrong: While scanning a face, the user is required to face the camera until the camera detects the face, then turn “slowly” to the left and through the center to the right. Providing visual feedback in the form of a bounding box for the face when the user is within the camera FOV is therefore important when looking at the screen. Note that this is feedback that is required before we can even start the scan. Once the scan begins, when the user turns to the left or to the right, the user cannot see the screen and hence visual feedback is useless. In the sample application, we build both visual and audio feedback to assist with this scenario.
The implementation:
The PXCM3DScan module incorporates the following alerts: ALERT_IN_RANGE, ALERT_TOO_CLOSE, ALERT_TOO_FAR, ALERT_TRACKING, and ALERT_TRACKING_LOST. Within the application, we capture these alerts asynchronously and provide both the visual and audio feedback as necessary. Here is a snapshot of the application capturing the alerts and providing feedback.
Visual feedback before starting the scan and while the scan is in progress:
Note that in this example, we are not demonstrating how you can save the mesh and render it. You can learn more about the specifics of implementing the 3D scan module in your apps through the SDK API documentation.
Summary
The use of Intel RealSense technology in applications poses many UX challenges, both from the perspective of understanding non-tactile feedback and how end users use and interpret the technology. Through a real-time demonstration of some of the UX challenges and code snippets showing potential ways to address those challenges, we hope this application will help developers and UI designers gain a better understanding of Intel RealSense technology.
Additional Resources
UX Best Practices for Intel® RealSense™ Camera (User Facing) - Technical Tips