Abstract
The purpose of this article is to showcase the implementation of object detection1 on drone videos using Intel® optimized framework for neon™2 on Intel® processors. The functional problem tackled in this work is the identification of pedestrians, trees, and vehicles such as cars, trucks, buses, and boats from the real-world video footage captured by commercially available drones. In this work, we have conducted multiple experiments to derive the optimal batch size, iteration count, and learning rate for the model to converge. The deep learning model developed is able to detect the trained objects in real-world scenarios with high confidence, and the ratio between the detected objects and desired objects is almost equivalent.
Introduction
Modern drones have become very powerful and ever since they have been equipped with potent cameras. They have been successful in areas like aerial photography, surveillance, etc. The integration of smart computer vision with drones has become the need of the moment.
In today’s scenario, object detection and segmentation are the classic problems in computer vision. More challenges exist with the drones due to the top-down view angles and the issue to integrate with a deep learning system for compute-intensive operations.
In this project, we implemented the detection component using Single Shot MultiBox Detector topology (SSD)1. We implemented our solution in Intel® Xeon Phi™ processors and evaluated frame rate, and accuracy on several videos captured by the drone.
Experiment Setup
The following hardware and software environments were used to perform the experiments.
Hardware
Architecture | x86_64 |
CPU op-mode(s) | 32-bit, 64-bit |
Byte Order: | Little Endian |
CPU(s): | 256 |
On-line CPU(s) list: | 0-255 |
Thread(s) per core: | 4 |
Core(s) per socket: | 64 |
Socket(s): | 1 |
NUMA node(s): | 2 |
Vendor ID: | GenuineIntel |
CPU family: | 6 |
Model: | 87 |
Model name: | Intel® Xeon Phi™ Processor 7210 @ 1.30 GHz |
Stepping: | 1 |
CPU MHz: | 1302.386 |
BogoMIPS: | 2600.06 |
L1d cache: | 32K |
L1i cache: | 32K |
L2 cache: | 1024K |
NUMA node0 CPU(s): | 0-255 |
Software
neon™ Framework Setup
neon™ version | 2.2 |
Intel® Nervana™ platform aeon (data loader) | 1.0 |
Python* | 2.7 |
GCC version | 6.3.1 |
Model
The experiments detailed in the subsequent sections employ the transfer learning technique to speed up the entire process. For this purpose, we used a pre-trained Visual Geometry Group (VGG) 16 model with SSD topology.
Solution Design
The main component of our system includes a training component and a detection algorithm running SSD. SSD, though it is compute-intensive, has been more optimized for Intel® architecture. We adopt the neon™ framework as our Deep Learning Frameworks and the hardware is an Intel® Xeon Phi™ processor.
In this work, the entire solution is divided into three stages:
- Dataset preparation
- Model training
- Inferencing
Dataset Preparation
The dataset used for training the model is collected through unmanned aerial vehicles (UAVs). The images collected vary in resolution, aspect, and orientation, with respect to the object of interest.
The high-level objective of preprocessing is to convert the raw, high-resolution drone images into an annotated file format, which is then used for training the deep learning model.
The various processes involved in the preprocessing pipeline are as follows:
- Data creation
- Video to image frame conversion
- Image annotation
- Conversion to framework-native data format
The individual steps are detailed as follows:
Step 1. Dataset creation
The dataset chosen for these experiments consisted of 30 real-time drone videos in the following 7 classes: boat, bus, car, person, train, tree, and truck.
Step 2. Video to image frame conversion
To train the model, all the video files were converted to image frames. The entire conversion code was built using OpenCV 33.
The final dataset prepared for training consisted of 1,312 color images.
Step 3. Image annotation
Image annotation task involved manually labeling the objects within the training image set. In our experiment, we relied on the Python* tool, LabelImg*4 for annotation. The tool gives the object coordinates in XML format as output for further processing.
The following figure shows the training split for each class:
Figure 1. Training data set distribution.
Step 4. Conversion to framework-native data format
To enable fast and flexible access to data during training of the network, we used framework-specific file formats.
Data Conversion for Neon™ framework:
We used aeon data loader5 which is Nervana’s new and evolving project. The aeon data loader is designed to deal with large datasets from different modalities, including image, video, and audio that may be too large to load directly into memory. We used a macro batching approach, where the data is loaded in chunks (macro batches) and are then split further into mini batches to feed the model. The basic workflow is depicted in Figure 2.
Figure 2. The aeon data loader pipeline.
First, users perform ingestion, which means generating a manifest file in comma-separated values (CSV) format. This file indicates to the data loader where the input and target data reside. Given a configuration file (JSON), the aeon data loader processes the next steps (green box). During an operation, the first time a dataset is encountered, the data loader will cache the data into CPIO format, allowing for quick subsequent reads. During provision, the data loader reads the data from disk, performs any needed transformations on-the-fly, transfers the data to device memory, and provisions the data to the model as an input-target pair. We use a multithreaded library to hide the latency of these disk reads and operations in the device compute.
Network Topology and Model Training
SSD
The SSD approach discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages, and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component.
Model
In our experiment, we used the VGG 16 as the base network. An auxiliary structure is then appended to the network to produce detections with the following key features:
- Multiscale feature maps for detection: We add convolutional feature layers to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales.
- Convolutional predictors: Each added feature layer can produce a fixed set of detection predictions using a set of convolutional filters. For a feature layer of size m×n with p channels, the basic element for predicting parameters of a potential detection is a small 3×3×p kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the m×n locations where the kernel is applied, it produces an output value. The bounding box offset output values are measured relative to a default box position relative to each feature map location.
- Default boxes and aspect ratios: We associate a set of default bounding boxes with each feature map cell for multiple feature maps at the top of the network. The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed. At each feature map cell, we predict the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of k at a given location, we compute c class scores and the four offsets relative to the original default box shape. This results in a total of (c+4) k filters that are applied around each location in the feature map, yielding (c+4) kmn outputs for an m×n feature map.
To achieve a faster convergence, we trained the network using an ImageNet pre-trained model. The model is available to download at the Neon™ Model Zoo6.
Transfer Learning
In our experiments, we applied transfer learning on a pre-trained VGG 16 model (trained on ImageNet). The transfer learning approach initializes the last fully connected layer with random weights (or zeroes), and when the system is trained for the new data, these weights are readjusted. The base concept of transfer learning is that the initial layers in the topology will have learned some of the base features such as edges and curves, and this learning can be reused for the new problem with the new data. However, the final, fully connected layers would be fine-tuned for the very specific labels that they are trained for. Hence, this needs to be retrained on the new data.
Inferencing
The video captured by the drone is broken down into frames using OpenCV with a configurable frames per second. As the frames are generated, they are passed to the detection model, which localizes the different objects in the form of four coordinates (xmin, xmax, ymin, and ymax) and provides a classification score to the different possible objects. By applying the NMS (Non-Maximal Suppression) threshold and setting confidence thresholds, the number of predictions can be reduced and kept to the prediction that is the most optimal. OpenCV is used to draw a rectangular box with various colors around the detected objects (see Figure 3).
Figure 3. Detection flow diagram.
Results
The different iterations of the experiments involve varying batch sizes and iteration counts.
Batch Size | KMP_Affinity | OMP_NUM_THREADS | Training Time |
---|---|---|---|
64 | granularity=thread,verbose,balanced | 16 | 1.10X |
64 | none,verbose,compact | 16 | 4.29X |
64 | granularity=thread,verbose,balanced | 24 | 1.30X |
64 | granularity=fine,verbose,balanced | 24 | 1.00X |
64 | none,verbose,compact | 32 | 2.50X |
Note: The numbers in the table are indicative. Results may vary depending on hyper parameter tuning.
The following detection was obtained when the inference use-case was run on the below sample image.
Figure 4. Red bounding boxes display the objects detected.
Conclusion and Future Work
The functional use case attempted in this paper involved the detection of vehicles and pedestrians from a drone or aerial vehicle. The training data was more skewed towards cars as opposed to other objects of interest since it was hand crafted from videos. The use case could be further expanded for video surveillances and tracking.
About the Authors
Krishnaprasad T and Ratheesh Achari are part of the Intel team working on the artificial intelligence (AI) evangelization.
References
The references and links used to create this paper are as follows:
- Object Detection, 2014: https://arxiv.org/pdf/1412.1441.pdf
- Neon™ framework: https://neon.nervanasys.com/index.html/
- OpenCV: https://github.com/opencv/opencv
- LabelImg: https://github.com/tzutalin/labelImg
- Nervana aeon: http://aeon.nervanasys.com/index.html/
- Neon™ Model Zoo: https://github.com/NervanaSystems/ModelZoo/tree/master/ImageClassification/ILSVRC2012/VGG