Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Object Detection on Drone Videos using Neon™ Framework

$
0
0

Abstract

The purpose of this article is to showcase the implementation of object detection1 on drone videos using Intel® optimized framework for neon™2 on Intel® processors. The functional problem tackled in this work is the identification of pedestrians, trees, and vehicles such as cars, trucks, buses, and boats from the real-world video footage captured by commercially available drones. In this work, we have conducted multiple experiments to derive the optimal batch size, iteration count, and learning rate for the model to converge. The deep learning model developed is able to detect the trained objects in real-world scenarios with high confidence, and the ratio between the detected objects and desired objects is almost equivalent.

Introduction

Modern drones have become very powerful and ever since they have been equipped with potent cameras. They have been successful in areas like aerial photography, surveillance, etc. The integration of smart computer vision with drones has become the need of the moment.

In today’s scenario, object detection and segmentation are the classic problems in computer vision. More challenges exist with the drones due to the top-down view angles and the issue to integrate with a deep learning system for compute-intensive operations.

In this project, we implemented the detection component using Single Shot MultiBox Detector topology (SSD)1. We implemented our solution in Intel® Xeon Phi™ processors and evaluated frame rate, and accuracy on several videos captured by the drone.

Experiment Setup

The following hardware and software environments were used to perform the experiments.

Hardware

Architecturex86_64
CPU op-mode(s)32-bit, 64-bit
Byte Order:Little Endian
CPU(s):256
On-line CPU(s) list:0-255
Thread(s) per core:4
Core(s) per socket:64
Socket(s):1
NUMA node(s):2
Vendor ID:GenuineIntel
CPU family:6
Model:87
Model name:Intel® Xeon Phi™ Processor 7210 @ 1.30 GHz
Stepping:1
CPU MHz:1302.386
BogoMIPS:2600.06
L1d cache:32K
L1i cache:32K
L2 cache:1024K
NUMA node0 CPU(s):0-255

Software

neon™ Framework Setup

neon™ version2.2
Intel® Nervana™ platform aeon (data loader)1.0
Python*2.7
GCC version6.3.1

Model

The experiments detailed in the subsequent sections employ the transfer learning technique to speed up the entire process. For this purpose, we used a pre-trained Visual Geometry Group (VGG) 16 model with SSD topology.

Solution Design

The main component of our system includes a training component and a detection algorithm running SSD. SSD, though it is compute-intensive, has been more optimized for Intel® architecture. We adopt the neon™ framework as our Deep Learning Frameworks and the hardware is an Intel® Xeon Phi™ processor.

In this work, the entire solution is divided into three stages:

  1. Dataset preparation
  2. Model training
  3. Inferencing

Dataset Preparation

The dataset used for training the model is collected through unmanned aerial vehicles (UAVs). The images collected vary in resolution, aspect, and orientation, with respect to the object of interest.

The high-level objective of preprocessing is to convert the raw, high-resolution drone images into an annotated file format, which is then used for training the deep learning model.

The various processes involved in the preprocessing pipeline are as follows:

  1. Data creation
  2. Video to image frame conversion
  3. Image annotation
  4. Conversion to framework-native data format

The individual steps are detailed as follows:

Step 1. Dataset creation

The dataset chosen for these experiments consisted of 30 real-time drone videos in the following 7 classes: boat, bus, car, person, train, tree, and truck.

Step 2. Video to image frame conversion

To train the model, all the video files were converted to image frames. The entire conversion code was built using OpenCV 33.

The final dataset prepared for training consisted of 1,312 color images.

Step 3. Image annotation

Image annotation task involved manually labeling the objects within the training image set. In our experiment, we relied on the Python* tool, LabelImg*4 for annotation. The tool gives the object coordinates in XML format as output for further processing.

The following figure shows the training split for each class:


Figure 1. Training data set distribution.

Step 4. Conversion to framework-native data format

To enable fast and flexible access to data during training of the network, we used framework-specific file formats.

Data Conversion for Neon™ framework:

We used aeon data loader5 which is Nervana’s new and evolving project. The aeon data loader is designed to deal with large datasets from different modalities, including image, video, and audio that may be too large to load directly into memory. We used a macro batching approach, where the data is loaded in chunks (macro batches) and are then split further into mini batches to feed the model. The basic workflow is depicted in Figure 2.


Figure 2. The aeon data loader pipeline.

First, users perform ingestion, which means generating a manifest file in comma-separated values (CSV) format. This file indicates to the data loader where the input and target data reside. Given a configuration file (JSON), the aeon data loader processes the next steps (green box). During an operation, the first time a dataset is encountered, the data loader will cache the data into CPIO format, allowing for quick subsequent reads. During provision, the data loader reads the data from disk, performs any needed transformations on-the-fly, transfers the data to device memory, and provisions the data to the model as an input-target pair. We use a multithreaded library to hide the latency of these disk reads and operations in the device compute.

Network Topology and Model Training

SSD

The SSD approach discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages, and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component.

Model

In our experiment, we used the VGG 16 as the base network. An auxiliary structure is then appended to the network to produce detections with the following key features:

  • Multiscale feature maps for detection: We add convolutional feature layers to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales.
  • Convolutional predictors: Each added feature layer can produce a fixed set of detection predictions using a set of convolutional filters. For a feature layer of size m×n with p channels, the basic element for predicting parameters of a potential detection is a small 3×3×p kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the m×n locations where the kernel is applied, it produces an output value. The bounding box offset output values are measured relative to a default box position relative to each feature map location.
  • Default boxes and aspect ratios: We associate a set of default bounding boxes with each feature map cell for multiple feature maps at the top of the network. The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed. At each feature map cell, we predict the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of k at a given location, we compute c class scores and the four offsets relative to the original default box shape. This results in a total of (c+4) k filters that are applied around each location in the feature map, yielding (c+4) kmn outputs for an m×n feature map.

To achieve a faster convergence, we trained the network using an ImageNet pre-trained model. The model is available to download at the Neon™ Model Zoo6.

Transfer Learning

In our experiments, we applied transfer learning on a pre-trained VGG 16 model (trained on ImageNet). The transfer learning approach initializes the last fully connected layer with random weights (or zeroes), and when the system is trained for the new data, these weights are readjusted. The base concept of transfer learning is that the initial layers in the topology will have learned some of the base features such as edges and curves, and this learning can be reused for the new problem with the new data. However, the final, fully connected layers would be fine-tuned for the very specific labels that they are trained for. Hence, this needs to be retrained on the new data.

Inferencing

The video captured by the drone is broken down into frames using OpenCV with a configurable frames per second. As the frames are generated, they are passed to the detection model, which localizes the different objects in the form of four coordinates (xmin, xmax, ymin, and ymax) and provides a classification score to the different possible objects. By applying the NMS (Non-Maximal Suppression) threshold and setting confidence thresholds, the number of predictions can be reduced and kept to the prediction that is the most optimal. OpenCV is used to draw a rectangular box with various colors around the detected objects (see Figure 3).


Figure 3. Detection flow diagram.

Results

The different iterations of the experiments involve varying batch sizes and iteration counts.

Batch SizeKMP_AffinityOMP_NUM_THREADSTraining Time
64granularity=thread,verbose,balanced161.10X
64none,verbose,compact164.29X
64granularity=thread,verbose,balanced241.30X
64granularity=fine,verbose,balanced241.00X
64none,verbose,compact322.50X

Note: The numbers in the table are indicative. Results may vary depending on hyper parameter tuning.

The following detection was obtained when the inference use-case was run on the below sample image.


Figure 4. Red bounding boxes display the objects detected.

Conclusion and Future Work

The functional use case attempted in this paper involved the detection of vehicles and pedestrians from a drone or aerial vehicle. The training data was more skewed towards cars as opposed to other objects of interest since it was hand crafted from videos. The use case could be further expanded for video surveillances and tracking.

About the Authors

Krishnaprasad T and Ratheesh Achari are part of the Intel team working on the artificial intelligence (AI) evangelization.

References

The references and links used to create this paper are as follows:

  1. Object Detection, 2014: https://arxiv.org/pdf/1412.1441.pdf
  2. Neon™ framework: https://neon.nervanasys.com/index.html/
  3. OpenCV: https://github.com/opencv/opencv
  4. LabelImg: https://github.com/tzutalin/labelImg
  5. Nervana aeon: http://aeon.nervanasys.com/index.html/
  6. Neon™ Model Zoo: https://github.com/NervanaSystems/ModelZoo/tree/master/ImageClassification/ILSVRC2012/VGG

Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>