top of page
itoperations22

FOMO, An efficient way to do Object detection on Edge devices

Updated: May 16

Deep learning and neural networks have revolutionized the field, enabling more precise and accurate results in object detection.

In today's world, we are witnessing a proliferation of AI solutions. However, in many cases, these solutions fail to reach consumers due to the high hardware resource requirements needed to run these models. To scale our AI journey, we require solutions that are efficient, faster, and accurate enough to run on edge devices. This is where FOMO comes into the picture.


Object detection is a crucial aspect of computer vision that has been explored for many years. Deep learning and neural networks have revolutionized the field, enabling more precise and accurate results in object detection. Popular deep learning-based algorithms and model architectures like R-CNNs and their variants are prevalent in object detection. However, feature-based methods like Haar Cascades, SIFT, SURF, and HOG still play a significant role in certain applications. The strengths and weaknesses of these methods should be considered when selecting the best approach.


Object detection techniques have greatly benefited from Convolutional Neural Networks, but their usage requires specialized hardware and computational resources. tinyML has enabled deep learning on microcontrollers, making real-time multi-object detection possible on constrained devices. This breakthrough has brought about new possibilities for object detection applications, as deep learning models can now be run directly on the devices that detect them.


TinyML has made great strides in image classification, which predicts the presence of an object in an image. However, object detection requires identifying multiple objects and their bounding boxes, making it more complex and memory-intensive. Traditional object detection models processed images multiple times, but newer models like YOLO use single-shot detection for near real-time results. However, these models still require large memory and data sets, making it challenging to run them on small devices and detect small objects.


FOMO (Faster Objects, More Objects) is a concept that challenges the idea that all object-detection applications require high-precision output from deep learning models. It suggests that by balancing accuracy, speed, and memory, deep-learning models can be reduced to small sizes while remaining useful. One way this can be achieved is by predicting the object's centre rather than detecting bounding boxes. Many object detection applications only require the location of objects in the frame, not their sizes, and detecting centroids is more compute-efficient than bounding box prediction while requiring less data.



Fomo changes the structure of deep learning models

FOMO changes the structure of deep learning models for object detection. Single-shot detectors use convolutional layers to extract features and fully-connected layers to predict bounding boxes. The layers detect increasingly complex features, such as lines, corners, and objects. Pooling layers reduce output size and highlight important features. With more layers, feature maps can detect intricate things like faces.


Although an image classifier's output is binary (i.e., "face" or "no face"), the underlying architecture is composed of convolutional layers that create a diffused lower-resolution image of the previous layer. In a standard image classification network, this locality, or "receptive field," decreases as you move deeper into the network until there is only one output. FOMO uses the same architecture but replaces the final layer with a per-region class probability map and a custom loss function that preserves locality in the final layer, resulting in a heatmap of object locations.

Heatmap of other locations

The size of the heat map is determined by where the layers of the network are cut off. In the beer bottles FOMO model, the size of the heat map is set to be 8 times smaller than the input image, resulting in a 20x20 heat map for a 160x160 input image. However, this can be adjusted. When the size is set to 1:1, it provides pixel-level segmentation, allowing for the counting of many small objects.


FOMO stands out from other object detection algorithms because it does not directly generate bounding boxes. However, it is a straightforward process to convert the heat map into bounding boxes by drawing a box around the highlighted region.


convert the heat map into bounding boxes by drawing a box around the highlighted region

One limitation of using a heat map is that each cell functions as an independent classifier. For instance, if the classes are "lamp," "plant," and "background," each cell will only be classified as either lamp, plant, or background. Consequently, detecting objects with overlapping centroids is not possible.


During the initial evaluation, it was discovered that while bounding boxes are a common output format for object detection models, they are not always necessary. In many cases, the object size is not a concern since cameras are fixed and objects have a consistent size, so what is needed is simply the object location and count. Consequently, the model has been adapted to train on object centroids, making it easier to count closely located objects. Since the neural network architecture is convolutional, it naturally searches for objects surrounding the centroid.



In addition, FOMO can be used with any MobileNetV2 model, allowing for the selection of a model with a higher or lower alpha depending on the deployment requirements. Transfer learning is also possible, although base models must be trained specifically with FOMO in mind. This makes FOMO suitable for a wide range of hardware, from microcontrollers to gateways and GPUs.


For example, the top video on the FOMO webpage demonstrates classification at 60 times per second on a Raspberry Pi 4 with a MobileNetV2 model of 0.1 alpha and 160x160 grayscale input. This is 20 times faster than MobileNet SSD, which only does around 3 frames per second. Meanwhile, the second video on the page shows classification at 30 times per second on an Arduino Nicla Vision board with a MobileNetV2 model of 0.35 alpha and 96x96 grayscale input, using only 240K of RAM. The smallest version of FOMO, which uses a MobileNetV2 model of 0.05 alpha and 96x96 grayscale input, runs at around 10 fps and requires less than 100KB of RAM on a Cortex-M4F processor at 80MHz.



Reference:


441 views0 comments
bottom of page