Fishing and Aquaculture plays a hugely important role in our way of life. Fishing captures are approaching 100 million tonnes per annum. Illicit fishing is estimated by the FAO at another 15 million tonnes per annum, threatening food security & bio-diversity.

Unsustainable fishing is a real problem threatening fisheries across the globe. International agreements aspire to control it, but enforcing the recent 2009 Agreement on “Port State Measures to Prevent, Deter and Eliminate Illegal, Unreported and Unregulated Fishing” (PSMA) stretches resources. To release and re-allocate resources, organizations are turning to technology.

The Nature Conservancy is betting data scientists can aid the automation of supervising the more highly valued capture of tuna. With Kaggle they’re hosting a competition to design a machine learning algorithm to replace the time consuming task of reviewing video footage from on-board fishing vessels.

The algorithm should identify when and what fish is being caught. Its input will be the images from mounted camera equipment on a fishing vessel, and output is to be the likelihood of particular fish species being present at a location in the images.

This problem can be broken down into two major parts, first a localizer to locate a region of the image that likely contains a fish, and second a classifier to classify the fish in that specific region between one of the several given species.

In recent years, classifiers have improved in performance dramatically since the use of Deep Neural Networks (DNN), specifically Convolutional Neural Networks (CNNs), and the use of Graphics Processor Units (GPUs). These more performant classifiers are encouraging the developments of more efficient localizers. It’s this development that leads to my focus on Single Shot Multibox Detector (SSD).

Before the SSD localizer, engineers had to choose between a high-accuracy localizer or a real-time localizer. High-accuracy localizers such as Selective Search, R-CNN or SPP relied upon region proposal followed by a high quality classifier. Later Faster R-CNN replaced the region proposal with a DNN, but didn’t quite reach real-time inference speed. For real-time localizers, there was DPM at 26% mean Average Precision (mAP) on PASCAL VOC ’07 data and later MSC-Multibox at 45% mAP and YOLO at 63% mAP that fell short in terms of accuracy when compared to the slower counterparts. Following the integrated design of Multibox and YOLO, SSD made some key improvements to achieve high accuracy at real-time speed.

Single Shot Multibox Detection (2016)

SSD replaced its predecessors complex pipeline of region proposal, re-sampling and classification with a single CNN detection network. The CNN model begins with a typical VGG16 network, on top of which additional CNN layers are appended with deceasing size/scale. These serve to provide predictions on object class and bounding box for the particular scale.

SSD Model Design

Each feature map of a convolution layer must output values for each object class and each bounding box (restricted to small predefined set varying in size and aspect ratio), the latter determined by the jaccard overlap of the possible bounding box to the ground truth bounding box. When exceeding a defined minimum threshold, the possible bounding box is considered as a positive match.

Overlapping Ground-Truth/Possible Bounding Boxes

The data required to feed the model is just the single image supplied with the ground-truth bounding boxes for training purposes. In fact, some extra work is required to map these ground-truths to the fixed set of detector outputs.

The objective during training is to minimize the weighted sum of Confidence Loss and Localization Loss. Confidence Loss is the Softmax over multiple classes. The Localization Loss is the L1 loss between the predicted bounding box and the ground-truth bounding box.

Other important steps are Hard Negative mining to resolve the imbalance between negative to positive matches. Random patch sampling, horizontal flipping and distortions of inputs to make the classifier more robust to size and shape variations.

The result is an object detector that can run with real-time inference 46FPS  with an accuracy as good as Faster R-CNN at 74% mAP on PASCAL VOC dataset.

Nature Conservancy
Faster R-CNN
Fast R-CNN
Selective Search