Similarity Learning: It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are.
Currently models are using Sparse Ground Truth Matching as their training objective which ignores a lot of info. regions in the image
Aim is to involve similarity learning
Quasi-Dense Similarity Learning densely samples hundreds of region proposals for contrastive learning which aims to learn representations such that similar samples stay close to each other, while dissimilar ones are far apart.
Observations made include that at inference time the feature space admits a simple nearest neighbout search
It outperforms all existing methods on MOT, BDD100K, Waymo and TAO tracking benchmarks
MOT is a fundamental problem as is evident by rising self driving cars etc.
Older MOT methods used tracking-by-detection where they detect objects frame by frame and associate them to a class based on instance similarity.
Newer works have shown we can use spatial proximity between objects in consecutive frames (using IoUs or center distances) if objects are accurately detected in the first place which can help us associate objects.
However problems arise when the object is occluded or crowded, to curb this methods like motion estimation where we determine motion vectors by looking at adjacent frames and displacement regression.
We constrain our search regions to be local neighbourhoods to avoid distractions due to surrounding images.
We also use object appearence similarity as a secondary result to restrenghten our beliefs about particular associations or track in case some object vanishes.
So why are humans able to indentify identical objects immediately while computers struggle? Well the paper conjectures that this might be because while training image and object info. might not be fully used.
Re-identification(reID) is the process of associating images or videos of the same person taken from different angles and cameras
All 4 approaches indicate different ways used. The first 3 regard using similarity learning as a later activity or use sparse ground truth bounding boxes as training samples.
In the real world it’s highly unlikely that you’ll find indentical looking objects in an image so the model if trained properly should be able to distinguish objects which are similar and have or lack some features from older frames.
A trick employed is using many bounding boxes, the ones closer to the actual bounding box give positive examples while the ones far apart act as negative samples.
The paper employs quasi dense contrastive learning wherein we take an image, split it into 100s of region of interests and match those between a pair of images. Using this it learns a metric with contrastive loss which essentially means it uses the positive examples feeds it into the network, gets an output, calculates its distance from an object of the same class and then contrasts this from a negative example. The exact mathematical formula is as follows -
These quasi dense samples are able to cover most info. regions giving more box examples and hard negatives.
One simple approach was to use a handful of ground truth labels. However the paper uses another approach which enhances similarity learning. Since each sample has many positive samples on the original image we can extend out contrastive learning process to account for those multiple forms which makes quasi-dense learning feasible. So each sample is trained to distinguish all proposals on the other image simultaneously.
Apart from looking for similarity MOT needs to look for false positives, id switches, new appeared objects and terminated tracks.
Quasi-dense contrastive learning can be used on top of pre-existing detectors such as R-CNNs or YOLO.
The paper uses it on top of Faster R-CNN while applying a light weight embedding extractor and residual connections in the network.
Figure 2
It is not at all trivial how we’re going to use all this stuff we’ve built so far to track objects. Heck, does it even make sense aren’t all we doing just some boring maths?
Let’s try to make some sense of what we are doing
Suppose you have an object and you get no or more than one target, what do you do? Your nearest neighbour logic will cease to work, so for the current model to work you need to have only one target in the matching candidates
Due to various issues such as false positives, id switches, newly appeared objects (ofcourse they do when you drive for instance) and terminated tracks all make it much more non trivial to understand the tracking process or even build one in the first place.
Now the authors come to our rescue. They observed that their inference strategy worked well which includes ways of maintaining matching candidates and measuring instance similarity can mitigate these problems.
Bi-directional Softmax
No Target Cases
Multi-Target Classes
But the problem arises that some objects might be in the same location but have different classes and usually only one of them is out required prediction.
This process can boost the object recall and increase our mean Average Prediction (mAP)
This process has a downfall of creating duplicate feature embeddings to deal with which we do inter class duplicate removal using NMS.
The IoU threshold for NMS is 0.7 for objects with high confidence (larger than 0.5) and 0.3 for objects with lower detection confidence (lower than 0.5).
MOT
BDD100K
Waymo
TAO
128 RoIs chosen from the Key Frame as training samples and 256 RoIs from the Reference Frame with a positive-negative ratio of 1 as contrastive targets.
IoU Balanced Sampling is used to sample RoIs.
Feature Embedding extraction is done using 4conv-1fc head with group normalization
There are 256 Channels for embedding features
The hyperparameters are as follows: Batch Size 16, Initial Learning Rate 0.02 for 12 epochs which is reduced by 0.1 after 8 and 11 epochs.
Originial Image is used without any scaling
Only Data Augmentation done is Horizontal Flipping
Pretrained on ImageNet for training
A new track is only initialized if confidence > 0.8
Feature embeddings are updated online with a momentum of 0.8
The standard procedure is followed on MOT1 to get results comparable to other papers. Images are randomly resizes, longer side cut to 1088 and aspect ratio remains unchanged during training and inference. Random Horizontal Flipping as well as Color Jittering is done where we randomly change the brightness, contrast and saturation of an image.
No extra data apart from a pre-trained model from COCO was used. The rules don’t consider COCO as additional training data and it is heaviliy used. COCO is a large-scale object detection, segmentation, and captioning dataset.
On the TAO Dataset the shorter side of the image is randomly rescaled between 640 and 800. At inference time shorter side is rescaled to 800. LVIS pre-trained model is used, however overfitting was frequently noticed. So the authors decided to freeze the model and only train the embeddings head to get instance representations.
Method outperformed all benchmarks.
Scores can be seen in the paper key observations from each dataset are mentioned below:
MOT
BDD100K
Waymo
TAO
What is an Ablation Study? In artificial intelligence, particularly machine learning, ablation is the removal of a component of an AI system. An ablation study studies the performance of an AI system by removing certain components, to understand the contribution of the component to the overall system.
BDD100K Validation Set was used to test the different model components.
Importance of Quasi-Dense Matching