How to detect motile objects in video frames more accurately with deep learning?

How to detect motile objects in video frames more accurately with deep learning?

Object detection has been one of the most challenging subjects in computer vision. Since the advent of deep convolutional networks, object detection has been improved a lot. However, there is still a big space for enhancing the neural networks and object detection accuracy in different domains.

In some fields, we may need to detect some motile objects that have few attributes or exist in bad quality conditions. Some examples may be moving vehicles in bad weather conditions like rain or fog, or it may be small objects like cells or sperms. In these cases, it would be hard for the deep feature extractor network to extract the object features, and it may cause the network not to able to detect the objects or detects wrongly. Therefore we propose a method so that the network be also able to utilize the motility attribute of the objects in the consecutive frames. Consequently, the detection accuracy shall increase.

The video that we want the detect the objects inside it is constructed of consecutive frames. In the traditional ways, we would have extracted the frames of the video with a generator and then fed every frame to the deep network, so that the objects be detected in each frame. By performing this method, the network would not be able to distinguish the motility attribute of the objects. Our proposed for the similar cases is to concatenate some consecutive frames and give them to the network as a single input.

In the training stage, you can concatenate three consecutive frames and give them to the network with the ground-truth of the middle frame. This way, the network learns that the target objects move between the previous frame to the next frame and so it utilizes this motility feature also to detect and classify the objects.

In the testing phase, if we want to detect the objects in one frame, we concatenate that frame with it’s previous and next frames and then give it to the network. As the network has learned to detect the objects in the middle frame by using the motility attributes of the objects plus other attributes in the training stage, it would do the same and detects the objects in the middle frame more accurately. We can also use more consecutive frames like five or seven frames.

This idea is from this paper:

In the detection phase of this paper, we implemented the proposed method on RetinaNet for detecting sperms in the phase-contrast microscopy image sequences. In our case, concatenation of three frames achieved better results.

Codes are available at

The graphical abstract of our paper, which includes detection and tracking phases, is depicted in the next figure:

Graphical abstract of our paper

The results of our paper are available in the next plots.

As it is evident, the concatenation of three consecutive frames more effectively improves the detection accuracy and the speed of network convergence.