Monocular depth estimation and object detection

3 min readAug 25, 2022

With just little aid to the visually impaired existing currently, there is a need to implement a device that helps in daily activities. This project is made to help a person with detecting the object in front of them, with impaired eyesight and without the need for a companion. A method, which uses object detection on the images sent from the user end. The resultant object or person detected is then transmitted to the impaired person as an audio message.

The goal of this project is to develop software which can receive signals from a camera, detect the closest object and send back audio cues so that the user can avoid the obstacle.

Datasets we used for this project to train Midas are MegaDepth, RedWeb and KITTI.

Megadeth is an SMF dataset which consists of diverse outdoor dynamic images.

RedWeb stands for Relative Depth from Web is an RGB-D dataset which consists of images collected from the web. This gives a diverse range of pictures and contains dynamic scenes.

KITTI is a stereo image dataset captured with an autonomous car driving around the streets.

A custom dataset was prepared to fine-tune the YOLO model. It consists of seven classes car, bicycle, bench, stop sign, fire hydrant, person, and animals with a split of 0.8 training, 0.1 validation accuracy, and 0.1 testing data. The data consists of 2000 images which were further augmented using the Image Data Generator.

The data was collected primarily from Google Open Images.

The primary models we used were the MiDaS_Small model ad YOLOv5. We chose MiDaS_Small because it has a faster inference speed ad for this project to work in real-time we need it to be as fast as possible. Choosing YOLOv5 was also for the same reasons as this model is faster than models like RCNN, and SSD selective search.

As for the working of the model, the images are loaded, and the labels are encoded using one hot encoding.

The split is 80% for training and 20% for testing

The input image is transformed to suitable means by converting the pixel data into float and normalizing it (0–1) by dividing each pixel value by 255.

Model Structure

Our model consists of 6 Convolutional layers and 6 fully connected layers. The input shape is (32,32,3).

We have also used Dropout layers and Batch Normalization to improve validation accuracy.

For all the layers except the final layer, we have used ‘relu’ as an activation function and for the final layer, we have used ‘softmax’.

We chose SGD as our optimizer and categorical cross-entropy as our loss function.

Validation Accuracy after 50 epochs was 69.4 %

Even though the validation accuracy seemed to be acceptable, the model failed to work as per expectations on

real-world images and hence we were inclined to go with a fine-tuned YOLOv5 model.

Monocular depth estimation and object detection

Written by Nagasekharnov23

No responses yet