Find Duplicate Images using imagededup

Find Duplicate Images using imagededup

The imagededup is a Python package that allows to find exact and near duplicate images in the collection of images. It can be useful to find and remove duplicate images from a dataset when training a model.

The imagededup provides various algorithms to find duplicates. This tutorial provides an example how to use convolutional neural network (CNN) to find duplicate images in a directory.

Using pip package manager, install imagededup from the command line. The pip installs TensorFlow 2 because it required by imagededup.

pip install imagededup

We will use 9 images which stored in the images directory. We will try to find duplicates for image01.jpg.

Images Collection for Finding Duplicates

We create a convolutional neural network by using CNN class. To find duplicate images, we use find_duplicates method. The image_dir parameter defines the path to the directory that contains images. If the scores parameter is True then similarity scores are returned together with duplicates. The min_similarity_threshold is a threshold value which defines a minimum score of the similarity. If similarity score is greater than min_similarity_threshold value, then the image will be a duplicate.

from imagededup.methods import CNN
from imagededup.utils import plot_duplicates

imgDir = 'images'
img = 'image01.jpg'

cnn = CNN()
duplicates = cnn.find_duplicates(image_dir=imgDir, scores=True,
                                 min_similarity_threshold=0.9)

plot_duplicates(image_dir=imgDir, duplicate_map=duplicates, filename=img)

The find_duplicates method returns a dictionary of the form like this:

{
    'image01.jpg': [
        ('image05.jpg', 0.9601821),
        ('image07.jpg', 0.95339376),
        ('image09.jpg', 0.9276193)
    ],
    'image02.jpg': [
        ('image06.jpg', 0.93348324)
    ],
    'image03.jpg': [
        ('image04.jpg', 0.92860264),
        ('image08.jpg', 0.91540354)
    ],
    'image04.jpg': [
        ('image03.jpg', 0.92860264),
    ...
}

Finally, we use the plot_duplicates function to display duplicated images for the image image01.jpg.

Duplicated Images

Leave a Comment

Cancel reply

Your email address will not be published.