Find Duplicate Images using imagededup

The imagededup is a Python package that allows to find exact and near duplicate images in the collection of images. It can be useful to find and remove duplicate images from a dataset when training a model.

The imagededup provides various algorithms to find duplicates. This tutorial provides example how to use convolutional neural network (CNN) to find duplicate images in a directory.

Using pip package manager install imagededup from the command line. The pip installs TensorFlow 2 because it required by imagededup.

pip install imagededup

We will use 9 images which stored in the images directory. We will try to find duplicates for image01.jpg.

Images Collection for Finding Duplicates

We create convolutional neural network by using CNN class. To find duplicate images we use find_duplicates method. The image_dir parameter defines the path to the directory that contains images. If scores parameter is True then similarity scores are returned together with duplicates. The min_similarity_threshold is a threshold value which defines a minimum score of the similarity. If similarity score is greater than min_similarity_threshold value, then image will be a duplicate.

from imagededup.methods import CNN
from imagededup.utils import plot_duplicates

imgDir = 'images'
img = 'image01.jpg'

cnn = CNN()
duplicates = cnn.find_duplicates(image_dir=imgDir, scores=True,

plot_duplicates(image_dir=imgDir, duplicate_map=duplicates, filename=img)

The find_duplicates method returns a dictionary of the form like this:

    'image01.jpg': [
        ('image05.jpg', 0.9601821),
        ('image07.jpg', 0.95339376),
        ('image09.jpg', 0.9276193)
    'image02.jpg': [
        ('image06.jpg', 0.93348324)
    'image03.jpg': [
        ('image04.jpg', 0.92860264),
        ('image08.jpg', 0.91540354)
    'image04.jpg': [
        ('image03.jpg', 0.92860264),

Finally we use plot_duplicates function to display duplicated images for the image image01.jpg.

Duplicated Images

Leave a Comment

Your email address will not be published.