The imagededup is a Python package that allows to find exact and near duplicate images in the collection of images. It can be useful to find and remove duplicate images from a dataset when training a model.
The imagededup provides various algorithms to find duplicates. This tutorial provides an example how to use convolutional neural network (CNN) to find duplicate images in a directory.
Using pip
package manager, install imagededup
from the command line. The pip
installs TensorFlow 2 because it required by imagededup.
pip install imagededup
We will use 9 images which stored in the images
directory. We will try to find duplicates for image01.jpg
.
We create a convolutional neural network by using CNN
class. To find duplicate images, we use find_duplicates
method. The image_dir
parameter defines the path to the directory that contains images. If the scores
parameter is True
then similarity scores are returned together with duplicates. The min_similarity_threshold
is a threshold value which defines a minimum score of the similarity. If similarity score is greater than min_similarity_threshold
value, then the image will be a duplicate.
from imagededup.methods import CNN
from imagededup.utils import plot_duplicates
imgDir = 'images'
img = 'image01.jpg'
cnn = CNN()
duplicates = cnn.find_duplicates(image_dir=imgDir, scores=True,
min_similarity_threshold=0.9)
plot_duplicates(image_dir=imgDir, duplicate_map=duplicates, filename=img)
The find_duplicates
method returns a dictionary of the form like this:
{
'image01.jpg': [
('image05.jpg', 0.9601821),
('image07.jpg', 0.95339376),
('image09.jpg', 0.9276193)
],
'image02.jpg': [
('image06.jpg', 0.93348324)
],
'image03.jpg': [
('image04.jpg', 0.92860264),
('image08.jpg', 0.91540354)
],
'image04.jpg': [
('image03.jpg', 0.92860264),
...
}
Finally, we use the plot_duplicates
function to display duplicated images for the image image01.jpg
.
Leave a Comment
Cancel reply