Search Datasets in Kaggle using API and Python

Search Datasets in Kaggle using API and Python

Let's say you didn't know the exact dataset identifier. Kaggle allows searching datasets via API by providing keyword, tags, file type, license or owner. Kaggle API returns paginated results. In order to get next page, we should to provide page number.

Kaggle API client provides the dataset_list method for searching datasets. In the following code, datasets are searched by iris keyword. Also, we want to receive results sorted by votes.

import os
from pprint import pprint

os.environ['KAGGLE_USERNAME'] = 'YOUR_USERNAME'
os.environ['KAGGLE_KEY'] = 'YOUR_KEY'

from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()

datasets = api.dataset_list(search='iris', sort_by='votes')

for dataset in datasets:
    print('---- ' + dataset.ref + ' ----')
    pprint(vars(dataset))

The dataset_list method returns a list of Dataset objects. We print dataset identifier and all attributes of an object. A part of the output:

---- uciml/iris ----
{'creatorName': 'Kaggle Team',
 'creatorUrl': 'kaggleteam',
 'currentVersionNumber': 2,
 'description': None,
 'downloadCount': 196903,
 'files': [],
 'id': 19,
 'isFeatured': False,
 'isPrivate': False,
 'isReviewed': True,
 'kernelCount': 5239,
 'lastUpdated': datetime.datetime(2016, 9, 27, 7, 38, 5),
 'licenseName': 'CC0: Public Domain',
 'ownerName': 'UCI Machine Learning',
 'ownerRef': 'uciml',
 'ref': 'uciml/iris',
 'size': '4KB',
.......................

dataset_list method parameters:

NoParameterDefault valueDescription
1.sort_byNoneSort results. Available options: hottest (default), votes, updated, active, published.
2.file_typeNoneSearch for datasets by file type. Available options: all (default), csv, sqlite, json, bigQuery.
3.license_nameNoneSearch for datasets by license. Available options: all (default), cc, gpl, odb, other.
4.tag_idsNoneSearch for datasets by tags. Tag list should be separated by comma.
5.searchNoneSearch for datasets by keyword.
6.userNoneSearch for datasets by owner.
7.mineFalseReturn datasets owned by currently logged user.
8.page1Page number.
9.max_sizeNoneMaximum size of the dataset.
10.min_sizeNoneMinimum size of the dataset.

Leave a Comment

Cancel reply

Your email address will not be published.