Awesome Public Datasets


Awesome Public Datasets

MachineLearning

Image classification

Source Citation Download Description
MNIST LeCun et al., 1998a download Classic dataset of small (28x28) handwritten grayscale digits, developed in the 1990s for testing the most sophisticated models of the day; today, often used as a basic “hello world” for introducing deep learning. This fast.ai datasets version uses a standard PNG format instead of the special binary format of the original, so you can use the regular data pipelines in most libraries; if you want to use just a single input channel like the original, simply pick a single slice from the channels axis.
CIFAR10 Krizhevsky, 2009 download 60000 32x32 colour images in 10 classes, with 6000 images per class (50000 training images and 10000 test images). Very widely used today for testing performance of new algorithms. This fast.ai datasets version uses a standard PNG format instead of the platform-specific binary formats of the original, so you can use the regular data pipelines in most libraries.
CIFAR100 Krizhevsky, 2009 download This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).
Caltech-UCSD Birds-200-2011 Lin et al. 2015 download An image dataset with photos of 200 bird species (mostly North American); it can also be used for localization. Number of categories: 200; Number of images: 11,788; Annotations per image: 15 Part Locations, 312 Binary Attributes, 1 Bounding Box
Caltech 101 L. Fei-Fei et al., 2004 download Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. The size of each image is roughly 300 x 200 pixels. Can also be used for localization.
Oxford-IIIT Pet O. M. Parkhi et al., 2012 download A 37 category pet dataset with roughly 200 images for each class. The images have a large variations in scale, pose and lighting. Can also be used for localization.
Oxford 102 Flowers Nilsback, M-E. and Zisserman, A., 2008 download A 102 category dataset consisting of 102 flower categories, commonly occuring in the United Kingdom. Each class consists of 40 to 258 images. The images have large scale, pose and light variations.
Food-101 Bossard, Lukas et al., 2014 download 101 food categories, with 101,000 images; 250 test images and 750 training images per class. The training images were not cleaned. All images were rescaled to have a maximum side length of 512 pixels.
Stanford cars Jonathan Krause et al., 2013 download 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. Classes are typically at the level of Make, Model, Year.
Imagenette Based on Deng et al., 2009 Full size 320 px 160 px A subset of 10 easily classified classes from Imagenet: tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute
Imagewoof Based on Deng et al., 2009 Full size 320 px 160 px A subset of 10 harder to classify classes from Imagenet (all dog breeds): Australian terrier, Border terrier, Samoyed, beagle, Shih-Tzu, English foxhound, Rhodesian ridgeback, dingo, golden retriever, Old English sheepdog

NaturalLanguage

Source Citation Download Description
IMDb Large Movie Review Dataset Andrew L. Maas et al., 2011 download A dataset for binary sentiment classification containing 25,000 highly polarized movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Wikitext-103 Stephen Merity et al., 2016 download A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Widely used for language modeling, including the pretrained models used in the fastai library and ULMFiT algorithm.
Wikitext-2 Stephen Merity et al., 2016 download A subset of Wikitext-103; useful for testing language model training on smaller datasets.
WMT 2015 French/English parallel texts Callison-Burch et al., 2009 download French/English parallel texts for training translation models. Over 20 million sentences in French and English. Dataset created by Chris Callison-Burch, who crawled millions of web pages and then used a set of simple heuristics to transform French URLs onto English URLs, and assumed that these documents are translations of each other.
AG News Xiang Zhang et al., 2015 download 496,835 categorized news articles from >2000 news sources from the 4 largest classes from AG’s corpus of news articles, using only the title and description fields. The number of training samples for each class is 30,000 and testing 1900.
Amazon reviews - Full Xiang Zhang et al., 2015 download 34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This full dataset contains 600,000 training samples and 130,000 testing samples in each class.
Amazon reviews - Polarity Xiang Zhang et al., 2015 download 34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This subset contains 1,800,000 training samples and 200,000 testing samples in each polarity sentiment.
DBPedia ontology Xiang Zhang et al., 2015 download 40,000 training samples and 5,000 testing samples from 14 nonoverlapping classes from DBpedia 2014.
Sogou news Xiang Zhang et al., 2015 download 2,909,551 news articles from the SogouCA and SogouCS news corpora, in 5 categories. The number of training samples selected for each class is 90,000 and testing 12,000. Note that the Chinese characters have been converted to Pinyin.
Yahoo! Answers Xiang Zhang et al., 2015 download The 10 largest main categories from the Yahoo! Answers Comprehensive Questions and Answers version 1.0 dataset. Each class contains 140,000 training samples and 5,000 testing samples.
Yelp reviews - Full Xiang Zhang et al., 2015 download 1,569,264 samples from the Yelp Dataset Challenge 2015. This full dataset has 130,000 training samples and 10,000 testing samples in each star.
Yelp reviews - Polarity Xiang Zhang et al., 2015 download 1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity.

文章作者: Terence Cai
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Terence Cai !
  目录