Awesome Public Datasets

Datasets

发布日期: 2021-04-22

更新日期: 2021-04-22

文章字数: 1.5k

阅读时长: 9 分

Awesome Public Datasets

MachineLearning

Image classification

Source	Citation	Download	Description
MNIST	LeCun et al., 1998a	download	Classic dataset of small (28x28) handwritten grayscale digits, developed in the 1990s for testing the most sophisticated models of the day; today, often used as a basic “hello world” for introducing deep learning. This fast.ai datasets version uses a standard PNG format instead of the special binary format of the original, so you can use the regular data pipelines in most libraries; if you want to use just a single input channel like the original, simply pick a single slice from the channels axis.
CIFAR10	Krizhevsky, 2009	download	60000 32x32 colour images in 10 classes, with 6000 images per class (50000 training images and 10000 test images). Very widely used today for testing performance of new algorithms. This fast.ai datasets version uses a standard PNG format instead of the platform-specific binary formats of the original, so you can use the regular data pipelines in most libraries.
CIFAR100	Krizhevsky, 2009	download	This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).
Caltech-UCSD Birds-200-2011	Lin et al. 2015	download	An image dataset with photos of 200 bird species (mostly North American); it can also be used for localization. Number of categories: 200; Number of images: 11,788; Annotations per image: 15 Part Locations, 312 Binary Attributes, 1 Bounding Box
Caltech 101	L. Fei-Fei et al., 2004	download	Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. The size of each image is roughly 300 x 200 pixels. Can also be used for localization.
Oxford-IIIT Pet	O. M. Parkhi et al., 2012	download	A 37 category pet dataset with roughly 200 images for each class. The images have a large variations in scale, pose and lighting. Can also be used for localization.
Oxford 102 Flowers	Nilsback, M-E. and Zisserman, A., 2008	download	A 102 category dataset consisting of 102 flower categories, commonly occuring in the United Kingdom. Each class consists of 40 to 258 images. The images have large scale, pose and light variations.
Food-101	Bossard, Lukas et al., 2014	download	101 food categories, with 101,000 images; 250 test images and 750 training images per class. The training images were not cleaned. All images were rescaled to have a maximum side length of 512 pixels.
Stanford cars	Jonathan Krause et al., 2013	download	16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. Classes are typically at the level of Make, Model, Year.
Imagenette	Based on Deng et al., 2009	Full size 320 px 160 px	A subset of 10 easily classified classes from Imagenet: tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute
Imagewoof	Based on Deng et al., 2009	Full size 320 px 160 px	A subset of 10 harder to classify classes from Imagenet (all dog breeds): Australian terrier, Border terrier, Samoyed, beagle, Shih-Tzu, English foxhound, Rhodesian ridgeback, dingo, golden retriever, Old English sheepdog

NaturalLanguage

Source	Citation	Download	Description
IMDb Large Movie Review Dataset	Andrew L. Maas et al., 2011	download	A dataset for binary sentiment classification containing 25,000 highly polarized movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Wikitext-103	Stephen Merity et al., 2016	download	A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Widely used for language modeling, including the pretrained models used in the fastai library and ULMFiT algorithm.
Wikitext-2	Stephen Merity et al., 2016	download	A subset of Wikitext-103; useful for testing language model training on smaller datasets.
WMT 2015 French/English parallel texts	Callison-Burch et al., 2009	download	French/English parallel texts for training translation models. Over 20 million sentences in French and English. Dataset created by Chris Callison-Burch, who crawled millions of web pages and then used a set of simple heuristics to transform French URLs onto English URLs, and assumed that these documents are translations of each other.
AG News	Xiang Zhang et al., 2015	download	496,835 categorized news articles from >2000 news sources from the 4 largest classes from AG’s corpus of news articles, using only the title and description fields. The number of training samples for each class is 30,000 and testing 1900.
Amazon reviews - Full	Xiang Zhang et al., 2015	download	34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This full dataset contains 600,000 training samples and 130,000 testing samples in each class.
Amazon reviews - Polarity	Xiang Zhang et al., 2015	download	34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This subset contains 1,800,000 training samples and 200,000 testing samples in each polarity sentiment.
DBPedia ontology	Xiang Zhang et al., 2015	download	40,000 training samples and 5,000 testing samples from 14 nonoverlapping classes from DBpedia 2014.
Sogou news	Xiang Zhang et al., 2015	download	2,909,551 news articles from the SogouCA and SogouCS news corpora, in 5 categories. The number of training samples selected for each class is 90,000 and testing 12,000. Note that the Chinese characters have been converted to Pinyin.
Yahoo! Answers	Xiang Zhang et al., 2015	download	The 10 largest main categories from the Yahoo! Answers Comprehensive Questions and Answers version 1.0 dataset. Each class contains 140,000 training samples and 5,000 testing samples.
Yelp reviews - Full	Xiang Zhang et al., 2015	download	1,569,264 samples from the Yelp Dataset Challenge 2015. This full dataset has 130,000 training samples and 10,000 testing samples in each star.
Yelp reviews - Polarity	Xiang Zhang et al., 2015	download	1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity.

Terence Cai

http://terence1023.github.io/2021/04/22/awesome-public-datasets/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Terence Cai !

Datasets

上一篇

Reading Remember

Reading Remember

2021-04-22 Terence Cai

下一篇

Reading Remember

Reading Remember

2021-04-21 Terence Cai