Awesome Public Datasets
MachineLearning
- All-Age-Faces Dataset - Contains 13’322 Asian face images distributed
- Audi Autonomous Driving Dataset - We have published the Audi Autonomous
- Context-aware data sets from five domains
- Delve Datasets for classification and regression
- Discogs Monthly Data
- Free Music Archive
- IMDb Database
- Iranis - A Large-scale Dataset of Farsi/Arabic License Plate Characters
- Keel Repository for classification, regression and time series
- Labeled Faces in the Wild (LFW)
- Lending Club Loan Data
- Machine Learning Data Set Repository [fixme]
- Million Song Dataset
- More Song Datasets
- MovieLens Data Sets
- New Yorker caption contest ratings
- RDataMining - “R and Data Mining” ebook data
- Registered Meteorites on Earth [fixme]
- Restaurants Health Score Data in San Francisco
- TikTok Dataset - More than 300 dance videos that capture a single person
- UCI Machine Learning Repository
- Yahoo! Ratings and Classification Data
- YouTube-BoundingBoxes
- Youtube 8m
- eBay Online Auctions (2012)
Image classification
Source | Citation | Download | Description |
---|---|---|---|
MNIST | LeCun et al., 1998a | download | Classic dataset of small (28x28) handwritten grayscale digits, developed in the 1990s for testing the most sophisticated models of the day; today, often used as a basic “hello world” for introducing deep learning. This fast.ai datasets version uses a standard PNG format instead of the special binary format of the original, so you can use the regular data pipelines in most libraries; if you want to use just a single input channel like the original, simply pick a single slice from the channels axis. |
CIFAR10 | Krizhevsky, 2009 | download | 60000 32x32 colour images in 10 classes, with 6000 images per class (50000 training images and 10000 test images). Very widely used today for testing performance of new algorithms. This fast.ai datasets version uses a standard PNG format instead of the platform-specific binary formats of the original, so you can use the regular data pipelines in most libraries. |
CIFAR100 | Krizhevsky, 2009 | download | This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs). |
Caltech-UCSD Birds-200-2011 | Lin et al. 2015 | download | An image dataset with photos of 200 bird species (mostly North American); it can also be used for localization. Number of categories: 200; Number of images: 11,788; Annotations per image: 15 Part Locations, 312 Binary Attributes, 1 Bounding Box |
Caltech 101 | L. Fei-Fei et al., 2004 | download | Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. The size of each image is roughly 300 x 200 pixels. Can also be used for localization. |
Oxford-IIIT Pet | O. M. Parkhi et al., 2012 | download | A 37 category pet dataset with roughly 200 images for each class. The images have a large variations in scale, pose and lighting. Can also be used for localization. |
Oxford 102 Flowers | Nilsback, M-E. and Zisserman, A., 2008 | download | A 102 category dataset consisting of 102 flower categories, commonly occuring in the United Kingdom. Each class consists of 40 to 258 images. The images have large scale, pose and light variations. |
Food-101 | Bossard, Lukas et al., 2014 | download | 101 food categories, with 101,000 images; 250 test images and 750 training images per class. The training images were not cleaned. All images were rescaled to have a maximum side length of 512 pixels. |
Stanford cars | Jonathan Krause et al., 2013 | download | 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. Classes are typically at the level of Make, Model, Year. |
Imagenette | Based on Deng et al., 2009 | Full size 320 px 160 px | A subset of 10 easily classified classes from Imagenet: tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute |
Imagewoof | Based on Deng et al., 2009 | Full size 320 px 160 px | A subset of 10 harder to classify classes from Imagenet (all dog breeds): Australian terrier, Border terrier, Samoyed, beagle, Shih-Tzu, English foxhound, Rhodesian ridgeback, dingo, golden retriever, Old English sheepdog |
NaturalLanguage
Source | Citation | Download | Description |
---|---|---|---|
IMDb Large Movie Review Dataset | Andrew L. Maas et al., 2011 | download | A dataset for binary sentiment classification containing 25,000 highly polarized movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. |
Wikitext-103 | Stephen Merity et al., 2016 | download | A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Widely used for language modeling, including the pretrained models used in the fastai library and ULMFiT algorithm. |
Wikitext-2 | Stephen Merity et al., 2016 | download | A subset of Wikitext-103; useful for testing language model training on smaller datasets. |
WMT 2015 French/English parallel texts | Callison-Burch et al., 2009 | download | French/English parallel texts for training translation models. Over 20 million sentences in French and English. Dataset created by Chris Callison-Burch, who crawled millions of web pages and then used a set of simple heuristics to transform French URLs onto English URLs, and assumed that these documents are translations of each other. |
AG News | Xiang Zhang et al., 2015 | download | 496,835 categorized news articles from >2000 news sources from the 4 largest classes from AG’s corpus of news articles, using only the title and description fields. The number of training samples for each class is 30,000 and testing 1900. |
Amazon reviews - Full | Xiang Zhang et al., 2015 | download | 34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This full dataset contains 600,000 training samples and 130,000 testing samples in each class. |
Amazon reviews - Polarity | Xiang Zhang et al., 2015 | download | 34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This subset contains 1,800,000 training samples and 200,000 testing samples in each polarity sentiment. |
DBPedia ontology | Xiang Zhang et al., 2015 | download | 40,000 training samples and 5,000 testing samples from 14 nonoverlapping classes from DBpedia 2014. |
Sogou news | Xiang Zhang et al., 2015 | download | 2,909,551 news articles from the SogouCA and SogouCS news corpora, in 5 categories. The number of training samples selected for each class is 90,000 and testing 12,000. Note that the Chinese characters have been converted to Pinyin. |
Yahoo! Answers | Xiang Zhang et al., 2015 | download | The 10 largest main categories from the Yahoo! Answers Comprehensive Questions and Answers version 1.0 dataset. Each class contains 140,000 training samples and 5,000 testing samples. |
Yelp reviews - Full | Xiang Zhang et al., 2015 | download | 1,569,264 samples from the Yelp Dataset Challenge 2015. This full dataset has 130,000 training samples and 10,000 testing samples in each star. |
Yelp reviews - Polarity | Xiang Zhang et al., 2015 | download | 1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity. |
- Automatic Keyphrase Extraction
- The Big Bad NLP Database
- Blizzard Challenge Speech - The speech + text data comes from […]
- Blogger Corpus
- CLiPS Stylometry Investigation Corpus [fixme]
- ClueWeb09 FACC
- ClueWeb12 FACC
- DBpedia - 4.58M things with 583M facts
- Dirty Words - With millions of images in our library and billions of […]
- Flickr Personal Taxonomies
- Freebase of people, places, and things [fixme]
- German Political Speeches Corpus - Collection of political speeches from […]
- Google Books Ngrams (2.2TB)
- Google MC-AFP - Generated based on the public available Gigaword dataset […]
- Google Web 5gram (1TB, 2006)
- Gutenberg eBooks List [fixme]
- Hansards text chunks of Canadian Parliament
- LJ Speech - Speech dataset consisting of 13,100 short audio clips of a […]
- M-AILabs Speech - The M-AILABS Speech Dataset is the first large dataset […] [fixme]
- Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)
- Machine Comprehension Test (MCTest) of text from Microsoft Research
- Machine Translation of European languages
- Making Sense of Microposts 2013 - Concept Extraction [fixme]
- Making Sense of Microposts 2016 - Named Entity rEcognition and Linking
- Multi-Domain Sentiment Dataset (version 2.0)
- Noisy speech database for training speech enhancement algorithms and TTS […] [fixme]
- Open Multilingual Wordnet
- POS/NER/Chunk annotated data
- Personae Corpus [fixme]
- SMS Spam Collection in English
- SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)
- Stanford Question Answering Dataset (SQuAD)
- USENET postings corpus of 2005~2011
- Universal Dependencies
- Webhose - News/Blogs in multiple languages
- Wikidata - Wikipedia databases
- Wikipedia Links data - 40 Million Entities in Context
- WordNet databases and tools
- WorldTree Corpus of Explanation Graphs for Elementary Science Questions - […]