Top Qs
Timeline
Chat
Perspective

List of datasets for machine-learning research

From Wikipedia, the free encyclopedia

Remove ads
Remove ads

These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.[2][3][4]

Many organizations, including governments, publish and share their datasets. The datasets are classified, based on the licenses, as Open data and Non-Open data.

The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.

Remove ads

List of sorting used for datasets

More information Type, Subtypes ...

The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.

Remove ads

List of open data portals

More information Portal-name, License ...
Remove ads

List of portals suitable for multiple types of applications

Summarize
Perspective

The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.

Academic Torrents https://academictorrents.com
Amazon Datasets https://registry.opendata.aws/
Awesome Public Datasets Collection https://github.com/awesomedata/awesome-public-datasets
data.world https://data.world/datasets/machine-learning
Datahub – Core Datasets https://datahub.io/docs/core-data
DataONE https://www.dataone.org/
DataPortals https://dataportals.org/
Datasetlist.com https://www.datasetlist.com
Global Open Data Index – Open Knowledge Foundation https://okfn.org/ Archived 25 May 2020 at the Wayback Machine
Google Dataset Search https://datasetsearch.research.google.com/
Hugging Face https://huggingface.co/docs/datasets/
IBM's Data Asset Exchange https://developer.ibm.com/exchanges/data/
Jupyter – Tutorial Data https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html
Kaggle https://www.kaggle.com/datasets
Machine learning datasets https://macgence.com/data-sets-and-cataloges/
Major Smart Cities with Open Data https://rlist.io/l/major-smart-cities-with-open-data-portals
Microsoft Datasets https://msropendata.com/datasets
Open Data Inception https://opendatainception.io/
Opendatasoft https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en
OpenDOAR https://v2.sherpa.ac.uk/opendoar/
OpenML https://www.openml.org/search?type=data
Papers with Code https://paperswithcode.com/datasets
Penn Machine Learning Benchmarks https://github.com/EpistasisLab/pmlb/tree/master/datasets
Public APIs https://github.com/public-apis/public-apis
Registry of Open Access Repositories http://roar.eprints.org/ 
REgistry of REsearch Data REpositories https://www.re3data.org/ 
UCI Machine Learning Repository http://mlr.cs.umass.edu/ml/ Archived 26 June 2020 at the Wayback Machine
Speech Dataset https://www.shaip.com/offerings/speech-data-catalog/
Visual Data Discovery https://visualdata.io/discovery
Remove ads

List of portals suitable for a specific subtype of applications

The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.

Image data

Text data

Summarize
Perspective

These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Reviews

More information Dataset Name, Brief description ...

News articles

More information Dataset Name, Brief description ...

Messages

More information Dataset Name, Brief description ...

Twitter and tweets

More information Dataset Name, Brief description ...

Dialogues

More information Dataset Name, Brief description ...
More information Dataset Name, Brief description ...

Other text

More information Dataset Name, Brief description ...
Remove ads

Sound data

Summarize
Perspective

These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.

Speech

More information Dataset Name, Brief description ...

Music

More information Dataset Name, Brief description ...

Other sounds

More information Dataset Name, Brief description ...
Remove ads

Signal data

Summarize
Perspective

Datasets containing electric signal information requiring some sort of signal processing for further analysis.

Electrical

More information Dataset Name, Brief description ...

Motion-tracking

More information Dataset Name, Brief description ...

Other signals

More information Dataset Name, Brief description ...
Remove ads

Physical data

Summarize
Perspective

Datasets from physical systems.

High-energy physics

More information Dataset Name, Brief description ...

Systems

More information Dataset Name, Brief description ...

Astronomy

More information Dataset Name, Brief description ...

Earth science

More information Dataset Name, Brief description ...

Other physical

More information Dataset Name, Brief description ...

Biological data

Summarize
Perspective

Datasets from biological systems.

Human

More information Dataset Name, Brief description ...

Animal

More information Dataset Name, Brief description ...

Fungi

More information Dataset Name, Brief description ...

Plant

More information Dataset Name, Brief description ...

Microbe

More information Dataset Name, Brief description ...

Drug discovery

More information Dataset Name, Brief description ...

Anomaly data

Summarize
Perspective
More information Dataset Name, Brief description ...

Question answering data

Summarize
Perspective

This section includes datasets that deals with structured data.

More information Dataset Name, Brief description ...

Dialog or instruction prompted data

Summarize
Perspective

This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request.

More information Dataset Name, Brief description ...

Cybersecurity

Summarize
Perspective
More information Dataset Name, Brief description ...

Climate and sustainability

More information Dataset Name, Brief description ...

Code data

Summarize
Perspective
More information Dataset Name, Brief description ...

Multivariate data

Financial

More information Dataset Name, Brief description ...

Weather

More information Dataset Name, Brief description ...

Census

More information Dataset Name, Brief description ...

Transit

More information Dataset Name, Brief description ...

Internet

More information Dataset Name, Brief description ...

Games

More information Dataset Name, Brief description ...

Other multivariate

More information Dataset Name, Brief description ...

Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.

  • OpenML:[494] Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
  • PMLB:[495] A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
  • Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
  • Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.[496][497]

See also

References

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.

Remove ads