What datasets are required for machine learning? 20 precautions and recommendations when using

October 13, 2023

4666

What is a dataset in machine learning?
There are 3 types of datasets
How to obtain datasets for use in machine learning
Can I make it myself? How to create a dataset for machine learning
Points to note when using datasets in machine learning
20 recommended datasets for machine learning
summary

Machine learning requires the existence of datasets, which are collections of data. In order to improve the accuracy of machine learning, it is necessary to select and handle abundant, high-quality data, and data that suits the purpose. If you do not have enough data or are unable to prepare a dataset yourself, it is convenient to prepare it from open data, which can be downloaded for free on the web. In this article, we will introduce the meaning of datasets necessary for machine learning, precautions when using them, and 20 recommended sites where you can use datasets that are useful for dataset construction.

What is a dataset in machine learning?

Machine learning is one of the elemental technologies of AI that implements a mechanism equivalent to human “learning” in machines. Machine learning requires data to be organized in a way that computers can understand and learn from it. Machine learning requires the existence of datasets. First, we will specifically explain the datasets used for machine learning.

Dataset is a collection of data used for machine learning

A dataset is a collection of data that is processed by a program for machine learning. Machine learning generally uses three types of datasets: training set, validation set, and test set. The training set is often used to update the parameters of the classifier (machine learning model), the validation set is used to check the quality of manually set parameters, and the test set is often used to check generalization performance after learning. .
When performing machine learning, it is necessary to use datasets appropriate for each purpose. The basic parts of the dataset are available for free on the web as open data, so it can be adjusted and used according to the purpose.

Importance of datasets in machine learning

Machine learning handles a lot of data such as image data, video data, and text data. Additionally, the accuracy of the results will vary depending on the quality and quantity of data. Therefore, in machine learning, where the quality of data is important, data cleansing is necessary to find duplications, errors, and variations in notation in the data, and delete or correct them to make the data easier to process.

There are 3 types of datasets

Datasets are considered the most important element in machine learning. Generally, there are three types of datasets:

training set

The training set is the first and largest data set used. By providing a machine learning algorithm, it can be used for training development models.

validation set

Variation sets are used to tune the classifier’s hyperparameters, which are the parameters that control the behavior of machine learning algorithms, after training on the training set. After training the hyperparameters using the training set, we use the validation set and select the one with the best performance.

test set

A test set is a dataset used to check the accuracy of a model. It is often used only at the final stage, and only for performance testing.

How to obtain datasets for use in machine learning

Datasets can also utilize open data published on the Internet. Sources include “DATA GO JP,” an open database operated by the Administrative Management Bureau of the Ministry of Internal Affairs and Communications, and “Data.dov,” which provides data on government budgets, weather, economic indicators, etc. published by American government agencies. It can also be used by domestic and foreign government agencies. Other sites include the National Institute of Informatics, the University of California, Harvard University, Kaggle, a community site where people involved in machine learning and data science gather, and Link Data, a site that aims to process and share open data. , available from web public data.

Can I make it myself? How to create a dataset for machine learning

You can create your own dataset, but you need to prepare enough data and input the necessary data for analysis and analysis. If you wish to perform analysis in-house, first collect and organize the data from Excel files, experiment notebooks, etc. Furthermore, it is necessary to organize the data so that it can be easily analyzed.
By organizing datasets into a csv file, it is easier to check and modify without unnecessary information, and data analysis is also smoother. When preparing a dataset in Excel, use csv files instead of xlsx files. Also, when organizing datasets in Excel, organize them simply by arranging samples vertically and variables (features) horizontally. Additionally, make sure to use different sample names for all samples and different feature names for all features. Excel has a cell merging function, but please do not use cell merging as it will make it impossible to read the data when combining datasets.

Points to note when using datasets in machine learning

In order to improve the accuracy of machine learning, it is necessary to select an appropriate dataset and conduct repeated verifications, so there are some important points to keep in mind. We will introduce points to keep in mind when using datasets for machine learning.

Choose the right dataset for your company

There are various types of datasets, so you need to choose one that suits your company’s purpose and use. This is because unless you use a dataset that matches your purpose and usage, you will not be able to realize the machine learning algorithm you envisioned. Additionally, to improve the performance of machine learning, it is important to choose average data as much as possible.

Eliminate unused data

When choosing a dataset, humans must decide what data to actually use. This is because data that is difficult to analyze during the validation stage may reduce the accuracy of the system. Including unnecessary data can lead to a decrease in accuracy, so care must be taken to remove unused data each time.

Verify and improve even after completion

Just because a dataset is complete doesn’t mean it will always be in its best condition. Rather than leaving things as they are, we need to regularly verify and improve them by finding problems and making improvements while actually using them.

Be careful about copyright

Since machine learning handles a lot of data such as image data, video data, and text data, it is necessary to keep various rights in mind when handling it. When using data for commercial purposes, please be careful of copyright.

20 recommended datasets for machine learning

Some governments, websites, etc. provide datasets that are useful for machine learning. Creating datasets in-house requires labor and knowledge, so it is convenient to use open data as much as possible. We will introduce sites that provide datasets recommended for machine learning, along with links.

5 recommended comprehensive data sets

Comprehensive datasets are provided by governments, websites, etc. First, we will introduce five recommended sites for comprehensive data sets.

DATA GO JP (https://www.data.go.jp)
is an open data data catalog site published by the Japanese government for the purpose of providing information and cross-sectional searches on public data that can be used for secondary purposes. We publish data formats suitable for machine reading with usage rules that allow secondary use, including for commercial purposes.
National Informatics Research Data Repository (https://www.nii.ac.jp/dsc/idr/datalist.html)
Operated by the Dataset Sharing Research and Development Center (DSC) of the National Institute of Informatics (NII). This is a data set sharing project. We provide data from private companies, universities, and other researchers for researchers.
Link Data (http://linkdata.org/home)
A site that supports conversion and publication of table data. Popular datasets are displayed on the top screen, and the datasets are arranged in an easy-to-read manner.
Kaggle (https://www.kaggle.com)
Kaggle is a platform for competing in predictive models and analysis. Although the entire text is in English as it is an overseas site, you can download various datasets for free.
Harvard Dataverse (https://dataverse.harvard.edu)
This is a dataset published by Harvard University, a prestigious university in the United States. This is also an overseas site, so the entire text is written in English, but there are nearly 500 datasets that can be used for machine learning etc.

5 recommended image datasets

There are many overseas sites, but there are also datasets that are freely available for images, so you may want to use them depending on your purpose. I will now introduce five recommended sites where you can use image datasets.

MegaFace (http://megaface.cs.washington.edu)
is used in the face recognition algorithm public competition held at the University of Washington. As this is an overseas site, all information is in English, but we publish face recognition mixed with noise data and large-scale datasets.
Deep Fashion (http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html)
DeepFashion has a publicly available fashion image dataset consisting of over 800,000 images and 50 categories.
Google Open Image V4 (Open Images V6 – Description)
Google Open Image V4 is an image annotated with image-level labels, object bounding boxes, object segmentation masks, and visual relationships, published by Google. A dataset of 10,000 images.
MNIST(http://yann.lecun.com/exdb/mnist/)
MNIST is a dataset of handwritten digit images. It is also said to be a dataset mainly for machine learning beginners.
CIFAR-10/CIFAR-100(http://www.cs.toronto.edu/~kriz/cifar.html)
CIFAR-10 consists of 60000 32×32 color images of 10 classes, each class has 6000 I have an image, 50000 training images and 10000 test images. CIFAR-100 provides 100 classes of 600 images each, with 500 training images and 100 test images for each class.

5 recommended video datasets

Like image dataset sites, there are many overseas sites, but there are also video datasets that are available for free. We will introduce five recommended sites that publish video datasets.

YouTube-8M Dataset(https://research.google.com/youtube8m/)
This is a dataset published by Google’s research team. We have a dataset of 8 million YouTube videos tagged with 4800 Knowledge Graph entities.
Kinetics(https://deepmind.com/research/open-source/kinetics)
Kinetics is a site published by Deep Mind. The site contains approximately 650,000 videos, a video dataset that labels human-object interactions such as playing musical instruments, and actions such as shaking hands.
Moments in Time Dataset(http://moments.csail.mit.edu)
Moments in Time Dataset is a joint research project between MIT and IBM. A video dataset with action labels assigned to each 3 second video is available.
Atomic Visual Actions (AVA)(https://research.google.com/ava/)
Atomic Visual Actions (AVA) is a dataset for recognizing human movements published by Google, a major overseas company. Prepared. Approximately 57,000 videos have been assigned 80 types of labels, such as walking and flying motions.
BDD100K: A Large-scale Diverse Driving Video Database(https://bair.berkeley.edu/blog/2018/05/30/bdd/)
BDD100K: A Large-scale Diverse Driving Video Database is an AI This is a driving video dataset released by Lab (BAIR). The site has a dataset of 10-second videos labeled with bounding boxes of road objects, drivable areas, lane markings, etc., which can be downloaded for free.

summary

In machine learning, datasets, which are collections of data, are essential. The accuracy of machine learning varies depending on the quality and quantity of data, so it is important to take steps in order when preparing, processing, and handling data. Furthermore, it is important to understand and handle what data is available and what data is needed.
It can be said that human resources with knowledge and skills in handling datasets are essential for machine learning. However, although the number of companies seeking data analysis and AI-related skills is increasing, there is a shortage of human resources with specialized skills, and it is difficult to secure human resources with skills, programming, and data analysis knowledge. There are also many companies.

Tags
datasets

What datasets are required for machine learning? 20 precautions and recommendations when using

Table of contents

What is a dataset in machine learning?

Dataset is a collection of data used for machine learning

Importance of datasets in machine learning

There are 3 types of datasets

training set

validation set

test set

How to obtain datasets for use in machine learning

Can I make it myself? How to create a dataset for machine learning

Points to note when using datasets in machine learning

Choose the right dataset for your company

Eliminate unused data

Verify and improve even after completion

Be careful about copyright

20 recommended datasets for machine learning

5 recommended comprehensive data sets

5 recommended image datasets

5 recommended video datasets

summary

LEAVE A REPLY Cancel reply

Recent Posts

Most Popular

Recent Comments

EDITOR PICKS

Features

POPULAR CATEGORY

ABOUT US

FOLLOW US