Home Data Science Different ways to get the datasets you need for your data science tasks!

Different ways to get the datasets you need for your data science tasks!

by Yasir Aslam
0 comment

 

Resources for finding datasets that meet your needs

Looking through the list of articles I’ve written (to be posted on Medium), I’ve found quite a few related to the concept of obtaining datasets for data science tasks. Some of these articles focus on finding websites that publish good datasets, while others focus on how to create custom datasets. This article is a collection of various concepts (for obtaining datasets) covered in various articles. It seems to be a compilation of multiple techniques (for obtaining datasets) while following the links in the original article.

1. Advanced Google Search

Image created by the author

Google searches are by far the most popular way to search datasets. But did you know that you can get more accurate results and faster by customizing your search queries? In the article below, you’ll find three ways to optimize your searches on the Internet.

The book ” Use Google Search More Efficiently to Find Data ” introduces three advanced search methods:

Advanced Google Search Techniques for Dataset Search

  • Specifies the files to search using the search operator “filetype” .
  • Specify the website to search using the search operator “site” .
  • Specify different file formats from the Search Options page.

In addition to the above , various search operators are listed on the Google search help page ” Improving the accuracy of web searches “. In addition, search operators are also detailed in the blog post ” 40 useful commands for Google search – search operators that you should know in 2022 (useful for research, competitive analysis, SEO) ” on the blog operated by the rental server company Kinsta . is explained in

・・・

2. Useful Sites for Finding Datasets for Data Analysis Tasks

Image created by the author

Google search is great, but there are also specialized sites with good datasets. In the article below, I’ve listed five such datasets (sites that collect them), with videos detailing how to access them. Don’t worry , we’ve left out common ones like the UCI Machine Learning Repository , Kaggle datasets , and Data.gov to highlight the lesser-known ones.

In ” Useful Sites for Finding Datasets for Data Analysis Tasks, ” five websites are listed below.

5 sites to help you get datasets

  • Google Dataset Search : A webpage dedicated to dataset search developed by Google. After creating a search index for the dataset description, the search results are displayed.
  • OpenML : An open data science platform aimed at democratizing machine learning research. In addition to the dataset , they also publishmachine learning models .
  • FiveThirtyEight : Web media that advocates reporting based on data analysis. The data set used to create the articleis also made public.
  • awesome public datasets : A web page with public datasets published on GitHub.
  • BuzzFeed News US GitHub page : The media publishes data and analysis related to published articles on GitHub. Among the publicly available datasets is “ Contributions to US Presidential Elections .”

・・・

3. 5 real-world datasets to sharpen your exploratory data analysis skills

real-world datasets

If you want to jump right into the analysis without searching for datasets, this article will help you. We have listed five datasets that are ideal for performing exploratory data analysis and visualization. You can analyze salary datasets, clinical trial reports, or even air traffic data. Fortunately, all of these datasets are available on Kaggle, so all you have to do is run your notebook and get started.

World Datasets to Sharpen Exploratory Data Analysis Skills, ” the following real-world datasets are presented:

5 Real-World Datasets to Sharpen Your Exploratory Data Analysis Skills

  • Palmer Land Penguin Dataset : A dataset of three species of penguins inhabiting Palmer Land, Antarctica.
  • COVID-19 clinical trial dataset : A dataset created by extracting data related to COVID-19 from ClinicalTrial.gov , a website that summarizes clinical trials conducted around the world
  • Forbes Athletes’ Richest Ranking Dataset : A data set that summarizes the sportsman’s richest rankings published by the US version of Forbes since 1990.
  • EU Region IT Salary Survey : A dataset summarizing the results of a salary survey of IT specialists in Europe, mainly in Germany. Results for 2018-2020 are included.
  • US International Air Traffic Data : US airport traffic data set based on the US International Air Passenger and Cargo Statistical Report .

・・・

4. Create a custom image dataset

Image created by the author

If you’re into deep learning and want to work on a project with datasets, this article lists 5 browser extensions that make bulk image downloads pretty easy. However, be careful not to download images that violate copyright.

In ” Creating Custom Image Datasets for Deep Learning Projects “, 5 ways to download images from the internet are presented:

5 Useful Tools to Download Images

  • Fatkun Batch Download Image : Chrome extension that lets you filter images to download by resolution or link
  • Imageye : Chrome extension that filters images based on pixel width and height
  • Download All Images : A Chrome extension that downloads all images on a web page and puts them in a ZIP file
  • ImageAssistant Batch Image Downloader : A Chrome extension that allows you to batch extract image URLs and filter them by extension and resolution size
  • How to use GitHub code: How to download images using Practical-Deep-Learning-for-Coders-2.0 published on GitHub . You need to install the Fastai library.

5. Extract data from HTML table

Datasets published on the Internet are sometimes provided in HTML tabular format. Such tables are typically long and span the entire web page. Also, the data available in HTML forms is dynamic. That is, the data are updated at regular intervals. As a result, it’s not always convenient to copy-paste (an HTML table) into an Excel sheet. There is also a hand called scraping, but there is an easier way. Google Spreadsheets has a handy function called IMPORTHTML , which is great for importing data from tables and lists within HTML pages. The article below describes the end-to-end process for bringing tables (and lists) into Google Spreadsheets.

See also the official Google help page for the function IMPORTHTML in Google Spreadsheets . The above ” How to easily import an HTML table into a Google Spreadsheet” also explains how to use IMPORTHTML and the QUERY function together to extract data from a specified location in a table.

・・・

6. Extract data from PDF

Extracting tabular data from a PDF is painstaking. But a bigger problem is that much open data is provided as PDF files. This open data is very important to analyze and gain important insights. However, accessing the data contained in the PDF becomes a bottleneck. This article describes Camelot , an open-source Python library that makes it easy to extract tables from PDFs . I’ll also cover a web interface called Excalibur for those who don’t want to code but want to use the functionality of the library.

See here for Camelot’s page on the Python Package Index (PyPI), a collection of Python packages . In addition, in the article ” Efforts to digitize personnel changes – PDF table data extraction using Camelot” on the Sansan Builders Blog, an engineer blog of Sansan Corporation, table data extraction using Camelot is explained. For Excalibur, see this page on PyPI . In addition, the OPTiM TECH BLOG, an engineer blog of OPTiM Corporation, whose core business is IT solutions, introduced an example of using Excalibur in the article “Try extracting PDF table data with Excalibur”.  The author of this article, Parul Pandey, has a video of him using Excalibur on his YouTube channel:

7. Extracting information from XML files

datasets

We have learned to work with data in the form of HTML tables and PDF files. In addition, there is another data category called an XML file as a format that needs to be processed before it can be used. XML stands for Extensible Markup Language . As the name suggests, it is a markup language that defines a set of rules to encode documents in machine-readable and human-readable formats . In this article, I’ll walk you through the process of converting XML data into a parsable CSV file and then ingesting it into the pandas library for further parsing.

Extracting information from an XML file into a Pandas DataFrame uses the Python module xml.etree.ElementTree for XML parsing and data extraction. For information on how to use this module, see the blog article ” [Python Introduction] Let’s try to analyze XML using ElementTree!” ] is also explained.

・・・

8. Read data from clipboard to DataFrame in pandas

This article describes a very interesting function called pandas’ read_clipboard() method that creates a DataFrame from the data copied to the clipboard. This method reads the text from the clipboard and passes it to read_csv() , which returns a parsed DataFrame object.

See also the pandas API reference page for read_clipboard .

Conclusion

This article presents several techniques for downloading datasets. Some of these techniques should help you find the dataset you need for your next project adventure. You can also create your own datasets or perform meaningful analysis from downloaded data. The possibilities of datasets are endless!

You may also like

Leave a Comment