Resources for finding datasets that meet your needs
Looking through the list of articles I’ve written (to be posted on Medium), I’ve found quite a few related to the concept of obtaining datasets for data science tasks. Some of these articles focus on finding websites that publish good datasets, while others focus on how to create custom datasets. This article is a collection of various concepts (for obtaining datasets) covered in various articles. It seems to be a compilation of multiple techniques (for obtaining datasets) while following the links in the original article.
1. Advanced Google Search
Google searches are by far the most popular way to search datasets. But did you know that you can get more accurate results and faster by customizing your search queries? In the article below, you’ll find three ways to optimize your searches on the Internet.
The book ” Use Google Search More Efficiently to Find Data ” introduces three advanced search methods:
Advanced Google Search Techniques for Dataset Search
|
In addition to the above , various search operators are listed on the Google search help page ” Improving the accuracy of web searches “. In addition, search operators are also detailed in the blog post ” 40 useful commands for Google search – search operators that you should know in 2022 (useful for research, competitive analysis, SEO) ” on the blog operated by the rental server company Kinsta . is explained in
・・・
2. Useful Sites for Finding Datasets for Data Analysis Tasks
Google search is great, but there are also specialized sites with good datasets. In the article below, I’ve listed five such datasets (sites that collect them), with videos detailing how to access them. Don’t worry , we’ve left out common ones like the UCI Machine Learning Repository , Kaggle datasets , and Data.gov to highlight the lesser-known ones.
In ” Useful Sites for Finding Datasets for Data Analysis Tasks, ” five websites are listed below.
5 sites to help you get datasets
|
・・・
3. 5 real-world datasets to sharpen your exploratory data analysis skills
If you want to jump right into the analysis without searching for datasets, this article will help you. We have listed five datasets that are ideal for performing exploratory data analysis and visualization. You can analyze salary datasets, clinical trial reports, or even air traffic data. Fortunately, all of these datasets are available on Kaggle, so all you have to do is run your notebook and get started.
World Datasets to Sharpen Exploratory Data Analysis Skills, ” the following real-world datasets are presented:
5 Real-World Datasets to Sharpen Your Exploratory Data Analysis Skills
|
・・・
4. Create a custom image dataset
If you’re into deep learning and want to work on a project with datasets, this article lists 5 browser extensions that make bulk image downloads pretty easy. However, be careful not to download images that violate copyright.
In ” Creating Custom Image Datasets for Deep Learning Projects “, 5 ways to download images from the internet are presented:
5 Useful Tools to Download Images
|
5. Extract data from HTML table
Datasets published on the Internet are sometimes provided in HTML tabular format. Such tables are typically long and span the entire web page. Also, the data available in HTML forms is dynamic. That is, the data are updated at regular intervals. As a result, it’s not always convenient to copy-paste (an HTML table) into an Excel sheet. There is also a hand called scraping, but there is an easier way. Google Spreadsheets has a handy function called IMPORTHTML , which is great for importing data from tables and lists within HTML pages. The article below describes the end-to-end process for bringing tables (and lists) into Google Spreadsheets.
・・・
6. Extract data from PDF
Extracting tabular data from a PDF is painstaking. But a bigger problem is that much open data is provided as PDF files. This open data is very important to analyze and gain important insights. However, accessing the data contained in the PDF becomes a bottleneck. This article describes Camelot , an open-source Python library that makes it easy to extract tables from PDFs . I’ll also cover a web interface called Excalibur for those who don’t want to code but want to use the functionality of the library.
7. Extracting information from XML files
We have learned to work with data in the form of HTML tables and PDF files. In addition, there is another data category called an XML file as a format that needs to be processed before it can be used. XML stands for Extensible Markup Language . As the name suggests, it is a markup language that defines a set of rules to encode documents in machine-readable and human-readable formats . In this article, I’ll walk you through the process of converting XML data into a parsable CSV file and then ingesting it into the pandas library for further parsing.
・・・
8. Read data from clipboard to DataFrame in pandas
This article describes a very interesting function called pandas’ read_clipboard() method that creates a DataFrame from the data copied to the clipboard. This method reads the text from the clipboard and passes it to read_csv() , which returns a parsed DataFrame object.
Conclusion
This article presents several techniques for downloading datasets. Some of these techniques should help you find the dataset you need for your next project adventure. You can also create your own datasets or perform meaningful analysis from downloaded data. The possibilities of datasets are endless!