5 Ways to Get Interesting Datasets for Your Next Data Project (Not Kaggle)

Artificial Intelligence

5 Ways to Get Interesting Datasets for Your Next Data Project (Not Kaggle)

admin

June 24, 2023

5 Ways to Get Interesting Datasets for Your Next Data Project (Not Kaggle)

Bored of Kaggle and FiveThirtyEight? Listed here are the choice strategies I exploit for getting high-quality and unique datasets

The important thing to a terrific data science project is a terrific dataset, but finding great data is way easier said than done.

I remember back once I was studying for my master’s in Data Science, a bit of over a 12 months ago. Throughout the course, I discovered that coming up with project ideas was the straightforward part — it was finding good datasets that I struggled with essentially the most. I might spend hours scouring the web, pulling my hair out trying to seek out juicy data sources and getting nowhere.

Since then, I’ve come a great distance in my approach, and in this text I would like to share with you the 5 strategies that I exploit to seek out datasets. For those who’re bored of normal sources like Kaggle and FiveThirtyEight, these strategies will enable you to get data which can be unique and way more tailored to the precise use cases you have got in mind.

Yep, consider it or not, this is definitely a legit strategy. It’s even got a elaborate technical name (“synthetic data generation”).

For those who’re trying out a latest idea or have very specific data requirements, making synthetic data is a incredible strategy to get original and tailored datasets.

For instance, let’s say that you just’re attempting to construct a churn prediction model — a model that may predict how likely a customer is to go away an organization. Churn is a fairly common “operational problem” faced by many corporations, and tackling an issue like that is a terrific strategy to show recruiters that you could use ML to unravel commercially-relevant problems, as I’ve argued previously:

Nonetheless, in case you search online for “churn datasets,” you’ll find that there are (on the time of writing) only two fundamental datasets obviously available to the general public: the Bank Customer Churn Dataset, and the Telecom Churn Dataset. These datasets are a incredible place to begin, but won’t reflect the type of information required for modelling churn in other industries.

As a substitute, you possibly can try creating synthetic data that’s more tailored to your requirements.

If this sounds too good to be true, here’s an example dataset which I created with just a brief prompt to that old chestnut, ChatGPT:

After all, ChatGPT is restricted within the speed and size of the datasets it might create, so if you should upscale this system I’d recommend using either the Python library faker or scikit-learn’s sklearn.datasets.make_classification and sklearn.datasets.make_regression functions. These tools are a incredible strategy to programmatically generate huge datasets within the blink of an eye fixed, and ideal for constructing proof-of-concept models without having to spend ages looking for the proper dataset.

In practice, I actually have rarely needed to make use of synthetic data creation techniques to generate entire datasets (and, as I’ll explain later, you’d be sensible to exercise caution in case you intend to do that). As a substitute, I find this can be a really neat technique for generating adversarial examples or adding noise to your datasets, enabling me to check my models’ weaknesses and construct more robust versions. But, no matter how you employ this system, it’s an incredibly useful gizmo to have at your disposal.

Creating synthetic data is a pleasant workaround for situations when you’ll be able to’t find the variety of data you’re on the lookout for, but the plain problem is that you just’ve got no guarantee that the info are good representations of real-life populations.

If you should guarantee that your data are realistic, the perfect strategy to try this is, surprise surprise…

… to truly go and find some real data.

A technique of doing that is to achieve out to corporations which may hold such data and ask in the event that they’d be focused on sharing some with you. Prone to stating the plain, no company goes to provide you data which can be highly sensitive or in case you are planning to make use of them for industrial or unethical purposes. That may just be plain silly.

Nonetheless, in case you intend to make use of the info for research (e.g., for a university project), you may well find that corporations are open to providing data if it’s within the context of a quid pro quo joint research agreement.

What do I mean by this? It’s actually pretty easy: I mean an arrangement whereby they give you some (anonymised/de-sensitised) data and you employ the info to conduct research which is of some profit to them. For instance, in case you’re focused on studying churn modelling, you possibly can put together a proposal for comparing different churn prediction techniques. Then, share the proposal with some corporations and ask whether there’s potential to work together. For those who’re persistent and solid a large net, you’ll likely find an organization that’s willing to supply data on your project so long as you share your findings with them in order that they’ll get a profit out of the research.

If that sounds too good to be true, you may be surprised to listen to that this is precisely what I did during my master’s degree. I reached out to a few of corporations with a proposal for the way I could use their data for research that might profit them, signed some paperwork to verify that I wouldn’t use the info for another purpose, and conducted a extremely fun project using some real-world data. It really might be done.

The opposite thing I particularly like about this strategy is that it provides a strategy to exercise and develop quite a broad set of skills that are necessary in Data Science. You will have to speak well, show industrial awareness, and turn out to be a professional at managing stakeholder expectations — all of that are essential skills within the day-to-day lifetime of a Data Scientist.

Pleeeeeease let me have your data. I’ll be an excellent boy, I promise! Image by Nayeli Rosales on Unsplash

A number of datasets utilized in academic studies aren’t published on platforms like Kaggle, but are still publicly available to be used by other researchers.

Among the finest ways to seek out datasets like these is by looking within the repositories related to academic journal articles. Why? Because a lot of journals require their contributors to make the underlying data publicly available. For instance, two of the info sources I used during my master’s degree (the Fragile Families dataset and the Hate Speech Data website) weren’t available on Kaggle; I discovered them through academic papers and their associated code repositories.

How are you going to find these repositories? It’s actually surprisingly easy — I start by opening up paperswithcode.com, seek for papers in the world I’m focused on, and take a look at the available datasets until I find something that appears interesting. In my experience, this can be a really neat strategy to find datasets which haven’t been done-to-death by the masses on Kaggle.

Truthfully, I’ve no idea why more people don’t make use of BigQuery Public Datasets. There are actually tons of of datasets covering all the things from Google Search Trends to London Bicycle Hires to Genomic Sequencing of Cannabis.

One in all the things I especially like about this source is that a lot of these datasets are incredibly commercially relevant. You possibly can kiss goodbye to area of interest academic topics like flower classification and digit prediction; in BigQuery, there are datasets on real-world business issues like ad performance, website visits and economic forecasts.

A number of people shrink back from these datasets because they require SQL skills to load them. But, even in case you don’t know SQL and only know a language like Python or R, I’d still encourage you to take an hour or two to learn some basic SQL after which start querying these datasets. It doesn’t take long to rise up and running, and this truly is a treasure trove of high-value data assets.

To make use of the datasets in BigQuery Public Datasets, you’ll be able to enroll for a very free account and create a sandbox project by following the instructions here. You don’t have to enter your bank card details or anything like that — just your name, your email, a bit of information in regards to the project, and also you’re good to go. For those who need more computing power at a later date, you’ll be able to upgrade the project to a paid one and access GCP’s compute resources and advanced BigQuery features, but I’ve personally never needed to do that and have found the sandbox to be greater than adequate.

My final tip is to try using a dataset search engine. These are incredibly tools which have only emerged in the previous few years, and so they make it very easy to quickly see what’s on the market. Three of my favourites are:

In my experience, searching with these tools generally is a way more effective strategy than using generic serps as you’re often supplied with metadata in regards to the datasets and you have got the flexibility to rank them by how often they’ve been used and the publication date. Quite a nifty approach, in case you ask me.

Bored of Kaggle and FiveThirtyEight? Listed here are the choice strategies I exploit for getting high-quality and unique datasets

LEAVE A REPLY Cancel reply