Home Artificial Intelligence The synthetic data field guide

The synthetic data field guide

4
The synthetic data field guide

A guide to the assorted species of faux data: Part 2

If you would like to work with data, what are your options? Here’s a solution that’s as coarse as possible: you possibly can pay money for real data or you possibly can get hold of faux data.

In my previous article, we made friends with the concept of synthetic data and discussed the thought process around creating it. We compared real data, noisy data, and handcrafted data. Let’s dig into the species of synthetic data that’s fancier than asking a human to select a number, any number…

A classic of British sketch comedy.

(Note: the links on this post take you to explainers by the identical creator.)

Duplicated data

Perhaps you measured 10,000 real human heights but you wish 20,000 datapoints. One approach you are taking is to suppose your existing dataset already represents your population fairly well. (Assumptions are all the time dangerous, proceed with caution.) Then you possibly can simply duplicate the dataset or duplicate some portion of it using ye olde copy-paste. Ta-da! More data! But is it good and useful data? That all the time is determined by what you wish it for. For many situations, the reply can be no. But hey, there are reasons you were born with a head, and people reasons are to chew and to use your best judgment.

Resampled data

Speaking of duplicating only a portion of your data, there’s a option to inject a spot of randomness to help you in determining which portion to select. You need to use a random number generator to help you in picking which height to attract out of your existing list of heights. You possibly can do that “without alternative”, meaning that you simply make at most one copy of every existing height, but…

Bootstrapped data

You’ll more often see people doing this “with alternative”, meaning that each time you randomly pick a height to repeat, you immediately forget you probably did this in order that the identical height could make its way into your dataset as a second, third, fourth, etc. copy. Perhaps if there’s enough interest within the comments, I’ll explain why this can be a powerful and effective technique (yes, it feels like witchcraft at first, I assumed so too) for population inference.

Augmented data

Augmented data might sound fancy, and there *are* fancy ways to reinforce data, but often if you see this term, it means you took your resampled data and added some random noise to it. In other words, you generated a random number from a statistical distribution and typically you just added it to the resampled datapoint. That’s it. That’s the augmentation.

All image rights belong to the creator.

Oversampled data

Speaking of duplicating only a portion of your data, there’s a option to be intentional about boosting certain characteristics over others. Perhaps you took your measurements at a typical AI conference, so female heights are underrepresented in your data (sad but true today). That’s called the issue of unbalanced data. There are techniques for rebalancing the representation of those characteristics, resembling SMOTE (Synthetic Minority Oversampling TEchnique), which is just about what it feels like. Probably the most naive option to smite the issue is to easily limit your resampling to the minority datapoints, ignoring the others. So in our example, you’d just resample the feminine heights while ignoring the opposite data. You possibly can also consider more sophisticated augmentation, still limiting your efforts to the feminine heights.

If you happen to desired to get even fancier, you’d look up techniques like ADASYN (Adaptive Synthetic Sampling) and follow the breadcrumbs on a trail that’s out of scope for a fast intro to this topic.

Edge case data

You possibly can also make up (handcrafted) data that’s totally unlike anything you (or anyone) has ever seen. This is able to be a really silly thing to do in the event you were attempting to use it to create models of the true world, however it’s clever in the event you’re using it to, for instance, test your system’s ability to handle weird things. To get a way of whether your model/theory/system chokes when it meets an outlier, you would possibly make synthetic outliers on purpose. Go ahead, put in a height of three meters and see what explodes. Type of like a hearth drill at work. (Don’t leave an actual fire within the constructing or an actual monster outlier in your dataset.)

http://bit.ly/quaesita_ytoutliers

Simulated data

When you’re getting cozy with the concept of constructing data up in line with your specifications, you would possibly prefer to go a step further and create a recipe to explain the underlying nature of the sort of knowledge that you simply’d like in your dataset. If there’s a random component, then what you’re actually doing is simulating from a statistical distribution that permits you to specify what the core principles are, as described by a model (which is just a flowery way of claiming “a formula that you simply’re going to make use of as a recipe”) with a rule for the way the random bits work. As an alternative of adding random noise to an existing datapoint because the vanilla data augmentation techniques do, you may add noise to a algorithm you got here up with, either by meditating or by doing a little statistical inference with a related dataset. Learn more about that here.

All image rights belong to the creator.

Heights? Wait, you’re asking me for a dataset of nothing but one height at a time? How boring! How… floppy disk era of us. We call this univariate data and it’s rare to see it collected within the wild today.

Now that we now have incredible storage capability, data can are available in rather more interesting and complicated forms. It’s very low cost to grab some extra characteristics together with heights while we’re at it. We could, for instance record hairstyle, making our dataset bivariate. But why stop there? How concerning the age too, so our data’s multivariate? How fun!

But today, we are able to go wild and mix all that with image data (take a photograph throughout the height measurement) and text data (that essay they wrote about how their unnecessarily boring their statistics class was). We call this multimodal data and we are able to synthesize that too! If you happen to’d prefer to learn more about that, let me know within the comments.

Why might someone intend to make synthetic data? There are good reasons to adore it and a few solid reasons to avoid it just like the plague (article coming soon), but in the event you’re an information science skilled, head over to this text to search out out which reason I feel ought to be your favorite to make use of it often.

If you happen to rejoiced here and also you’re on the lookout for a complete applied AI course designed to be fun for beginners and experts alike, here’s the one I made on your amusement:

Benefit from the course on YouTube here.

P.S. Have you ever ever tried hitting the clap button here on Medium greater than once to see what happens? ❤️

4 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here