Home Artificial Intelligence What’s synthetic data? Infinite possibilities Synthetic numbers Thanks for reading! How a couple of YouTube course?

What’s synthetic data? Infinite possibilities Synthetic numbers Thanks for reading! How a couple of YouTube course?

2
What’s synthetic data?
Infinite possibilities
Synthetic numbers
Thanks for reading! How a couple of YouTube course?

A field guide to the varied species of faux data: Part 1

is, to place it bluntly, fake data. As in, data that’s not actually from the you’re fascinated about. (Population is a technical term in data science, which I explain here.) It’s data that you just’re planning to treat as if it got here from the place/group you would like it got here from. (It didn’t.)

Synthetic data is, to place it bluntly, fake data.

Artificial data, synthetic data, fake data, and simulated data are all synonyms with barely different heydays because the term du jour, so that they carry poetic connotations from different eras. Lately, the cool kids prefer the buzzword, perhaps because investors must be convinced that something recent has been invented, slightly than rediscovered. And there’s something barely recent in play here, but (for my part) not recent enough for all of the old ideas to be irrelevant.

Let’s dive in!

Some synthetic numbers! All image rights belong to the creator.

(Note: the links on this post take you to explainers by the identical creator.)

In case you’ve suffered through a graduate course on advanced probability and measure theory like I actually have (my therapist and I are still working through it over a decade later), you’ll be superfluously aware that there are infinite real numbers. Amongst other things, signifies that in the event you attempt to enumerate all of them, I can swoop in like a jerk and find you a recent one, for instance by adding 1 to your largest number, taking the typical of your two closest numbers, or popping a digit on the back of the number with the longest series of digits after the decimal point.

This also signifies that in the event you give me the list of all of the numbers ever recorded by humans over the history of humankind, I can still make a brand recent one. Boom! The ability.

Where am I going with this, besides providing fodder to your next beery debate on whether there’s such a thing as true originality (ugh)?

Let’s say you will have a dataset filled with human heights. Between any two measurements (say 173cm and 174cm, the interval wherein you’ll find my height) there are infinite possibilities for a number you possibly can write down. Just keep lengthening the decimal place beyond the reasonable ability of our measuring tools. Beyond subatomic particles. Beyond common sense. There are still loads of numbers I could make up, like: 173.4335524095820398502639008342984598739874944444443842397593645873649572850263894458092843956389479592489586232342349832842849687394208287645545352525353353826482384724628732648732799999992323…

The principles governing the creation of this silly number are thoroughly on the market beyond the realm of what’s useful and practical, so while you ask me to offer you a number that would represent a human height that you possibly can add to your dataset, how might I approach your request?

Real world data

One option is to offer you real data from an actual human. I go searching the room, spot my bff Heather (true story, she says hi), and measure her to your dataset. In case your population of interest was all humans, her height can be a legit datapoint to your dataset if (and that’s big if) I measured it in line with the foundations you laid out for a way your population needs to be measured.

Noisy data

If I measure Heather’s height in laptops (I didn’t bring a tape measure to our weekend retreat, sorry) to the closest 13 inches whilst you measured heights in millimeters using certainly one of those meter rulers, we’ll have problems.

After we say noisy data, we mean there’s nondeterministic error in there that hides the true answer. And that’s exactly what’ll occur if I get it into my head to measure Heather in laptops. (Or Smoots.)

Any measurement you’ll get from me could have random error inbuilt that’s of a distinct profile from what’s in the remaining of your data. To cope with the can of worms we’re potentially opening up here, make sure you include a record of the source of the info. (Who collected it — you or me?) You possibly can all the time nuke my entries later… so long as they’re not hiding amongst your legit contributions.

When collecting data from the true world, it’s surprisingly easy to mess up. To learn more, try my series on data design and data collection:

Handcrafted data

Let’s say there was nobody to measure but you wanted one other datapoint anyway? (Why might you should do that and what are the professionals and cons? See my next blog post!)

You then’re saying you’re okay with synthetic data. (In case you allow synthetic data into your project, all the time keep a record of which datapoints are synthetic and the way they were made!)

I could also offer you a height datapoint by making up a number following no rules in any respect. If I’m especially perverse, I’d even throw out a posh number like -5 + 60*sqrt(-1) simply to mess with you. Did you say I couldn’t? You must. In case you’re letting me make stuff up, it is advisable to constrain my creativity.

No imaginary numbers? Okay, how about -100?

Oh, it needs to be throughout the range of actual human heights? How about that 173.43355240… number from earlier?

Too many decimal places because human measuring instruments aren’t that sensitive? Wonderful, how about 173.5cm?

We would call this , since I, a human, got here up with it by handcrafting an example that appeals to me.

But what in the event you wanted multiple recent height to your dataset? And also you tell me to be reasonable and round my decisions to the closest millimeter?

Well, I’d give you: 173.5cm, 182.4cm, 175.1cm, 190.2cm, 180.1cm…

These are all plausible human measurements, but they’re on the tallish side. They likely don’t represent your population of interest thoroughly. They’re biased by my ideas of what good entries into your dataset appear to be. And what do I find out about human heights anyhow? You can do higher.

So let’s do higher in Part 2, where we’ll go on a journey that covers:

  • duplicated data
  • resampled data
  • bootstrapped data
  • augmented data
  • oversampled data
  • edge case data
  • simulated data
  • univariate data
  • bivariate data
  • multivariate data
  • multimodal data

Or help yourself to my certainly one of my other data taxonomy guides here:

In case you rejoiced here and also you’re in search of a whole applied AI course designed to be fun for beginners and experts alike, here’s the one I made to your amusement:

Benefit from the course on YouTube here.

P.S. Have you ever ever tried hitting the clap button here on Medium greater than once to see what happens? ❤️

All image rights belong to the creator.

2 COMMENTS

  1. … [Trackback]

    […] Info on that Topic: bardai.ai/artificial-intelligence/whats-synthetic-datainfinite-possibilitiessynthetic-numbersthanks-for-reading-how-a-couple-of-youtube-course/ […]

LEAVE A REPLY

Please enter your comment!
Please enter your name here