Home Artificial Intelligence Here’s why your efforts to extract value from data are going nowhere Appreciate your specialists Contribute to knowledge sharing Tell someone Summary Thanks for reading! How a couple of YouTube course? On the lookout for hands-on ML/AI tutorials? Don’t forget to go to the Data Cards Playbook! Liked the creator? Connect with Cassie Kozyrkov

Here’s why your efforts to extract value from data are going nowhere Appreciate your specialists Contribute to knowledge sharing Tell someone Summary Thanks for reading! How a couple of YouTube course? On the lookout for hands-on ML/AI tutorials? Don’t forget to go to the Data Cards Playbook! Liked the creator? Connect with Cassie Kozyrkov

2
Here’s why your efforts to extract value from data are going nowhere
Appreciate your specialists
Contribute to knowledge sharing
Tell someone
Summary
Thanks for reading! How a couple of YouTube course?
On the lookout for hands-on ML/AI tutorials?
Don’t forget to go to the Data Cards Playbook!
Liked the creator? Connect with Cassie Kozyrkov

My favorite way of explaining the difference between data science and data engineering is that this:

If data science is “making data useful,” then data engineering is “making data usable.”

These disciplines are so exciting that it’s easy to get ahead of ourselves and forget that before we will make data usable (let alone useful), we want to make data in the primary place.

But what about “making data” in the primary place?

The art of creating good data is extremely neglected. If you may have no data — no inputs — to work with, then there’s not an awful lot that your data engineers and data scientists can provide help to with.

But even while you do have some data, there’s a probability you’re missing something: data quality. For those who’ve collected truly rancid data, ignore extracting value from it. It’s futile to battle the inescapable gravity of this basic law of nature: Garbage In, Garbage Out.

An analogy for AI by the creator from the article “Why Businesses Fail at Machine Learning.

Data plays the identical role in data science and AI as ingredients play in cooking. A spiffy kitchen stuffed with all of the latest implements won’t prevent; in case your ingredients are garbage, you could as well quit. Irrespective of the way you slice and dice them, you’re not about to cook up anything worthwhile. That’s why you should take into consideration investing in good data before you rush headlong into your project.

For those who care about results, spend money on good data before chasing fancy algorithms, models, and a parade of information scientists.

Speaking of Garbage In, Garbage Out, your creator went into this place and got here out the exact same. ¯_(ツ)_/¯

Let me make a bit guess about you, dear reader: you’re not latest to Garbage In, Garbage Out (GIGO). Or QIQO for the more upbeat glass-half-full personalities on the market (the Q is for quality). You’re practically begging me to say something you haven’t heard before, yet here I’m chafing your patience with GIGO talk. Again. Yes, we’ve all repeated the GIGO principle ad nauseam. I’m a minimum of as sick if it as you’re.

But riddle me this. If we’ve got a complete industry of GIGO-respecting professionals and we also understand that designing quality datasets isn’t trivial, where’s the evidence that we put our money where our mouths are?

If data quality is so obviously necessary — in any case, it’s the inspiration of the entire multibillion dollar data/AI/ML/statistics/analytics shebang — what can we call the professionals who’re accountable for it? This isn’t a trick query. All I would like you to inform me is:

What’s the *job title* of the person whose primary role is the design, collection, curation, and documentation of top of the range datasets?

Except, unfortunately, it might as well be a trick query. Every time I chat with a bunch of datafolk at a conference, I attempt to sneak the query in. And each time I’ve asked them who’s accountable for data quality of their organizations, they’ve never give you anything remotely resembling consensus. Whose job is it? Data engineers say data engineers, statisticians say statisticians, researchers say researchers, UX designers say UX designers, product managers say product managers… GIGO ad nauseam indeed. Data quality appears to be precisely the form of “everybody’s job” that finally ends up being no one’s job, because it requires skills (!) yet nobody appears to be investing in them intentionally, let alone sharing best practices.

Data quality is strictly the form of “everybody’s job” that finally ends up being no one’s job.

Possibly I care a bit bit an excessive amount of in regards to the data science occupation. If I were here only for my very own profession, I’d make a fast buck with data charlatanism, but I would like data careers basically to matter. To be price something. To be useful. To make the world higher than we found it. So once I see the 2 most vital prerequisites neglected (data quality and data leadership), it breaks my heart.

If the {data quality skilled / data designer / data curator / data collector / data steward / dataset engineer / data excellence expert} profession doesn’t also have a name (see?) or a community, no wonder you won’t find it on a resume or in a university program. What keywords will your recruiters use to go looking for candidates? What interview questions will you utilize to screen for the core skills? And good luck finding excellence — your candidate will need quite the symphony of skills.

What keywords will your recruiters use to go looking for candidates? What interview questions will you utilize to screen for the core skills?

First off, let’s recognize that we’re not talking about your kid cousin’s “data labeling” summer job here, the form of job that involves mindless data entry and/or choosing all of the cupcake shots amongst a purgatory of bakery thumbnails and/or going door to door with a paper survey. Thought I’d mention this because “isn’t it just data labeling?” is an issue I’ve been asked multiple times in a tone of polite concern for my blood pressure. What a technique to dismiss a complete category of genius.

“Isn’t it just data labeling?” No. (What a technique to dismiss a complete category of genius.)

No, we’re talking in regards to the form of one who designs that data collection process in the primary place. It takes a minimum of a pinch of user experience design, a splash of decision science, a spoonful of survey design experience, a lump of psychology, a dollop of experimental social science with field experience (anyone who’s got real experience will anticipate the Philadelphia Problem for you of their sleep), and a piece of statistics training too (though you don’t need a complete statistician), plus solid analytics experience, loads of domain expertise, some project/program management skills, a little bit of exposure to data product management, and enough of a data engineering background to take into consideration data collection at scale. It is a rare mix — we urgently need a latest specialization.

To have any hope of constructing a mature data ecosystem, we must give a latest generation of specialists a superb home where they will likely be appreciated for flexing their specialist skills.

But until we’ve fought for a data-making profession that’s well recognized, well managed, and well rewarded, we’re stuck. Budding badasses with a flair for this array of skills could be lemmings to throw themselves at it. It’s a desk-in-the-basement form of job today, if it’s a job in any respect. To have any hope of constructing a mature data ecosystem, we give a latest generation of specialists a superb home where they will likely be appreciated for flexing their specialist skills.

So what can you do?

If there are already individuals with these skills and abilities who, despite a history of neglect, are stepping up in your organization to tackle data quality, are you encouraging them? Are you nurturing them? Are you rewarding them? I hope you’re. Whereas if you happen to’re creating incentives to chase the paychecks in buzzy MLOps or PhD-spangled data science, you’re shooting yourself (and our whole industry) within the foot.

Google’s People + AI Research (PAIR) team recently released the Data Cards Playbook to assist train the community in data design, data transparency, data quality, and data documentation best practices. I’m so pleased with our work and I’m thrilled those materials are freely available for everybody’s profit, but there’s still a lot to learn. For those who’re on this path too and passionately championing data excellence, please share the teachings you’re learning with the remaining of the world.

Get it here: bit.ly/datacardsplaybook (Image by Mahima Pushkarna, playbook co-creator, used with permission)

If a research paper falls in a forest and nobody uses it, did it make a sound? It’s a protracted journey from good ideas to a longtime discipline of excellence… a journey that needs all of the cheerleading and amplifying it may possibly get. For those who imagine on this and you may encourage even one other person to take it seriously, you’ll have played an important part in constructing the longer term. Thanks prematurely for spreading the word.

Our community has done an ideal job of celebrating data scientists. We’re doing a good job of celebrating MLOps and data engineers. But we’re doing a pathetic job of celebrating the people on whom all the opposite data careers depend: the individuals who design data collection and are accountable for data excellence, documentation, and curation. Possibly we could start by naming them (I’d love to listen to your suggestions) and a minimum of acknowledging that they matter. From there, will we progress to training them, hiring them, and appreciating them for his or her specialized skills? I sure hope so.

For those who rejoiced here and also you’re on the lookout for a complete applied AI course designed to be fun for beginners and experts alike, here’s the one I made to your amusement:

Benefit from the course on YouTube here.

P.S. Have you ever ever tried hitting the clap button here on Medium greater than once to see what happens? ❤️

Listed below are a few of my favorite 10 minute walkthroughs:

Get it here: bit.ly/datacardsplaybook (Image by Mahima Pushkarna, playbook co-creator, used with permission)

Although the positioning emphasizes data documentation and AI (gotta catch that zeitgeist) the Data Cards Playbook is so way more. It’s the strongest set of general data design resources I’m aware of. Preview:

Get it here: bit.ly/datacardsplaybook (Image by Mahima Pushkarna, playbook co-creator, used with permission)

Let’s be friends! Yow will discover me on Twitter, YouTube, Substack, and LinkedIn. Keen on having me speak at your event? Use this kind to get in contact.

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here