You’ve argued that a well-designed experiment can teach you greater than knowing the counterfactual. In practice, where experimentation remains to be underused, what’s your minimum viable experiment when data is scarce or stakeholders are impatient?
I do think that experimentation remains to be underused, and should be more underused now than it has been historically. Observational data is cheaper, easier to access, and more abundant with every passing day – and that’s an excellent thing. But for this reason, I don’t think many data scientists have what Paul Rosenbaum called the “experimental frame of mind” in his book . In other words, I believe that observational data has crowded out experimental data in a variety of places. While observational data can legitimately be used for causal evaluation, experimental data will all the time be the gold standard.
Certainly one of my mentors steadily says “some testing is healthier than no testing.” That is an efficient, pragmatic philosophy in industry. In business, learning doesn’t have intrinsic value – we don’t run experiments simply to learn, we do it so as to add value. Because experimental learnings have to be converted into economic value, they will be balanced with the price of experimentation, which can also be measured in economic value. We only wish to do things which have a net profit to the organization. For this reason, statistically ideal experiments are sometimes not economically ideal. I believe data scientists’ focus must be on understanding different levels of business constraints on the experimental design and articulating how those constraints will impact the worth of the learnings. With those key ingredients, the suitable compromises will be made that end in experiments which have a positive value impact to the organization overall. In my mind, a minimal viable experiment is one which stakeholders are willing to log out on and is anticipated to have a positive economic impact to the firm.
Where has AI improved your day-to-day workflow, as a practicing/leading data scientist, and where has it made things worse?
Generative AI has made me a more productive data scientist overall. I do nevertheless think there are drawbacks if we “abuse” it.
Improvements to productivity
Coding
I leverage GenAI to make my coding faster – at once I exploit it to assist (1) write and (2) debug code.
A lot of the productivity I see from GenAI is expounded to writing basic Python code. GenAI can write basic snippets of code faster than I can. I often find myself telling ChatGPT to write down a somewhat easy function, and I reply to a message or read an email while it writes the code. When ChatGPT first got here out, I discovered that the code was often pretty bad and required a variety of debugging. But now, the code is mostly pretty good – in fact I’m all the time going to review and test the generated code, but the upper quality of the generated code increases my productivity much more.
Generally, Python error notifications are pretty helpful, but sometimes they’re cryptic. It is de facto nice to simply copy/paste an error and immediately get clues as to what’s causing it. Before I might should spend a variety of time parsing through Stack Overflow and other similar sites, hoping to seek out a post that’s close enough to my problem to assist. Now I can debug much faster.
I haven’t used GenAI to write down code documentation or answer questions on codebases yet, but I hope to experiment with these capabilities in the longer term. I’ve heard really good things about these tools.
Research
The second way that I exploit GenAI to extend my productivity is in research. I actually have found GenAI to be a great study companion as I’m researching and studying data science topics. I’m all the time careful to not consider all the things it generates, but I actually have found that the fabric is mostly quite accurate. When I need to learn something, I normally discover a paper or published book to read through. Often, I’ll have questions on parts that aren’t clear within the texts and ChatGPT does a reasonably good job of clarifying things I find confusing.
I actually have also found ChatGPT to be an excellent resource for locating resources. I can tell it that I’m trying to resolve a selected variety of problem at work and I need it to refer me to papers and books that cover the subject. I’ve found its recommendations to generally be pretty helpful.
Drawback — Substituting actual intelligence for artificial intelligence
Socrates was skeptical of storing knowledge in writing (that’s why we primarily learn about him through Plato’s books – Socrates didn’t write). Certainly one of his concerns with writing is that it makes our memory worse — that we depend on external writing as an alternative of counting on our internal memorization and deep understanding of topics. I actually have this concern for myself and humanity with GenAI. Since it is all the time available, it is straightforward to simply ask the identical things over and yet again and never remember and even understand the things that it generates. I do know that I’ve asked it to write down similar code multiple times. As a substitute I should ask it once, take notes and memorize the techniques and approaches it generates. While that’s the best, it could possibly definitely be a challenge to follow that standard when I actually have deadlines, emails, chats, etc. vying for my time. Principally, I’m concerned that we’ll use artificial intelligence as an alternative to actual intelligence quite than a complement and multiplier.
I’m also concerned that the access to quick answers results in a shallow understanding of topics. We are able to generate a solution to anything and get the ‘gist’ of the data. This will often result in knowing simply enough to ‘be dangerous.’ That’s the reason I exploit GenAI as a complement to my studies, not as a primary source.
You’ve written about breaking into data science, and you’ve hired interns. In case you were advising a career-switcher today, which “break-in” tactics still work, which aged poorly, and what early signals really predict success on a team?
I believe that each one of the tactics I’ve shared in previous articles still apply today. If I were to write down the article again I might probably add two points though.
One is that not everyone seems to be searching for GenAI experience in data science. It’s an important and stylish skill, but there are still a variety of what I might call “traditional” data science positions that require traditional data science skills. Be certain you understand which variety of position you might be applying for. Don’t send a GenAI saturated resume to a standard position or vice versa.
The second is to pursue an mental mastery of the fundamentals of knowledge science. Actual intelligence is a differentiator within the age of artificial intelligence. The academic field has grow to be pretty crowded with short data science master’s programs that usually appear to teach people simply enough to have a superficial conversation about data science topics, train a cookie-cutter model in Python and rattle off a number of buzzwords. Our interview process elicits deeper conversations on topics — that is where candidates with shallow knowledge go off the rails. For instance, I’ve had many interns tell me that accuracy is a great performance measurement for regression models in interviews. Accuracy is often not even a great performance metric for classification problems, it doesn’t make any sense for regression. Candidates who say this know that accuracy is a performance metric and never far more. You’ll want to develop a deep understanding of the fundamentals so you’ll be able to have in-depth conversations in interviews at first and later effectively solve analytics problems.
You could have written a few wide selection of topics on TDS. How do you choose what to write down about next?
Generally, the inspiration for my topics comes from a mixture of necessity and curiosity.
Necessity
Often I need to get a deeper understanding of a subject due to an issue I’m trying to resolve at work. This leads me to research and study to achieve more in-depth knowledge. After learning more, I’m normally pretty excited to share my knowledge. My series on linear programming is a great example of this. I had taken a linear programming course in college (which I actually enjoyed), but I didn’t feel like I had a deep mastery of the subject. At work, I had a project that was using linear programming for a prescriptive analytics optimization engine. I made a decision I desired to grow to be an authority inf linear programming. I purchased a textbook, read it, replicated a variety of the processes from scratch in Python, and wrote some articles to share the knowledge that I had recently mastered.
Curiosity
I’ve all the time been an intensely curious person and learning has been fun for me. Due to these personality traits, I’m often reading books and fascinated by topics that appear interesting. This naturally generates a never-ending backlog of things to write down about. My curiosity-driven approach has two elements – (1) reading/researching and (2) taking intentional time away from the books to digest what I read and make connections—- what Kethledge and Erwin consult with because the definition of solitude of their book . This combined approach is far greater than the sum of the parts. If I just read the entire time and didn’t take time to take into consideration what I used to be reading, I wouldn’t internalize the data or provide you with my very own unique insights on the fabric. If I just thought of things I’d be ignoring life times of research by other people. By combining each elements, I learn so much and I even have insights and opinions about what I learn.
The information science and philosophy series I wrote is a great example of curiosity-driven articles. I got really inquisitive about philosophy a number of years ago. I read multiple books and watched some lectures on it. I also took a variety of time to set the books down and just think in regards to the ideas in them. That’s when I spotted that most of the concepts I studied in philosophy had strong implications on and connections to my work as an information scientist. I wrote down my thoughts and had the outline for my first article series!
What does your drafting workflow for an article seem like? How do you choose when to incorporate code or visuals, and who (if anyone) do you ask to review your draft before you publish it?
Typically I’ll have mulled over an idea for an article for a number of months before I start writing. At any given time limit I actually have 2-4 article ideas in my head. Due to the length of time that I take into consideration articles I normally have a reasonably good structure before I start writing. After I start writing, I put the headers within the articles first, then I write down good sentences that I previously got here up with. At that time, I start filling within the gaps until I feel that the article gives a transparent picture of the thoughts I’ve generated through my studies and contemplations. This process works very well for my goal of writing one article every month. If I wanted to write down more, I’d probably should be somewhat more intentional and fewer organic in my process.
Any time I find myself writing a paragraph that’s painful to write down and skim, I attempt to provide you with a graphic or visual to exchange it. Graphics and concise commentary will be really powerful and way higher in creating understanding than a lengthy and cumbersome paragraph.
I often insert code for a similar reason that I put visuals. It’s annoying to read a verbal description of what code is doing — it’s way higher to simply read well-commented code. I also like putting code in articles to exhibit “baby” solutions to problems that any practitioner would use pre-built packages to truly solve. It helps me (and hopefully others) to get an intuitive understanding of what is occurring under the hood.
To learn more about Jarom‘s work and stay up-to-date along with his latest articles, you’ll be able to follow him on TDS or LinkedIn.