Google’s Data Science Agent: Can It Really Do Your Job?

-

On March third, Google officially rolled out its Data Science Agent to most Colab users without spending a dime. This will not be something brand latest — it was first announced in December last yr, nevertheless it is now integrated into Colab and made widely accessible.

Google says it’s “The longer term of information evaluation with Gemini”, stating: “” But is it an actual game-changer in Data Science? What can it actually do, and what can’t it do? Is it ready to interchange data analysts and data scientists? And what does it tell us in regards to the future of information science careers?

In this text, I’ll explore these questions with real-world examples.


What It Can Do

The Data Science Agent is easy to make use of:

  1. Open a latest notebook in Google Colab — you simply need a Google Account and might use Google Colab without spending a dime;
  2. Click “Analyze files with Gemini” — this can open the Gemini chat window on the appropriate;
  3. Upload your data file and describe your goal within the chat. The agent will generate a series of tasks accordingly;
  4. Click “Execute Plan”, and Gemini will start to put in writing the Jupyter Notebook mechanically.

Data Science Agent UI (image by creator)

Let’s take a look at an actual example. Here, I used the dataset from the Regression with an Insurance Dataset Kaggle Playground Prediction Competition (Apache 2.0 license). This dataset has 20 features, and the goal is to predict the insurance premium amount. It has each continuous and categorical variables with scenarios like missing values and outliers. Subsequently, it’s an excellent example dataset for Machine Learning practices.

Jupyter Notebook generated by the Data Science Agent (image by creator)

After running my experiment, listed here are the highlights I’ve observed from the Data Science Agent’s performance:

  • Customizable execution plan: Based on my prompt of “C “, the Data Science Agent first got here up with a series of 10 tasks, including and. This can be a pretty standard and reasonable technique of conducting exploratory data evaluation and constructing a machine learning model. It then asked for my confirmation and feedback before executing the plan. I attempted to ask it to concentrate on Exploratory Data Evaluation first, and it was capable of adjust the execution plan accordingly. This provides flexibility to customize the plan based in your needs.

Initial tasks the agent generated (image by creator)

Plan adjustment based on feedback (image by creator)

  • End-to-end execution and autocorrection: After confirming the plan, the Data Science Agent was capable of execute the plan end-to-end autonomously. Each time it encountered errors while running Python code, it diagnosed what was incorrect and attempted to correct the error by itself. For instance, on the model training step, it first bumped into a DTypePromotionError error due to including a datetime column in training. It decided to drop the column in the following try but then got the error message ValueError: Input X incorporates NaN. In its third attempt, it added a simpleImputer to impute all missing values with the mean of every column and eventually got the step to work.

The agent bumped into an error and auto-corrected it (image by creator)

  • Interactive and iterative notebook: Because the Data Science Agent is built into Google Colab, it populates a Jupyter Notebook because it executes. This comes with several benefits:
    • Real-time visibility: Firstly, you possibly can actually watch the Python code running in real time, including the error messages and warnings. The dataset I provided was a bit large — though I only kept the primary 50k rows of the dataset for the sake of a fast test — and it took about 20 minutes to complete the model optimization step within the Jupyter notebook. The notebook kept running without timeout and I received a notification once it finished.
    • Editable code: Secondly, you possibly can edit the code on top of what the agent has built for you. That is something clearly higher than the official Data Analyst GPT in ChatGPT, which also runs the code and shows the result, but you could have to repeat and paste the code elsewhere to make manual iterations.
    • Seamless collaboration: Lastly, having a Jupyter Notebook makes it very easy to share your work with others — now you possibly can collaborate with each AI and your teammates in the identical environment. The agent also drafted step-by-step explanations and key findings, making it way more presentation-friendly.

Summary section generated by the Agent (image by creator)


What It Cannot Do

We’ve talked about its benefits; now, let’s discuss some missing pieces I’ve noticed for the Data Science Agent to be an actual autonomous data scientist.

  • It doesn’t modify the Notebook based on follow-up prompts. I discussed that the Jupyter Notebook environment makes it easy to iterate. In this instance, after its initial execution, I noticed the Feature Importance charts didn’t have the feature labels. Subsequently, I asked the Agent so as to add the labels. I assumed it will update the Python code directly or not less than add a brand new cell with the refined code. Nevertheless, it merely provided me with the revised code within the chat window, leaving the actual notebook update work to me. Similarly, after I asked it so as to add a brand new section with recommendations for lowering the insurance premium costs, it added a markdown response with its suggestion within the chatbot 🙁 Although copy-pasting the code or text isn’t an enormous deal for me, I still feel disillusioned – once the notebook is generated in its first pass, all further interactions stay within the chat, similar to ChatGPT.

My follow-up on updating the feature importance chart (image by creator)

My follow-up on adding recommendations (image by creator)

  • It doesn’t all the time select the very best data science approach. For this regression problem, it followed an inexpensive workflow – data cleansing (handling missing values and outliers), data wrangling (one-hot encoding and log transformation), feature engineering (adding interaction features and other latest features), and training and optimizing three models (Linear Regression, Random Forest, and Gradient Boosting Trees). Nevertheless, after I looked into the small print, I spotted not all of its operations were necessarily the very best practices. For instance, it imputed missing values using the mean, which could not be an excellent idea for very skewed data and will impact correlations and relationships between variables. Also, we normally test many various feature engineering ideas and see how they impact the model’s performance. Subsequently, while it sets up a solid foundation and framework, an experienced data scientist remains to be needed to refine the evaluation and modeling.

These are the 2 important limitations regarding the Data Science Agent’s performance on this experiment. But when we expect in regards to the whole data project pipeline and workflow, there are broader challenges in applying this tool to real-world projects:

  • What’s the goal of the project? This dataset is provided by Kaggle for a playground competition. Subsequently, the project goal is well-defined. Nevertheless, an information project at work may very well be pretty ambiguous. We regularly have to discuss with many stakeholders to know the business goal, and have many backwards and forwards to be sure we stay on the appropriate track. This will not be something the Data Science Agent can handle for you. It requires a transparent goal to generate its list of tasks. In other words, when you give it an incorrect problem statement, the output shall be useless.
  • How will we get the clean dataset with documentation? Our example dataset is comparatively clean, with basic documentation. Nevertheless, this normally doesn’t occur within the industry. Every data scientist or data analyst has probably experienced the pain of talking to multiple people just to seek out one data point, solving the parable of some random columns with confusing names and putting together hundreds of lines of SQL to arrange the dataset for evaluation and modeling. This sometimes takes 50% of the particular work time. In that case, the Data Science Agent can only help with the beginning of the opposite 50% of the work (so possibly 10 to twenty%).

Who Are the Goal Users

With the professionals and cons in mind, who’re the goal users of the Data Science Agent? Or who will profit essentially the most from this latest AI tool? Listed below are my thoughts:

  1. Aspiring data scientists. Data Science remains to be a hot space with plenty of beginners starting daily. On condition that the agent “understands” the usual process and basic concepts well, it could actually provide invaluable guidance to those just getting began, organising an incredible framework and explaining the techniques with working code. For instance, many individuals are inclined to learn from participating in Kaggle competitions. Similar to what I did here, they will ask the Data Science Agent to generate an initial notebook, then dig into each step to know why the agent does certain things and what may be improved.
  2. Individuals with clear data questions but limited coding skills. The important thing requirements listed here are 1. the issue is clearly defined and a pair of. the info task is standard (not as complicated as optimizing a predictive model with 20 columns).. Let me offer you some scenarios:
    • Many researchers have to run analyses on the datasets they collected. They typically have an information query clearly defined, which makes it easier for the Data Science Agent to help. Furthermore, researchers normally have an excellent understanding of the fundamental statistical methods but won’t be as proficient in coding. So the Agent can save them the time of writing code, meanwhile, the researchers can judge the correctness of the methods AI used. This is similar use case Google mentioned when it first introduced the Data Science Agent: “”
    • Product managers often have to do some basic evaluation themselves — they should make data-driven decisions. They know their questions well (and sometimes the potential answers), they usually can pull some data from internal BI tools or with the assistance of engineers. For instance, they could need to examine the correlation between two metrics or understand the trend of a time series. In that case, the Data Science Agent will help them conduct the evaluation with the issue context and data they provided.

Can It Replace Data Analysts and Data Scientists Yet?

We finally come to the query that each data scientist or analyst cares about essentially the most: Is it ready to interchange us yet?

The short answer is “No”. There are still major blockers for the Data Science Agent to be an actual data scientist — it’s missing the capabilities of modifying the Jupyter Notebook based on follow-up questions, it still requires someone with solid data science knowledge to audit the methods and make manual iterations, and it needs a transparent data problem statement with clean and well-documented datasets.

Nevertheless, AI is a fast-evolving space with significant improvements continually. Just taking a look at where it got here from and where it stands now, listed here are some very essential lessons for data professionals to remain competitive:

  1. AI is a tool that greatly improves productivity. As a substitute of worrying about being replaced by AI, it is healthier to embrace the advantages it brings and learn the way it could actually improve your work efficiency. Don’t feel guilty when you use it to put in writing basic code — nobody remembers all of the numpy and pandas syntax and scikit-learn models 🙂 Coding is a tool to finish complex statistical evaluation quickly, and AI is a brand new tool to save lots of you much more time.
  2. In case your work is usually repetitive tasks, then you definately are in danger. It is vitally clear that these AI agents are improving and higher at automating standard and basic data tasks. In case your job today is usually making basic visualizations, constructing standard dashboards, or doing easy regression evaluation, then the day of AI automating your job might come prior to you expected.

Being a site expert and an excellent communicator will set you apart. To make the AI tools work, that you must understand your domain well and have the opportunity to speak and translate the business knowledge and problems to each your stakeholders and the AI tools. In the case of machine learning, we all the time say “Garbage in, garbage out”. It is similar for an AI-assisted data project.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x