of the universe (made by one of the vital iconic singers ever) says this:
Wish I could return
And alter these years
I’m going through changesBlack sabbath – Changes
This song is incredibly powerful and talks about how life can change right in front of you so quickly.
That song is a few broken heart and a love story. Nevertheless, it also jogs my memory plenty of the changes that my job, as a Data Scientist, has undergone during the last 10 years of my profession:
- After I began studying Physics, the one thing I believed of when someone said “Transformer” was Optimus Prime. Machine Learning for me was all about Linear Regression, SVM, Random Forest etc… []
- After I did my Master’s Degree in Big Data and Physics of Complex Systems, I first heard of “BERT” and various Deep Learning technologies that seemed very promising at the moment. The primary GPT models got here out, they usually looked very interesting, regardless that nobody expected them to be as powerful as they’re today. []
- Fast forward to my life now as a full-time Data Scientist. Today, in the event you don’t know what GPT stands for and have never read “Attention is All You Need” you may have only a few possibilities of passing a Data Science System Design interview. []
When people state that the tools and the on a regular basis lifetime of an individual working with data are substantially different than 10 (and even 5) years ago, I agree all the best way. What with is the concept the tools used prior to now must be erased simply because the whole lot now appears to be solvable with GPT, LLMs, or Agentic AI.
The goal of this text is to contemplate a single task, which is classifying the love/hate/neutral intent of a Tweet. Particularly, we are going to do it with traditional Machine Learning, Deep Learning, and Large Language Models.
We’ll do that hands-on, using Python, and we are going to describe why and when to make use of each approach. Hopefully, after this text, you’ll learn:
- The tools utilized in the early days should still be considered, studied, and at times adopted.
- Latency, Accuracy, and Cost must be evaluated when selecting the perfect algorithm on your use case
- Changes within the Data Scientist world are vital and to be embraced without fear 🙂
Let’s start!
1. The Use Case
The case we’re coping with is something that is definitely very adopted in Data Science/AI applications: sentiment evaluation. Which means, given a text, we would like to extrapolate the “feeling” behind the writer of that text. This may be very useful for cases where you ought to gather the feedback behind a given review of an object, a movie, an item you might be recommending, etc…
On this blog post, we’re using a really “famous” sentiment evaluation example, which is classifying the sensation behind a tweet. As I wanted more control, we won’t work with organic tweets scraped from the net (where labels are uncertain). As an alternative, we will probably be using content generated by Large Language Models that we will control.
This method also allows us to tune the problem and the range of the issue and to watch how different techniques react.
- Easy case: the love tweets sound like postcards, the hate ones are blunt, and the neutral messages speak about weather and occasional. If a model struggles here, something else is off.
- Harder case: still love, hate, neutral, but now we inject sarcasm, mixed tones, and subtle hints that demand attention to context. We even have less data, to have a smaller dataset to coach with.
- Extra Hard case: we move to 5 emotions: love, hate, anger, disgust, envy, so the model has to parse richer, more layered sentences. Furthermore, we’ve got 0 entries to coach the information: we will not do any training.
I even have generated the information and put each of the files in a particular folder of the general public GitHub Folder I even have created for this project [data].
Our goal is to construct a sensible classification system that can give you the chance to efficiently grasp the sentiment behind the tweets. But how lets do it? Let’s figure it out.
2. System Design
An image that’s at all times extremely helpful to contemplate is the next:
Accuracy, cost, and scale in a Machine Learning system form a triangle. You’ll be able to only fully optimize two at the identical time.
You’ll be able to have a really accurate model that scales thoroughly with hundreds of thousands of entries, but it surely won’t be quick. You’ll be able to have a fast model that scales with hundreds of thousands of entries, but it surely won’t be that accurate. You’ll be able to have an accurate and quick model, but it surely won’t scale thoroughly.
These considerations are abstracted from the precise problem, but they assist guide which ML System Design to construct. We’ll come back to this.
Also, the ability of our model must be proportional to the dimensions of our training set. Generally, we attempt to avoid the training set error to diminish at the price of a rise within the test set (the famous overfitting).

We don’t wish to be within the Underfitting or Overfitting area. Let me explain why.
In easy terms, underfitting happens when your model is just too easy to learn the true pattern in your data. It’s like attempting to draw a straight line through a spiral. Overfitting is the alternative. The model learns the training data too well, including all of the noise, so it performs great on what it has already seen but poorly on recent data. The sweet spot is the center ground, where your model understands the structure without memorizing it.
We’ll come back to this one as well.
3. Easy Case: Traditional Machine Learning
We open with the friendliest scenario: a highly structured dataset of 1,000 tweets that we generated and labelled. The three classes (positive, neutral, negative) are balanced on purpose, the language may be very explicit, and each row lives in a clean CSV.
Let’s start with a straightforward import block of code.
Let’s see what the dataset looks like:

Now, we anticipate that this won’t scale for hundreds of thousands of rows (since the dataset is just too structured to be diverse). Nevertheless, we will construct a really quick and accurate method for this tiny and specific use case. Let’s start with the modeling. Three foremost points to contemplate:
- We’re doing train/test split with 20% of the dataset within the test set.
- We’re going to use a TF-IDF approach to get the embeddings of the words. TF-IDF stands for Term Frequency–Inverse Document Frequency. It’s a classic technique that transforms text into numbers by giving each word a weight based on how necessary it’s in a document in comparison with the entire dataset.
- We’ll mix this method with two ML models: Logistic Regression and Support Vector Machines, from scikit-learn. Logistic Regression is easy and interpretable, often used as a powerful baseline for text classification. Support Vector Machines give attention to finding the perfect boundary between classes and typically perform thoroughly when the information shouldn’t be too noisy.
And the performance is actually perfect for each models.

For this quite simple case, where we’ve got a consistent dataset of 1,000 rows, a standard approach gets the job done. No need for billions of parameter models like GPT.
4. Hard Case: Deep Learning
The second dataset remains to be synthetic, but it surely is designed to be annoying on purpose. Labels remain love, hate, and neutral, yet the tweets lean on sarcasm, mixed tone, and backhanded compliments. On top of that, the training pool is smaller while the validation slice stays large, so the models work with less evidence and more ambiguity.
Now that we’ve got this ambiguity, we’d like to take out the larger guns. There are Deep Learning embedding models that maintain strong accuracy and still scale well in these cases (remember the triangle and the error versus complexity plot!). Particularly, Deep Learning embedding models learn the meaning of words from their context as an alternative of treating them as isolated tokens.
For this blog post, we are going to use BERT, which is one of the vital famous embedding models on the market. Let’s first import some libraries:
… and a few helpers.
Because of these functions, we will quickly evaluate our embedding model vs the TF-IDF approach.


As we will see, the TF-IDF model is amazingly underperforming within the positive labels, while it preserves high accuracy when using the embedding model (BERT).
5. Extra Hard case: LLM Agent
Okay, now let’s make things VERY hard:
- We only have 100 rows.
- We assume we have no idea the labels, meaning we cannot train any machine learning model.
- We’ve five labels: envy, hate, love, disgust, anger.

As we will not train anything, but we still wish to perform our classification, we must adopt a way that by some means already has the classifications inside. Large Language Models are the best example of such a way.
Note that if we used LLMs for the opposite two cases, it might be like shooting a fly with a cannon. But here, it makes perfect sense: the duty is difficult, and we’ve got no option to do anything smart, because we will not train our model (we don’t have the training set).
On this case, we’ve got accuracy at a big scale. Nevertheless, the API takes a while, so we’ve got to attend a second or two before the response comes back (remember the triangle!).
Let’s import some libraries:
And that is the classification API call:
And we will see that the LLM does a tremendous classification job:
6. Conclusions
Over the past decade, the role of the Data Scientist has modified as dramatically because the technology itself. This might result in the concept of just using essentially the most powerful tools on the market, but that’s NOT the perfect route for a lot of cases.
As an alternative of reaching for the most important model first, we tested one problem through a straightforward lens: accuracy, latency, and price.
Particularly, here’s what we did, step-by-step:
- We defined our use case as tweet sentiment classification, aiming to detect love, hate, or neutral intent. We designed three datasets of accelerating difficulty: a clean one, a sarcastic one, and a zero-training one.
- We tackled the straightforward case using TF-IDF with Logistic Regression and SVM. The tweets were clear and direct, and each models performed almost perfectly.
- We moved to the hard case, where sarcasm, mixed tone, and subtle context made the duty more complex. We used BERT embeddings to capture meaning beyond individual words.
- Finally, for the additional hard case with no training data, we used a Large Language Model to categorise emotions directly through zero-shot learning.
Each step showed how the precise tool is dependent upon the issue. Traditional ML is fast and reliable when the information is structured. Deep Learning models help when meaning hides between the lines. LLMs are powerful when you may have no labels or need broad generalization.
7. Before you head out!
Thanks again on your time. It means quite a bit ❤️
My name is Piero Paialunga, and I’m this guy here:

I’m originally from Italy, hold a Ph.D. from the University of Cincinnati, and work as a Data Scientist at The Trade Desk in Latest York City. I write about AI, Machine Learning, and the evolving role of knowledge scientists each here on TDS and on LinkedIn. In case you liked the article and wish to know more about machine learning and follow my studies, you possibly can:
A. Follow me on Linkedin, where I publish all my stories
B. Follow me on GitHub, where you possibly can see all my code
C. For questions, you possibly can send me an email at
