Hands-On Attention Mechanism for Time Series Classification, with Python

is a game changer in Machine Learning. In actual fact, within the recent history of Deep Learning, the thought of allowing models to deal with probably the most relevant parts of an input sequence when making a prediction completely revolutionized the way in which we take a look at Neural Networks.

That being said, there’s one controversial take that I actually have in regards to the attention mechanism:

The perfect solution to learn the eye mechanism is not through Natural Language Processing (NLP)

It’s (technically) a controversial take for 2 reasons.

People naturally use NLP cases (e.g., translation or NSP) because NLP is the explanation why the eye mechanism was developed in the primary place. The unique goal was to overcome the constraints of RNNs and CNNs in handling long-range dependencies in language (if you happen to haven’t already, it is best to really read the paper Attention is All You Need).
Second, I may even should say that with a view to understand the overall idea of putting the “attention” on a selected word to do translation tasks may be very intuitive.

That being said, if we wish to know how attention REALLY works in a hands-on example, I consider that Time Series is one of the best framework to make use of. There are lots of explanation why I say that.

Computers usually are not really “made” to work with strings; they work with ones and zeros. All of the embedding steps which are mandatory to convert the text into vectors add an additional layer of complexity that shouldn’t be strictly related to the eye idea.
The eye mechanism, though it was first developed for text, has many other applications (for instance, in computer vision), so I like the thought of exploring attention from one other angle as well.
With time series specifically, we are able to create very small datasets and run our attention models in minutes (yes, including the training) with none fancy GPUs.

On this blog post, we are going to see how we are able to construct an attention mechanism for time series, specifically in a classification setup. We’ll work with sine waves, and we are going to try to categorise a standard sine wave with a “modified” sine wave. The “modified” sine wave is created by flattening a portion of the unique signal. That’s, at a certain location within the wave, we simply remove the oscillation and replace it with a flat line, as if the signal had temporarily stopped or develop into corrupted.

To make things more , we are going to assume that the sine can have whatever frequency or amplitude, and that the location and extension (we call it length) of the “rectified” part are also parameters. In other words, the sine could be whatever sine, and we are able to put our “straight line” wherever we like on the sine wave.

Well, okay, but why should we even trouble with the eye mechanism? Why are we not using something simpler, like Feed Forward Neural Networks (FFNs) or Convolutional Neural Networks (CNNs)?

Well, because again we’re assuming that the “modified” signal could be “flattened” all over the place (in whatever location of the timeseries), and it might probably be flattened for whatever length (the rectified part can have whatever length). Which means that a normal Neural Network shouldn’t be that efficient, since the anomalous “part” of the timeseries shouldn’t be all the time in the identical portion of the signal. In other words, if you happen to are only attempting to cope with this with a linear weight matrix + a non linear function, you should have suboptimal results, because index 300 of time series 1 could be completely different from index 300 of time series 14. What we want as a substitute is a dynamic approach that puts the eye on the anomalous a part of the series. Because of this (and where) the eye method shines.

This blog post might be divided into these 4 steps:

Code Setup. Before moving into the code, I’ll display the setup, with all of the libraries we are going to need.
Data Generation. I’ll provide the code that we’ll need for the info generation part.
Model Implementation. I’ll provide the implementation of the eye model
Exploration of the outcomes. The good thing about the eye model might be displayed through the eye scores and classification metrics to evaluate the performance of our approach.

It looks as if we have now plenty of ground to cover. Let’s start! 🚀

1. Code Setup

Before delving into the code, let’s invoke some friends that we’ll need for the remainder of the implementation.

These are only default values that could be used throughout the project. What you see below is the short and sweet requirements.txt file.

I prefer it when things are easy to alter and modular. For that reason, I created a .json file where we are able to change every little thing in regards to the setup. A few of these parameters are:

The variety of normal vs abnormal time series (the ratio between the 2)
The variety of time series steps (how long your timeseries is)
The dimensions of the generated dataset
The min and max locations and lengths of the linearized part
Way more.

The .json file looks like this.

So, before going to the following step, be certain that you’ve got:

The constants.py file is in your work folder
The .json file in your work folder or in a path that you simply remember
The libraries in the necessities.txt file were installed

2. Data Generation

Two easy functions construct the traditional sine wave and the modified (rectified) one. The code for that is present in data_utils.py:

Now that we have now the fundamentals, we are able to do all of the backend work in data.py. This is meant to be the function that does all of it:

Receives the setup information from the .json file (that’s why you would like it!)
Builds the modified and normal sine waves
Does the train/test split and train/val/test split for the model validation

The information.py script is the next:

The extra data script is the one which prepares the info for Torch (SineWaveTorchDataset), and it looks like this:

If you would like to have a look, this can be a random anomalous time series:

Image generated by creator

And this can be a non-anomalous time series:

Now that we have now our dataset, we are able to worry in regards to the model implementation.

3. Model Implementation

The implementation of the model, the training, and the loader could be present in the model.py code:

Now, let me take a while to clarify why the eye mechanism is a game-changer here. Unlike FFNN or CNN, which might treat all time steps equally, attention dynamically highlights the parts of the sequence that matter most for classification. This permits the model to “zoom in” on the anomalous section (no matter where it appears), making it especially powerful for irregular or unpredictable time series patterns.

Let me be more precise here and talk in regards to the Neural Network.
In our model, we use a bidirectional LSTM to process the time series, capturing each past and future context at every time step. Then, as a substitute of feeding the LSTM output directly right into a classifier, we compute attention scores over the complete sequence. These scores determine how much weight every time step must have when forming the ultimate context vector used for classification. This implies the model learns to focus only on the meaningful parts of the signal (i.e., the flat anomaly), irrespective of where they occur.

Now let’s connect the model and the info to see the performance of our approach.

4. A practical example

4.1 Training the Model

Given the large backend part that we develop, we are able to train the model with this super easy block of code.

This took around 5 minutes on the CPU to finish.
Notice that we implemented (on the backend) an early stopping and a train/val/test to avoid overfitting. We’re responsible kids.

4.2 Attention Mechanism

Let’s use the next function here to display the eye mechanism along with the sine function.

Let’s show the eye scores for a standard time series.

Image generated by creator using the code above

As we are able to see, the eye scores are localized (with a kind of time shift) on the areas where there’s a flat part, which could be near the peaks. Nonetheless, again, these are only localized spikes.

Now let’s take a look at an anomalous time series.

As we are able to see here, the model recognizes (with the identical time shift) the world where the function flattens out. Nonetheless, this time, it shouldn’t be a localized peak. It’s an entire section of the signal where we have now higher than usual scores. Bingo.

4.3 Classification Performance

Okay, this is sweet and all, but does this work? Let’s implement the function to generate the classification report.

The outcomes are the next:

Accuracy : 0.9775
Precision : 0.9855
Recall : 0.9685
F1 Rating : 0.9769
ROC AUC Rating : 0.9774

Confusion Matrix:
[[1002 14]
[ 31 953]]

Very high performance when it comes to all of the metrics. Works like a charm. 🙃

5. Conclusions

Thanks very much for reading through this text ❤️. It means loads. Let’s summarize what we present in this journey and why this was helpful. On this blog post, we applied the eye mechanism in a classification task for time series. The classification was between normal time series and “modified” ones. By “modified” we mean that an element (a random part, with random length) has been rectified (substituted with a straight line). We found that:

Attention mechanisms have been originally developed in NLP, but additionally they excel at identifying anomalies in time series data, especially when the situation of the anomaly varies across samples. This flexibility is difficult to realize with traditional CNNs or FFNNs.
Through the use of a bidirectional LSTM combined with an attention layer, our model learns what parts of the signal matter most. We saw that a posteriori through the eye scores (alpha), which reveal which era steps were most relevant for classification. This framework provides a transparent and interpretable approach: we are able to visualize the eye weights to know why the model made a certain prediction.
With minimal data and no GPU, we trained a highly accurate model (F1 rating ≈ 0.98) in only a number of minutes, proving that focus is accessible and powerful even for small projects.

6. About me!

Thanks again to your time. It means loads ❤️

My name is Piero Paialunga, and I’m this guy here:

I’m a Ph.D. candidate on the University of Cincinnati Aerospace Engineering Department. I speak about AI and Machine Learning in my blog posts and on LinkedIn, and here on TDS. In case you liked the article and need to know more about machine learning and follow my studies, you’ll be able to:

A. Follow me on Linkedin, where I publish all my stories
B. Follow me on GitHub, where you’ll be able to see all my code
C. For questions, you’ll be able to send me an email at

Ciao!

Hands-On Attention Mechanism for Time Series Classification, with Python

1. Code Setup

2. Data Generation

3. Model Implementation