Google Trends is Misleading You: How one can Do Machine Learning with Google Trends Data

-

. What a present to society that is. If not for google trends, how would we’ve ever known that more Disney movies released within the 2000s led to fewer divorces within the UK. Or that drinking Coca Cola is an unknown treatment for cat scratches.

Wait, am I getting confused by correlation vs causation again?

Should you prefer watching over reading, you possibly can achieve this right here:

Google Trends is some of the widely used tools for analysing human behaviour at scale. Journalists use it. Data scientists use it. Entire papers are built on it. But there’s a fundamental property of Google Trends data that makes it very easy to misuse, especially when you are working with time series or attempting to construct models, and most of the people never realise they’re doing it.

All charts and screenshots are created by the writer unless stated otherwise.

The Problem with Google Trends Data

Google doesn’t actually publish figures on their search volume. That information prints dollars for them and there’s no way they might open that up for other people to monetise. But what they do give us is a method to see a time series, to know changes in people’s searches of a selected term and the best way they try this is by giving us a normalised set of knowledge.

This doesn’t sound like an issue until you are attempting and do some machine learning with it. Because in relation to getting a machine to learn anything, we want to provide it a whole lot of data.

My initial idea was to grab a window of 5 years but I immediately have an issue: the larger the time window, the less granular the information. I couldn’t get each day data for five years and while I then thought “just take the utmost time period you possibly can get each day data for and move that window”, that was an issue too. Since it was here that I discovered the true terror of normalisation:

Whatever time period I exploit or whatever single search term I exploit, the information point with the very best variety of searches is straight away set to 100. Meaning the meaning of 100 changes with every window I exploit.

This whole post exists because of this.

Google Trends Basics

Now, I don’t know when you’ve used Google Trends before but when you haven’t, I’m going to speak you thru it so we are able to get to the meat of the issue.

So I’m going to go looking the word “motivation” and it’s going to default to the UK because that’s where I’m from and to the past day and we’ve a beautiful graph which shows how often people were searching the word “motivation” within the last 24 hours.

, Screenshot by Creator

I like this because you possibly can see really clearly that individuals are mostly looking for motivation in the course of the working day, nobody is searching it when many of the country is asleep and there’s definitely a pair of children needing some encouragement for his or her homework. I don’t have an evidence for the late night searches but I’d type of guess these are people not able to return to work tomorrow.

Now this is beautiful but while eight minute increments over 24 hours does give us a pleasant 180 data points to make use of, most of them are literally zero and I don’t know if the past 24 hours have been highly demotivating in comparison with the remaining of the 12 months or if today represents the 12 months’s highest GDP contribution, so I’m going to extend the window just a little bit.

The moment we go to per week, the very first thing you notice is that the information is lots less granular. We have now per week of knowledge but now it’s only hourly and I still have the identical core problem of not knowing how representative this week is.

I can keep zooming out. 30 days, 90 days. At each point we lose granularity and don’t have anywhere near as many data points as we did for twenty-four hours. If I’m going to construct an actual model, this isn’t going to chop it. I would like to go big.

And once I select five years is where we’re going to come across the issue that motivated this complete video (excuse the pun, that was unintentional): I can’t get each day data. And likewise, why is today not at 100 anymore?

, Screenshot by Creator

Herein lies the true problem with google trends data

As I discussed earlier, google trends data is normalised. Which means that whatever time period I exploit or whatever single search term I exploit, the information point with the very best variety of searches is straight away set to 100. All the opposite points are scaled down accordingly. If the first of April had half the searches of the utmost, then the first of April goes to have a google trends rating of fifty.

So let’s have a look at an example here just as an instance the purpose. Let’s take the months of May and June 2025, each 30 or 31 days so we’ve each day data here, we actually lose it beyond 90 days. If I have a look at May you possibly can see we’re scaled so we hit 100 on the thirteenth and in June we hit it on the tenth. So does that mean motivation was searched just as often on the tenth of June because it was on the thirteenth of May?

, Screenshot by Creator
, Screenshot by Creator

If I zoom out now in order that I actually have May and June on the identical graph, you possibly can immediately see that that’s not the case. When each months are included we see that the searches for motivation had a google trends rating of 83 on the tenth of June, meaning as a proportion of searches within the UK, it was 81% of the proportion of searches on the thirteenth May. If we didn’t zoom out, we wouldn’t have known that.

, screenshot by Creator

Now all just isn’t lost, we did get a very good bit of data from this experiment because we all know that we are able to see the relative difference between two data points in the event that they’re each included in the identical graph, so if we did load May and June individually, knowing tenth of June is 81% of thirteenth of May means we are able to scale June down accordingly and the information might be comparable.

In order that’s what I made a decision I’d do. I’d fetch my google trends data with a at some point overlap on each window, so 1st of Jan to thirty first of March, then thirty first of March to thirty first of July. Then I could use March thirty first in each data sets to scale the second set to be comparable to the primary.

But while that is near something we are able to use, there’s yet one more problem I would like to make you aware of.

Google Trends: One other Layer of Randomness

So in relation to google trends data, google isn’t actually tracking each search. That might be a computational nightmare. As a substitute, Google makes use of sampling techniques so to construct a representation of search volumes.

Which means that while the sample is probably going very well-built, it’s Google in any case, every day could have some natural random variation. If by probability March thirty first was a day where Google’s sample happened to be unusually high or low in comparison with the true world, our overlap method would introduce an error into our entire data set.

On top of this, we even have to think about rounding. Google trends rounds all the things to the closest whole number. There’s no 50.5, it’s 50 or it’s 51. Now this looks like a small detail but it might probably actually change into a giant problem. Let me show you why.

On the 4th of October 2021, there was a massive spike in searches for Facebook. This massive spike gets scaled to 100 and in consequence all the things else in that period is way closer to zero. Once you’re rounding to the closest whole number that tiny error of 0.5 suddenly becomes a huge proportional error when your number is just 1 or 2. Which means that our solution needs to be robust enough to handle noise, not only scaling.

So how will we solve this? Well we all know that on average the samples might be representative, so let’s just take an even bigger sample. If we use a bigger window to get our overlap, the random variation and rounding errors have less of an impact.

So here’s the ultimate plan. I do know I can get each day data for as much as 90 days. I’m going to load a rolling window of 90-day periods but I’ll be sure each window overlaps by a full month with the subsequent. That way, our overlap isn’t only one potentially noisy day but a stable month-long anchor that we are able to use to scale our data more accurately.

So it seems like we’ve got a plan. I’ve got some concerns, mainly that by having plenty of batches there’s going to be compounding errors and it could end in big numbers absolutely blowing up. But as a way to see how this shakes out with real data we’ve to go and do it. So here’s one I made earlier.

Writing Code to Figure Out Google Trends

After writing up all the things we’ve discussed in code form and, after having some fun getting temporarily banned from google trends for pulling an excessive amount of data, I’ve put together some graphs. My immediate response once I saw this was: “Oh no, it blew up”.

ct, Image by Creator

The graph below shows my chained-together five years of search volumes for Facebook. You’ll see a fairly regular downward trend but two spikes stand out. The primary of those was the large spike on 4th October 2021 that we mentioned earlier.

, Image by Creator

My first thought was to confirm the spikes. I, unironically, googled it and came upon about widespread Meta outages that day. I pulled data for Instagram and Whatsapp over the identical period and saw similar spikes. So I knew the spike was real but I still had an issue: Was it too big?

After I put my time series side-by-side with Google Trends’ own graph, my heart sank. My spikes were huge as compared. I began fascinated with methods to handle this. Should I cap the utmost spike value? That felt arbitrary and would lose information concerning the relative sizes of spikes. Should I apply an arbitrary scaling factor? Again, it felt like a guess.

, Screenshot by Creator

That was until I had a bolt of inspiration. Remember, Google Trends is giving us weekly data for this era, that’s the entire reason we’re doing this. What if I averaged my data for that week to see the way it in comparison with Google’s weekly value?

That is where I breathed an enormous sigh of relief. That week was the largest spike on Google Trends so set to 100. After I averaged my data for a similar week, I got 102.8. Incredibly near Google Trends. We also finish in concerning the same place. This implies the compounding errors from my scaling method haven’t blown up my data. I actually have something that appears and behaves identical to the Google Trends data!

So now we’ve a strong methodology for making a clean, comparable each day time series for any search term. Which is great. But what if we actually wish to do something useful with it, like comparing search terms world wide for instance?

Because while Google Trends means that you can compare multiple search terms it doesn’t allow direct comparison of multiple countries. So I can grab a dataset of motivation for every country using the strategy we’ve discussed today, but how do I make them comparable? Facebook is an element of the answer.

But this solution is one for a later blog post, one during which we’re going to construct a “basket of products” to match countries and see exactly how Facebook suits into all of this.

So today we began with the query of whether we are able to model national motivation and in attempting to achieve this immediately hit a wall. Because Google Trends each day data is misleading. Not as a consequence of an error, but by its very design. We’ve found a method to tackle that now, but within the lifetime of an information scientist, there are at all times more problems lurking across the corner.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x