I Stole a Wall Street Trick to Solve a Google Trends Data Problem

-

is a god-send for market research. If you must understand interest in a selected term you possibly can just look it up and see the way it’s changing over time. That is the type of information we could do some serious data science with. Or somewhat, it will be if the info was actually usable.

In point of fact, Google Trends exists solely to do what it says: show trends. The information is normalised and regionalised to the purpose where it’s unimaginable to come up with comparable data to do any meaningful modelling with. Unless we have now a couple of tricks up our sleeve.

In my last post on this topic we introduced the concept of chaining data across overlapping windows to get across the granularity limitations of google trends data. Today we’re going to learn the right way to compare that data across countries and regions so you should use it for real insights.

Motivation: Comparing Motivation

Google trends allows the downloading and reuse of Trends data with citation, so I’ve gone and downloaded the info on motivation for five years and scaled it so we have now one dataset of motivation searches for every country that provides us a rough idea of how each country’s interest in motivation changes over time. My goal was to check how motivated different countries are, but I even have an issue. I don’t know whether a google trends rating of 100 searches within the US is greater or smaller than a rating of 100 within the UK, and my first suggestion for the right way to work that out fell flat. Let me explain.

So after I began this project I wasn’t a connoisseur of Google Trends and I quite naively tried typing in UK motivation, then adding a comparison and typing it motivation again and changing the situation to the US. Admittedly, I used to be confused as to why it was the identical graph. So then I assumed it was just that UK and US were too similar so I added Japan and it wasn’t until I got to China that I realised that the graph was changing all the lines to be that country’s motivation.

I assumed I used to be changing countries. Seems I used to be just reloading the identical data 3 times. Screenshot by the creator. Data source: Google Trends (https://www.google.com/trends)

So if I can’t get the countries on the identical graph then I can’t compare them. Unless I discover a more creative way…

My next brainwave got here from taking a look at the US, because should you scroll down on google trends you’ll see that there’s this subregion section showing the states within the US in relative terms. So the state with the best search volume is ready to 100 and the opposite states are scaled accordingly. 

US search results for motivation scaled relatively by state. Screenshot by the creator. Data source: Google Trends (https://www.google.com/trends)

So I assumed I used to be a genius, I’ll just set the region to be worldwide, see the several numbers that come out for my countries of interest and just multiply the outcomes for that country by that number.

But it surely seems, I had misunderstood something fundamental again. And I’m sorry but we’re going to wish to do some maths to elucidate it.

The Maths Behind Google Trends Normalisation

So I grabbed ninety days of information from the US and the UK from the twenty fourth of April on two separate google trends graphs as you possibly can see here. They’re each scaled so the utmost is at 100 which occurs on a special day for every country.

When 100 means something different on both sides of the atlantic. Screenshot by the creator. Data source: Google Trends (https://www.google.com/trends)
Graph of US and UK showing interest over time looking for motivation over 90 days. Screenshot by the creator. Data source: Google Trends (https://www.google.com/trends)

The issue is that because we’re taking a look at two different countries, the google trends scores are in fundamentally different units for every country. Identical to inches and centimetres are different units of measurement, so are US Google Trends units and UK Google trends Units. And in contrast to inches to centimetres, we don’t know the conversion factor here.

Let’s assume that on the worldwide graph the US is given a rating of 100 and the UK is given a rating of fifty. The UK rating of fifty implies that the height of UK is 50% of the height of the US. On a primary look this might suggest that the conversion factor between these two units is a half, ie UK units are half the US units or equivalently one US unit is 2 UK units. I’m now going to persuade you why this isn’t true.

Let’s take this to a day that’s not a peak day. Let’s have a look at the thirtieth April and say hypothetically that its rating was 70 within the US and 80 within the UK. Because of this the rating within the US that day was 70% of its peak and the rating within the UK that day was 80% of its peak. Let’s have a look at it with some maths:

70% of US peak = 70% * 100 US units = 70% * 2 * 100 UK units (based on the scaling factor of 1 US unit = 2 UK units) = 140 UK units

Now taking a look at it from a UK perspective:

80% of UK peak = 80% * 100 UK units = 80 UK units

And last time I checked, 140 was not double 80.

Simply because the height of US is twice the height of UK doesn’t mean that for the entire time period the US data is twice the UK data!

So okay, we will’t just take the worldwide ratios to check the info of various countries. So what can we do?

The thing I like essentially the most about data science is that the underlying science and methodologies we use can translate across multiple different domains so for this problem I’m going to take an analogous approach.

Because I learned my data scientist skills before I even knew what a knowledge scientist was, forged within the chaos that’s the trading floor of an investment bank. For those who’ve ever heard of the term “Exchange Traded Fund” then that may provide you with slightly little bit of an idea of what you’re in for, but when not don’t fear.

Taking Inspiration from the Stock Market

So the stock market, as you’re probably aware, is a spot for getting and selling equity, or shares in an organization. These shares are a partial ownership and frequently include things like voting rights or the flexibility to receive dividends, like a small bonus for being an owner of the corporate. Stocks will be held by individuals such as you and I or big investors like banks and hedge funds or other private firms.

The stock market will be used as a measure of the economic health of a rustic. When stocks are going up, we’re in a bull market and the country is, in theory, financially prosperous. When the market starts to fall we enter a bear market and things are going less well. It is a huge simplification, the markets move in accordance with human behaviour which is a notoriously difficult thing to grasp, but for our purposes this generalisation holds : we will gain an understanding of a rustic’s economic health based on its stock market.

Tracking the Market Through Indices

So how will we track the stock market as an entire? Well the plain thing to do is to take all of the shares on the stock exchange and add up all their prices to get an overall number for the worth of the stock market. But this isn’t how it really works in point of fact. In point of fact, we use indices.

You’ve probably heard of the S&P 500, an index built up of the five hundred biggest firms within the US. It’s used to trace the US market because, being the most important firms, it covers about 80% of the overall market capitalisation, that’s value effectively, and are also very liquid, meaning they’re easily traded and their prices move quite a bit.

Because they cover nearly all of the market, it’s a superb representation of the entire market in a smaller collection of 500 stocks. Why 500? Well, for starters the S&P 500 was introduced in 1957 and I used to be going to say that the computational power available to calculate the market capitalisation of hundreds of stocks wasn’t there prefer it is today nevertheless it’s much more interesting than that since the S&P 500 was only created with 500 stocks due to a brand new electronic calculation method that enabled 500 stocks to be included within the calculation. Before that, indices were even smaller because they were calculated by hand!

Why you’d estimate on this big data world

Now we do have the computation power to calculate all the market if we would like, a couple of thousand stocks is small fry in today’s big data world, nevertheless it’s probably not mandatory. Adding in smaller firms means a rise in overhead in tracking all of them and in addition a few of them may not get traded fairly often, meaning the data about them goes stale. The professionals of adding them are outweighed by the cons.

And this conversation pops up throughout finance. The UK has the FTSE-100, a basket of 100 stocks. Commodity baskets will be used to trace the health of specific industries reminiscent of oil or agriculture. And inflation, measured by CPI, is made up of a basket of products to trace price changes over time.

FTSE 100, Screenshot by the creator.

So if a basket of representative items will be used to measure all the stock market, or inflation, why not use it to trace search volumes?

Applying ETFs to Google Trends Data

So if I need to make use of this idea, what I actually need is a few idea of essentially the most commonly searched terms that I can use to construct a S&P-500-esque index for every country. Considered one of the things we will use is Google Trend’s Yr In Search functionality to get basket candidates from popular search terms. 

The every day Google Trends data for Facebook, as built using my chaining methodology. Image by the creator.

So let’s say for now that I did have the typical search volumes for a minimum of one country, let’s say the US. The way in which we get around that is to average the scaling aspects for a subset of my basket (or the entire basket) and have this as a mean US google trends units to real world search volumes. And I can then use this number to get an idea of absolutely the search volumes for motivation.

Making Search Data Truly Comparable Across Countries

Now there are a few caveats here. I don’t know the way representative my basket is. In point of fact, I’m constrained by how much google trends data I can manually download so my basket was small, just nine items. As well as, some countries may have very large search volumes for particular terms which might be completely absent from my basket. For instance, I even have Facebook and Instagram in my basket that are extremely popular in places just like the UK, US et cetera. But in China, the equivalent could be WeChat which isn’t used very much outside of the country.

I wouldn’t put WeChat in my basket, since it’s not representative of the overwhelming majority of nations around the globe. But it surely is very representative of China.

The opposite problem I even have to resolve is that even when I can benchmark for one country, how do I scale the opposite countries which I don’t have a benchmark for?

So as to tackle this problem I had a take into consideration things that may influence the search volumes of a rustic. An obvious one is the population of the country. The US has five times as many individuals because the UK so it wouldn’t be surprising if the US had five times the search volume of the UK. But actually I believe we will do higher.

Because web access just isn’t uniform across the population. There are still many places on the earth where people find themselves without web access. There are older individuals who grew up without technology and have no real interest in learning, toddlers who haven’t yet been given a tablet or individuals who only for whatever reason resolve to opt out. The demographics of those non-internet users will likely be very country dependent, and so a more accurate figure may very well be the share of web users in each country.

I actually managed to search out this data and mixing that with population we will get a figure for absolutely the variety of web users in each country. By taking the ratio of web users within the country and the US, we will calculate an adjustment factor for the US scaling factor for every country to depart us with a technique to calculate absolutely the search volume of any term for any country.

When the maths simplifies itself

Now with that in mind, I do have another caveat. Because as a way to compare countries and model motivation trends, what we’re modelling isn’t absolute search volumes for motivation. If we were then we’d conclude the US is less motivated than the UK since it searches for motivation more, but in point of fact we all know that they’re not necessarily less motivated, there’s just more of them.

So to resolve this problem I’d need to have a look at search volumes of motivation as a proportion of total search volume and we’ve already built something to model this: our basket of terms. So I can calculate absolute search volume for all of those terms, add them up for the basket and divide absolute motivation by absolute basket.

You may have noticed something here. If I try this, won’t all my scaling aspects cancel out? And truly the reply is yes. All of those scaling aspects cancel out rendering the work we’ve done before unnecessary, from a certain perspective.

Adjusting for reality: accounting for differences in web access when estimating search volumes across countries. Image by the creator.

But actually, it’s not unnecessary. Because if I’d began this post saying “let’s just add up the google trends rating of the basket and divide motivation by it” you almost certainly would have thought “why? Is that something we will actually do?”. Until we did this evaluation, we didn’t know we could.

There’s also an additional good thing about this. I used to be aware that by the point we’ve chained all the info and scaled all of the numbers we’ve actually gathered plenty of estimations and consequently plenty of noise that might pollute our numbers. By cancelling out our scale aspects, we’re actually removing plenty of that noise.

Compounding errors in motion, image by the creator.

So yes, we did work that’s unnecessary to the ultimate calculation. But we did it since it enabled us to grasp the issue and trust that what we’ve actually give you is powerful. And that makes it worthwhile.

At Evil Works we’re all about improving the lifetime of the info scientist, through showcasing real world projects and constructing the tools to only do data science higher. Click the links to search out out more.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x