뉴스피드 큐레이션 SNS 대시보드 저널

Google 트렌드 데이터 문제를 해결하기 위해 월스트리트의 트릭을 훔쳤습니다.

Towards Data Science | | 💼 비즈니스
#tip #구글 트렌드 #꿀팁 #데이터 분석 #데이터 정규화 #월스트리트

요약

이 기사는 글로벌 구글 트렌드 데이터의 국가 간 비교가 어렵다는 문제를 해결하기 위해 월스트리트에서 사용하는 기법을 차용한 새로운 방법론을 소개합니다. 해당 방법론은 데이터의 정규화 및 보정 과정을 통해 서로 다른 국가 간의 검색어 인기를 정확하게 비교할 수 있도록 돕는 것을 목적으로 합니다. 이를 통해 연구자들은 분석의 정확성을 높이고 오해의 소지를 줄일 수 있습니다.

왜 중요한가

개발자 관점

검토중입니다

연구자 관점

검토중입니다

비즈니스 관점

검토중입니다

본문

In reality, Google Trends exists solely to do what it says: show trends. The data is normalised and regionalised to the point where it’s impossible to get a hold of comparable data to do any meaningful modelling with. Unless we have a few tricks up our sleeve. In my last post on this topic we introduced the concept of chaining data across overlapping windows to get around the granularity limitations of google trends data. Today we’re going to learn how to compare that data across countries and regions so you can use it for real insights. Motivation: Comparing Motivation Google trends allows the downloading and reuse of Trends data with citation, so I’ve gone and downloaded the data on motivation for five years and scaled it so we have one dataset of motivation searches for each country that gives us a rough idea of how each country’s interest in motivation changes over time. My goal was to compare how motivated different countries are, but I have a problem. I don’t know whether a google trends score of 100 searches in the US is bigger or smaller than a score of 100 in the UK, and my first suggestion for how to work that out fell flat. Let me explain. So when I started this project I wasn’t a connoisseur of Google Trends and I quite naively tried typing in UK motivation, then adding a comparison and typing it motivation again and changing the location to the US. Admittedly, I was confused as to why it was the same graph. So then I thought it was just that UK and US were too similar so I added Japan and it wasn’t until I got to China that I realised that the graph was changing all of the lines to be that country’s motivation. So if I can’t get the countries on the same graph then I can’t compare them. Unless I find a more creative way… My next brainwave came from looking at the US, because if you scroll down on google trends you’ll see that there’s this subregion section showing the states in the US in relative terms. So the state with the highest search volume is set to 100 and the other states are scaled accordingly. So I thought I was a genius, I’ll just set the region to be worldwide, see the different numbers that come out for my countries of interest and just multiply the results for that country by that number. But it turns out, I had misunderstood something fundamental again. And I’m sorry but we’re going to need to do some maths to explain it. The Maths Behind Google Trends Normalisation So I grabbed ninety days of data from the US and the UK from the 24th of April on two separate google trends graphs as you can see here. They’re both scaled so the maximum is at 100 which occurs on a different day for each country. The problem is that because we’re looking at two different countries, the google trends scores are in fundamentally different units for each country. Just like inches and centimetres are different units of measurement, so are US Google Trends units and UK Google trends Units. And unlike inches to centimetres, we don’t know the conversion factor here. Let’s assume that on the worldwide graph the US is given a score of 100 and the UK is given a score of 50. The UK score of 50 means that the peak of UK is 50% of the peak of the US. On a first look this might suggest that the conversion factor between these two units is a half, ie UK units are half the US units or equivalently one US unit is 2 UK units. I’m now going to convince you why this isn’t true. Let’s take this to a day that’s not a peak day. Let’s look at the 30th April and say hypothetically that its score was 70 in the US and 80 in the UK. This means that the score in the US that day was 70% of its peak and the score in the UK that day was 80% of its peak. Let’s look at it with some maths: 70% of US peak = 70% * 100 US units = 70% * 2 * 100 UK units (based on the scaling factor of one US unit = 2 UK units) = 140 UK units Now looking at it from a UK perspective: 80% of UK peak = 80% * 100 UK units = 80 UK units And last time I checked, 140 was not double 80. Just because the peak of US is twice the peak of UK doesn’t mean that for the whole time period the US data is twice the UK data! So okay, we can’t just take the worldwide ratios to compare the data of different countries. So what can we do? The thing I love the most about data science is that the underlying science and methodologies we use can translate across multiple different domains so for this problem I’m going to take a similar approach. Because I learned my data scientist skills before I even knew what a data scientist was, forged in the chaos that is the trading floor of an investment bank. If you’ve ever heard of the term “Exchange Traded Fund” then that might give you a little bit of an idea of what you’re in for, but if not do not fear. Taking Inspiration from the Stock Market So the stock market, as you’re probably aware, is a place for buying and selling equity, or shares in a company. These shares are a partial ownership and usually come with things like voting rig

관련 저널 읽기

전체 보기 →