Okay, this is too weird. I recently started playing with the Google Correlate tool and it’s cool and all, but I hadn’t put in my own data yet. Then I started thinking about what correlations I might tease out, so I went to the CDC site and got a list of adult obesity rates by state. It’s incredibly easy to upload this to Google Correlate. I would say it took under two minutes. You just copy the table, paste into Excel, delete the columns/rows you don’t want, save as CSV and upload to Correlate. The data uploaded just fine. Then I hit the button to see correlations… and that’s where things get interesting.
What is Google Correlate?
But first, just so you’re appropriately wowed and impressed by the results, let me detour through a brief explanation of what Google Correlate does. In essence it looks for real world data series and correlates them to search volume. Originally it was because some engineer at Google realized that by finding correlations between historic flu data and searches, they could see which searches correlated to flu outbreaks. By tracking those searches, they could then have very early indicators of a flu outbreak coming on, data that can precede that of the CDC and other authorities because people are in the earliest phases and searching on things like “sore throat”. They then took that tool and generalized it to work with any data series, not just historic flu data.
So, for example, you could input things like average days of sunshine by state and see which search terms most precisely match that distribution pattern (at least that’s how I think it works). You could do this with inflation over time, with temperature by state, with anything you can imagine as long as you have the data. This, by the way, is just another example of how Google is moving to relate its data to real-world relationships as we’ve seen with the Google Maps, the Bacon Number generator, the new author tag and more. Make no mistake, Google is coming to a real world near you.
And What Does This Have to Do with Rap, Obesity and God?
So anyway, here’s where it gets interesting. I expected obesity rates to correlate most to things like “nearest McDonald’s” or “fried food” or maybe inversely to “marathon dates” or who knows what. At the least I expected the correlations to be obvious — food, exercise, sedentary leisure activities (queries about television or video games for example). But no, what sticks out in states with high obesity? Queries about… get ready… RAP MUSIC! Seriously? But here’s the list.
Correlated with Obesity by State
0.8512 popular rap
0.8485 top rap
0.8460 top rap songs
0.8448 if god
0.8385 symptoms of high blood sugar
0.8378 small puppies
0.8376 top 20 rap
0.8368 popular rap songs
0.8367 plus size girls
0.8366 symptoms of high blood
Yes, of course there are some terms that stand out as obvious: [symptoms of high blood sugar], [symptoms of high blood] which is presumably related to blood sugar or blood pressure, and, of course, [plus size girls]. We’re still left with seven searches with no obvious correlation to obesity, and of those, five, fully half of those found, relate to rap songs. If we look at the most closely correlated search term, [popular rap] and compare its search prevalence to a map of obesity by state, we get a pair of maps like this:
It’s pretty striking, though far from a perfect overlay. Perhaps there’s no surprise that relatively speaking, North Dakota is high in obesity but not so much for [popular rap], though California shows a similar pattern.
We’re not done though. Let’s look at a couple of others. If I were going to be snarky, I might say that I can see a correlation between small puppies and obesity, but that’s still pretty weird. And [if god] simply defies explanation at first. However, notice what happens when you type [if god] into Google and wait for Google Suggest to propose terms:
See how Google offers ten choices once you type [if god] and at least half are clearly related to music (use the words “dj” or “lyrics” or are shortened versions of searches that do). At least one other, [if god is for us], is also a song. So since at least six out of ten [if god] suggested searches are music-related and a couple specifically DJ-related we lump that in with the rap-related terms.
As for small puppies, that one is just a mystery to me, but that may be due to my total ignorance of rap culture and music. To my ignorant eyes, none of the suggested searches seem rap-related:
So in short that leaves with at least six out of the ten terms most closely correlated with obesity by state being related to music, mostly rap.
Where does one go from here? It seems that an obvious takeaway is that if you want to bring obesity down in those places where it’s highest, you might want to get rappers and DJs involved in your efforts. Or, if you are working the evil side of the equation, if you’re trying to sell super-sized double bacon lard burgers, you might get rappers to do your ads.
A little looking around and I found a Hacker News post about this with an excerpt from the Google whitepaper saying:
In our Approximate Nearest Neighbor (ANN) system, we achieve a good balance of precision and speed by using a two-pass hash-based system. In the first pass, we compute an approximate distance from the target series to a hash of each series in our database. In the second pass, we compute the exact distance function on the top results returned from the first pass.
My statistician friend Doug, using that info, puts it in terms that explain the rap results a bit better:
Seems like the vast majority of what you will find will be simple serendipity, or really randomness that momentarily appears to have a pattern (like rolling 5 dice and getting 5 ones). I think a first step in making it a more useful tool would be gaining a deeper understanding of who turns to google, and what kind of information they turn to google for.
Millions of searches have been summarized (“reduced dimensionality”) to a 51 number series (in the case of States). And the plethora of 51-number summaries have been lumped together in groups (the hash table). Then it looks to me like they pick the top two correlated hashes, and in the second pass pick the highest correlations from those two hash lists.
So one thing to observe about the algorithm is that multiple references to rap are because the rap references are all related to each other – you’ve really only got one underlying correlation there. I wouldn’t get too hung up on there being a lot of rap references, the algorithm almost guarantees that there will be some coherence to the search terms that are returned to you.
The connection between rap and obesity might be real … data mining can be useful in marketing and genomics. We don’t even need for there to be a causal connection if you are trying to build a marketing campaign. If rap music happens to be popular in states with high obesity, it could still be a useful vehicle for getting out some message, at least until the history of popular culture moves on.
Among the correlates of hunting accidents is “edward steichen”! An epidemiologist or demographer would say “interesting!” and begin looking for data that illuminated the process they thought they were observing.
With State by State statistics, you are only correlating 51 numbers. You are correlating them with millions? of search phrases, so strong but spurious correlations shouldn’t be surprising. Interestingly, when I plug in 51 numbers simulated from a random normal distribution, I get nothing back! So there must be more to it than simple correlation ….