Why are the Numbers So Whack? Plausible Explanations for Crazy Gem Download Counts
A couple of days ago I published an analysis on the usage data provided by NewRelic and GemStats.org which showed that real world usage does not seem to correlate with gem download counts provided by RubyGems.org (or watchers/forks on Github). This generated a fair bit of feedback from people who were skeptical of the results. This article briefly attempts to address some of the questions and concerns people are having with the analysis. I guess you could call it a FAQ, of sorts.
1. Why use Gemfile and not Gemfile.lock?
It was GemStats.org’s hypothesis that the Gemfile.lock would not correspond to real world usage and that the Gemfile would. This is a completely valid hypothesis, and one that is best explained through an anecdote.
There are certain gems that have tons of downloads but you’ve probably never heard of, or heard of and never used. The first that immediately comes to mind is hoe. It has over 1.1 million downloads. That’s quite a lot. You’d expect one in every 10 people to have used it before (at least). Well, I’d like to think I know a lot of Rubyists, and yet I don’t really know any who actively use that gem. On the other hand, the hoe gem has as many downloads as sinatra (Sinatra has 1.2million), and I know quite a few who have used and still use sinatra. Maybe my sample is just totally skewed, so ask yourself the same question: do you really know as many sinatra users as hoe users? The download count seems heavily skewed to me, for this gem. The question is why?
And the answer is: dependencies. According to The Ruby Toolbox, hoe is used by 2000+ gems. A lot of those incorrectly list it as a runtime dependency (uh oh!), which means occasionally it gets downloaded just because you required a library in your project. Some of those gems that depend on hoe are very popular. But just because you used those popular gems, doesn’t mean you’re using their dependencies (like hoe).
Another example would be the i18n gem. Another boatload of downloads, but are there really 4.3 million of you guys writing internationalized applications? I wish! It’s more like you’re just downloading it because it’s part of Rails. Does that count as “usage”? Not in my mind.
That is why GemStats.org made their hypothesis as such. Their goal was to calculate usage based on the libraries that users explicitly asked for in their projects, which means no backdoor dependencies like hoe or i18n creeping into usage counts. You’ll notice that their hypothesis was correct in this regard: the raw data shows that a gem like i18n was in fact used way less in Gemfiles than in the Gemfile.lock data that NewRelic had; 5% of the time when explicitly specified versus 89% of the time. My guess is the number of internationalized Rails apps is probably way closer to the 5% mark than the 89% mark. Can we at least agree on that?
So they performed a valid experiment. They asked the question: “what happens to correlation if we only consider explicitly used gems?” It turned out that there is no real correlation either way (Gemfile or Gemfile.lock), but good science is all about asking these kinds of questions. We got extra data out of this too, because testing this alternate scenario helped make the data set more robust.
2. How is it that users are downloading these gems and not using them?
Indeed. Why is the mapping between downloading a gem and using it so far off? After all, why would you download a gem if you aren’t going to use it?
There are actually many valid explanations as to why the correlation is so poor, and it has to do with installation rates. In fact, that’s really what download rates are. A gem is downloaded every time a gem is installed. Gems get installed for more reasons than “usage”. Here are a couple of valid scenarios:
- Every time you deploy your code on a PaaS host like Heroku, your instance is being reinitialized, and your code and dependencies are being re-installed. Although a service like Heroku will cache all your gems (per app, not system wide), caches get invalidated, and you can imagine that other services might not be as efficient. Basically, each deploy might trigger a re-download of a bunch of gems. You're not "using" it any more than you were before. Instead the number is scaling up with the amount you deploy your site. I hear Etsy deploys their code like 40 times a day. I doubt they reinstall all their gems each time, but I imagine there are others out there that deploy a bunch and do re-install each time.
- Developer installations tend to use multiple versions of Ruby (thank you RVM!), and each of those instances have their own gem installations. Of course, they are probably using the same gems in many of those installations, so usage can potentially triple or quadruple with respect to actual use.
- Ditto for VMs. I personally have 4 or so VMs for my multiple OS environment tests, and I have to install multiple Rubies in these multiple VMs. In total, I might have 16 copies of Ruby installed across my various systems all with the same 30 or so gems. That's a 16:1 download to usage ratio, and that's just me. Multiply that by other users who have similar usage patterns, and it makes sense why the numbers are what they are.
I think along with the data, these explanations provide a really solid basis for why gem downloads do not correlate with actual usage, and, why downloads are not a reliable indicator of popularity. The data shows it, the explanations are fairly sound, that seems like a good nail in the coffin to me.