Map Reducing the Royals with Mongo

MongoDb is a real lifesaver when it comes to improving developer productivity in web applications, however, that’s only a small part of the power in MongoDb. To do a lot of the deep down data mining, we need to learn to use Map/Reduce to massage our data. Please note, some of this functionality can be accomplished using Mongo’s Aggregate functions, however, I’ve intentionally avoided it, as there are limitations with using aggregates on sharded environments, and I expect most of my Mongo apps will need to be sharded.

Since we just finished the 2012 All-Star Game here in Kansas City, a baseball statistics example seems appropriate.

Setting up your environment
You’ll need to have console access to a mongodb database. To set up mongo on your computer, see the Quick Start.

Loading some sample data
Lets create some realistic baseball stats. I’ll start with the real roster for the Kansas City Royals. However, instead of using their real stats, we’ll generate some random numbers using javascript’s Math object. For example, we know that the best players in the league will get 200 hits, the worst players get none. Math.floor(Math.random()*200) will give us a random number between 0 and 200. We’ll make sure that the number of hits never exceeds the number of At-Bats, and we’ll keep the number of Home Runs capped at 50 (rather generous for the Royals).

To add a single player, we can run the following javascript:

<br />{<br /> number : 47,<br /> name: 'Nathan Adcock',<br /> hits : Math.floor(Math.random()*200),<br /> ab: Math.floor(Math.random()*300)+200,<br /> bb:Math.floor(Math.random()*50)+5,<br /> hr: Math.floor(Math.random()*50)<br /> });<br />

Grab the script for the whole roster here, and run it in your mongo console.

Counting Home Runs
Confirm that you’ve got the data loaded. Your stats for Billy Butler will vary (my Billy Butler kind of sucks), but you should always have 43 players.

<br /> &gt;db.Player.count();<br /> 43<br /> &gt;db.Player.find({name: 'Billy Butler'});<br /> {<br /> "_id" : ObjectId("50021639b5145ef5327c66b2"),<br /> "number" : 16,<br /> "name" : "Billy Butler",<br /> "hits" : 66,<br /> "ab" : 386,<br /> "bb" : 5,<br /> "hr" : 5<br /> }<br />

We now know how many home runs Billy Butler hit this season, but let us say we want to find the number of home runs that the combined Royals roster hit this season.

<br /> var map = function() {<br /> emit( { }, { hr:} );<br /> };</p> <p>var reduce = function(key, values) {<br /> var sum = 0;<br /> values.forEach(function(doc) {<br /> sum +=;<br /> });<br /> return sum;<br /> };</p> <p>homeRuns = db.runCommand( {<br /> mapreduce: 'Player',<br /> map: map,<br /> reduce: reduce,<br /> out: 'totalHomers',<br /> verbose: true<br /> } );</p> <p>db[homeRuns.result].find();</p> <p>

A more complex example

Cool huh? Lets take a slightly more complicated case. We’d like to take all players with more than 250 AB, and group them by batting average.

<br /> var map = function() {<br /> var ba = this.hits /this.ab;<br /> if (ba &lt; .250) {<br /> key = '&lt; .250';<br /> }<br /> if (ba &gt; .250 &amp;&amp; ba &lt; .300) {<br /> key = '.250 -&gt; .300';<br /> }<br /> if (ba &gt; .300) {<br /> key = '&gt; .300';<br /> }<br /> emit(key, { count : 1});<br /> };</p> <p>var reduce = function(key, values) {<br /> var sum = 0;<br /> values.forEach(function(doc) {<br /> sum += doc.count;<br /> });<br /> return sum;<br /> };</p> <p>ba = db.runCommand( {<br /> mapreduce: 'Player',<br /> map: map,<br /> reduce: reduce,<br /> query: {"ab": {$gt: 250}},<br /> out: 'battingAverages',<br /> verbose: true<br /> } );</p> <p>db[ba.result].find();<br />

Source Code

These examples are pretty simple, but we can still take away a few lessons:

  • Do the heavy lifting in the map function. These are the tasks that get executed in parallel across your shards. For example, by pushing the batting average calculation, and the categorization into the map function, we ensure a fast runtime across a large dataset.
  • Make use of the query arg for the map/reduce command. By filtering out the undesireable data, we save mapping operations and reduce the load on the db

Thanks to several bloggers who helped me understand this concept:

Full source code is available on github.