Friday, July 29, 2011

Ubuntu takes UFOs to the cloud


I've always believed in UFOs as a kid, and while I've never seen one (yet?) I am still more on the believer side! I was interested to stumble upon a database of UFO sightings at http://www.infochimps.com/tags/ufo# A shout-out at infochimps (you guys are great!). Downloading the sightings DB (around 80MB), I found a listing of 60,000 documented sightings, hmm interesting! I started thinking I could crunch on this data in some useful and fun way, what about finding the most commonly spotted UFO shape?! Sounds like I could use hadoop for that, just for the coolness factor really, the data is not that large anyway, but hey why not! I had no-idea how to get started with hadoop though and wasn't really interested in learning up all the gory details!

Well Ensemble to the rescue, hadoop master and slave formulas exist, which means someone else packaged the knowledge needed to setup and run a hadoop cluster for me. All I needed to do was ask Ensemble to deploy me a couple of cloud instances and start playing. Let's see how you can do that for yourself

I won't repeat the instructions to get started with Ensemble, since the documentation is a good place for that (and it's so easy anyway!). If you feel you need more help there, this little video should be helpful. If you're still stuck, you can always drop by on irc/freenode at #ubuntu-ensemble and ask your questions

Hadoop node, with an extra slave please


So, let's start ensembling
bzr branch lp:~negronjl/+junk/hadoop-master
bzr branch lp:~negronjl/+junk/hadoop-slave
ensemble bootstrap
wait a minute or two for EC2 to spin up the instance, then
ensemble status
which’ll give you output like
$ ensemble status
2011-07-12 15:20:54,978 INFO Connecting to environment.
The authenticity of host 'ec2-50-17-28-19.compute-1.amazonaws.com (50.17.28.19)' can't be established.
RSA key fingerprint is c5:21:62:f0:ac:bd:9c:0f:99:59:12:ec:4d:41:48:c8.
Are you sure you want to continue connecting (yes/no)? yes
machines:
  0: {dns-name: ec2-50-17-28-19.compute-1.amazonaws.com, instance-id: i-8bc034ea}
services: {}
2011-07-12 15:21:01,205 INFO 'status' command finished successfully

Now let's deploy a two node hadoop cluster
ensemble deploy --repository . hadoop-master
ensemble deploy --repository . hadoop-slave
ensemble add-relation hadoop-master hadoop-slave

Yeah it's that easy! Ensemble formulas manage all the kung-fu for you. The hadoop cluster is ready, let's ssh into the master node and switch to user hdfs
ensemble ssh hadoop-master/0
sudo -su hdfs

Downloading UFOs


Download the infochimps sightings database here, unzip it and locate the TSV file (tab separated values) file. Note that you can download the file from infochimps without registering on their website (didn't I say these guys were great :)

Upload the TSV DB to hadoop's distributed filesystem
hadoop dfs -copyFromLocal ufo_awesome.tsv ufo_awesome.tsv

Almost ready, the corpus has been uploaded. Now we need to write some map/reduce jobs to do the actual crunching. Not being a pro developer, the thought of writing that in java was like (oh no ew), so python to the rescue! Thanks to the great instructions at Michael Noll's blog, I was able to massage some of that code to get it to do what I wanted. I pushed my code to launchpad, so that you can grab it directly from the hadoop master node

cd /tmp
bzr branch lp:~kim0/+junk/ufo-ensemble-cruncher
cd ufo-ensemble-cruncher

Unleashing the elephant


Now for the big moment, let's launch the elephant
hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-*.jar -file ./mapper.py -mapper mapper.py -file ./reducer.py -reducer reducer.py -input ufo_awesome.tsv -output ufo-output
packageJobJar: [./mapper.py, ./reducer.py, /tmp/hadoop-hdfs/hadoop-unjar1418682529553378062/] [] /tmp/streamjob5701745574334998473.jar tmpDir=null
11/07/29 12:27:52 INFO mapred.FileInputFormat: Total input paths to process : 1
11/07/29 12:27:53 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hdfs/mapred/local]
11/07/29 12:27:53 INFO streaming.StreamJob: Running job: job_201107290935_0010
11/07/29 12:27:53 INFO streaming.StreamJob: To kill this job, run:
11/07/29 12:27:53 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=domU-12-31-39-10-81-8E.compute-1.internal:8021 -kill job_201107290935_0010
11/07/29 12:27:53 INFO streaming.StreamJob: Tracking URL: http://domU-12-31-39-10-81-8E.compute-1.internal:50030/jobdetails.jsp?jobid=job_201107290935_0010
11/07/29 12:27:54 INFO streaming.StreamJob:  map 0%  reduce 0%
11/07/29 12:28:11 INFO streaming.StreamJob:  map 10%  reduce 0%
11/07/29 12:28:12 INFO streaming.StreamJob:  map 19%  reduce 0%
11/07/29 12:28:14 INFO streaming.StreamJob:  map 72%  reduce 0%
11/07/29 12:28:16 INFO streaming.StreamJob:  map 100%  reduce 0%
11/07/29 12:28:33 INFO streaming.StreamJob:  map 100%  reduce 100%
11/07/29 12:28:37 INFO streaming.StreamJob: Job complete: job_201107290935_0010
11/07/29 12:28:37 INFO streaming.StreamJob: Output: ufo-output

Woohoo success! Now let's grab the results, sorting it to easily see the most popular sighting shape

Is the answer really 42


hadoop dfs -cat ufo-output/part-00000 | sort -k 2,2 -nr
light   12202
triangle        6082
circle  5271
disk    4825
other   4593
unknown 4490
sphere  3637
fireball        3452
oval    2869
formation       1788
cigar   1782
changing        1546
flash   990
cylinder        982
rectangle       966
diamond 915
chevron 760
egg     664
teardrop        595
cone    265
cross   177
delta   8
round   2
crescent        2
pyramid 1
hexagon 1
flare   1
dome    1
changed 1

The answer is "light" then! Wow that was a blast! I had fun doing this exercise. Now I am no hadoop expert in any way (so direct those hadoopy questions to someone who can actually answer them), however I was quite pleased Ensemble could help me get up and running that fast. The Ensemble community is doing a great job wrapping many free software with formulas, such that you can always get up and running with any app you need in seconds rather than days (months?). You too can write Ensemble formulas for your favorite (server?) application. Hop on to #ubuntu-ensemble and grab me (kim0) or any of the dev team and ask any questions on your mind! We're a happy community

So was that fun? Can you think of something cooler you want to see done? Leave me a comment, let me know about it