I've always believed in
UFOs as a kid, and while I've never seen one (yet?) I am still more on the believer side! I was interested to stumble upon a database of UFO sightings at
http://www.infochimps.com/tags/ufo# A shout-out at infochimps (you guys are great!). Downloading the sightings DB (around 80MB), I found a listing of 60,000 documented sightings, hmm interesting! I started thinking I could crunch on this data in some useful and fun way, what about finding the most commonly spotted UFO shape?! Sounds like I could use hadoop for that, just for the coolness factor really, the data is not that large anyway, but hey why not! I had no-idea how to get started with hadoop though and wasn't really interested in learning up all the gory details!
Well
Ensemble to the rescue, hadoop master and slave formulas exist, which means someone else packaged the knowledge needed to setup and run a hadoop cluster for me. All I needed to do was ask Ensemble to deploy me a couple of cloud instances and start playing. Let's see how you can do that for yourself
I won't repeat the instructions to
get started with Ensemble, since the documentation is a good place for that (and it's so easy anyway!). If you feel you need more help there, this little
video should be helpful. If you're still stuck, you can always drop by on irc/freenode at #ubuntu-ensemble and ask your questions
Hadoop node, with an extra slave please
So, let's start ensembling
bzr branch lp:~negronjl/+junk/hadoop-master
bzr branch lp:~negronjl/+junk/hadoop-slave
ensemble bootstrap
wait a minute or two for EC2 to spin up the instance, then
ensemble status
which’ll give you output like
$ ensemble status
2011-07-12 15:20:54,978 INFO Connecting to environment.
The authenticity of host 'ec2-50-17-28-19.compute-1.amazonaws.com (50.17.28.19)' can't be established.
RSA key fingerprint is c5:21:62:f0:ac:bd:9c:0f:99:59:12:ec:4d:41:48:c8.
Are you sure you want to continue connecting (yes/no)? yes
machines:
0: {dns-name: ec2-50-17-28-19.compute-1.amazonaws.com, instance-id: i-8bc034ea}
services: {}
2011-07-12 15:21:01,205 INFO 'status' command finished successfully
Now let's deploy a two node hadoop cluster
ensemble deploy --repository . hadoop-master
ensemble deploy --repository . hadoop-slave
ensemble add-relation hadoop-master hadoop-slave
Yeah it's that easy! Ensemble formulas manage all the kung-fu for you. The hadoop cluster is ready, let's ssh into the master node and switch to user hdfs
ensemble ssh hadoop-master/0
sudo -su hdfs
Downloading UFOs
Download the infochimps sightings database here, unzip it and locate the TSV file (tab separated values) file. Note that you can download the file from infochimps without registering on their website (didn't I say these guys were great :)
Upload the TSV DB to hadoop's distributed filesystem
hadoop dfs -copyFromLocal ufo_awesome.tsv ufo_awesome.tsv
Almost ready, the corpus has been uploaded. Now we need to write some map/reduce jobs to do the actual crunching. Not being a pro developer, the thought of writing that in java was like (oh no ew), so python to the rescue! Thanks to the great instructions at
Michael Noll's blog, I was able to massage some of that code to get it to do what I wanted. I pushed my code to
launchpad, so that you can grab it directly from the hadoop master node
cd /tmp
bzr branch lp:~kim0/+junk/ufo-ensemble-cruncher
cd ufo-ensemble-cruncher
Unleashing the elephant
Now for the big moment, let's launch the elephant
hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-*.jar -file ./mapper.py -mapper mapper.py -file ./reducer.py -reducer reducer.py -input ufo_awesome.tsv -output ufo-output
packageJobJar: [./mapper.py, ./reducer.py, /tmp/hadoop-hdfs/hadoop-unjar1418682529553378062/] [] /tmp/streamjob5701745574334998473.jar tmpDir=null
11/07/29 12:27:52 INFO mapred.FileInputFormat: Total input paths to process : 1
11/07/29 12:27:53 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hdfs/mapred/local]
11/07/29 12:27:53 INFO streaming.StreamJob: Running job: job_201107290935_0010
11/07/29 12:27:53 INFO streaming.StreamJob: To kill this job, run:
11/07/29 12:27:53 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=domU-12-31-39-10-81-8E.compute-1.internal:8021 -kill job_201107290935_0010
11/07/29 12:27:53 INFO streaming.StreamJob: Tracking URL: http://domU-12-31-39-10-81-8E.compute-1.internal:50030/jobdetails.jsp?jobid=job_201107290935_0010
11/07/29 12:27:54 INFO streaming.StreamJob: map 0% reduce 0%
11/07/29 12:28:11 INFO streaming.StreamJob: map 10% reduce 0%
11/07/29 12:28:12 INFO streaming.StreamJob: map 19% reduce 0%
11/07/29 12:28:14 INFO streaming.StreamJob: map 72% reduce 0%
11/07/29 12:28:16 INFO streaming.StreamJob: map 100% reduce 0%
11/07/29 12:28:33 INFO streaming.StreamJob: map 100% reduce 100%
11/07/29 12:28:37 INFO streaming.StreamJob: Job complete: job_201107290935_0010
11/07/29 12:28:37 INFO streaming.StreamJob: Output: ufo-output
Woohoo success! Now let's grab the results, sorting it to easily see the most popular sighting shape
Is the answer really 42
hadoop dfs -cat ufo-output/part-00000 | sort -k 2,2 -nr
light 12202
triangle 6082
circle 5271
disk 4825
other 4593
unknown 4490
sphere 3637
fireball 3452
oval 2869
formation 1788
cigar 1782
changing 1546
flash 990
cylinder 982
rectangle 966
diamond 915
chevron 760
egg 664
teardrop 595
cone 265
cross 177
delta 8
round 2
crescent 2
pyramid 1
hexagon 1
flare 1
dome 1
changed 1
The answer is "light" then! Wow that was a blast! I had fun doing this exercise. Now I am no hadoop expert in any way (so direct those hadoopy questions to someone who can actually answer them), however I was quite pleased Ensemble could help me get up and running that fast. The Ensemble community is doing a great job wrapping many free software with formulas, such that you can always get up and running with any app you need in seconds rather than days (months?).
You too can write Ensemble formulas for your favorite (server?) application. Hop on to #ubuntu-ensemble and grab me (kim0) or any of the dev team and ask any questions on your mind! We're a happy community
So was that fun? Can you think of something cooler you want to see done? Leave me a comment, let me know about it