I've always believed in UFOs as a kid, and while I've never seen one (yet?) I am still more on the believer side! I was interested to stumble upon a database of UFO sightings at http://www.infochimps.com/tags/ufo# A shout-out at infochimps (you guys are great!). Downloading the sightings DB (around 80MB), I found a listing of 60,000 documented sightings, hmm interesting! I started thinking I could crunch on this data in some useful and fun way, what about finding the most commonly spotted UFO shape?! Sounds like I could use hadoop for that, just for the coolness factor really, the data is not that large anyway, but hey why not! I had no-idea how to get started with hadoop though and wasn't really interested in learning up all the gory details!
Well Ensemble to the rescue, hadoop master and slave formulas exist, which means someone else packaged the knowledge needed to setup and run a hadoop cluster for me. All I needed to do was ask Ensemble to deploy me a couple of cloud instances and start playing. Let's see how you can do that for yourself
I won't repeat the instructions to get started with Ensemble, since the documentation is a good place for that (and it's so easy anyway!). If you feel you need more help there, this little video should be helpful. If you're still stuck, you can always drop by on irc/freenode at #ubuntu-ensemble and ask your questions
So, let's start ensembling
Now let's deploy a two node hadoop cluster
Yeah it's that easy! Ensemble formulas manage all the kung-fu for you. The hadoop cluster is ready, let's ssh into the master node and switch to user hdfs
Download the infochimps sightings database here, unzip it and locate the TSV file (tab separated values) file. Note that you can download the file from infochimps without registering on their website (didn't I say these guys were great :)
Upload the TSV DB to hadoop's distributed filesystem
Almost ready, the corpus has been uploaded. Now we need to write some map/reduce jobs to do the actual crunching. Not being a pro developer, the thought of writing that in java was like (oh no ew), so python to the rescue! Thanks to the great instructions at Michael Noll's blog, I was able to massage some of that code to get it to do what I wanted. I pushed my code to launchpad, so that you can grab it directly from the hadoop master node
Now for the big moment, let's launch the elephant
Woohoo success! Now let's grab the results, sorting it to easily see the most popular sighting shape
The answer is "light" then! Wow that was a blast! I had fun doing this exercise. Now I am no hadoop expert in any way (so direct those hadoopy questions to someone who can actually answer them), however I was quite pleased Ensemble could help me get up and running that fast. The Ensemble community is doing a great job wrapping many free software with formulas, such that you can always get up and running with any app you need in seconds rather than days (months?). You too can write Ensemble formulas for your favorite (server?) application. Hop on to #ubuntu-ensemble and grab me (kim0) or any of the dev team and ask any questions on your mind! We're a happy community
So was that fun? Can you think of something cooler you want to see done? Leave me a comment, let me know about it
Well Ensemble to the rescue, hadoop master and slave formulas exist, which means someone else packaged the knowledge needed to setup and run a hadoop cluster for me. All I needed to do was ask Ensemble to deploy me a couple of cloud instances and start playing. Let's see how you can do that for yourself
I won't repeat the instructions to get started with Ensemble, since the documentation is a good place for that (and it's so easy anyway!). If you feel you need more help there, this little video should be helpful. If you're still stuck, you can always drop by on irc/freenode at #ubuntu-ensemble and ask your questions
Hadoop node, with an extra slave please
So, let's start ensembling
bzr branch lp:~negronjl/+junk/hadoop-master bzr branch lp:~negronjl/+junk/hadoop-slave ensemble bootstrapwait a minute or two for EC2 to spin up the instance, then
ensemble statuswhich’ll give you output like
$ ensemble status 2011-07-12 15:20:54,978 INFO Connecting to environment. The authenticity of host 'ec2-50-17-28-19.compute-1.amazonaws.com (50.17.28.19)' can't be established. RSA key fingerprint is c5:21:62:f0:ac:bd:9c:0f:99:59:12:ec:4d:41:48:c8. Are you sure you want to continue connecting (yes/no)? yes machines: 0: {dns-name: ec2-50-17-28-19.compute-1.amazonaws.com, instance-id: i-8bc034ea} services: {} 2011-07-12 15:21:01,205 INFO 'status' command finished successfully
Now let's deploy a two node hadoop cluster
ensemble deploy --repository . hadoop-master ensemble deploy --repository . hadoop-slave ensemble add-relation hadoop-master hadoop-slave
Yeah it's that easy! Ensemble formulas manage all the kung-fu for you. The hadoop cluster is ready, let's ssh into the master node and switch to user hdfs
ensemble ssh hadoop-master/0 sudo -su hdfs
Downloading UFOs
Download the infochimps sightings database here, unzip it and locate the TSV file (tab separated values) file. Note that you can download the file from infochimps without registering on their website (didn't I say these guys were great :)
Upload the TSV DB to hadoop's distributed filesystem
hadoop dfs -copyFromLocal ufo_awesome.tsv ufo_awesome.tsv
Almost ready, the corpus has been uploaded. Now we need to write some map/reduce jobs to do the actual crunching. Not being a pro developer, the thought of writing that in java was like (oh no ew), so python to the rescue! Thanks to the great instructions at Michael Noll's blog, I was able to massage some of that code to get it to do what I wanted. I pushed my code to launchpad, so that you can grab it directly from the hadoop master node
cd /tmp bzr branch lp:~kim0/+junk/ufo-ensemble-cruncher cd ufo-ensemble-cruncher
Unleashing the elephant
Now for the big moment, let's launch the elephant
hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-*.jar -file ./mapper.py -mapper mapper.py -file ./reducer.py -reducer reducer.py -input ufo_awesome.tsv -output ufo-output packageJobJar: [./mapper.py, ./reducer.py, /tmp/hadoop-hdfs/hadoop-unjar1418682529553378062/] [] /tmp/streamjob5701745574334998473.jar tmpDir=null 11/07/29 12:27:52 INFO mapred.FileInputFormat: Total input paths to process : 1 11/07/29 12:27:53 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hdfs/mapred/local] 11/07/29 12:27:53 INFO streaming.StreamJob: Running job: job_201107290935_0010 11/07/29 12:27:53 INFO streaming.StreamJob: To kill this job, run: 11/07/29 12:27:53 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=domU-12-31-39-10-81-8E.compute-1.internal:8021 -kill job_201107290935_0010 11/07/29 12:27:53 INFO streaming.StreamJob: Tracking URL: http://domU-12-31-39-10-81-8E.compute-1.internal:50030/jobdetails.jsp?jobid=job_201107290935_0010 11/07/29 12:27:54 INFO streaming.StreamJob: map 0% reduce 0% 11/07/29 12:28:11 INFO streaming.StreamJob: map 10% reduce 0% 11/07/29 12:28:12 INFO streaming.StreamJob: map 19% reduce 0% 11/07/29 12:28:14 INFO streaming.StreamJob: map 72% reduce 0% 11/07/29 12:28:16 INFO streaming.StreamJob: map 100% reduce 0% 11/07/29 12:28:33 INFO streaming.StreamJob: map 100% reduce 100% 11/07/29 12:28:37 INFO streaming.StreamJob: Job complete: job_201107290935_0010 11/07/29 12:28:37 INFO streaming.StreamJob: Output: ufo-output
Woohoo success! Now let's grab the results, sorting it to easily see the most popular sighting shape
Is the answer really 42
hadoop dfs -cat ufo-output/part-00000 | sort -k 2,2 -nr light 12202 triangle 6082 circle 5271 disk 4825 other 4593 unknown 4490 sphere 3637 fireball 3452 oval 2869 formation 1788 cigar 1782 changing 1546 flash 990 cylinder 982 rectangle 966 diamond 915 chevron 760 egg 664 teardrop 595 cone 265 cross 177 delta 8 round 2 crescent 2 pyramid 1 hexagon 1 flare 1 dome 1 changed 1
The answer is "light" then! Wow that was a blast! I had fun doing this exercise. Now I am no hadoop expert in any way (so direct those hadoopy questions to someone who can actually answer them), however I was quite pleased Ensemble could help me get up and running that fast. The Ensemble community is doing a great job wrapping many free software with formulas, such that you can always get up and running with any app you need in seconds rather than days (months?). You too can write Ensemble formulas for your favorite (server?) application. Hop on to #ubuntu-ensemble and grab me (kim0) or any of the dev team and ask any questions on your mind! We're a happy community
So was that fun? Can you think of something cooler you want to see done? Leave me a comment, let me know about it