So this week was one heck of a week. We started Data Engineering and Data at Scale, had several guest lectures and played with some really cool tools. I got the RAM kit I ordered over the weekend, so it was nice starting the week running on 16gb of RAM. You could easily wrangle a few gigs of data locally without resorting to AWS or mmap tricks.
Okay, this post is going to be a long one. The events here are in chronological order.
The week started off with a quick assessment. We are getting to the end of the structured part of the curriculum, so this really helped pinpoint some areas and topics to revisit. The day officially started with an introduction to Hadoop, Hadoop Streaming and the map-reduce framework via mrjob. We wrote map-reduce jobs to do word counts. MapReduce framework / Hadoop easily made its way into mainstream because people realized they could now store massive amounts of data (terabytes -> petabytes) on a bunch of commodity nodes and also perform computation on those nodes. Of course Google also had a hand in this after they published their celebrated Map Reduce
paper back in 2004. They seem to be years ahead of everyone else.
The next day, we worked on a more interesting problem using mrjob to do recommendations at scale by building a scaled down version of LinkedIn's
People You May Know. We developed a multi-step map-reduce program with mrjob to analyze a social graph of friends and then performed
Triadic closure to recommend new friends. So basically, if A knows B and A knows C, then to some degree B and C should also be friends.
Next day, we moved on to using Hive to build a contextualized ads server on AWS. Tools like Hive and Pig (which both have SQL-like syntax) are used to query data on Hadoop / Distributed File Systems and they both compile down to map-reduce. These tools were developed to cater to the audience that was already familiar with SQL but were not Java experts. We also discussed some of the next generation frameworks like Spark (in-memory), Impala (in-memory), Storm (real-time), Giraph (graphs), Flume (streaming-log files), etc.
After the first few days working with data at scale, we moved in a different direction to building data products with
Flask and
yHat . Using these tools basically opens the flood gates for data products. If you can build a predictive model in python or R, you can push it to production or to a client with minimal plumbing.
We wrapped up the week discussing project proposals, doing reviews, a breakout on Deep Learning and catching up on sprints. I ended up exploring a bunch of APIs from some non-profits. I was impressed some of them would build APIs to let people access their data and other government data.
Highlights of the week:
- A lot happened this week, we started out with a guest lecture with Dennis from @IdleGames. He gave his point of view from being in a Data Scientist role for the past year. A few quotes from the talk : "if you can't explain your analysis, it didn't happen!!", "Remove yourself as the bottleneck and automate everything". He's working on some interesting projects like churn prediction, integrating offline analysis into live games, collusion detection, customer long term value (LTV) and the uncertainty associated with that and also building out data products to democratize insights.
- I had a chance to attend a meetup hosted by Microsoft. There were two talks at this meetup. Setting up and running Hadoop on Cloudera and a more hands on intro to using F# for machine learning. It's not everyday the words 'open source' and 'Microsoft' show up in the same sentence, but I was really impressed with how expressive the language was and did I say it was open source and multi-platform? There was a really nice demo and I liked the F# type providers. One of the great things about the Bay Area is that on any given day, there are at least a handful of pretty good Tech meetups with very interesting topics and speakers.
- Tuesday, we had another guest lecture with Tyler from @radius. He's a Data Engineer who uses mrjob daily to write map-reduce jobs. He went through some detailed examples and sample code. It's really cool how mrjob handles a lot of plumbing around writing map-reduce jobs. You can run your code locally and also push it up to an Amazon Elastic MapReduce cluster with minimal changes.
- Next day, we attended another meetup hosted at YelpHQ. There were two very good presentations from @bigmlcom on doing exploratory data analysis / modeling and @YhatHQ on productionizing your python and R models. BigML has some cool features for getting a feel for your data and some pretty cool model visualizations. They focus mainly on decision trees and random forest ensembles. BigML's tool also let you download your models either in pseudo-code or in a variety of languages like python, R, Java, etc. yHat is a glue between the analytics stack and engineering stack. They also enable Data Scientists to be more productive by letting them push out their models to production via a REST streaming API with a json end point. I actually wrote about both startups a few months ago here.
- The co-founders of @YhatHQ dropped by the 'Data Warehouse' and gave a nice demo of a beer recommender service built with yHat and Flask. Greg was very helpful in working us through some of the issues we had setting things up.
- We had a special guest this week, @nathanmarz. He's a former Twitter Engineer and creator of Storm and Cascalog. His talk was half technical (overview of Clojure) and half philosophical (thoughts on where things are headed in the Big Data ecosystem). He reiterated the concept of doing your data analysis by building abstractions. I have to say, it is truly inspiring watching a Master at work.
- Our last guest lecture of week was by Nick from @DominoDataLab. They've built a Heroku for Data Science. Domino takes care of solving technical infrastructure problem like spinning up clusters, etc and lets you focus on Data Science. Their tool provides version control for your models / data, group collaboration, cron flexibility, etc