Yet Another Data Blog: immersive learning

Showing posts with label immersive learning. Show all posts

Saturday, March 1, 2014

Week 6 : Zipfian Academy - The Elephant, the Pig, the Bee and other stories

So this week was one heck of a week. We started Data Engineering and Data at Scale, had several guest lectures and played with some really cool tools. I got the RAM kit I ordered over the weekend, so it was nice starting the week running on 16gb of RAM. You could easily wrangle a few gigs of data locally without resorting to AWS or mmap tricks.

Okay, this post is going to be a long one. The events here are in chronological order.

The week started off with a quick assessment. We are getting to the end of the structured part of the curriculum, so this really helped pinpoint some areas and topics to revisit. The day officially started with an introduction to Hadoop, Hadoop Streaming and the map-reduce framework via mrjob. We wrote map-reduce jobs to do word counts. MapReduce framework / Hadoop easily made its way into mainstream because people realized they could now store massive amounts of data (terabytes -> petabytes) on a bunch of commodity nodes and also perform computation on those nodes. Of course Google also had a hand in this after they published their celebrated Map Reduce paper back in 2004. They seem to be years ahead of everyone else.

The next day, we worked on a more interesting problem using mrjob to do recommendations at scale by building a scaled down version of LinkedIn's People You May Know. We developed a multi-step map-reduce program with mrjob to analyze a social graph of friends and then performed Triadic closure to recommend new friends. So basically, if A knows B and A knows C, then to some degree B and C should also be friends.

Next day, we moved on to using Hive to build a contextualized ads server on AWS. Tools like Hive and Pig (which both have SQL-like syntax) are used to query data on Hadoop / Distributed File Systems and they both compile down to map-reduce. These tools were developed to cater to the audience that was already familiar with SQL but were not Java experts. We also discussed some of the next generation frameworks like Spark (in-memory), Impala (in-memory), Storm (real-time), Giraph (graphs), Flume (streaming-log files), etc.

After the first few days working with data at scale, we moved in a different direction to building data products with Flask and yHat . Using these tools basically opens the flood gates for data products. If you can build a predictive model in python or R, you can push it to production or to a client with minimal plumbing.

We wrapped up the week discussing project proposals, doing reviews, a breakout on Deep Learning and catching up on sprints. I ended up exploring a bunch of APIs from some non-profits. I was impressed some of them would build APIs to let people access their data and other government data.

Highlights of the week:

A lot happened this week, we started out with a guest lecture with Dennis from @IdleGames. He gave his point of view from being in a Data Scientist role for the past year. A few quotes from the talk : "if you can't explain your analysis, it didn't happen!!", "Remove yourself as the bottleneck and automate everything". He's working on some interesting projects like churn prediction, integrating offline analysis into live games, collusion detection, customer long term value (LTV) and the uncertainty associated with that and also building out data products to democratize insights.
I had a chance to attend a meetup hosted by Microsoft. There were two talks at this meetup. Setting up and running Hadoop on Cloudera and a more hands on intro to using F# for machine learning. It's not everyday the words 'open source' and 'Microsoft' show up in the same sentence, but I was really impressed with how expressive the language was and did I say it was open source and multi-platform? There was a really nice demo and I liked the F# type providers. One of the great things about the Bay Area is that on any given day, there are at least a handful of pretty good Tech meetups with very interesting topics and speakers.
Tuesday, we had another guest lecture with Tyler from @radius. He's a Data Engineer who uses mrjob daily to write map-reduce jobs. He went through some detailed examples and sample code. It's really cool how mrjob handles a lot of plumbing around writing map-reduce jobs. You can run your code locally and also push it up to an Amazon Elastic MapReduce cluster with minimal changes.
Next day, we attended another meetup hosted at YelpHQ. There were two very good presentations from @bigmlcom on doing exploratory data analysis / modeling and @YhatHQ on productionizing your python and R models. BigML has some cool features for getting a feel for your data and some pretty cool model visualizations. They focus mainly on decision trees and random forest ensembles. BigML's tool also let you download your models either in pseudo-code or in a variety of languages like python, R, Java, etc. yHat is a glue between the analytics stack and engineering stack. They also enable Data Scientists to be more productive by letting them push out their models to production via a REST streaming API with a json end point. I actually wrote about both startups a few months ago here.
The co-founders of @YhatHQ dropped by the 'Data Warehouse' and gave a nice demo of a beer recommender service built with yHat and Flask. Greg was very helpful in working us through some of the issues we had setting things up.
We had a special guest this week, @nathanmarz. He's a former Twitter Engineer and creator of Storm and Cascalog. His talk was half technical (overview of Clojure) and half philosophical (thoughts on where things are headed in the Big Data ecosystem). He reiterated the concept of doing your data analysis by building abstractions. I have to say, it is truly inspiring watching a Master at work.
Our last guest lecture of week was by Nick from @DominoDataLab. They've built a Heroku for Data Science. Domino takes care of solving technical infrastructure problem like spinning up clusters, etc and lets you focus on Data Science. Their tool provides version control for your models / data, group collaboration, cron flexibility, etc

Sunday, February 23, 2014

Week 5 : Zipfian Academy - Graphs and Community Detection

The update for last week will be short and quick. Doing these blog posts is getting much harder.

We started the week looking at unsupervised learning techniques like k-means and hierarchical clustering. We also visited dimension reduction techniques like SVD and NMF. By mid-week, we switched gears to graph analysis and covered in no particular order BFS, DFS, A*, Dijkstra and community detection in graph networks

Take aways from the week:

We had several guest lectures this week. @kanjoya is working on the cutting edge of Natural Language Processing. They help their clients derive actionable intelligence from emotions and intuition. The speaker discussed the general NLP landscape : tools and techniques. I found it interesting that some of their training data comes from The Experience Project
@geli gave an interesting talk. They've basically built an OS for energy systems and hope to revolutionize the energy management space
@thomaslevine talk was on open data initiatives around the country. Open Data is one of those things cities like to talk about but very few of them are doing it well
Things were switched around this week. We ended the week working on a dataset from one of the partner companies. The dataset recorded mobile ads served to user at various locations, we were supposed to do some exploration and find out the best locations to serve ads to users. The dataset had a couple million records. Trying to wrangle giga-byte sized data on just 4gb of RAM is definitely not fun. I ordered a 16gb RAM kit, should get it by this weekend. If you are thinking of enrolling for the course, you should shoot for at least 8gb of RAM.

Saturday, February 15, 2014

Week 4 : Zipfian Academy - Oh SQL, Oh SQL... MySQL and some NLP too

So things were totally ramped up this week. We started out by scrapping, parsing and cleaning data from the NYTimes API and then jsonified and stored the data in MongoDB. The next day we used the same dataset ported to a few SQL tables and implemented the Naive Bayes algorithm in SQL to classify which labels an article would fall under. We continued with some diagnostics like confusion matrix , confusion tables, false alarm rate, hit rate, precision, recall, ROC curves, etc. Other topics covered include NLTK, tokenization, TF-IDF, n-grams, regular expressions, feature selection using Chi-Squared and Mutual Information. We ended the week by working on another past Kaggle Competition - StumbleUpon Evergreen Classification Challenge

We are at the half way point for the structured part of the class. Just in case you're thinking of doing this, my schedule these days is about 12 - 15 hrs / day during the week doing daily sprints (data scrubbing , transformation / machine learning challenges), reading data science materials and lectures. Over the weekend, I'd say about 10 hrs / day closing the loop on a few of the sprints from the current week and doing more data science / readings for the following week. You basically live and breathe data science... all day long..all week long

Highlights from the week :

We had two guest lectures this week. They were on Naive Bayes and feature extraction in NLP. Zipfian also added a guest lecturer to their roster. The new instructor is a Deep Learning expert and I'm really excited to explore working on new datasets with Neural Networks.
Implementing things from first principles gives you a better understanding of how some of these algorithms work and what may be going on under the hood when they fail.
My team also took the top spot in the Kaggle competition for the second week. The problem we worked on was a classification problem using AUC (Area Under the Curve) as the evaluation metric. We achieved an AUC of $\approx 0.8895$ which is about 0.008 off the leading Kaggle submission on the public leaderboard
Cross-validating on your training set is always a good idea

Saturday, February 8, 2014

Week 3 : Zipfian Academy - Multi-armed bandits and some Machine Learning

We started the week by finishing off the session on Bayesian Statistics with the study of Bayesian A/B Testing techniques. Some of the strategies covered are extensions of the Multi-armed Bandit problem : epsilon greedy, Bayesian Bandits and UCB1. These algorithms typically out perform traditional A/B testing. We officially started machine learning this week with the treatment of linear regression, multiple linear regression, hetero/homo-scedasticity and multicolinearity. Other topics we covered include Lasso / Ridge regression, cross-validation , over fitting, bias / variance and Gradient Descent. We capped off the week by working on data from one of the past Kaggle competitions - Blue Book for Bulldozers

A few take aways from this week:

There were a few algorithms I had always sort of understood. Some of these algorithms become very clear once you implement them from first principles and then apply them on a dataset. We implemented a Gradient Descent function and then used it to minimize the cost function of both linear and logistic regression problems ( I'll probably have a more detailed blog post on this). Working on some regularization with Lasso and Ridge also gave a better understand on how they both work
We had a visit from @StreetLightData .Very cool problem they're working on. They essentially model mini migration patterns in cities / across the country. They feed data from cell signals, GPS, Census Data (Demographics / Geo) and Traffic data into their systems to extract insights used for marketing and planning
Always remember 80-20. Data scientists spend 80% of their time cleaning datasets and extracting features (or at least more than half their time) and about 20% of their time doing modeling and parameter tuning. Forget those datasets you used in Stats class, real world data can be real messy
$k-fold$ Cross-validation helps you prevent over fitting, get an estimate for your prediction error and helps you understand how stable / robust your model is

$$CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_{i}$$

where MSE is Mean Squared Error

My team took the top spot in the Kaggle competition we worked on. We had an RMSLE (Root Mean Squared Log Error) of $\approx 0.43$ which is about $0.2$ off the winning Kaggle submission. Decent for a few hours of work. It does look like working on Kaggle competitions may become a mainstay / regular end of week exercise

Week 2 : Zipfian Academy - Are you Frequentist or Bayesian ?

We started the week with a tour de force of matplotlib and then switched gears to Statistics. For the rest of the week, we covered Hypothesis Testing, Goodness of Fit (Kolmogorov Smirnov Test), Distributions, Confidence Intervals, p-values, t-tests, Frequentist A/B vs Bayesian A/B testing, MCMC

A few notes from this week:

There's a huge debate on the Frequentist vs Bayesian schools of thought with proponents on both sides. Just in case you're on the fence here's an Open Letter on why you should think about going Bayesian
The goal in Bayesian inference is to get a good handle on the posterior distribution over the input parameters. Some of the math can get pretty beefy so you would normally use a package like pymc or rjags to fit you distributions / models. Some good resources for pymc and MCMC are this Stats Book for Hackers and this set of videos from mathematicalmonk
Some EDA (Exploratory Data Analysis) tools you should have in your workflow include Raw and CartoDB
We had two pretty good talks by @nitin on LearnDataScience and @Udacity on their Data Science course development at one of the local meetup groups

Tuesday, January 28, 2014

Week 1 : Zipfian Academy - Priming the Pump , some Unix shell, python, recommenders and data wrangling

So things got off to a speedy start for the Zipfian Academy winter cohort. I think this first week was used to establish a pace for the course and it was intense. We probably spent about 8 - 10 hours a day on lectures / sprints and working on problems. I would imagine some folks probably spent a few more hours working on code / readings when they got back home.

For the first week, we covered the following in no particular order : overview of probability , Git / Github, some Bayesian Statistics, Bash / Unix shell, python, pdb, ipython, TDD, OOP, SVD, matrix factorization, overview of Linear Algebra and built a recommender system based on a large Amazon dataset

A few interesting themes I noticed

Pair programming is woven into the fabric of the program. We'll be pair programming for the first few weeks and will break off after about the first month or so. It does take a lot of getting used to especially if you haven't done it before or not used to it.
Git / Github which is probably the standard version control tool at most tech shops is also tightly integrated into the program. From the first day, we were expected to fork repos and also push and pull code.
Emergence of ipython as a truly awesome collaborative tool for data analysis. If you do python and you haven't used ipython and ipython notebooks, please stop reading this blog post and google them, you'll thank me.. seriously.
The ever powerful and omni-present Bash and Unix command line. Every now and then, you're reminded how powerful these tools are for data analysis and data pipelining tasks.
It is amazing how much you can learn by osmosis when you find yourself immersed in a collaborative environment with very like-minded colleagues.
In thinking like a Data Scientist, your mind always has to be on the business problem you're trying to solve. It's not always about running the fastest or most exciting machine learning algorithm.

Looking at the curriculum for the next couple of weeks. Saying it will be intense will probably be an understatement, but I'll try and put out weekly updates as time permits.