Monday, March 31, 2014

Some more interesting links...

An awesome list of April Fools gadgets from various companies. Maybe someone should bring these products to life.. that would be pretty epic

Another Python vs R  post

Speeding up python


Friday, March 28, 2014

Week 9 : Zipfian Academy - Personal projects

This is a little late, so I'll try and make this quick. Personal projects began this week and we'll be working on them for another week. My project is focused on modeling Senators' past voting patterns and using that to predict how they'll vote on future legislation and whether bills pass or not.

Data Acquisition : I initially planned to source my data using APIs from The Sunlight Foundation and Votesmart but realized quickly things might take much longer with the APIs since I needed several different datasets and also needed a way to aggregate all the data. I decided it would be more optimal to go straight to the source: US Senate website. Setting up and debugging my data ingestion pipeline took another two days and by the end of the week I had all the data I needed..scraped, cleaned and packaged nicely in a database and several python pickle objects.

Data Transformation : Getting the data is one thing and transforming it to get it ready for analysis and modeling is another. Most Data Scientist tend to spend a lot of project time cleaning, aggregating and transforming data.

Highlights from the week:
  • Got a chance to attend a meetup organized by BaRUG (Bay Area R Users Group). There was a talk from the author of the caret package (this is kind of the R version of scikit-learn) and another from the Human Rights Data Analysis Group - they use R to build statistical models to work on human rights projects across the globe.
  • We had a guest talk from a former physicist who is now a Data Scientist @WalmartLabs. He works with a group that deals with algorithmic business optimization. The talk was actually quite insightful as he touched on some interesting pain points.."reconciling technical and business needs "..."The simpler the model the better"
  • We also had another guest lecture on visualization. The speaker also worked on this awesome visualization of BART employee salaries 
  • Several of us attended a D3 workshop organized by the VUDlab at UC Berkeley

Sunday, March 23, 2014

A few interesting links - Wolfram Language, CLT visualized, Google graveyard .....

Wolfram language looks pretty impressive. Definitely need to give it a try

Central Limit Theorem visualized, Another awesome D3 visualization

Data Elite : This is the YCombinator for Big Data startups

Metacademy - machine learning focused search engine

Nate Silver Interview : pretty insightful interview from Nate Silver

Deep Belief Networks in the browser. This is actually pretty cool. It would be nice if there was an API around this

explainshelll helps you understand all those shell scripts you come across

Google Graveyard of products : many-a-google products have graced us in the past decade. Hoping Google Glass and Google Driverless Cars don't end up in the graveyard. Hmm.. looks like someone already created a spot for Google Glass

Sergey Brin's old grad school resume from almost 2 decades ago.. Bet you he didn't know he'd become one of the richest people in the world today

Monday, March 17, 2014

Week 8 : Zipfian Academy - Assessment and Review

This is the last week for the structured part of the curriculum....moving forward, it's projects, hiring day and interviews.

We started the week by reviewing material from the first three weeks : Statistics, distributions, hypothesis testing, Bayesian A/B testing and different distance metrics (jaccard, euclidean, hamming, cosine) and then we moved to reviewing material from the second four weeks : web scraping with beautifulsoup and Naive Bayes. We had a breakout on advanced web scraping with scrappy and mechanize. You use these advanced tools when you have complex scraping requirements like javascript loading, pagination, etc. Tools like kimono and import.io were also introduced. These tools can be a lifesaver especially when you're working on a tight schedule.

By midweek, we reviewed some of the material we covered earlier : more Naive Bayes. regression and outliers. Next day, we worked on an assessment. It was an NLP classification problem from one of our partner companies. The dataset was fairly small and the problem was well defined. Things got a bit interesting when I tried doing "extensive" grid search and cross validation on a pipeline of different models. It took more than 12 hours to go through one model.... this is where you either push things to a beefy machine on AWS or perform a randomized grid search to help you find the best hyperparameters for your model. We also reviewed sample interview problems.

We finished off the week with another assessment. This was more involved and had a few parts to it : some data wrangling of machine generated log data for a video content site,  classification, regression, clustering and building a recommendation engine. It really feels like things are winding down as we move to personal projects next week.

Highlights from the week:
  • We had a guest lecture from @enjalot. He's working on the bleeding edge of data visualization. Some of the things he's worked on include this visualization of BART employee salaries that made the rounds during the BART strike

Friday, March 14, 2014

In honor of Pi Day : Estimating the value of pi via Monte Carlo simulation

In honor of $\pi$ day, I'll run you through calculating the numerical value of $\pi$ using a method called Monte Carlo simulation. It is basically a type of simulation that samples a lot random values from a distribution. It is used widely to solve many types of problems including those that don't have closed form solutions like estimating the value of numerical integrals, sensitivity analysis, bayesian inference, predicting election results, stock price movements and the list goes on ...

In our scenario, we want to calculate the value of $\pi$. To do this, we''ll be throwing darts at a square dart board (dimensions : 1 by 1) with a quadrant (radius : 1) in it. After throwing a bunch of darts at the board, we'll find the ratio of the number of darts that end up inside the quadrant to the total number of darts we threw and then multiply that number by 4 to get an estimate for the value of $\pi$

Lets work through the math

$4 \times \frac{A_{quadrant}}{A_{square}} = 4 \times \frac{\frac{1}{4}\pi r^{2}}{(1)^2} = 4 \times \frac{\frac{1}{4}\pi(1)^2}{(1)^2} = 4 \times \frac{\frac{1}{4}\pi}{1} \approx  \frac{N_{hits}}{N_{trials}}  \approx   \pi  $

In our Monte Carlo simulation, we'll be sampling random points onto our 1 by 1 space and comparing the number of points that end up in the quadrant to the total number of points.



From the code sample above and my cli, we see that as we increase the number of trials, our estimated value of $\pi$ gets closer to the real value. And if you run enough trials you will approach steady state (true value). Another version of the code sample runs a lot of trials, so you can visually see what's happening to the estimated value of  $\pi$

See the graph below. The peaks occur between 3.140 and 3.144, which tells us that the true value of $\pi$ lies somewhere in that range


Monday, March 10, 2014

Week 7 : Zipfian Academy - Advanced Machine Learning and Deep Learning

The week started with a detour into more advanced machine learning algorithms. We covered Logistic Regression, SVM, Naive Bayes and $k-$Nearest Neighbors. We compared these algorithms on several datasets to see which situations one would perform better than the other. Before now, I usually pick my favorite algorithms and apply them on a datasets -  that's the wrong way to approach things. You need to be more strategic in choosing which algorithm you use (Use the right tool for the job).  Machine learning algorithms are broadly classified as either generative or discriminative and you have MCMC (generative) and  Neural Networks (discriminative) at the opposite ends of this spectrum respectively.

Next day, we moved to Decision Trees, Random Forest and ensembles  We used BigML to visualize decision trees and i must say they have one of the best visualizations for Decision Trees. It comes in pretty handy when you're just doing EDA. Building Decision Trees can be slow and prone to over fitting but Random Forest solves a lot of these issues as the modeling process can be parallelized. They also tend to give you models with low bias and variance.

By mid-week, we delved into Deep Learning and built a Deep Belief Network using this library with several hidden layers (Restricted Boltzmann Machines) to classify digits from the popular MNIST dtataset. We had a feed-forward setup with no back propagation. Neural Networks have been around for a while but they were put back on the map with the advent of Deep Learning about a decade ago. In academic circles, most of the hottest Deep Learning research is going on at places like NYU/Facebook (Yann LeCun), Toronto/Google (Geoff Hinton), Montreal (Yoshua Bengio) and Stanford/Google (Andrew Ng) (..of course, not listed in order)

Towards the end of the week, we took another detour into time series analysis and worked on some trend and seasonality analysis using pandas. Friday was more of a catchup day. There were no official sprints. We worked on project proposals and had a Deep Learning and git breakout. We're slowing nearing the end of the structured curriculum and everyone is slowly moving into project mode. I'll say, there were a lot of 'aha' moments for me this week.

Highlights of the week:
  • We had a guest lecture from Allen on multi-layer perceptrons. He talked about some of the interesting research he worked on at Nike and how they used Neural Networks and other machine learning tools to design better footwear.
  • Chief Data Scientist  at @Personagraph  gave an interesting lecture on how they're using machine learning
  • We closed out the busy week with a Mentor Mixer. A lot of industry practitioners attended. The goal was to match current students with practicing Data Scientists. There were a variety of mentors who showed up, mostly Data Scientists, some Chief Data Scientists and a two-time Kaggle Competition Winner (yes, they are a rare breed but they do exist)

Thursday, March 6, 2014

Slides from in-database analytics talk #MADlib

Here are my slides from a talk I gave on in-database analytics systems and the MADlib library

Saturday, March 1, 2014

Week 6 : Zipfian Academy - The Elephant, the Pig, the Bee and other stories

So this week was one heck of a week. We started Data Engineering and Data at Scale, had several guest lectures and played with some really cool tools. I got the RAM kit I ordered over the weekend, so it was nice starting the week running on 16gb of RAM. You could easily wrangle a few gigs of data locally without resorting to AWS or mmap tricks.

Okay, this post is going to be a long one. The events here are in chronological order.

The week started off with a quick assessment. We are getting to the end of the structured part of the curriculum, so this really helped pinpoint some areas and topics to revisit. The day officially started with an introduction to Hadoop, Hadoop Streaming and the map-reduce framework via  mrjob. We wrote map-reduce jobs to do word counts. MapReduce framework / Hadoop easily made its way into mainstream because people realized they could now store massive amounts of data (terabytes -> petabytes) on a bunch of commodity nodes and also perform computation on those nodes. Of course Google also had a hand in this after they published their celebrated Map Reduce paper back in 2004. They seem to be years ahead of everyone else.

The next day, we worked on a more interesting problem using mrjob to do recommendations at scale by building a scaled down version of LinkedIn's People You May Know. We developed a multi-step map-reduce program with mrjob to analyze a social graph of friends and then performed Triadic closure to recommend new friends. So basically, if A knows B and A knows C, then to some degree B and C should also be friends.

Next day, we moved on to using Hive to build a contextualized ads server on AWS. Tools like Hive and Pig (which both have SQL-like syntax) are used to query data on Hadoop / Distributed File Systems and they both compile down to map-reduce. These tools were developed to cater to the audience that was already familiar with SQL but were not Java experts. We also discussed some of the next generation frameworks like Spark (in-memory), Impala (in-memory), Storm (real-time), Giraph (graphs), Flume (streaming-log files),  etc.

After the first few days working with data at scale, we moved in a different direction to building data products with Flask and yHat . Using these tools basically opens the flood gates for data products. If you can build a predictive model in python or R, you can push it to production or to a client with minimal plumbing.

We wrapped up the week discussing project proposals, doing reviews, a breakout on Deep Learning and catching up on sprints. I ended up exploring a bunch of APIs from some non-profits. I was impressed some of them would build APIs to let people access their data and other government data.

Highlights of the week:
  • A lot happened this week, we started out with a guest lecture with Dennis from @IdleGames. He gave his point of view from being in a Data Scientist role for the past year. A few quotes from the talk : "if you can't explain your analysis, it didn't happen!!", "Remove yourself as the bottleneck and automate everything". He's working on some interesting projects like churn prediction, integrating offline analysis into live games, collusion detection, customer long term value (LTV) and the uncertainty associated with that and also building out data products to democratize insights.
  • I had a chance to attend a meetup hosted by Microsoft. There were two talks at this meetup. Setting up and running Hadoop on Cloudera and a more hands on intro to using F# for machine learning. It's not everyday the words 'open source' and 'Microsoft' show up in the same sentence, but I was really impressed with how expressive the language was and did I say it was open source and multi-platform? There was a really nice demo and I liked the F# type providers. One of the great things about the Bay Area is that on any given day, there are at least a handful of pretty good Tech meetups with very interesting topics and speakers.
  • Tuesday, we had another guest lecture with Tyler from @radius. He's a Data Engineer who uses mrjob daily to write map-reduce jobs. He went through some detailed examples and sample code. It's really cool how mrjob handles a lot of plumbing around writing map-reduce jobs. You can run your code locally and also push it up to an Amazon Elastic MapReduce cluster with minimal changes. 
  • Next day, we attended another meetup hosted at YelpHQ. There were two very good presentations from @bigmlcom on doing exploratory data analysis / modeling and @YhatHQ on productionizing your python and R models. BigML has some cool features for getting a feel for your data and some pretty cool model visualizations. They focus mainly on decision trees and random forest ensembles. BigML's tool also let you download your models either in pseudo-code or in a variety of languages like python, R, Java, etc. yHat is a glue between the analytics stack and engineering stack. They also enable Data Scientists to be more productive by letting them push out their models to production via a REST streaming API with a json end point. I actually wrote about both startups a few months ago here
  • The co-founders of @YhatHQ dropped by the 'Data Warehouse' and gave a nice demo of a beer recommender service built with yHat and Flask. Greg was very helpful in working us through some of the issues we had setting things up.
  • We had a special guest this week, @nathanmarz. He's a former Twitter Engineer and creator of Storm and Cascalog. His talk was half technical (overview of Clojure) and half philosophical (thoughts on where things are headed in the Big Data ecosystem). He reiterated the concept of doing your data analysis by building abstractions. I have to say, it is truly inspiring watching a Master at work.
  • Our last guest lecture of week was by Nick from @DominoDataLab. They've built a Heroku for Data Science. Domino takes care of solving technical infrastructure problem like spinning up clusters, etc  and lets you focus on Data Science. Their tool provides version control for your models / data,  group collaboration, cron flexibility, etc