Yet Another Data Blog: Bayesian Analysis

Saturday, February 8, 2014

Week 3 : Zipfian Academy - Multi-armed bandits and some Machine Learning

We started the week by finishing off the session on Bayesian Statistics with the study of Bayesian A/B Testing techniques. Some of the strategies covered are extensions of the Multi-armed Bandit problem : epsilon greedy, Bayesian Bandits and UCB1. These algorithms typically out perform traditional A/B testing. We officially started machine learning this week with the treatment of linear regression, multiple linear regression, hetero/homo-scedasticity and multicolinearity. Other topics we covered include Lasso / Ridge regression, cross-validation , over fitting, bias / variance and Gradient Descent. We capped off the week by working on data from one of the past Kaggle competitions - Blue Book for Bulldozers

A few take aways from this week:

There were a few algorithms I had always sort of understood. Some of these algorithms become very clear once you implement them from first principles and then apply them on a dataset. We implemented a Gradient Descent function and then used it to minimize the cost function of both linear and logistic regression problems ( I'll probably have a more detailed blog post on this). Working on some regularization with Lasso and Ridge also gave a better understand on how they both work
We had a visit from @StreetLightData .Very cool problem they're working on. They essentially model mini migration patterns in cities / across the country. They feed data from cell signals, GPS, Census Data (Demographics / Geo) and Traffic data into their systems to extract insights used for marketing and planning
Always remember 80-20. Data scientists spend 80% of their time cleaning datasets and extracting features (or at least more than half their time) and about 20% of their time doing modeling and parameter tuning. Forget those datasets you used in Stats class, real world data can be real messy
$k-fold$ Cross-validation helps you prevent over fitting, get an estimate for your prediction error and helps you understand how stable / robust your model is

$$CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_{i}$$

where MSE is Mean Squared Error

My team took the top spot in the Kaggle competition we worked on. We had an RMSLE (Root Mean Squared Log Error) of $\approx 0.43$ which is about $0.2$ off the winning Kaggle submission. Decent for a few hours of work. It does look like working on Kaggle competitions may become a mainstay / regular end of week exercise

Saturday, November 10, 2012

Nate Silver and the new Numerati

By now, we probably all know who Nate Silver is. He correctly forecast the result in 49 out of 50 states and all 35 US Senate Races in the 2008 election cycle and all 50 states in the 2012 election cycle. How did he do this? Bayesian Analysis. Ignore all the political pundits.. Nate simply removed the noise from the true signals. You can check out his 538 blog at the New York Times for more details.

We have also learned that the Obama campaign engaged in a massive data mining operation with micro - targeting and voter segmentation that many could say helped them win. They hired Data Scientists to build predictive models for everything from potential voter turn out, potential voters, fund raising through email campaigns to finding out which email campaigns raised the most money and why. This was recently featured in Time Magazine. If 2008 was the Facebook election, you could say 2012 was the data election.

Nate recently published a book 'The Signal and the Noise' . It should make for an interesting read.