Yet Another Data Blog: Week 3 : Zipfian Academy - Multi-armed bandits and some Machine Learning

Saturday, February 8, 2014

Week 3 : Zipfian Academy - Multi-armed bandits and some Machine Learning

We started the week by finishing off the session on Bayesian Statistics with the study of Bayesian A/B Testing techniques. Some of the strategies covered are extensions of the Multi-armed Bandit problem : epsilon greedy, Bayesian Bandits and UCB1. These algorithms typically out perform traditional A/B testing. We officially started machine learning this week with the treatment of linear regression, multiple linear regression, hetero/homo-scedasticity and multicolinearity. Other topics we covered include Lasso / Ridge regression, cross-validation , over fitting, bias / variance and Gradient Descent. We capped off the week by working on data from one of the past Kaggle competitions - Blue Book for Bulldozers

A few take aways from this week:

There were a few algorithms I had always sort of understood. Some of these algorithms become very clear once you implement them from first principles and then apply them on a dataset. We implemented a Gradient Descent function and then used it to minimize the cost function of both linear and logistic regression problems ( I'll probably have a more detailed blog post on this). Working on some regularization with Lasso and Ridge also gave a better understand on how they both work
We had a visit from @StreetLightData .Very cool problem they're working on. They essentially model mini migration patterns in cities / across the country. They feed data from cell signals, GPS, Census Data (Demographics / Geo) and Traffic data into their systems to extract insights used for marketing and planning
Always remember 80-20. Data scientists spend 80% of their time cleaning datasets and extracting features (or at least more than half their time) and about 20% of their time doing modeling and parameter tuning. Forget those datasets you used in Stats class, real world data can be real messy
$k-fold$ Cross-validation helps you prevent over fitting, get an estimate for your prediction error and helps you understand how stable / robust your model is

$$CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_{i}$$

where MSE is Mean Squared Error

My team took the top spot in the Kaggle competition we worked on. We had an RMSLE (Root Mean Squared Log Error) of $\approx 0.43$ which is about $0.2$ off the winning Kaggle submission. Decent for a few hours of work. It does look like working on Kaggle competitions may become a mainstay / regular end of week exercise

Yet Another Data Blog

Saturday, February 8, 2014

Week 3 : Zipfian Academy - Multi-armed bandits and some Machine Learning

No comments:

Post a Comment