Processing math: 25%

Saturday, February 8, 2014

Week 3 : Zipfian Academy - Multi-armed bandits and some Machine Learning

We started the week by finishing off the session on Bayesian Statistics with the study of Bayesian A/B Testing techniques. Some of the strategies covered are extensions of the Multi-armed Bandit problem : epsilon greedy, Bayesian Bandits and UCB1. These algorithms typically out perform traditional A/B testing. We officially started machine learning this week with the treatment of  linear regression, multiple linear regression, hetero/homo-scedasticity and multicolinearity. Other topics we covered include Lasso / Ridge regression, cross-validation , over fitting, bias / variance and Gradient Descent. We capped off the week by working on data from one of the past Kaggle competitions - Blue Book for Bulldozers

A few take aways from this week:
  • There were a few algorithms I had always sort of understood. Some of these algorithms become very clear once you implement them from first principles and then apply them on a dataset. We implemented a Gradient Descent function and then used it to minimize the cost function of both linear and logistic regression problems ( I'll probably have a more detailed blog post on this). Working on some regularization with Lasso and Ridge also gave a better understand on how they both work
  • We had a visit from @StreetLightData .Very cool problem they're working on. They essentially model mini migration patterns in cities / across the country. They feed data from cell signals, GPS, Census Data (Demographics / Geo) and Traffic data into their systems to extract insights used for marketing and planning
  • Always remember 80-20. Data scientists spend 80% of their time cleaning datasets and extracting features (or at least more than half their time) and about 20% of their time doing modeling and parameter tuning. Forget those datasets you used in Stats class, real world data can be real messy
  • k-fold Cross-validation helps you prevent over fitting, get an estimate for your prediction error and helps you understand how stable / robust your model is 
CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_{i}
          where MSE is Mean Squared Error
  • My team took the top spot in the Kaggle competition we worked on. We had an RMSLE (Root Mean Squared Log Error) of \approx 0.43 which is about 0.2 off the winning Kaggle submission. Decent for a few hours of work. It does look like working on Kaggle competitions may become a mainstay / regular end of week exercise

No comments:

Post a Comment