A few take aways from this week:
- There were a few algorithms I had always sort of understood. Some of these algorithms become very clear once you implement them from first principles and then apply them on a dataset. We implemented a Gradient Descent function and then used it to minimize the cost function of both linear and logistic regression problems ( I'll probably have a more detailed blog post on this). Working on some regularization with Lasso and Ridge also gave a better understand on how they both work
- We had a visit from @StreetLightData .Very cool problem they're working on. They essentially model mini migration patterns in cities / across the country. They feed data from cell signals, GPS, Census Data (Demographics / Geo) and Traffic data into their systems to extract insights used for marketing and planning
- Always remember 80-20. Data scientists spend 80% of their time cleaning datasets and extracting features (or at least more than half their time) and about 20% of their time doing modeling and parameter tuning. Forget those datasets you used in Stats class, real world data can be real messy
- k-fold Cross-validation helps you prevent over fitting, get an estimate for your prediction error and helps you understand how stable / robust your model is
where MSE is Mean Squared ErrorCV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_{i}
- My team took the top spot in the Kaggle competition we worked on. We had an RMSLE (Root Mean Squared Log Error) of \approx 0.43 which is about 0.2 off the winning Kaggle submission. Decent for a few hours of work. It does look like working on Kaggle competitions may become a mainstay / regular end of week exercise
No comments:
Post a Comment