A few take aways from this week:
- There were a few algorithms I had always sort of understood. Some of these algorithms become very clear once you implement them from first principles and then apply them on a dataset. We implemented a Gradient Descent function and then used it to minimize the cost function of both linear and logistic regression problems ( I'll probably have a more detailed blog post on this). Working on some regularization with Lasso and Ridge also gave a better understand on how they both work
- We had a visit from @StreetLightData .Very cool problem they're working on. They essentially model mini migration patterns in cities / across the country. They feed data from cell signals, GPS, Census Data (Demographics / Geo) and Traffic data into their systems to extract insights used for marketing and planning
- Always remember 80-20. Data scientists spend 80% of their time cleaning datasets and extracting features (or at least more than half their time) and about 20% of their time doing modeling and parameter tuning. Forget those datasets you used in Stats class, real world data can be real messy
- $k-fold$ Cross-validation helps you prevent over fitting, get an estimate for your prediction error and helps you understand how stable / robust your model is
where MSE is Mean Squared Error$$CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_{i}$$
- My team took the top spot in the Kaggle competition we worked on. We had an RMSLE (Root Mean Squared Log Error) of $\approx 0.43$ which is about $0.2$ off the winning Kaggle submission. Decent for a few hours of work. It does look like working on Kaggle competitions may become a mainstay / regular end of week exercise
No comments:
Post a Comment