Yet Another Data Blog: Week 4 : Zipfian Academy - Oh SQL, Oh SQL... MySQL and some NLP too

Saturday, February 15, 2014

Week 4 : Zipfian Academy - Oh SQL, Oh SQL... MySQL and some NLP too

So things were totally ramped up this week. We started out by scrapping, parsing and cleaning data from the NYTimes API and then jsonified and stored the data in MongoDB. The next day we used the same dataset ported to a few SQL tables and implemented the Naive Bayes algorithm in SQL to classify which labels an article would fall under. We continued with some diagnostics like confusion matrix , confusion tables, false alarm rate, hit rate, precision, recall, ROC curves, etc. Other topics covered include NLTK, tokenization, TF-IDF, n-grams, regular expressions, feature selection using Chi-Squared and Mutual Information. We ended the week by working on another past Kaggle Competition - StumbleUpon Evergreen Classification Challenge

We are at the half way point for the structured part of the class. Just in case you're thinking of doing this, my schedule these days is about 12 - 15 hrs / day during the week doing daily sprints (data scrubbing , transformation / machine learning challenges), reading data science materials and lectures. Over the weekend, I'd say about 10 hrs / day closing the loop on a few of the sprints from the current week and doing more data science / readings for the following week. You basically live and breathe data science... all day long..all week long

Highlights from the week :

We had two guest lectures this week. They were on Naive Bayes and feature extraction in NLP. Zipfian also added a guest lecturer to their roster. The new instructor is a Deep Learning expert and I'm really excited to explore working on new datasets with Neural Networks.
Implementing things from first principles gives you a better understanding of how some of these algorithms work and what may be going on under the hood when they fail.
My team also took the top spot in the Kaggle competition for the second week. The problem we worked on was a classification problem using AUC (Area Under the Curve) as the evaluation metric. We achieved an AUC of $\approx 0.8895$ which is about 0.008 off the leading Kaggle submission on the public leaderboard
Cross-validating on your training set is always a good idea

1 comment:

aFebruary 13, 2015 at 11:28 AM
Great posts, thanks for sharing. Would you say the workload was the same intensity for the entire 12 weeks or did it taper off at some point?
ReplyDelete
Replies