Yet Another Data Blog: modeling

Showing posts with label modeling. Show all posts

Sunday, April 13, 2014

Slides from Project Presentation #How Will My Senator Vote?

Here are slides from my project presentation on analyzing How Senators vote in Congress and building a model to predict how they would vote on future bills

Saturday, April 12, 2014

Week 11 : Zipfian Academy -The Beginning of the End

We started the week wrapping things up with our personal projects, putting together decks for our Hiring Day presentations and doing mock runs of our presentations. Towards mid-week, we did more mock runs and put final touches on our presentation decks.

Hiring Data was pretty hectic. It started off with a short mixer with representatives from the various companies that attended. Each of the companies did a quick presentation on who they were and what they were looking for. Once that was done, we proceeded to presenting each of our projects taking a few questions from the audience at the end of each presentation. There were a lot of really cool projects.

After project presentations and lunch, we had "speed dating" sessions with each of the companies that attended. It was a couple of minutes introducing yourself to the company, hearing what they were looking for and seeing if there's a good fit. It was quite tiring going through 16 or so interviews in the span of two hours but it was a worthwhile experience.

Most of us spent the last day of the week cleaning up and refactoring our project code.

Project Next Steps : I do plan to continue working on my project down the line, making some more improvements to my pipeline, looking at new and richer data sources, asking more interesting questions and doing some more analysis to improve my prediction accuracy. There's still a lot of ground to cover here. I also plan to use Latent Dirichlet Allocation (LDA) to extract better features from my data as you can pull out really rich and interesting features from your data using topic modeling. My original model used a "bag of words" approach. The eventual goal would be to release this as a web app anyone could use.

Highlights from the week:

We started the week with a guest lecture from @itsthomson. He is the founder of framed.io. He just finished the YC program and had lots of words of wisdom. He walked us through his experience making the transition from academia to Data Science, moving to a Chief Scientist role and now Founder. It's refreshing to hear from someone that has gone through the process. Some quotes from his lecture : "Data is the most offensive (vs defense) resource a company has",.. "In Data Science, you have to know a little of everything",.."Being technical helps, but being convincing is better",.. "Understanding how your analysis ties back to your business / organization is key"
We attended a Data Science for Social Good panel event at TaggedHQ. The panelist included CTO - Code for America, CEO - Lumiata, Data Scientist - BrightBytes, Data Scientist - OPower and Lead Data Scientist - Change.org. These companies are utilizing data science to make a difference. It was a very insightful panel session.
Hiring Day was rather interesting. 16 companies attended. The companies came from different verticals including CRM, consulting, social good, social, health, payments, real estate, education and infrastructure. It was interesting hearing some of the problems they were trying to solve in their respective domains

Saturday, April 5, 2014

Week 10 : Zipfian Academy - Closing the loop ... rinse..repeat

Continued working on my personal project and was glad my data ingestion and aggregation pipeline was built and optimized.

Analysis : Now that I had most of the data I needed, the next step was trying the close the loop as soon as possible, get some predictions for each Senator and then iterate. One challenge was trying the find signals that indicated uniqueness just from voting patterns and the content of the bills. As part of my analysis, I used techniques like MDS, clustering and NLP to extract salient features from my initial dataset. I did find out from my analysis that over the past 3.5 years, Democrats are more predictable and are more alike than Republicans based on just their voting patterns.

Modeling : I started off with a Naive bag of words model and got an average prediction accuracy in the low 60's. I went back and did some chi-squared feature selection, natural language processing (tfidf, n-grams, stop-words, stemming, binning, lemmatization, etc...), grid search and cross-validation on a pipeline of models (Logistic Regression, Random Forest, SVM, AdaBoost, Naive Bayes, kNN) and added some social data from wikipedia and twitter. This improved my average prediction accuracy to the high 60's. Moving forward, there's still a lot of ground to cover here. I can probably get this to low 80's on average prediction accuracy across all the Senators in congress The biggest take away here is to spend time and lots of it understanding your dataset, crafting better features and adding external data that would give additional insights or increase the richness of your data. The modeling part can be automated but your models can only be as good as the data you feed them.

At this point, we're all seeing the light at the end of the tunnel. I gave a top level overview of my project. I'm working on putting up a Github repo with a more in-depth version.

Highlights from the week:

We had a guest lecture from @WibiData on building real-time recommendation engines at scale with kiji. The kiji platform seems pretty mature and has support for quite a few languages and connectors to several Big Data frameworks. Evaluating recommender engines has always been a problem. One approach is to perform validation on a hold out sample of your data.
We also had another pretty interesting guest lecture from @maebert. They've built an automatic journaling tool. They built a data product out all the ambient data (passively generated data) we generate by triangulating your position using GPS signals and cell phone towers. They use those data points to tell a story about you. Their pipeline looks something like this (Signals -> Data -> Information -> Knowledge -> Stories). It's actually quite cool how patterns start to emerge when you look at aggregate data. I guess we all know a little something about that from the "revelations" that happened last summer. They utilize techniques like LSA / LDA/ SVD to extract concepts and their weights, expectation maximization (Gaussian mean shift) and some NLP. They try to see if the concepts change over time and also try to enrich their datasets using external feeds for weather data, events data, ticketing, etc
We had breakouts on presentations. We worked on our projects for two weeks and trying to bottle all that work into a three minute presentation won't do it justice. So you'd want to answer the following questions to give the audience enough to spark some interest - What?, Why?, How?, So What?, Next Steps?