Yet Another Data Blog: closing the loop

Showing posts with label closing the loop. Show all posts

Friday, April 18, 2014

Week 12 : Zipfian Academy - And That's All folks....

And so, all great things must come to an end. This is the final week for the bootcamp program. We continued interview prep, white boarding and code reviews. Apparently interviewing feels like having a full time job. Towards the end of the week, we continued with project one on ones, some more white boarding and interview prep, runtime and complexity analysis.

At the end of the week, we had a get together to celebrate the past the 12 weeks. A handful of alums from the last cohort attended and it's kind of cool to see what past alums are doing now.. some are at stealth startups and startups while others are working at some very impressive companies.

Highlights of the week:

We had a guest lecture from former cosmologist and Data Scientist @datamusing on using topic models to understand restaurant reviews. The topics were learned from the review corpus using LDA and NNMF. He also had a pretty cool d3 + flask visualization to show the results
We spent a day at the Big Data Innovation Summit. The morning talks mostly felt like business sales pitches. The afternoon talks were a lot more interesting as there were breakouts for Data Science, Machine Learning, Hadoop, etc
In the Data Science breakouts, there were a lot of LDA related talks including using topic modeling in Health Care and using LDA to extract features for matches in a dating website.
Lots of interview prep and white boarding

And so this is it. My hope is that someone actually finds my ramblings over the past 12 weeks somewhat helpful in forging their own path into Data Science......signing off.

Sunday, April 13, 2014

Slides from Project Presentation #How Will My Senator Vote?

Here are slides from my project presentation on analyzing How Senators vote in Congress and building a model to predict how they would vote on future bills

Saturday, April 12, 2014

Week 11 : Zipfian Academy -The Beginning of the End

We started the week wrapping things up with our personal projects, putting together decks for our Hiring Day presentations and doing mock runs of our presentations. Towards mid-week, we did more mock runs and put final touches on our presentation decks.

Hiring Data was pretty hectic. It started off with a short mixer with representatives from the various companies that attended. Each of the companies did a quick presentation on who they were and what they were looking for. Once that was done, we proceeded to presenting each of our projects taking a few questions from the audience at the end of each presentation. There were a lot of really cool projects.

After project presentations and lunch, we had "speed dating" sessions with each of the companies that attended. It was a couple of minutes introducing yourself to the company, hearing what they were looking for and seeing if there's a good fit. It was quite tiring going through 16 or so interviews in the span of two hours but it was a worthwhile experience.

Most of us spent the last day of the week cleaning up and refactoring our project code.

Project Next Steps : I do plan to continue working on my project down the line, making some more improvements to my pipeline, looking at new and richer data sources, asking more interesting questions and doing some more analysis to improve my prediction accuracy. There's still a lot of ground to cover here. I also plan to use Latent Dirichlet Allocation (LDA) to extract better features from my data as you can pull out really rich and interesting features from your data using topic modeling. My original model used a "bag of words" approach. The eventual goal would be to release this as a web app anyone could use.

Highlights from the week:

We started the week with a guest lecture from @itsthomson. He is the founder of framed.io. He just finished the YC program and had lots of words of wisdom. He walked us through his experience making the transition from academia to Data Science, moving to a Chief Scientist role and now Founder. It's refreshing to hear from someone that has gone through the process. Some quotes from his lecture : "Data is the most offensive (vs defense) resource a company has",.. "In Data Science, you have to know a little of everything",.."Being technical helps, but being convincing is better",.. "Understanding how your analysis ties back to your business / organization is key"
We attended a Data Science for Social Good panel event at TaggedHQ. The panelist included CTO - Code for America, CEO - Lumiata, Data Scientist - BrightBytes, Data Scientist - OPower and Lead Data Scientist - Change.org. These companies are utilizing data science to make a difference. It was a very insightful panel session.
Hiring Day was rather interesting. 16 companies attended. The companies came from different verticals including CRM, consulting, social good, social, health, payments, real estate, education and infrastructure. It was interesting hearing some of the problems they were trying to solve in their respective domains

Saturday, April 5, 2014

Week 10 : Zipfian Academy - Closing the loop ... rinse..repeat

Continued working on my personal project and was glad my data ingestion and aggregation pipeline was built and optimized.

Analysis : Now that I had most of the data I needed, the next step was trying the close the loop as soon as possible, get some predictions for each Senator and then iterate. One challenge was trying the find signals that indicated uniqueness just from voting patterns and the content of the bills. As part of my analysis, I used techniques like MDS, clustering and NLP to extract salient features from my initial dataset. I did find out from my analysis that over the past 3.5 years, Democrats are more predictable and are more alike than Republicans based on just their voting patterns.

Modeling : I started off with a Naive bag of words model and got an average prediction accuracy in the low 60's. I went back and did some chi-squared feature selection, natural language processing (tfidf, n-grams, stop-words, stemming, binning, lemmatization, etc...), grid search and cross-validation on a pipeline of models (Logistic Regression, Random Forest, SVM, AdaBoost, Naive Bayes, kNN) and added some social data from wikipedia and twitter. This improved my average prediction accuracy to the high 60's. Moving forward, there's still a lot of ground to cover here. I can probably get this to low 80's on average prediction accuracy across all the Senators in congress The biggest take away here is to spend time and lots of it understanding your dataset, crafting better features and adding external data that would give additional insights or increase the richness of your data. The modeling part can be automated but your models can only be as good as the data you feed them.

At this point, we're all seeing the light at the end of the tunnel. I gave a top level overview of my project. I'm working on putting up a Github repo with a more in-depth version.

Highlights from the week:

We had a guest lecture from @WibiData on building real-time recommendation engines at scale with kiji. The kiji platform seems pretty mature and has support for quite a few languages and connectors to several Big Data frameworks. Evaluating recommender engines has always been a problem. One approach is to perform validation on a hold out sample of your data.
We also had another pretty interesting guest lecture from @maebert. They've built an automatic journaling tool. They built a data product out all the ambient data (passively generated data) we generate by triangulating your position using GPS signals and cell phone towers. They use those data points to tell a story about you. Their pipeline looks something like this (Signals -> Data -> Information -> Knowledge -> Stories). It's actually quite cool how patterns start to emerge when you look at aggregate data. I guess we all know a little something about that from the "revelations" that happened last summer. They utilize techniques like LSA / LDA/ SVD to extract concepts and their weights, expectation maximization (Gaussian mean shift) and some NLP. They try to see if the concepts change over time and also try to enrich their datasets using external feeds for weather data, events data, ticketing, etc
We had breakouts on presentations. We worked on our projects for two weeks and trying to bottle all that work into a three minute presentation won't do it justice. So you'd want to answer the following questions to give the audience enough to spark some interest - What?, Why?, How?, So What?, Next Steps?