Extreme Quantified Self. This MIT professor analyzed about 90,000 hours of video / 140,000 hours of audio / 200 terabytes of home videos to understand how his child's speech developed. This is probably one of the coolest things I've seen. He started a company (Bluefin Labs) around the technology he used for the analysis and then sold that company to Twitter for a fat wad of cash. This is his TED talk
You definitely want to utilize the resources at your local public library. These days they have amazing resources like access to Safari which gives you access to O'Reilly and Packt titles
Some very good advice if you're on the interview trail - Always Be Coding
A nice visualization / simulation of what's happening in a Multi - armed Bandit problem
Thoughts on Collective Intelligence, Data Wrangling, Data Science, Predictive Modeling,Start-ups and a repository for ideas and things I hope not to forget
Sunday, April 27, 2014
Friday, April 25, 2014
Zipfian Academy - All 12 weeks
Here you go.. a week to week summary of my experience at Zipfian Academy
Week 2 : Zipfian Academy - Are you Frequentist or Bayesian ?
Week 3 : Zipfian Academy - Multi-armed bandits and some Machine Learning
Week 4 : Zipfian Academy - Oh SQL, Oh SQL... MySQL and some NLP too
Week 5 : Zipfian Academy - Graphs and Community Detection
Week 6 : Zipfian Academy - The Elephant, the Pig, the Bee and other stories
Week 7 : Zipfian Academy - Advanced Machine Learning and Deep Learning
Week 8 : Zipfian Academy - Assessment and Review
Week 9 : Zipfian Academy - Personal projects
Week 10 : Zipfian Academy - Closing the loop ... rinse..repeat
Week 11 : Zipfian Academy -The Beginning of the End
Week 12 : Zipfian Academy - And That's All folks....
For a different point of view about the Zipfian experience do checkout another fellow alum - Melanie's blog All the Tech Things
Friday, April 18, 2014
Week 12 : Zipfian Academy - And That's All folks....
And so, all great things must come to an end. This is the final week for the bootcamp program. We continued interview prep, white boarding and code reviews. Apparently interviewing feels like having a full time job. Towards the end of the week, we continued with project one on ones, some more white boarding and interview prep, runtime and complexity analysis.
At the end of the week, we had a get together to celebrate the past the 12 weeks. A handful of alums from the last cohort attended and it's kind of cool to see what past alums are doing now.. some are at stealth startups and startups while others are working at some very impressive companies.
Highlights of the week:
And so this is it. My hope is that someone actually finds my ramblings over the past 12 weeks somewhat helpful in forging their own path into Data Science......signing off.
At the end of the week, we had a get together to celebrate the past the 12 weeks. A handful of alums from the last cohort attended and it's kind of cool to see what past alums are doing now.. some are at stealth startups and startups while others are working at some very impressive companies.
Highlights of the week:
- We had a guest lecture from former cosmologist and Data Scientist @datamusing on using topic models to understand restaurant reviews. The topics were learned from the review corpus using LDA and NNMF. He also had a pretty cool d3 + flask visualization to show the results
- We spent a day at the Big Data Innovation Summit. The morning talks mostly felt like business sales pitches. The afternoon talks were a lot more interesting as there were breakouts for Data Science, Machine Learning, Hadoop, etc
- In the Data Science breakouts, there were a lot of LDA related talks including using topic modeling in Health Care and using LDA to extract features for matches in a dating website.
- Lots of interview prep and white boarding
And so this is it. My hope is that someone actually finds my ramblings over the past 12 weeks somewhat helpful in forging their own path into Data Science......signing off.
Sunday, April 13, 2014
Slides from Project Presentation #How Will My Senator Vote?
Here are slides from my project presentation on analyzing How Senators vote in Congress and building a model to predict how they would vote on future bills
Saturday, April 12, 2014
Week 11 : Zipfian Academy -The Beginning of the End
We started the week wrapping things up with our personal projects, putting together decks for our Hiring Day presentations and doing mock runs of our presentations. Towards mid-week, we did more mock runs and put final touches on our presentation decks.
Hiring Data was pretty hectic. It started off with a short mixer with representatives from the various companies that attended. Each of the companies did a quick presentation on who they were and what they were looking for. Once that was done, we proceeded to presenting each of our projects taking a few questions from the audience at the end of each presentation. There were a lot of really cool projects.
After project presentations and lunch, we had "speed dating" sessions with each of the companies that attended. It was a couple of minutes introducing yourself to the company, hearing what they were looking for and seeing if there's a good fit. It was quite tiring going through 16 or so interviews in the span of two hours but it was a worthwhile experience.
Most of us spent the last day of the week cleaning up and refactoring our project code.
Project Next Steps : I do plan to continue working on my project down the line, making some more improvements to my pipeline, looking at new and richer data sources, asking more interesting questions and doing some more analysis to improve my prediction accuracy. There's still a lot of ground to cover here. I also plan to use Latent Dirichlet Allocation (LDA) to extract better features from my data as you can pull out really rich and interesting features from your data using topic modeling. My original model used a "bag of words" approach. The eventual goal would be to release this as a web app anyone could use.
Hiring Data was pretty hectic. It started off with a short mixer with representatives from the various companies that attended. Each of the companies did a quick presentation on who they were and what they were looking for. Once that was done, we proceeded to presenting each of our projects taking a few questions from the audience at the end of each presentation. There were a lot of really cool projects.
After project presentations and lunch, we had "speed dating" sessions with each of the companies that attended. It was a couple of minutes introducing yourself to the company, hearing what they were looking for and seeing if there's a good fit. It was quite tiring going through 16 or so interviews in the span of two hours but it was a worthwhile experience.
Most of us spent the last day of the week cleaning up and refactoring our project code.
Project Next Steps : I do plan to continue working on my project down the line, making some more improvements to my pipeline, looking at new and richer data sources, asking more interesting questions and doing some more analysis to improve my prediction accuracy. There's still a lot of ground to cover here. I also plan to use Latent Dirichlet Allocation (LDA) to extract better features from my data as you can pull out really rich and interesting features from your data using topic modeling. My original model used a "bag of words" approach. The eventual goal would be to release this as a web app anyone could use.
Highlights from the week:
- We started the week with a guest lecture from @itsthomson. He is the founder of framed.io. He just finished the YC program and had lots of words of wisdom. He walked us through his experience making the transition from academia to Data Science, moving to a Chief Scientist role and now Founder. It's refreshing to hear from someone that has gone through the process. Some quotes from his lecture : "Data is the most offensive (vs defense) resource a company has",.. "In Data Science, you have to know a little of everything",.."Being technical helps, but being convincing is better",.. "Understanding how your analysis ties back to your business / organization is key"
- We attended a Data Science for Social Good panel event at TaggedHQ. The panelist included CTO - Code for America, CEO - Lumiata, Data Scientist - BrightBytes, Data Scientist - OPower and Lead Data Scientist - Change.org. These companies are utilizing data science to make a difference. It was a very insightful panel session.
- Hiring Day was rather interesting. 16 companies attended. The companies came from different verticals including CRM, consulting, social good, social, health, payments, real estate, education and infrastructure. It was interesting hearing some of the problems they were trying to solve in their respective domains
Saturday, April 5, 2014
Data Science Bootcamp Programs - Full TIme, Part Time and Online
I've gotten a lot of inquiries on options to move into Data science. This is my attempt to answer that question. If I excluded any programs from this, please feel free to ping me. You'll see that there are quite a few options and you need to find the best fit based on your profile. This list does not include any university programs.
Everyone seems to reference the quote from Google Economist Hal Varian "Being a statistician is the sexiest job of the 21st century" and the McKinsey report about the shortage in Data Science talent.
For a guide on factors to consider when Choosing a Data Science Bootcamp Program, the article should be helpful.
We are collecting and publishing detailed Data Science Bootcamp Reviews from students that have attended and graduated from the various Data Science Bootcamps
Visit this link for more in depth coverage of Data Science Bootcamp Programs including Interviews with Data Science Bootcamp Founders
Regarding Data Science Interview Resources, I hear from a lot of people including those asking about interview resources and the most efficient way to prepare for Data Science Interviews. At a lot of companies and startups, a very important component of the interview process is either the Take Home Data Challenge and/or Onsite Data Challenge. Another important component is the Theory interview, I'll talk more on this later..
This is a also a great resource for individuals who feel they have the background and experience to interview for jobs without going through a bootcamp type program.
To become more familiar with and get efficient working on Data Challenges, I recommend taking a look at the Collection of Data Science Take-home Challenges book. It gives very clear and realistic examples of some of the types of problems you could face on a Data Challenge and projects you could potentially work on as a data scientist
Here we go...
Update : Since the initial post went up a few months ago, Zipfian Academy has added two more programs
Interview with Data Science Europe Founder
Data Science Eutope Student Reviews
Insight Data Engineering : They'll enroll the first cohort this summer. Bootcamp will focus on the data engineering track. It is project based self directed learning, so if you need some hand holding or you're not already very familiar with the material this may not be the program for you
Metis Data Science Bootcamp : This looks like its modeled after the Zipfian program from a duration / structure / curriculum stand point. It is owned by Kaplan which also recently acquired Dev Bootcamp. Looks like the big .edu players are trying to make a play for the tech bootcamp space
Interview with Metis Data Science Cofounder
Science to Data Science : They accept only PhDs / Post Docs or those close to completing their PhD studies. We are seeing more bootcamps adopt this model.
Level Data Analyst Bootcamp : This is one of the first full time Data Analyst bootcamps we've seen and its run by a University which is also a first. I think folks in academia have realized that the typical university structure can't keep up with the pace of innovation in the space
Praxis Data Science : This is another program coming with an interesting approach. Another option for individuals with a strong STEM and programming background who want to make a move into Data Science
Startup.ML Data Science Fellowship : Startup.ML is taking an interesting approach to Data Science education. Their fellows work on real problems with established Data Science teams or on undefined startup problems.
ASI Data Science Fellowship : This is another program modeled after the Insight program. They pair students with an Industry partner which allows students to work on real business problems / data. They also have a modular program which allows for some customization.
NYC Data Science Academy : This looks like its also modeled after the Zipfian 12 week immersive. Another option for non-postdocs on the east coast looking to make the transition to Data Science
Silicon Valley Data Academy : This also looks like another program modeled after the Insight program. It does look like they skew towards applicants that are much further along the skills spectrum
Microsoft Research Data Science Summer School : targets upper level undergraduate students attending college in the New York area. Program instructors are research scientists from Microsoft Research
These bootcamps are popping up and thriving because there is currently an imbalance between demand and supply of Data Science talent and the acceptance rates at some of full time bootcamps are anywhere from 1 in 20 to 1 in 40
p.s : I need to stress that with any of the programs listed above, you need to do your due diligence and ask the tough questions to find out if it's a good fit for you. You probably want to be on the look out for programs that are not transparent about their placement.
Update 1 - 05/14 : Added the new Zipfian programs, Persontyle
Update 2 - 07/14 : Added Metis, Data Science Europe, Science to Data Science
Update 3 - 08/14 : Added Data Science Dojo
Update 4 - 10/14 : Added AMPLab
Everyone seems to reference the quote from Google Economist Hal Varian "Being a statistician is the sexiest job of the 21st century" and the McKinsey report about the shortage in Data Science talent.
For a guide on factors to consider when Choosing a Data Science Bootcamp Program, the article should be helpful.
We are collecting and publishing detailed Data Science Bootcamp Reviews from students that have attended and graduated from the various Data Science Bootcamps
Visit this link for more in depth coverage of Data Science Bootcamp Programs including Interviews with Data Science Bootcamp Founders
Regarding Data Science Interview Resources, I hear from a lot of people including those asking about interview resources and the most efficient way to prepare for Data Science Interviews. At a lot of companies and startups, a very important component of the interview process is either the Take Home Data Challenge and/or Onsite Data Challenge. Another important component is the Theory interview, I'll talk more on this later..
This is a also a great resource for individuals who feel they have the background and experience to interview for jobs without going through a bootcamp type program.
To become more familiar with and get efficient working on Data Challenges, I recommend taking a look at the Collection of Data Science Take-home Challenges book. It gives very clear and realistic examples of some of the types of problems you could face on a Data Challenge and projects you could potentially work on as a data scientist
Here we go...
Full Time
Zipfian Academy : This is not a 0-60 school. It's more like 40-80. They are currently about to graduate their second cohort.
Zipfian Academy : This is not a 0-60 school. It's more like 40-80. They are currently about to graduate their second cohort.
- Notes : Of all the Data Science bootcamps, Zipfian has the most ambitious curriculum. Graduates from the first cohort are currently working in Data Scientist roles across the Bay Area. I'm currently part of the second cohort
- Location : San Francisco, CA
- Requirement : Familiar with programming, statistics and math. Quantitative background
- Duration : 12 weeks
Update : Since the initial post went up a few months ago, Zipfian Academy has added two more programs
Data Engineering 12 - week Immersive : This follows the same format as the Data Science Immersive. The first cohort for this program will start January 2015
- Notes : This follows the same format as the Data Science Immersive
- Location : San Francisco, CA
- Requirement : Quantitative / Software Engineering background
- Duration : 12 weeks
Data Fellows 6 - week Fellowship : The first cohort for the fellows program will start Summer 2014
- Notes : This program is free for accepted fellows
- Location : San Francisco, CA
- Requirement : Significant Data Science Skills, Quantitative background
- Duration : 6 weeks
Also see a recent google hangout explaining these new programs : Zipfian Academy Data Fellows Program - Information Session
Data Science Europe Bootcamp : This looks like its modeled after the Insight program. Select a small group of very smart people with advanced degrees and help them get ready for Data Science roles in 6 weeks.
Interview with Data Science Europe Founder
Data Science Eutope Student Reviews
- Notes : It enrolls the first cohort January 2015. Also if you don't receive an offer for a quantitative job with 6 months of completing the course, you'll receive a full refund on tuition paid. They're currently on their second cohort and have a 100% placement rate
- Location : Dublin, Ireland
- Requirement : Quantitative Degree, Programming knowledge and Statistics background. It looks like they prefer graduate students and Post Docs but are open to applications from undergrads.
- Duration : 6 weeks
Insight Data Science : Accepts only PhDs or PostDocs. They have completed 5 cohorts in Palo Alto and are opening up a new class in New York this summer. From their website, it does look like they have almost perfect placement. It is project based self directed learning, so if you need some hand holding or you're not already very familiar with the material this may not be the program for you
- Notes : No Fees, pays Stipend
- Location : Palo Alto, CA / New York, NY
- Requirement : PhD / PostDoc
- Duration : 7 weeks
Insight Data Engineering : They'll enroll the first cohort this summer. Bootcamp will focus on the data engineering track. It is project based self directed learning, so if you need some hand holding or you're not already very familiar with the material this may not be the program for you
- Notes : No Fees
- Location : Palo Alto, CA
- Requirement : strong background in math, science and software engineering
- Duration : 7 weeks
- Notes : Curriculum is mostly in R, though they support other languages (python). They have tiered pricing for the class, so you can pay for which tier meets your needs
- Location : Berlin
- Requirement : Experience with programming, databases, R, Python
- Duration : 12 weeks
Data Science For Social Good : hosted by the University of Chicago. The students work with non-profits, federal agencies and local governments on projects that have a social impact
- Notes : they focus on civic projects or projects with social impact
- Location : Chicago, IL
- Requirement : It looks like they target academics (undergraduate and graduate students)
- Duration : 12 weeks
Metis Data Science Bootcamp : This looks like its modeled after the Zipfian program from a duration / structure / curriculum stand point. It is owned by Kaplan which also recently acquired Dev Bootcamp. Looks like the big .edu players are trying to make a play for the tech bootcamp space
Interview with Metis Data Science Cofounder
- Notes : It enrolls the first cohort Fall 2014. For individuals who are not already in the US or are international students, you could obtain an M-1 visa to attend. They're probably the first bootcamp that are able to issue M-1 student visas
- Location : New York, NY and San Francisco, CA
- Requirement : Familiarity with Statistics and Programming
- Duration : 12 weeks
Science to Data Science : They accept only PhDs / Post Docs or those close to completing their PhD studies. We are seeing more bootcamps adopt this model.
- Notes : It enrolls the first cohort August 2014. There is a small registration fee for the course otherwise the program is free for participants
- Location : London, UK
- Requirement : PhD / Post Doc
- Duration : 5 weeks
Level Data Analyst Bootcamp : This is one of the first full time Data Analyst bootcamps we've seen and its run by a University which is also a first. I think folks in academia have realized that the typical university structure can't keep up with the pace of innovation in the space
- Notes : Curriculum looks standard for the Data Analyst and Marketing Analytics job track. They also run hybrid and full - time programs
- Location : Boston, MA, Charlotte, NC, Seattle, WA, Silicon Valley, CA
- Requirement :
- Duration : 8 weeks
Praxis Data Science : This is another program coming with an interesting approach. Another option for individuals with a strong STEM and programming background who want to make a move into Data Science
- Notes : It enrolls the first cohort Summer 2015. They also offer a money back guarantee and will refund up to half of the fees paid if you're unable to find a job within 3 months. This speaks to the fact that they have a vested interest in their students' success. The curriculum also seems to focus on building the practical skills needed to both land a role and continue to grow as a Data Scientist.
- Location : Silicon Valley, CA
- Requirement : Looks like they're looking for people with a STEM background (advanced degrees preferred) and programming / quantitative experience
- Duration : 6 weeks
Insight Health Data Science : This is the first significant deviation we've seen from the norm (focus - wise). This program seems to have the same structure as the other Insight programs but the focus here is solely in Healthcare and Life Sciences. It is project based self directed learning, so if you need some hand holding or you're not already very familiar with the material this may not be the program for you
- Notes : No Fees
- Location : Boston, MA
- Requirement : PhD / PostDoc
- Duration : 7 weeks
Startup.ML Data Science Fellowship : Startup.ML is taking an interesting approach to Data Science education. Their fellows work on real problems with established Data Science teams or on undefined startup problems.
- Notes : No Fees. They also enrolled their first cohort in March 2015. I would imagine the typical profile here is someone that may be much further along.
- Location : San Francisco, CA
- Requirement : Background in Software Engineering, Quantitative Analysis. Advanced Quantitative degrees
- Duration : 4 months
ASI Data Science Fellowship : This is another program modeled after the Insight program. They pair students with an Industry partner which allows students to work on real business problems / data. They also have a modular program which allows for some customization.
GA Data Science Immersive : General Assembly was actually one of the first outfits to start part time Data Science classes. It looks like they've decided to also jump into the fray with a full time Data Science Immersive
Catenus Science : Catenus is also taking a very different approach here. Catenus Science is a paid apprenticeship program helping skilled Data Scientists explore opportunities at different startups / domians
The Data Incubator : Accepts only STEM PhDs or PostDocs. The first class is starting summer 2014.
- Notes : They've been doing part time Data Science classes for at least two years already
- Location : San Francisco, CA
- Requirement : Seems like they're interested in folks with quantitative backgrounds looking to transition to Data Science
- Duration : 12 weeks
Catenus Science : Catenus is also taking a very different approach here. Catenus Science is a paid apprenticeship program helping skilled Data Scientists explore opportunities at different startups / domians
- Notes : Paid Apprenticeship. Rotate through three different startups applying you skills to month long projects with these startups. The next sesison starts June 2016
- Location : San Francisco, CA
- Requirement : Background and Experience in Statistics, Machine Learning, Programming, Product Development. They're probbaly looking for people who are much further along.
- Duration : 13 weeks
The Data Incubator : Accepts only STEM PhDs or PostDocs. The first class is starting summer 2014.
- Notes : No Fees
- Location : New York, NY
- Requirement : PhD / PostDoc
- Duration : 6 weeks
NYC Data Science Academy : This looks like its also modeled after the Zipfian 12 week immersive. Another option for non-postdocs on the east coast looking to make the transition to Data Science
- Notes : It enrolls the first cohort February 2015. Just looking at the curriculum, it appears well thought out and seems to cover a lot of breadth. They focus on R and Python and spend significant amounts of the course time covering both ecosystems.
- Location : Manhattan, NY
- Requirement : Looks like they prefer people with STEM advanced degrees or equivalent experience in a Quantitative discipline or programming
- Duration : 12 weeks
Silicon Valley Data Academy : This also looks like another program modeled after the Insight program. It does look like they skew towards applicants that are much further along the skills spectrum
- Notes : No Fees and they run both Data Science and Data Engineering programs
- Location : Redwood City, CA
- Requirement : Advanced Degrees / PhD / Post Docs , Extensive quantitative / engineering background
- Duration : 8 weeks
Microsoft Research Data Science Summer School : targets upper level undergraduate students attending college in the New York area. Program instructors are research scientists from Microsoft Research
- Notes : Each student receives a stipend and a laptop
- Location : New York, NY
- Requirement : upper level undergraduate students interesting in continuing to graduate school in computer science or related field or breaking into Data Science
- Duration : 8 weeks
Part Time
- General Assembly - Data Science : San Francisco / New York. Part time program over 11 weeks (2 evenings a week)
- Hackbright - Data Science San Francisco. Full Stack Data Science class over one weekend
- District Data Labs : Washington DC. Data workshops and project based courses on weekends
- Persontyle : London, UK. Offering R based Data Science short classes
- Data Science Dojo : Silicon Valley, CA / Seattle, WA / Austin, TX. Offering data science talks, tutorials and hands on workshops and are looking to build a data science community
- AmpCamp : This is run by UC Berkeley AMPLab. Over two days, attendees learn how to solve big data problems using tools from the Berkeley Data Analytics Stack. The event is also live streamed and archived on YouTube
- NextML
Online
If you have enough time and patience to work through problems yourself, some of these resources will get you started with Data Science.
If you have enough time and patience to work through problems yourself, some of these resources will get you started with Data Science.
- Udacity Data Science + GeorgiaTech
- Learn Data Science
- Open Source Data Science Masters
- Data Science Apprenticeship | Overview of Apprenticeship
- Coursera Data Science + Johns Hopkins
- edX
- SlideRule Data Science
- Coursera Data Mining + University of Illinois
- Udacity Data Analyst Nanodegree
- Thinkful Data Science
- DataQuest
- Udacity Machine Learning Nanodegree
- Leada
These bootcamps are popping up and thriving because there is currently an imbalance between demand and supply of Data Science talent and the acceptance rates at some of full time bootcamps are anywhere from 1 in 20 to 1 in 40
p.s : I need to stress that with any of the programs listed above, you need to do your due diligence and ask the tough questions to find out if it's a good fit for you. You probably want to be on the look out for programs that are not transparent about their placement.
Update 1 - 05/14 : Added the new Zipfian programs, Persontyle
Update 2 - 07/14 : Added Metis, Data Science Europe, Science to Data Science
Update 3 - 08/14 : Added Data Science Dojo
Update 4 - 10/14 : Added AMPLab
Update 5 - 11/14 : Added Coursera/UIUC, Udacity Data Analyst Nanodegree, Thinkful, DataInquest
Update 6 - 12/14 : Added NYC Data Science Academy
Update 7 - 01/15 : Added Next.ML, Bitbootcamp, DataQuest
Update 7 - 01/15 : Added Next.ML, Bitbootcamp, DataQuest
Update 8 - 04/15 : Added Praxis Data Science, Insight Health Data Science
Update 9 - 05/15 : Added Startup.ML Fellowship, ASI Fellowship
Update 10 - 09/15 : Added Silicon Valley Data Academy
Update 11 - 01/16 : Added GA Data Science Immersive, Level Data Analyst Bootcamp, Udacity ML Nanodegree, Leada
Update 12 - 05/16 : Added Catenus Science
Update 10 - 09/15 : Added Silicon Valley Data Academy
Update 11 - 01/16 : Added GA Data Science Immersive, Level Data Analyst Bootcamp, Udacity ML Nanodegree, Leada
Update 12 - 05/16 : Added Catenus Science
Week 10 : Zipfian Academy - Closing the loop ... rinse..repeat
Continued working on my personal project and was glad my data ingestion and aggregation pipeline was built and optimized.
Analysis : Now that I had most of the data I needed, the next step was trying the close the loop as soon as possible, get some predictions for each Senator and then iterate. One challenge was trying the find signals that indicated uniqueness just from voting patterns and the content of the bills. As part of my analysis, I used techniques like MDS, clustering and NLP to extract salient features from my initial dataset. I did find out from my analysis that over the past 3.5 years, Democrats are more predictable and are more alike than Republicans based on just their voting patterns.
Modeling : I started off with a Naive bag of words model and got an average prediction accuracy in the low 60's. I went back and did some chi-squared feature selection, natural language processing (tfidf, n-grams, stop-words, stemming, binning, lemmatization, etc...), grid search and cross-validation on a pipeline of models (Logistic Regression, Random Forest, SVM, AdaBoost, Naive Bayes, kNN) and added some social data from wikipedia and twitter. This improved my average prediction accuracy to the high 60's. Moving forward, there's still a lot of ground to cover here. I can probably get this to low 80's on average prediction accuracy across all the Senators in congress The biggest take away here is to spend time and lots of it understanding your dataset, crafting better features and adding external data that would give additional insights or increase the richness of your data. The modeling part can be automated but your models can only be as good as the data you feed them.
At this point, we're all seeing the light at the end of the tunnel. I gave a top level overview of my project. I'm working on putting up a Github repo with a more in-depth version.
Highlights from the week:
Analysis : Now that I had most of the data I needed, the next step was trying the close the loop as soon as possible, get some predictions for each Senator and then iterate. One challenge was trying the find signals that indicated uniqueness just from voting patterns and the content of the bills. As part of my analysis, I used techniques like MDS, clustering and NLP to extract salient features from my initial dataset. I did find out from my analysis that over the past 3.5 years, Democrats are more predictable and are more alike than Republicans based on just their voting patterns.
Modeling : I started off with a Naive bag of words model and got an average prediction accuracy in the low 60's. I went back and did some chi-squared feature selection, natural language processing (tfidf, n-grams, stop-words, stemming, binning, lemmatization, etc...), grid search and cross-validation on a pipeline of models (Logistic Regression, Random Forest, SVM, AdaBoost, Naive Bayes, kNN) and added some social data from wikipedia and twitter. This improved my average prediction accuracy to the high 60's. Moving forward, there's still a lot of ground to cover here. I can probably get this to low 80's on average prediction accuracy across all the Senators in congress The biggest take away here is to spend time and lots of it understanding your dataset, crafting better features and adding external data that would give additional insights or increase the richness of your data. The modeling part can be automated but your models can only be as good as the data you feed them.
At this point, we're all seeing the light at the end of the tunnel. I gave a top level overview of my project. I'm working on putting up a Github repo with a more in-depth version.
Highlights from the week:
- We had a guest lecture from @WibiData on building real-time recommendation engines at scale with kiji. The kiji platform seems pretty mature and has support for quite a few languages and connectors to several Big Data frameworks. Evaluating recommender engines has always been a problem. One approach is to perform validation on a hold out sample of your data.
- We also had another pretty interesting guest lecture from @maebert. They've built an automatic journaling tool. They built a data product out all the ambient data (passively generated data) we generate by triangulating your position using GPS signals and cell phone towers. They use those data points to tell a story about you. Their pipeline looks something like this (Signals -> Data -> Information -> Knowledge -> Stories). It's actually quite cool how patterns start to emerge when you look at aggregate data. I guess we all know a little something about that from the "revelations" that happened last summer. They utilize techniques like LSA / LDA/ SVD to extract concepts and their weights, expectation maximization (Gaussian mean shift) and some NLP. They try to see if the concepts change over time and also try to enrich their datasets using external feeds for weather data, events data, ticketing, etc
- We had breakouts on presentations. We worked on our projects for two weeks and trying to bottle all that work into a three minute presentation won't do it justice. So you'd want to answer the following questions to give the audience enough to spark some interest - What?, Why?, How?, So What?, Next Steps?
Subscribe to:
Posts (Atom)