Wednesday, December 31, 2014

Year in Review

It's been a really interesting year.. I moved to the Bay Area. It's one thing to read about Silicon Valley or visit briefly. It's another to actually live out here and experience all it has to offer. This is the center of this data revolution everyone seems to be talking about. Obviously, if you can manage the ridiculously expensive housing out here and how much more expensive everything is out here, then you should be fine.





To wrap up the year, here is Jeff Leeks' Non-comprehensive list of awesome things other people did in 2014 . It has an rlang slant since he's a statistician.


I had more blog posts and traffic this year than each of the previous 3 year combined. Hoping this trend continues. Just looking at my traffic, it does appear there is a lot more interest in Data Science Education and immersive experiences like boot camps.


Going forward, I plan to do more tutorial style posts showing side projects or other interesting tech I encounter. 


I do want to spend more time delving into Deep Learning. Starting with the nuts and bolts and then moving to available libraries / implementations and sharing some of what I learn along the way... stay tuned 


Monday, December 22, 2014

Some more interesting links-4, Machine Intelligence, TDA, ipython notebooks

Most Topological Data Analysis tools are either stuck in academic research papers or Company intellectual property. DataRefiner might help to change that

Python for Exploratory Computing : Collection of ipython notebook showing python basics, statistics and advanced python topics

A collection of ipython notebooks on hacking security data 

This is the future of education Open Loop University, where your education is spread over several years. You'll have periods of work with schooling interlaced inbetween

Detailed infograph showing major players in the Machine Intelligence space

You should look at this if you're interested in the Quantified Self space

I've been looking for something like this. Instant temporary ipython notebooks hosted in the cloud

An extensive Deep Learning Reading list

Nice reading on Generative vs Discriminative Algorithms (Naive Bayes - Logistic Regression)


Sunday, August 17, 2014

Getting Started with Vowpal Wabbit - Part 1 : Installation

After a very long hiatus, I'm back blogging. I'm really excited about how the year is shaping up.... stay tuned.

I discovered Vowpal Wabbit about a year ago but only recently started using it. Vowpal Wabbit is a very fast out-of-core learning system. Its the brain child of John Langford. and development has been supported by Microsoft Research and Yahoo Research (past)

This is the first part of a series about getting started with Vowpal Wabbit

To get started on OSX, you need to ensure you already have XCode and Homebrew installed. If you already have these installed, run the command below to update Homebrew

1
brew update

Vowpal Wabbit has a few dependencies that also need to be installed via brew. The official docs have the boost library as the only external dependency, but I was having a few issues until I installed automake and libtool

1
2
3
brew install automake
brew install boost
brew install libtool

In order to prevent conflicts with Apple's own libtool, a "g" is appended when you install libtool so you have instead: glibtool and glibtoolize. The code below adds a symbolic link.

1
2
cd /usr/local/bin/
ln -s glibtoolize libtoolize

Clone the Vowpal Wabbit git repo for the latest code

1
2
git clone https://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit

Then you should run

1
2
3
4
./autogen.sh 
./configure 
make 
make install

If you are having set up issues or issues with dependencies, you may want to spin up a virtual machine. If you're on Ubuntu, you should run

sudo apt-get install vowpal-wabbit

Two of the most informative blogs out there with great coverage of Vowpal Wabbit are MLWave and  FastML (it looks like this is behind a paywall)

Sunday, April 27, 2014

Some more interesting links-3, Quantified Self, Bandits

Extreme Quantified Self. This MIT professor analyzed about 90,000 hours of video / 140,000 hours of audio / 200 terabytes of home videos to understand how his child's speech developed. This is probably one of the coolest things I've seen. He started a company (Bluefin Labs) around the technology he used for the analysis and then sold that company to Twitter for a fat wad of cash. This is his TED talk

You definitely want to utilize the resources at your local public library. These days they have amazing resources like access to Safari which gives you access to O'Reilly and Packt titles

Some very good advice if you're on the interview trail - Always Be Coding

A nice visualization / simulation of what's happening in a Multi - armed Bandit problem

Friday, April 25, 2014

Zipfian Academy - All 12 weeks

Here you go.. a week to week summary of my experience at Zipfian Academy

Week 1 : Zipfian Academy - Priming the Pump , some Unix shell, python, recommenders and data wrangling 
Week 2 : Zipfian Academy - Are you Frequentist or Bayesian ? 
Week 3 : Zipfian Academy - Multi-armed bandits and some Machine Learning 
Week 4 : Zipfian Academy - Oh SQL, Oh SQL... MySQL and some NLP too 
Week 5 : Zipfian Academy - Graphs and Community Detection 
Week 6 : Zipfian Academy - The Elephant, the Pig, the Bee and other stories 
Week 7 : Zipfian Academy - Advanced Machine Learning and Deep Learning 
Week 8 : Zipfian Academy - Assessment and Review 
Week 9 : Zipfian Academy - Personal projects
Week 10 : Zipfian Academy - Closing the loop ... rinse..repeat 
Week 11 : Zipfian Academy -The Beginning of the End
Week 12 : Zipfian Academy - And That's All folks....

For a different point of view about the Zipfian experience do checkout another  fellow alum - Melanie's blog All the Tech Things 

Friday, April 18, 2014

Week 12 : Zipfian Academy - And That's All folks....

And so, all great things must come to an end. This is the final week for the bootcamp program. We continued interview prep, white boarding and code reviews. Apparently interviewing feels like having a full time job. Towards the end of the week, we continued with project one on ones, some more white boarding and interview prep, runtime and complexity analysis.

At the end of the week, we had a get together to celebrate the past the 12 weeks. A handful of alums from the last cohort attended and it's kind of cool to see what past alums are doing now.. some are at stealth startups and startups while others are working at some very impressive companies.

Highlights of the week:
  • We had a guest lecture from former cosmologist and Data Scientist @datamusing on using topic models to understand restaurant reviews. The topics were learned from the review corpus using LDA and NNMF. He also had a pretty cool d3 + flask visualization to show the results
  • We spent a day at the Big Data Innovation Summit. The morning talks mostly felt like business sales pitches. The afternoon talks were a lot more interesting as there were breakouts for Data Science, Machine Learning, Hadoop, etc  
  • In the Data Science breakouts, there were a lot of LDA related talks including using topic modeling in Health Care and using LDA to extract features for matches in a dating website. 
  • Lots of interview prep and white boarding 

And so this is it. My hope is that someone actually finds my ramblings over the past 12 weeks somewhat helpful in forging their own path into Data Science......signing off.

Sunday, April 13, 2014

Slides from Project Presentation #How Will My Senator Vote?

Here are slides from my project presentation on analyzing How Senators vote in Congress and building a model to predict how they would vote on future bills

Saturday, April 12, 2014

Week 11 : Zipfian Academy -The Beginning of the End

We started the week wrapping things up with our personal projects, putting together decks for our Hiring Day presentations and doing mock runs of our presentations. Towards mid-week, we did more mock runs and put final touches on our presentation decks.

Hiring Data was pretty hectic. It started off with a short mixer with representatives from the various companies that attended. Each of the companies did a quick presentation on who they were and what they were looking for. Once that was done, we proceeded to presenting each of our projects taking a few questions from the audience at the end of each presentation. There were a lot of really cool projects.

After project presentations and lunch, we had "speed dating" sessions with each of the companies that attended. It was a couple of minutes introducing yourself to the company, hearing what they were looking for and seeing if there's a good fit. It was quite tiring going through 16 or so interviews in the span of two hours but it was a worthwhile experience.

Most of us spent the last day of the week cleaning up and refactoring our project code.

Project Next Steps : I do plan to continue working on my project down the line, making some more improvements to my pipeline, looking at new and richer data sources, asking more interesting questions and doing some more analysis to improve my prediction accuracy. There's still a lot of ground to cover here. I also plan to use Latent Dirichlet Allocation (LDA) to extract better features from my data as you can pull out really rich and interesting features from your data using topic modeling. My original model used a "bag of words" approach. The eventual goal would be to release this as a web app anyone could use.

Highlights from the week:
  • We started the week with a guest lecture from @itsthomson. He is the founder of framed.io. He just finished the YC program and had lots of words of wisdom. He walked us through his experience making the transition from academia to Data Science, moving to a Chief Scientist role and now Founder. It's refreshing to hear from someone that has gone through the process. Some quotes from his lecture : "Data is the most offensive (vs defense) resource a company has",.. "In Data Science, you have to know a little of everything",.."Being technical helps, but being convincing is better",.. "Understanding how your analysis ties back to your business / organization is key"
  • We attended a Data Science for Social Good panel event at TaggedHQ. The panelist included CTO - Code for America, CEO - Lumiata, Data Scientist - BrightBytes, Data Scientist - OPower and Lead Data Scientist - Change.org. These companies are utilizing data science to make a difference. It was a very insightful panel session.
  • Hiring Day was rather interesting. 16 companies attended. The companies came from different verticals including CRM, consulting, social good, social, health, payments, real estate, education and infrastructure. It was interesting hearing some of the problems they were trying to solve in their respective domains

Saturday, April 5, 2014

Data Science Bootcamp Programs - Full TIme, Part Time and Online

I've gotten a lot of inquiries on options to move into Data science. This is my attempt to answer that question. If I excluded any programs from this, please feel free to ping me. You'll see that there are quite a few options and you need to find the best fit based on your profile. This list does not include any university programs.

Everyone seems to reference the quote from Google Economist Hal Varian "Being a statistician is the sexiest job of the 21st century" and the McKinsey report about the shortage in Data Science talent.

For a guide on factors to consider when Choosing a Data Science Bootcamp Program, the article should be helpful.

We are collecting and publishing detailed Data Science Bootcamp Reviews from students that have attended and graduated from the various Data Science Bootcamps

Visit this link for more in depth coverage of Data Science Bootcamp Programs including Interviews with Data Science Bootcamp Founders

Regarding Data Science Interview Resources, I hear from a lot of people including those asking about interview resources and the most efficient way to prepare for Data Science Interviews. At a lot of companies and startups, a very important component of the interview process is either the Take Home Data Challenge and/or Onsite Data Challenge. Another important component is the Theory interview, I'll talk more on this later..

This is a also a great resource for individuals who feel they have the background and experience to interview for jobs without going through a bootcamp type program.

To become more familiar with and get efficient working on Data Challenges, I recommend taking a look at the Collection of Data Science Take-home Challenges book. It gives very clear and realistic examples of some of the types of problems you could face on a Data Challenge and projects you could potentially work on as a data scientist



Here we go...

Full Time

Zipfian Academy : This is not a 0-60 school. It's more like 40-80. They are currently about to graduate their second cohort.

  • Notes : Of all the Data Science bootcamps, Zipfian has the most ambitious curriculum. Graduates from the first cohort are currently working in Data Scientist roles across the Bay Area. I'm currently part of the second cohort
  • Location : San Francisco, CA
  • Requirement : Familiar with programming, statistics and math. Quantitative background
  • Duration : 12 weeks

Update : Since the initial post went up a few months ago, Zipfian Academy has added two more programs

Data Engineering 12 - week Immersive : This follows the same format as the Data Science Immersive. The first cohort for this program will start January 2015
  • Notes : This follows the same format as the Data Science Immersive
  • Location : San Francisco, CA
  • Requirement :  Quantitative / Software Engineering background
  • Duration : 12 weeks
Data Fellows 6 - week Fellowship :  The first cohort for the fellows program will start Summer 2014
  • Notes : This program is free for accepted fellows
  • Location : San Francisco, CA
  • Requirement :  Significant Data Science Skills, Quantitative background
  • Duration : 6 weeks
Also see a recent google hangout explaining these new programs :  Zipfian Academy Data Fellows Program  - Information Session 



Data Science Europe Bootcamp : This looks like its modeled after the Insight program. Select a small group of very smart people with advanced degrees and help them get ready for Data Science roles in 6 weeks. 



Interview with Data Science Europe Founder

Data Science Eutope Student Reviews
  • Notes : It enrolls the first cohort January 2015. Also if you don't receive an offer for a quantitative job with 6 months of completing the course, you'll receive a full refund on tuition paid. They're currently on their second cohort and have a 100% placement rate 
  • Location : Dublin, Ireland 
  • Requirement : Quantitative Degree, Programming knowledge and Statistics background. It looks like they prefer graduate students and Post Docs but are open to applications from undergrads.
  • Duration : 6 weeks 

Insight Data Science : Accepts only PhDs or PostDocs. They have completed 5 cohorts in Palo Alto and are opening up a new class in New York this summer. From their website, it does look like they have almost perfect placement. It is project based self directed learning, so if you need some hand holding or you're not already very familiar with the material this may not be the program for you



    • Notes : No Fees, pays Stipend
    • Location : Palo Alto, CA / New York, NY
    • Requirement : PhD / PostDoc
    • Duration : 7 weeks 

    Insight Data Engineering : They'll enroll the first cohort this summer. Bootcamp will focus on the data engineering track. It is project based self directed learning, so if you need some hand holding or you're not already very familiar with the material this may not be the program for you

    • Notes : No Fees 
    • Location : Palo Alto, CA
    • Requirement : strong background in math, science and software engineering
    • Duration : 7 weeks 

    Data Science Retreat : Follows the same format as Zipfian but is based in Europe


      • Notes : Curriculum is mostly in R, though they support other languages (python). They have tiered pricing for the class, so you can pay for which tier meets your needs
      • Location : Berlin
      • Requirement : Experience with programming, databases, R, Python
      • Duration : 12 weeks 

      Data Science For Social Good : hosted by the University of Chicago. The students work with non-profits, federal agencies and local governments on projects that have a social impact

      • Notes : they focus on civic projects or projects with social impact
      • Location : Chicago, IL
      • Requirement : It looks like they target academics (undergraduate and graduate students)
      • Duration : 12 weeks 

      Metis Data Science Bootcamp  : This looks like its modeled after the Zipfian program from a duration / structure / curriculum stand point. It is owned by Kaplan which also recently acquired Dev Bootcamp. Looks like the big .edu players are trying to make a play for the tech bootcamp space

      Interview with Metis Data Science Cofounder
      • Notes : It enrolls the first cohort Fall 2014. For individuals who are not already in the US or are international students, you could obtain an M-1 visa to attend. They're probably the first bootcamp that are able to issue M-1 student visas
      • Location : New York, NY and San Francisco, CA
      • Requirement : Familiarity with Statistics and Programming
      • Duration : 12 weeks 

      Science to Data Science : They accept only PhDs / Post Docs or those close to completing their PhD studies. We are seeing more bootcamps adopt this model.


      • Notes : It enrolls the first cohort August 2014. There is a small registration fee for the course otherwise the program is free for participants
      • Location : London, UK
      • Requirement : PhD / Post Doc
      • Duration : 5 weeks 

      Level Data Analyst Bootcamp : This is one of the first full time Data Analyst bootcamps we've seen and its run by a University which is also a first. I think folks in academia have realized that the typical university structure can't keep up with the pace of innovation in the space



      • Notes : Curriculum looks standard for the Data Analyst and Marketing Analytics job track. They also run hybrid and full - time programs
      • Location : Boston, MA, Charlotte, NC, Seattle, WA, Silicon Valley, CA
      • Requirement : 
      • Duration : 8 weeks 


      Praxis Data Science : This is another program coming with an interesting approach. Another option for individuals with a strong STEM and programming background who want to make a move into Data Science


      • Notes : It enrolls the first cohort Summer 2015. They also offer a money back guarantee and will refund up to half of the fees paid if you're unable to find a job within 3 months. This speaks to the fact that they have a vested interest in their students' success. The curriculum also seems to focus on building the practical skills needed to both land a role and continue to grow as a Data Scientist.
      • Location : Silicon Valley, CA 
      • Requirement : Looks like they're looking for people with a STEM background (advanced degrees preferred) and programming / quantitative experience 
      • Duration : 6 weeks 

      Insight Health Data Science : This is the first significant deviation we've seen from the norm (focus - wise). This program seems to have the same structure as the other Insight programs but the focus here is solely in Healthcare and Life Sciences. It is project based self directed learning, so if you need some hand holding or you're not already very familiar with the material this may not be the program for you

      • Notes : No Fees
      • Location : Boston, MA
      • Requirement : PhD / PostDoc
      • Duration : 7 weeks 

      Startup.ML Data Science Fellowship : Startup.ML is taking an interesting approach to Data Science education. Their fellows work on real problems with established Data Science teams or on undefined startup problems.



      • Notes : No Fees. They also enrolled their first cohort in March 2015. I would imagine the typical profile here is someone that may be much further along.
      • Location : San Francisco, CA 
      • Requirement : Background in Software Engineering, Quantitative Analysis. Advanced Quantitative degrees 
      • Duration : 4 months

      ASI Data Science Fellowship : This is another program modeled after the Insight program. They pair students with an Industry partner which allows students to work on real business problems / data. They also have a modular program which allows for some customization.



      • Notes : No Fees
      • Location : London, UK
      • Requirement : PhD
      • Duration : 8 weeks

      GA Data Science Immersive : General Assembly was actually one of the first outfits to start part time Data Science classes. It looks like they've decided to also jump into the fray with a full time Data Science Immersive


      • Notes : They've been doing part time Data Science classes for at least two years already
      • Location : San Francisco, CA
      • Requirement :  Seems like they're interested in folks with quantitative backgrounds looking to transition to Data Science
      • Duration : 12 weeks 

      Catenus Science :  Catenus is also taking a very different approach here. Catenus Science is a paid apprenticeship program helping skilled Data Scientists explore opportunities at different startups / domians



      • Notes : Paid Apprenticeship. Rotate through three different startups applying you skills to month long projects with these startups. The next sesison starts June 2016
      • Location : San Francisco, CA 
      • Requirement : Background and Experience in Statistics, Machine Learning, Programming, Product Development. They're probbaly looking for people who are much further along.  
      • Duration : 13 weeks

      The Data Incubator : Accepts only STEM PhDs or PostDocs. The first class is starting summer 2014.

      • Notes : No Fees
      • Location : New York, NY
      • Requirement : PhD / PostDoc
      • Duration : 6 weeks 

      NYC Data Science Academy : This looks like its also modeled after the Zipfian 12 week immersive. Another option for non-postdocs on the east coast looking to make the transition to Data Science

      • Notes : It enrolls the first cohort February 2015. Just looking at the curriculum, it appears well thought out and seems to cover a lot of breadth. They focus on R and Python and spend significant amounts of the course time covering both ecosystems. 
      • Location : Manhattan, NY 
      • Requirement : Looks like they prefer people with STEM advanced degrees or equivalent experience in a Quantitative discipline or programming 
      • Duration : 12 weeks 

      Silicon Valley Data Academy : This also looks like another program modeled after the Insight program. It does look like they skew towards applicants that are much further along the skills spectrum

      • Notes : No Fees and they run both Data Science and Data Engineering programs
      • Location : Redwood City, CA
      • Requirement : Advanced Degrees / PhD / Post Docs , Extensive quantitative / engineering background 
      • Duration : 8 weeks 

      Microsoft Research Data Science Summer School  : targets upper level undergraduate students attending college in the New York area. Program instructors are research scientists from Microsoft Research
      • Notes : Each student receives a stipend and a laptop
      • Location : New York, NY 
      • Requirement :  upper level undergraduate students interesting in continuing to graduate school in computer science or related field or breaking into Data Science
      • Duration : 8 weeks 

      Part Time
      • General Assembly - Data Science : San Francisco / New York. Part time program over 11 weeks (2 evenings a week) 
      • Hackbright - Data Science  San Francisco. Full Stack Data Science class over one weekend
      • District Data Labs : Washington DC.  Data workshops and project based courses on weekends
      • Persontyle : London, UK. Offering R based Data Science short classes
      • Data Science Dojo : Silicon Valley, CA /  Seattle, WA / Austin, TX. Offering data science talks, tutorials and hands on workshops and are looking to build a data science community
      • AmpCamp : This is run by UC Berkeley AMPLab. Over two days, attendees learn how to solve big data problems using tools from the Berkeley Data Analytics Stack. The event is also live streamed and archived on YouTube
      • NextML

      These bootcamps are popping up and thriving because there is currently an imbalance between demand and supply of Data Science talent and the acceptance rates at some of full time bootcamps are anywhere from 1 in 20 to 1 in 40

      p.s : I need to stress that with any of the programs listed above, you need to do your due diligence and ask the tough questions to find out if it's a good fit for you. You probably want to be on the look out for programs that are not transparent about their placement.


      Update 1 - 05/14  : Added the new Zipfian programs, Persontyle
      Update 2 - 07/14 :  Added Metis, Data Science Europe,  Science to Data Science
      Update 3 - 08/14 :  Added Data Science Dojo
      Update 4 - 10/14 :  Added AMPLab
      Update 5 - 11/14 :  Added Coursera/UIUC, Udacity Data Analyst Nanodegree, Thinkful, DataInquest
      Update 6 - 12/14 :  Added NYC Data Science Academy
      Update 7 - 01/15 :  Added Next.ML, Bitbootcamp, DataQuest  
      Update 8 - 04/15 :  Added Praxis Data Science, Insight Health Data Science
      Update 9 - 05/15 :  Added Startup.ML Fellowship, ASI Fellowship
      Update 10 - 09/15 : Added Silicon Valley Data Academy
      Update 11 - 01/16 : Added GA Data Science Immersive, Level Data Analyst Bootcamp, Udacity ML Nanodegree, Leada
      Update 12 - 05/16 : Added Catenus Science 

      Week 10 : Zipfian Academy - Closing the loop ... rinse..repeat

      Continued working on my personal project and was glad my data ingestion and aggregation pipeline was built and optimized.

      Analysis : Now that I had most of the data I needed, the next step was trying the close the loop as soon as possible, get some predictions for each Senator and then iterate. One challenge was trying the find signals that indicated uniqueness just from voting patterns and the content of the bills. As part of my analysis, I used techniques like MDS, clustering and NLP to extract salient features from my initial dataset. I did find out from my analysis that over the past 3.5 years, Democrats are more predictable and are more alike than Republicans based on just their voting patterns.

      Modeling : I started off with a Naive bag of words model and got an average prediction accuracy in the low 60's. I went back and did some chi-squared feature selection, natural language processing (tfidf, n-grams, stop-words, stemming, binning, lemmatization, etc...), grid search and cross-validation on a pipeline of models (Logistic Regression, Random Forest, SVM, AdaBoost, Naive Bayes, kNN) and added some social data from wikipedia and twitter. This improved my average prediction accuracy to the high 60's. Moving forward, there's still a lot of ground to cover here. I can probably get this to low 80's on average prediction accuracy across all the Senators in congress The biggest take away here is to spend time and lots of it understanding your dataset, crafting better features and adding external data that would give additional insights or increase the richness of your data. The modeling part can be automated but your models can only be as good as the data you feed them.

      At this point, we're all seeing the light at the end of the tunnel. I gave a top level overview of my project. I'm working on putting up a Github repo with a more in-depth version.

      Highlights from the week:
      • We had a guest lecture from @WibiData on building real-time recommendation engines at scale with kiji. The kiji platform seems pretty mature and has support for quite a few languages and connectors to several Big Data frameworks. Evaluating recommender engines has always been a problem. One approach is to perform validation on a hold out sample of your data.
      • We also had another pretty interesting guest lecture from @maebert. They've built an automatic journaling tool. They built a data product out all the ambient data (passively generated data) we generate by triangulating your position using GPS signals and cell phone towers. They use those data points to tell a story about you.  Their pipeline looks something like this (Signals -> Data -> Information -> Knowledge -> Stories). It's actually quite cool how patterns start to emerge when you look at aggregate data. I guess we all know a little something about that from the "revelations" that happened last summer. They utilize techniques like LSA / LDA/ SVD to extract concepts and their weights, expectation maximization (Gaussian mean shift) and some NLP. They try to see if the concepts change over time and also try to enrich their datasets using external feeds for weather data, events data, ticketing, etc
      • We had breakouts on presentations. We worked on our projects for two weeks and trying to bottle all that work into a three minute presentation won't do it justice. So you'd want to answer the following questions to give the audience enough to spark some interest - What?, Why?, How?, So What?, Next Steps? 

      Monday, March 31, 2014

      Some more interesting links...

      An awesome list of April Fools gadgets from various companies. Maybe someone should bring these products to life.. that would be pretty epic

      Another Python vs R  post

      Speeding up python


      Friday, March 28, 2014

      Week 9 : Zipfian Academy - Personal projects

      This is a little late, so I'll try and make this quick. Personal projects began this week and we'll be working on them for another week. My project is focused on modeling Senators' past voting patterns and using that to predict how they'll vote on future legislation and whether bills pass or not.

      Data Acquisition : I initially planned to source my data using APIs from The Sunlight Foundation and Votesmart but realized quickly things might take much longer with the APIs since I needed several different datasets and also needed a way to aggregate all the data. I decided it would be more optimal to go straight to the source: US Senate website. Setting up and debugging my data ingestion pipeline took another two days and by the end of the week I had all the data I needed..scraped, cleaned and packaged nicely in a database and several python pickle objects.

      Data Transformation : Getting the data is one thing and transforming it to get it ready for analysis and modeling is another. Most Data Scientist tend to spend a lot of project time cleaning, aggregating and transforming data.

      Highlights from the week:
      • Got a chance to attend a meetup organized by BaRUG (Bay Area R Users Group). There was a talk from the author of the caret package (this is kind of the R version of scikit-learn) and another from the Human Rights Data Analysis Group - they use R to build statistical models to work on human rights projects across the globe.
      • We had a guest talk from a former physicist who is now a Data Scientist @WalmartLabs. He works with a group that deals with algorithmic business optimization. The talk was actually quite insightful as he touched on some interesting pain points.."reconciling technical and business needs "..."The simpler the model the better"
      • We also had another guest lecture on visualization. The speaker also worked on this awesome visualization of BART employee salaries 
      • Several of us attended a D3 workshop organized by the VUDlab at UC Berkeley

      Sunday, March 23, 2014

      A few interesting links - Wolfram Language, CLT visualized, Google graveyard .....

      Wolfram language looks pretty impressive. Definitely need to give it a try

      Central Limit Theorem visualized, Another awesome D3 visualization

      Data Elite : This is the YCombinator for Big Data startups

      Metacademy - machine learning focused search engine

      Nate Silver Interview : pretty insightful interview from Nate Silver

      Deep Belief Networks in the browser. This is actually pretty cool. It would be nice if there was an API around this

      explainshelll helps you understand all those shell scripts you come across

      Google Graveyard of products : many-a-google products have graced us in the past decade. Hoping Google Glass and Google Driverless Cars don't end up in the graveyard. Hmm.. looks like someone already created a spot for Google Glass

      Sergey Brin's old grad school resume from almost 2 decades ago.. Bet you he didn't know he'd become one of the richest people in the world today

      Monday, March 17, 2014

      Week 8 : Zipfian Academy - Assessment and Review

      This is the last week for the structured part of the curriculum....moving forward, it's projects, hiring day and interviews.

      We started the week by reviewing material from the first three weeks : Statistics, distributions, hypothesis testing, Bayesian A/B testing and different distance metrics (jaccard, euclidean, hamming, cosine) and then we moved to reviewing material from the second four weeks : web scraping with beautifulsoup and Naive Bayes. We had a breakout on advanced web scraping with scrappy and mechanize. You use these advanced tools when you have complex scraping requirements like javascript loading, pagination, etc. Tools like kimono and import.io were also introduced. These tools can be a lifesaver especially when you're working on a tight schedule.

      By midweek, we reviewed some of the material we covered earlier : more Naive Bayes. regression and outliers. Next day, we worked on an assessment. It was an NLP classification problem from one of our partner companies. The dataset was fairly small and the problem was well defined. Things got a bit interesting when I tried doing "extensive" grid search and cross validation on a pipeline of different models. It took more than 12 hours to go through one model.... this is where you either push things to a beefy machine on AWS or perform a randomized grid search to help you find the best hyperparameters for your model. We also reviewed sample interview problems.

      We finished off the week with another assessment. This was more involved and had a few parts to it : some data wrangling of machine generated log data for a video content site,  classification, regression, clustering and building a recommendation engine. It really feels like things are winding down as we move to personal projects next week.

      Highlights from the week:
      • We had a guest lecture from @enjalot. He's working on the bleeding edge of data visualization. Some of the things he's worked on include this visualization of BART employee salaries that made the rounds during the BART strike

      Friday, March 14, 2014

      In honor of Pi Day : Estimating the value of pi via Monte Carlo simulation

      In honor of $\pi$ day, I'll run you through calculating the numerical value of $\pi$ using a method called Monte Carlo simulation. It is basically a type of simulation that samples a lot random values from a distribution. It is used widely to solve many types of problems including those that don't have closed form solutions like estimating the value of numerical integrals, sensitivity analysis, bayesian inference, predicting election results, stock price movements and the list goes on ...

      In our scenario, we want to calculate the value of $\pi$. To do this, we''ll be throwing darts at a square dart board (dimensions : 1 by 1) with a quadrant (radius : 1) in it. After throwing a bunch of darts at the board, we'll find the ratio of the number of darts that end up inside the quadrant to the total number of darts we threw and then multiply that number by 4 to get an estimate for the value of $\pi$

      Lets work through the math

      $4 \times \frac{A_{quadrant}}{A_{square}} = 4 \times \frac{\frac{1}{4}\pi r^{2}}{(1)^2} = 4 \times \frac{\frac{1}{4}\pi(1)^2}{(1)^2} = 4 \times \frac{\frac{1}{4}\pi}{1} \approx  \frac{N_{hits}}{N_{trials}}  \approx   \pi  $

      In our Monte Carlo simulation, we'll be sampling random points onto our 1 by 1 space and comparing the number of points that end up in the quadrant to the total number of points.



      From the code sample above and my cli, we see that as we increase the number of trials, our estimated value of $\pi$ gets closer to the real value. And if you run enough trials you will approach steady state (true value). Another version of the code sample runs a lot of trials, so you can visually see what's happening to the estimated value of  $\pi$

      See the graph below. The peaks occur between 3.140 and 3.144, which tells us that the true value of $\pi$ lies somewhere in that range


      Monday, March 10, 2014

      Week 7 : Zipfian Academy - Advanced Machine Learning and Deep Learning

      The week started with a detour into more advanced machine learning algorithms. We covered Logistic Regression, SVM, Naive Bayes and $k-$Nearest Neighbors. We compared these algorithms on several datasets to see which situations one would perform better than the other. Before now, I usually pick my favorite algorithms and apply them on a datasets -  that's the wrong way to approach things. You need to be more strategic in choosing which algorithm you use (Use the right tool for the job).  Machine learning algorithms are broadly classified as either generative or discriminative and you have MCMC (generative) and  Neural Networks (discriminative) at the opposite ends of this spectrum respectively.

      Next day, we moved to Decision Trees, Random Forest and ensembles  We used BigML to visualize decision trees and i must say they have one of the best visualizations for Decision Trees. It comes in pretty handy when you're just doing EDA. Building Decision Trees can be slow and prone to over fitting but Random Forest solves a lot of these issues as the modeling process can be parallelized. They also tend to give you models with low bias and variance.

      By mid-week, we delved into Deep Learning and built a Deep Belief Network using this library with several hidden layers (Restricted Boltzmann Machines) to classify digits from the popular MNIST dtataset. We had a feed-forward setup with no back propagation. Neural Networks have been around for a while but they were put back on the map with the advent of Deep Learning about a decade ago. In academic circles, most of the hottest Deep Learning research is going on at places like NYU/Facebook (Yann LeCun), Toronto/Google (Geoff Hinton), Montreal (Yoshua Bengio) and Stanford/Google (Andrew Ng) (..of course, not listed in order)

      Towards the end of the week, we took another detour into time series analysis and worked on some trend and seasonality analysis using pandas. Friday was more of a catchup day. There were no official sprints. We worked on project proposals and had a Deep Learning and git breakout. We're slowing nearing the end of the structured curriculum and everyone is slowly moving into project mode. I'll say, there were a lot of 'aha' moments for me this week.

      Highlights of the week:
      • We had a guest lecture from Allen on multi-layer perceptrons. He talked about some of the interesting research he worked on at Nike and how they used Neural Networks and other machine learning tools to design better footwear.
      • Chief Data Scientist  at @Personagraph  gave an interesting lecture on how they're using machine learning
      • We closed out the busy week with a Mentor Mixer. A lot of industry practitioners attended. The goal was to match current students with practicing Data Scientists. There were a variety of mentors who showed up, mostly Data Scientists, some Chief Data Scientists and a two-time Kaggle Competition Winner (yes, they are a rare breed but they do exist)

      Thursday, March 6, 2014

      Slides from in-database analytics talk #MADlib

      Here are my slides from a talk I gave on in-database analytics systems and the MADlib library

      Saturday, March 1, 2014

      Week 6 : Zipfian Academy - The Elephant, the Pig, the Bee and other stories

      So this week was one heck of a week. We started Data Engineering and Data at Scale, had several guest lectures and played with some really cool tools. I got the RAM kit I ordered over the weekend, so it was nice starting the week running on 16gb of RAM. You could easily wrangle a few gigs of data locally without resorting to AWS or mmap tricks.

      Okay, this post is going to be a long one. The events here are in chronological order.

      The week started off with a quick assessment. We are getting to the end of the structured part of the curriculum, so this really helped pinpoint some areas and topics to revisit. The day officially started with an introduction to Hadoop, Hadoop Streaming and the map-reduce framework via  mrjob. We wrote map-reduce jobs to do word counts. MapReduce framework / Hadoop easily made its way into mainstream because people realized they could now store massive amounts of data (terabytes -> petabytes) on a bunch of commodity nodes and also perform computation on those nodes. Of course Google also had a hand in this after they published their celebrated Map Reduce paper back in 2004. They seem to be years ahead of everyone else.

      The next day, we worked on a more interesting problem using mrjob to do recommendations at scale by building a scaled down version of LinkedIn's People You May Know. We developed a multi-step map-reduce program with mrjob to analyze a social graph of friends and then performed Triadic closure to recommend new friends. So basically, if A knows B and A knows C, then to some degree B and C should also be friends.

      Next day, we moved on to using Hive to build a contextualized ads server on AWS. Tools like Hive and Pig (which both have SQL-like syntax) are used to query data on Hadoop / Distributed File Systems and they both compile down to map-reduce. These tools were developed to cater to the audience that was already familiar with SQL but were not Java experts. We also discussed some of the next generation frameworks like Spark (in-memory), Impala (in-memory), Storm (real-time), Giraph (graphs), Flume (streaming-log files),  etc.

      After the first few days working with data at scale, we moved in a different direction to building data products with Flask and yHat . Using these tools basically opens the flood gates for data products. If you can build a predictive model in python or R, you can push it to production or to a client with minimal plumbing.

      We wrapped up the week discussing project proposals, doing reviews, a breakout on Deep Learning and catching up on sprints. I ended up exploring a bunch of APIs from some non-profits. I was impressed some of them would build APIs to let people access their data and other government data.

      Highlights of the week:
      • A lot happened this week, we started out with a guest lecture with Dennis from @IdleGames. He gave his point of view from being in a Data Scientist role for the past year. A few quotes from the talk : "if you can't explain your analysis, it didn't happen!!", "Remove yourself as the bottleneck and automate everything". He's working on some interesting projects like churn prediction, integrating offline analysis into live games, collusion detection, customer long term value (LTV) and the uncertainty associated with that and also building out data products to democratize insights.
      • I had a chance to attend a meetup hosted by Microsoft. There were two talks at this meetup. Setting up and running Hadoop on Cloudera and a more hands on intro to using F# for machine learning. It's not everyday the words 'open source' and 'Microsoft' show up in the same sentence, but I was really impressed with how expressive the language was and did I say it was open source and multi-platform? There was a really nice demo and I liked the F# type providers. One of the great things about the Bay Area is that on any given day, there are at least a handful of pretty good Tech meetups with very interesting topics and speakers.
      • Tuesday, we had another guest lecture with Tyler from @radius. He's a Data Engineer who uses mrjob daily to write map-reduce jobs. He went through some detailed examples and sample code. It's really cool how mrjob handles a lot of plumbing around writing map-reduce jobs. You can run your code locally and also push it up to an Amazon Elastic MapReduce cluster with minimal changes. 
      • Next day, we attended another meetup hosted at YelpHQ. There were two very good presentations from @bigmlcom on doing exploratory data analysis / modeling and @YhatHQ on productionizing your python and R models. BigML has some cool features for getting a feel for your data and some pretty cool model visualizations. They focus mainly on decision trees and random forest ensembles. BigML's tool also let you download your models either in pseudo-code or in a variety of languages like python, R, Java, etc. yHat is a glue between the analytics stack and engineering stack. They also enable Data Scientists to be more productive by letting them push out their models to production via a REST streaming API with a json end point. I actually wrote about both startups a few months ago here
      • The co-founders of @YhatHQ dropped by the 'Data Warehouse' and gave a nice demo of a beer recommender service built with yHat and Flask. Greg was very helpful in working us through some of the issues we had setting things up.
      • We had a special guest this week, @nathanmarz. He's a former Twitter Engineer and creator of Storm and Cascalog. His talk was half technical (overview of Clojure) and half philosophical (thoughts on where things are headed in the Big Data ecosystem). He reiterated the concept of doing your data analysis by building abstractions. I have to say, it is truly inspiring watching a Master at work.
      • Our last guest lecture of week was by Nick from @DominoDataLab. They've built a Heroku for Data Science. Domino takes care of solving technical infrastructure problem like spinning up clusters, etc  and lets you focus on Data Science. Their tool provides version control for your models / data,  group collaboration, cron flexibility, etc

      Sunday, February 23, 2014

      Week 5 : Zipfian Academy - Graphs and Community Detection

      The update for last week will be short and quick. Doing these blog posts is getting much harder.

      We started the week looking at unsupervised learning techniques like k-means and  hierarchical clustering. We also visited dimension reduction techniques like SVD and NMF. By mid-week, we switched gears to graph analysis  and covered in no particular order BFS, DFS, A*, Dijkstra and community detection in graph networks

      Take aways from the week:
      • We had several guest lectures this week. @kanjoya is working on the cutting edge of Natural Language Processing. They help their clients derive actionable intelligence from emotions and intuition. The speaker discussed the general NLP landscape : tools and techniques. I found it interesting that some of their training data comes from The Experience Project
      • @geli gave an interesting talk. They've basically built an OS for energy systems and hope to revolutionize the energy management space 
      • @thomaslevine talk was on open data initiatives around the country. Open Data is one of those things cities like to talk about but very few of them are doing it well
      • Things were switched around this week. We ended the week working on a dataset from one of the partner companies. The dataset recorded mobile ads served to user at various locations, we were supposed to do some exploration and find out the best locations to serve ads to users. The dataset had a couple million records. Trying to wrangle giga-byte sized data on just 4gb of RAM is definitely not fun. I ordered a 16gb RAM kit, should get it by this weekend. If you are thinking of enrolling for the course, you should shoot for at least 8gb of RAM.   

      Saturday, February 15, 2014

      Week 4 : Zipfian Academy - Oh SQL, Oh SQL... MySQL and some NLP too

      So things were totally ramped up this week. We started out by scrapping, parsing and cleaning data from the NYTimes API and then jsonified and stored the data in MongoDB. The next day we used the same dataset ported to a few SQL tables and implemented the Naive Bayes algorithm in SQL to classify which labels an article would fall under. We continued with some diagnostics like confusion matrix , confusion tables, false alarm rate, hit rate, precision, recall,  ROC curves, etc. Other topics covered include NLTK, tokenization, TF-IDF, n-grams, regular expressions, feature selection using Chi-Squared and Mutual Information. We ended the week by working on another past Kaggle Competition -  StumbleUpon Evergreen Classification Challenge

      We are at the half way point for the structured part of the class. Just in case you're thinking of doing this, my schedule these days is about 12 - 15 hrs / day during the week doing daily sprints (data scrubbing , transformation / machine learning challenges), reading data science materials and lectures. Over the weekend, I'd say about 10 hrs / day closing the loop on a few of the sprints from the current week and doing more data science / readings for the following week. You basically live and breathe data science... all day long..all week long

      Highlights from the week :
      • We had two guest lectures this week. They were on Naive Bayes and feature extraction in NLP. Zipfian also added a guest lecturer to their roster. The new instructor is a Deep Learning expert and I'm really excited to explore working on new datasets with Neural Networks.
      • Implementing things from first principles gives you a better understanding of how some of these algorithms work and what may be going on under the hood when they fail.
      • My team also took the top spot in the Kaggle competition for the second week. The problem we worked on was a classification problem using AUC (Area Under the Curve) as the evaluation metric. We achieved an AUC of $\approx 0.8895$  which is about 0.008 off the leading Kaggle submission on the public leaderboard
      • Cross-validating on your training set is always a good idea

      Thursday, February 13, 2014

      Some Things you should be reading

      An Intro to Statistical Learning has become my favorite statistical learning book.  It was written by two Masters in the field and others. You don't often see such brilliance and clarity in a book.

      A Few Useful Things to Know about Machine Learning  is a really good paper you should look at if you're interested in the field of Data Science 

      Some Practical Machine Learning tricks is a summary of some tips and trick to always keep in mind when working on machine learning problems

      A collection of videos on different Machine Learning topics from Caltech