Thursday, December 31, 2015

Year in Review

It's been quite a year.

It appears we're moving closer to the Hardware + Software + AI singularity and all the stuff that comes with that... and it's kind of scary.

Here again is Jeff Leek's Non-comprehensive list of awesome things other people did in 2015.


Some of the majors like Google, Facebook, Baidu and Microsoft open sourced some of their internal Deep Learning tools / frameworks. Most of the value coming from these tools will be interesting and useful products built with / on them.

Also, this one literally sent chills down my spine - Landing of Falcon 9 first stage. I guess we're one step closer to Mars and becoming a multi planetary species.

I didn't quite put out much content out there this year. I'm hoping to do more writing next year.

Stay tuned...

Wednesday, August 5, 2015

Some more interesting links-6, Movie Math, Random Forests, TED, Unicorns


Remember that Math problem Matt Damon solved in Good Will Hunting? ... It turns out this problem is actually accessible to us mere mortals. Do checkout this awesome video explanation

I'm a fan of TEDtalks. Here are nice playlist of interesting Data Related TEDtalks

A very detailed coverage on feature transformations

Random Forest workhorse : [Paper 1] [Paper 2]

Nice coverage on python

Deep Learning libraries by language

Sam Altman's Startup Class

YC Open Office Hours, Fellowship [1] [2], Research and Blog

Just in case anyone was keeping score, TC's Unicorn Dashboard

Sunday, July 19, 2015

Book Review : Elon Musk - Tesla, SpaceX and the Quest for a Fantastic future

This book gives you a glimpse into the man and the machines and companies he has built and the trials you'll face as an enterprenuer. While reading you'll experience occasional bursts of laughter. This is quite an interesting read.

This is the second biography I've bought and read ( the first was of Robert Oppenheimer, one of the principal architects of the Manhattan Project )

Elon Musk was born in South Africa and moved to Canada when he was 17 to attend college. He eventually transferred to the University of Pennsylvania to continue his studies.

I just keep wondering, assuming he wasn't able to make it to the US. The companies he founded and is/was involved with at a high level -  Paypal, SolarCity, Tesla, SpaceX which collectively employ tens of thousands of people may never have happened. I guess we'll never know.

This guy is transforming three multi - billion dollar industries and their derivatives at the same time. Truly a modern day Renaissance man.

Just imagine.. in our lifetimes (in 20 or 30 years or maybe even less), humans will have boots on the ground in Mars and Elon Musk is leading the vanguard here.

On a recent visit to the Tesla Factory in Fremont, he seems to be owning the whole Iron Man / Tony Stark comparison .

Friday, May 15, 2015

Choosing a Data Science Bootcamp program? - questions to ask, things to look for and look out for

Over the past year, I have had the opportunity to speak with a lot of prospective Data Science bootcamp students sharing my pre and post bootcamp experiences and helping them put in context some of the major factors they need to consider before deciding to attend a Data Science bootcamp. This post is a summary to shed some light on some of those thoughts I've shared privately with prospective Data Science bootcamp students, things they should look for and things to look out for.

The list below may not be encompassing as each prospective Data Science bootcamp student is unique in their own way and what they hope to get out of the experience.

Do keep in mind this list was put together for those considering full time Data Science bootcamp programs.

Without further ado.. here we go

Background : Data Science is a hybrid role. Having a background with the right mix of Quantitative skills, Programming, Statistics, Math, Business Acumen, Databases, and Machine Learning would probably work in your favor. 6 or 12 weeks is a very short time to learn these things from scratch.

Also, having a good background improves your chances of getting into one of these Data Science bootcamp programs. I hear they're getting quite competitive these days.

Cohort Makeup : At most bootcamps part of your learning comes from lectures and interactions with the instructors, TA's and guest speakers. The other half comes from working and collaborating with your cohort mates on the course materials and projects. It is important that a cohort have people with a diversity of past educational / professional experiences. Your cohort mates will become your friends, co-workers, collaborators and maybe even co-founders.

Placement Rate : This is a really interesting one. A bootcamp with 100% placement may not always be the best choice. I've heard some bootcamps drop students who they feel may not be able to find a job and don't include them in the numbers. Prospective students have to dig deeper on the placement rates and ask the following questions:
  • Percent of students placed in actual Data Science roles
  • Percent of students placed within one month or three months of finishing the program
  • Percent of students placed through Hiring Day 
  • Percent of students placed through an introduction that the bootcamp made
  • Percent of students actually looking for a job post bootcamp
  • Median Salaries for students placed
  • Salary Range for students placed
Going through a Data Science bootcamp is definitely not a silver bullet. There are a lot of people that go through these programs and still end up with non-optimal outcomes.

Hiring Day : As as far I know, most of the Data Science bootcamps have a hiring day event where students get an opportunity to present their capstone projects to potential employers and "speed date" with those employers. Some Data Science bootcamps have exclusive hiring events with employers in their Hiring Network or guest lectures and presentation from companies that might be looking for talent.

Cost : This could be major factor in deciding to attend a bootcamp. Data Science Bootcamp program tuition range from free to $16,000. There might also be other costs like room and board, incidentals, relocation and lost wages. What this mix looks like will be different for each prospective student.

One way to look at cost is that this a short term investment for a chance to break into a new career.

A good amount of the material you need to learn to become a Data Scientist is free and available on the internet. Some students that get admitted to these Data Science bootcamps could have chosen to lock themselves in a room for 6 months and study all this material and then emerge having learned all the material / skills required to be able to land a job.

This is entirely possible but you lose out on all the intangibles you get from attending an in-person Data Science bootcamp - mentoring from instructors/guest lecturers, structured learning, motivation, positive reinforcement, collaborating with cohort mates, networking, getting a different view on approaching and solving problems, etc. What these intangibles are worth / could be worth down the line should be carefully evaluated and added to the cost equation.

Interview Prep / Soft skills / Business Acumen : It is important to know how much time the Data Science bootcamp spends on soft skills, interviewing and white boarding. Most job interviews you go to may require you to work through programming problems, communicate the results of an analysis you may have worked on to a technical / non-technical audience, working through a modeling case study, etc.

These are skills you get better at with practice. Some Data Science bootcamps weave this in as part of the curriculum so the students are more comfortable with this by the end of the program whereas others may reserve time towards the last few weeks of the program to work on these.

Curriculum : It is difficult to learn everything you need to become a Data Scientist in 6 or 12 weeks. You want to look for a program that will give you enough breadth and depth and a good enough foundation to start and build a career in this field.

Location : Majority of the Data Science bootcamp programs are based either in the Bay Area, New York or scattered through Europe (London, Dublin, Berlin) and most graduates end up working in those places. I've seen some setup shop in other tech metros like Boston, Seattle and Denver.

Contact Alumni : There is a lot of information to be gleaned from talking to past students of Data Science bootcamp programs you're considering. You'll get a raw and unfiltered view of their experience.

Projects : You should look for programs that will enable you work on variety of projects with small, medium and large data. This way you'll have a broad range of experiences and a portfolio of interesting projects or analysis to talk about once you hit the interview trail.

In-Person vs Online : It is very difficult to replicate the collaborative environment of a full time in-person Data Science bootcamp in an online setting. Assuming there are no other extenuating factors, choosing a full time in-person bootcamp should be the preferred option.

Established vs New Programs : This is actually one of the most frequent questions I get. Prospective students are usually torn between going with a more established program which has gone through several cohorts and has established a track record versus a new and upcoming program which may have gone through one or two cohorts or is just getting started.

Prospective students need to evaluate Data Science bootcamp programs they're considering on their merits and the factors that are actually most important to the student. There are advantages going with either a more established program or a much newer one. The prospective student needs to do some introspecting after which the path they have to take becomes very obvious.

To keep things in context, none of these Data Science bootcamp programs existed 3 years ago.

Alumni Network : Generally, this is a perk. Going to a bootcamp with a strong alum network could sometimes make the difference. You could be exposed to opportunities and / or jobs that you may not otherwise have access to. Having access to an active and collaborative alum network is worth its weight in gold (or whatever precious metal you prefer)

Outcomes : Students going through Data Science bootcamps usually have different goals. For most, it's probably getting a Data Science / Machine Learning focused job. For others it could be gaining a skill set that'll enable them work on their own ideas, break into the industry and / or move up the ladder in their current job. Whatever those goals are, going with a bootcamp that can work with you or even personalize some of the curriculum to ensure you're getting the best value for your time and money would be most ideal.

As far as customizing the curriculum, some of the bootcamps with smaller cohorts will have a much easier time doing this.

These are outcome based programs so you should go with a program that'll give you the best chance of finishing with a positive outcome whatever that may be.

As with programming bootcamps, Data Science bootcamps are now becoming commoditized. Some of these Data Science bootcamps consider themselves Post Doctoral training programs while others want to own a different segment of the market.

If you're considering a Data Science bootcamp program , do go through this list and then pick the program that is the best fit and will deliver the highest delta / value for you.

Hopefully, this blog post will help start the conversation.

Saturday, March 14, 2015

In honor of Pi Day : Estimating the value of pi via Buffon's Needle

Last year we estimated the value of $\pi$ via Monte Carlo simulation. This year, we'll be revisiting the same exercise but using a different approach : Buffon's Needle. This approach is actually one of the oldest geometrical probability problems and it involves dropping needles on a lined sheet of paper and calculating the probability of the needles crossing lines in the page. This technique was first used by 18th century mathematician Georges-Louis Leclerc, Comte de Buffon.

In this scenario, we'll be dropping a bunch of randomly generated needles of length 1 on a grid with vertical lines. The spacing between the vertical lines is also of length 1. It turns out that you can estimate the value of $\pi$ by taking the fraction of the number of needles you dropped (Drops) and those that crossed any of the vertical lines (Hits) and multiplying by twice the length of a needle. See this ipython notebook for code used.

The following two graphs show our grid with 100 and 1000 randomly generated needles respectively



Let's work through the math:

$ 2 \times needlelength \times  \frac{Drops}{Hits}  \approx   \pi  $

where length of needle is 1 and the length of the spacing between the vertical grid lines is also 1

The graphs below were generated from a few hundred trials. For each trial, we increased the number of randomly generated needles. We can see the estimated value of  $\pi$ is about 3.12 which is a bit off from the true value of 3.14. I suspect there might be something going on with how the random needle center coordinates are generated since the needle graphs above are showing some symmetry. Regardless, we are still within 1% of the true value of $\pi$.




It's actually pretty cool to see how the value of $\pi$ sneaks out from the woodwork. There's probably a more intuitive way to explain how $\pi$ shows up in places we least expect

For all the code used for this analysis, visit this ipython notebook



Saturday, February 28, 2015

Some more interesting links-5,YC Companies, ipython and more pandas

Comprehensive list of all YC companies and a comprehensive list of all accelerators / cohorts

Good article explaining *args and **kwargs  and generators in python

Pivot Tables in pandas and more pandas

Gallery of some of the best ipython notebooks

Interactive ipython notebooks

Monday, January 19, 2015

Slides from Getting Started with Vowpal Wabbit Talk #Vowpal Wabbit

Slides from Getting Started with Vowpal Wabbit talk