Yet Another Data Blog

Sunday, December 31, 2017

Year in Review

To say 2017 was an interesting year would be quite the understatement.

We saw everything from the explosion of Bitcoin, blockchain and crypto-currencies. Too bad if you didn't get in on the crypto craze before the wave.

We also witnessed the growth and maturity of a several Deep Learning frameworks:

Facebook - PyTorch
Apple - CoreML
Uber - Pyro and Michelangelo
Amazon - Gluon + MXNet
Google - Tensorflow
DeepMind - Sonnet

We were wowed once again with the Model 3 launch and Tesla Semi / Roadster v2 unveil. And, we got more insights on Model Y and Falcon Heavy launch.

We also witnessed more products and services powered by Deep Learning / AI and more maturity in the different DL / ML frameworks out there. I really do feel a lot of the worthwhile things in the space going forward will be applying these tools and techniques to solving real problems that have real impacts on our lives vs things that just end up in research papers (even though that's important too) and making these tools more accessible.

I'll leave you with a summary of the awesome things other folks did in 2017

A Year in Computer Vision by M Tank
Deep Learning Achievements Over the past Year by Eduard Tyantov
AI and Deep Learning in 2017 – A Year in Review by WILDML
A non-comprehensive list of awesome things other people did in 2017 by Jeff Leek
Interesting ML projects from 2017
NIPS 2017 notes

Week 3 : fast.ai - 1 v2 - More CNN Internals, Structured Data

Jeremy did a more in-depth run through of the ConvNet pipeline in excel.

Architectures that make use of fully connected layers like VGG have a lot of weights which also means they could have trouble overfitting and they can also be very slow to train.

The fast.ai library automatically switches into multi-label mode for multi-label classification problems by switching the final activation from softmax to sigmoid

Highlights

Using differential learning rates applies your array of learning rates across different layer groups which are handled automatically by the fast.ai library. The last learning rate is typically applied to your fully connected layer while the other n-1 learning rates are applied evenly over the layer groups in your network.
data.resize() reduces the input image sizes which helps to speed up your training especially if you're working with large images
You can define custom metrics within the fast.ai library
You typically want to fine-tune your fully connected layers first if you're using a pre-trained network before you unfreeze the pre-trained weights in your convolution layers.
learn.summary() shows you the model architecture summary
add_datepart() pulls interesting features from time series data
pandas feather data format is a really efficient way to dump data in binary format

Some Useful links

CurlWget chrome extension
FileLink
Intuitive Understanding of ConvNets by Otavio Good
Entity Embedding of Categorical Variables

Sunday, December 17, 2017

Week 2 : fast.ai - 1 v2 - CNNs, Differential Learning Rates, Data Augmentation, State of the art Image Classifiers

Jeremy emphasized that a learning rate that is too low is a better problem than a learning rate that is too high. If the learning rate is too high, you'll probably overshoot the global minima.

From the Cyclical Learning Rate paper, for each mini-batch iteration, we gradually increase the learning rate until we get to a point where the loss starts to worsen. From the graph of loss vs learning rate below, we find that the optimal learning rate is usually one or two orders of magnitude smaller from the minimum point on the graph (In the graph below learning_rate = 0.01)

Courtesy of fast.ai

Data Augmentation

The most important strategy to improve your model is to get more data which also helps prevent overfitting. One way to do this is via data augmentation. The fast.ai library supports multiple data augmentation strategies. Some simple transforms include horizontal and vertical flips. Which transform you use depends on the orientation of your images. Data augmentation also helps us train our model to recognize images from different angles.

It's always a good idea to rerun the learning rate finder when you make significant changes to your architecture like unfreezing early layers.
precomputed activations use the activations identified from the earlier layers. To take advantage of data augmentation in the fast.ai library we use learn.precompute = False so that the system has a chance to define new activations for the augmented data.

Highlights

We typically want to decrease the learning rate (learning rate annealing / stepwise annealing) as we get closer to the global min. A better way to do learning rate annealing is to use a functional form like the one half of a cosine curve or cosine annealing.
In the diagram below the image on the left uses a regular learning rate schedule and it converges to a global minimum at the end of training whereas the image on the right has multiple jumps. Each jump is considered an annealing cycle and is controlled by the cycle_len parameter in the fast.ai library. The learning rate schedule for the image on the right also uses cosine annealing where cycle_len=1 denotes one annealing cycle per epoch. This is also described in depth in the Snapshot Ensembles paper. This strategy helps you get unstuck from local minima and gives you lots of opportunities to find the global minima. The image on the right is using Stochastic Gradient Descent with restarts.

Courtesy of Huang et al from Snapshot Ensemble paper

Courtesy of fast.ai

When finetuning a pre-trained network, the early layers usually need little or no refitting since they capture general purpose features (eg. lines and edges) whereas the deeper (later) layers typically need to be fine-tuned on your dataset (since they capture the higher level representations) especially if the images in your dataset are very different than the images the pre-trained network was trained on, eg training on satellite images using a pre-trained Imagenet network. We can also define an array of learning rates (differential learning rates) and then use these different learning rates on different parts of the network architecture including the unfrozen pre-trained layers.
We can also vary the length of a cycle when using SGD with restarts with the cycle_mult parameter in the fast.ai library. You can think of this as a way to adjust the explore/exploit ability of the model and helps it find the global minimum faster

Courtesy of fast.ai

We also utilized Test Time Augmentation (TTA) to improve our model further. When we make predictions on each image in the validation data set, we also take 4 random augmentations for each individual image, make label predictions on those augmentations + origin validation image and then average the predictions. This technique typically reduces validation error by 10% - 20%
As a general rule of thumb, to adjust differential learning rate for a dataset very similar to the data your pre-trained network was trained on, you'd want a 10x difference between the learning rates otherwise a 3x difference is recommended.
One way to avoid overfitting is to start you model training on smaller images over a few epochs and then switch to larger images to continue training.

Summary of steps to build state of the art Image classifier using fast.ai

Enable data augmentation, and precompute=True
Use lr_find() to find highest learning rate where loss is still clearly improving
Train last layer from precomputed activations for 1-2 epochs
Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
Unfreeze all layers
Set earlier layers to 3x-10x lower learning rate than next higher layer
Use lr_find() again
Train full network with cycle_mult=2 until over-fitting

Some Useful Links

kaggle-cli [2] - nice command line tool to download kaggle datasets
Visualizing and Understanding CNNs [2]

Sunday, December 10, 2017

Week 1 : fast.ai - 1 v2 - CNNs, Kernels, Optimal Learning Rate

Deep Learning is essentially a particular way of doing Machine Learning where you give your system a bunch of examples and then it learns the rules and representations vs manually programming the rules

We have interesting applications today like Cancer diagnosis, Language Translation, Inbox by Google, Style Transfer, Data Center Costs optimization, Playing the Game Of Go among others that are powered by Deep Learning. Jeremy also emphasizes all the negatives that come with the growth of Deep learning like algorithmic bias, societal implications, job automation, etc

With Deep Learning we need an infinitely flexible mathematical function that can solve any problem. Such a function would be quite large with lots of parameters and we have to fit those parameters in a fast and scalable way using GPUs.

The fast.ai philosophy is closely modeled after some of the concerns Paul Lockhart voiced in his essay A Mathematician's Lament which pushes you to start by doing right away and then gradually peel back the layers, modify and look under the hood. The general feeling out there is that there is a survival bias problem in the Deep Learning space which is typified by this Hacker News post. The only currency that should matter is how well you're able to use these tools to solve problems and generate value.

Convolutions

CNNs are the most important architecture for Deep Neural Networks. They're the state of the art for solving problems in many areas of Image Processing, NLP, Speech Processing, etc

The basic structure of a CNN is the convolution which on its own is a relatively straightforward process. A convolution is a linear operation that finds interesting features in an image. Performing one instance of the convolution operation (element-wise multiplication and addition) requires the following steps

Identify kernel matrix - this is typically a 3 x 3 matrix or in some cases a 1 x 1 matrix
Pass kernel matrix over image (see figure below)
Perform element-wise multiplication between kernel and overlaying image pixels (see red box in image below)
Sum all the elements in the resulting matrix ( in the figure below, the sum is 297)
Assign the sum as the new pixel value for the center pixel in the overlayed image crop in the activation map
This operation is repeated until you've completed passes over the entire image

There are other parameters like kernel stride and padding that determine the dimension of the activation maps. I'll be doing a more in-depth post on Convolutional Neural Network to discuss theses and the full CNN pipeline.

Courtesy of setosa.io/ev/image-kernels/

In the figure above, we used the sharpen kernel. There are also a few other predefined kernels like sobel, emboss, etc. In a typical CNN pipeline, we start with randomly initialized convolution filters, apply a non-linear ReLU activation (remove negatives) and then use SGD + backpropagation to find the best convolution filters. If we do this with enough filters and data we end up with a state of the art image recognizer. CNNs are able to learn filters that detect edges and other simple image characteristics in the lower layers and then use those to detect more complex image features and objects in the deeper layers.

To train an image classifier using the fast.ai library you need to

Select a starting architecture: resnet34 is a good option to start with.
Determine the number of epochs: start with 1, observe results and then run multiple epochs if needed
Determine the learning rate: Using the strategy in the Cyclical Learning Rate paper, we keep increasing learning rate until the loss stars decreasing. This will probably take less than one epoch if you have a small batch size. From the figure below, we want to pick the largest learning rate as long as the loss is still decreasing (in this case learning_rate = 0.01)
Train learn object

Courtesy of fast.ai

Highlights

We used a pre-trained ResNet34 to train a CNN on data from the Cats vs Dogs Kaggle competition and obtained > 99% accuracy
Used a new method (Cyclical Learning Rates for Training Neural Networks) to determine the optimal learning rate which determines how quickly we update our weights.

Some Useful Links

Sunday, December 3, 2017

Week 0 : fast.ai - 1 v2 - Setting Up, GPU instances and some housekeeping

I think fast.ai has been one of the many exciting developments in the Deep Learning / Machine Learning education space over the past 1.5 years. Fast.ai really champions the top-down approach where the goal is to make Deep Learning more accessible and get you to become productive using these tools to solve a variety of problems without spending years working on a Ph.D.

I've had some false starts with Deep Learning. I literally have stacks of books, papers and PDFs on the subject and a multitude of unfinished Coursera courses and MOOCs but I feel what's usually missing is how you distill all that knowledge to help you become productive right away.

I had the opportunity to watch several videos from the previous fast.ai MOOC and I really liked the teaching style so I decided to sign up for the second iteration of the MOOC and I must say I haven't been disappointed. Another added benefit is that the course is taught by Jeremy Howard. The next series of posts will highlight and summarize learnings from the class sessions. My goal with this is to help spread fast.ai's message of making start-of-the-art Deep Learning Research / tools more accessible.

This is really powerful stuff and what you want is for people to use these Deep Learning techniques on some of the problems they're working on which may help push the envelope in many ways.

Hardware Setup

A lot has already been written on why GPUs and specialized hardware are great for Deep Learning so I'll cut to the chase and run you through several setup options you could use for the course

AWS : This is a great guide to set up an AWS instance. A p2.xlarge GPU instance is probably okay for the course and it runs about 90 cents an hour. If your p2 instance limit is 0, you may have to request an instance limit increase through the AWS support center which usually takes a day or two to be approved. Jeremy also set up a fastai Community AMI in the Oregon and N.Virginia regions and a few other regions around the world (Mumbai, Sydney, Ireland) with all the needed software pre-installed.

Once you're done setting up your instance, the next steps would be to get your ssh keys and then finish the ssh config. My current setup is shown below. Notice the config file at ~/.ssh/config and its contents.

The setup allows you to ssh into your AWS instance by using the command ssh fast-ai. Notice that the Hostname is blank, you'll have to fill that in with your instance IPv4 Publib IP which you find on your EC2 dashboard. Do keep in mind that this changes whenever you stop/start your instance. You can either update this manually or write a script to replace the hostname each time.

Also, notice the last line in the config file. LocalForward establishes tunneling between your system and the AWS instance allowing you launch a jupyter instance on the AWS instance and then launch the actual notebook on your local system. Running the command jupyter notebook gives the following output. We can then paste the url in a browser on our system to launch the notebook.

Crestle : is the easiest way to get a GPU enabled notebook. It's built on AWS so you get access to an AMI via ssh. It also has the fast.at datasets loaded in by default and it was built by a former fast.ai student. The setup process is quick compared to AWS and Paperspace

Paperspace: Paperspace is a cloud platform that enables you to spin up a GPU enabled fully-fledged Ubuntu desktop in a browser. They have faster / cheaper GPUs than AWS and you're charged for compute (by the hour) and storage

They also have a fast.ai public template so it's a really quick way to set things up for the course.

Python / Anaconda : I also finally switched my python version from 2.7 to 3.6. This is trivial to do via Anaconda and you can essentially switch between version. The fast.ai pytorch library requires python 3.6.

Personal Workstation : It could make economic sense to build you own Deep Learning workstation. An NVIDIA 1080ti GPU would cost you around $700 or so and you can supercharge with multiple GPU cards. I have box with a 1060 card but I'm still running into some issues setting it up.

Saturday, December 31, 2016

Year in Review

It's been quite an interesting year to say the least.

Lots of new Machine Learning and Deep Learning tools and libraries were released into the wild reducing the barriers to entry.

I'm really hoping to turn a corner and do more writing next year. I just can't seem to be able to shake off my writer's block.

Here again is Jeff Leek's Non-comprehensive list of awesome things other people did in 2016

Sunday, July 24, 2016

Data Science Bootcamp Reviews

This post is also posted in whole / part here :

When we started this, our primary goal was to bring to light as much information as we could regarding Data Science Bootcamps. We initially published several in-depth interviews with boootcamp founders. We’re still working on a few more and we are also embarking on the next stage.

We contacted and conducted detailed interviews with individuals who have graduated from different Data Science Bootcamps and as you might guess, we heard a lot of very interesting anecdotes.

We noticed a disparity on what we heard from the individuals we interviewed about bootcamp placements and outcomes compared to the information some bootcamps put out there.

We actually don’t think Data Science Bootcamps should guarantee placements or positive outcomes but a lot of them do imply it by using wordplay, sleight of hand or displaying statistics that are either out of date or are aggregates which may not be very useful.

A better approach in our opinion will be for the Data Science Bootcamps to publish 3 and 6 month post-mortems or detailed placement reports (at least 6 months) for each cohort they graduate. A prospective bootcamp student would probably find it more useful to know that for a cohort, 30% of the students were not looking for a job, 15% decided this wasn’t the right path for them and of the remaining 55%, most ended up with placements or positive outcomes versus just saying the cohort had an 80% placement rate without providing any other information. So a 100% placement rate for a cohort might not always be as good as it sounds. You just only have to look behind the curtain at the details.

We know for a fact that some bootcamps kick people out that they feel will not be able to find a job and sometimes don’t include individuals that fall off the radar or aren’t able to find jobs in their placement stats.

People attend these Bootcamps for very different reasons. For some it’s probably to transition to a Data Science or Data related role, for others, it could be to skill-up and then make lateral moves within their organization or to work on their ideas and personal projects .

Over the next few weeks and months, we will be publishing some of these Data Science Bootcamp reviews here.

If you have attended and graduated from a Data Science Bootcamp and you’d like to do a review of your experience, we’d love to hear from you. Please fill this form and we’ll reach out to you to conduct the short interview.

Having this information out there helps prospective Data Science Bootcamps students understand the dynamics with each of these bootcamps and the value it will deliver for them. This will also give them enough information to decide which program is the best fit for them / their goals.

Friday, April 8, 2016

Some more interesting links-6, Tensorflow, Falcon 9 #falconhaslanded

Google outsourced TensorFlow, one of its machine learning interfaces [PDF] [Slides]
Jeff Dean on TensorFlow
Tensor Flow meetup recording

Startup Pitch Decks

Machine Intelligence Landscape 2.0

Google Self Driving Car

Open AI is really looking more like Xerox PARC and BellLabs

And the #falconhaslanded . Falcon 9 first stage landing on a Drone Ship . I guess with Elon Musk it was always really a matter of "when" and not "if". Another step closer to re-usability and Mars

Thursday, December 31, 2015

Year in Review

It's been quite a year.

It appears we're moving closer to the Hardware + Software + AI singularity and all the stuff that comes with that... and it's kind of scary.

Here again is Jeff Leek's Non-comprehensive list of awesome things other people did in 2015.

Some of the majors like Google, Facebook, Baidu and Microsoft open sourced some of their internal Deep Learning tools / frameworks. Most of the value coming from these tools will be interesting and useful products built with / on them.

Also, this one literally sent chills down my spine - Landing of Falcon 9 first stage. I guess we're one step closer to Mars and becoming a multi planetary species.

I didn't quite put out much content out there this year. I'm hoping to do more writing next year.

Stay tuned...

Wednesday, August 5, 2015

Some more interesting links-6, Movie Math, Random Forests, TED, Unicorns

Remember that Math problem Matt Damon solved in Good Will Hunting? ... It turns out this problem is actually accessible to us mere mortals. Do checkout this awesome video explanation

I'm a fan of TEDtalks. Here are nice playlist of interesting Data Related TEDtalks

A very detailed coverage on feature transformations

Random Forest workhorse : [Paper 1] [Paper 2]

Nice coverage on python

Deep Learning libraries by language

Sam Altman's Startup Class

YC Open Office Hours, Fellowship [1] [2], Research and Blog

Just in case anyone was keeping score, TC's Unicorn Dashboard

Sunday, July 19, 2015

Book Review : Elon Musk - Tesla, SpaceX and the Quest for a Fantastic future

This book gives you a glimpse into the man and the machines and companies he has built and the trials you'll face as an enterprenuer. While reading you'll experience occasional bursts of laughter. This is quite an interesting read.

This is the second biography I've bought and read ( the first was of Robert Oppenheimer, one of the principal architects of the Manhattan Project )

Elon Musk was born in South Africa and moved to Canada when he was 17 to attend college. He eventually transferred to the University of Pennsylvania to continue his studies.

I just keep wondering, assuming he wasn't able to make it to the US. The companies he founded and is/was involved with at a high level - Paypal, SolarCity, Tesla, SpaceX which collectively employ tens of thousands of people may never have happened. I guess we'll never know.

This guy is transforming three multi - billion dollar industries and their derivatives at the same time. Truly a modern day Renaissance man.

Just imagine.. in our lifetimes (in 20 or 30 years or maybe even less), humans will have boots on the ground in Mars and Elon Musk is leading the vanguard here.

On a recent visit to the Tesla Factory in Fremont, he seems to be owning the whole Iron Man / Tony Stark comparison .

Friday, May 15, 2015

Choosing a Data Science Bootcamp program? - questions to ask, things to look for and look out for

Over the past year, I have had the opportunity to speak with a lot of prospective Data Science bootcamp students sharing my pre and post bootcamp experiences and helping them put in context some of the major factors they need to consider before deciding to attend a Data Science bootcamp. This post is a summary to shed some light on some of those thoughts I've shared privately with prospective Data Science bootcamp students, things they should look for and things to look out for.

The list below may not be encompassing as each prospective Data Science bootcamp student is unique in their own way and what they hope to get out of the experience.

Do keep in mind this list was put together for those considering full time Data Science bootcamp programs.

Without further ado.. here we go

Background : Data Science is a hybrid role. Having a background with the right mix of Quantitative skills, Programming, Statistics, Math, Business Acumen, Databases, and Machine Learning would probably work in your favor. 6 or 12 weeks is a very short time to learn these things from scratch.

Also, having a good background improves your chances of getting into one of these Data Science bootcamp programs. I hear they're getting quite competitive these days.

Cohort Makeup : At most bootcamps part of your learning comes from lectures and interactions with the instructors, TA's and guest speakers. The other half comes from working and collaborating with your cohort mates on the course materials and projects. It is important that a cohort have people with a diversity of past educational / professional experiences. Your cohort mates will become your friends, co-workers, collaborators and maybe even co-founders.

Placement Rate : This is a really interesting one. A bootcamp with 100% placement may not always be the best choice. I've heard some bootcamps drop students who they feel may not be able to find a job and don't include them in the numbers. Prospective students have to dig deeper on the placement rates and ask the following questions:

Percent of students placed in actual Data Science roles
Percent of students placed within one month or three months of finishing the program
Percent of students placed through Hiring Day
Percent of students placed through an introduction that the bootcamp made
Percent of students actually looking for a job post bootcamp
Median Salaries for students placed
Salary Range for students placed

Going through a Data Science bootcamp is definitely not a silver bullet. There are a lot of people that go through these programs and still end up with non-optimal outcomes.

Hiring Day : As as far I know, most of the Data Science bootcamps have a hiring day event where students get an opportunity to present their capstone projects to potential employers and "speed date" with those employers. Some Data Science bootcamps have exclusive hiring events with employers in their Hiring Network or guest lectures and presentation from companies that might be looking for talent.

Cost : This could be major factor in deciding to attend a bootcamp. Data Science Bootcamp program tuition range from free to $16,000. There might also be other costs like room and board, incidentals, relocation and lost wages. What this mix looks like will be different for each prospective student.

One way to look at cost is that this a short term investment for a chance to break into a new career.

A good amount of the material you need to learn to become a Data Scientist is free and available on the internet. Some students that get admitted to these Data Science bootcamps could have chosen to lock themselves in a room for 6 months and study all this material and then emerge having learned all the material / skills required to be able to land a job.

This is entirely possible but you lose out on all the intangibles you get from attending an in-person Data Science bootcamp - mentoring from instructors/guest lecturers, structured learning, motivation, positive reinforcement, collaborating with cohort mates, networking, getting a different view on approaching and solving problems, etc. What these intangibles are worth / could be worth down the line should be carefully evaluated and added to the cost equation.

Interview Prep / Soft skills / Business Acumen : It is important to know how much time the Data Science bootcamp spends on soft skills, interviewing and white boarding. Most job interviews you go to may require you to work through programming problems, communicate the results of an analysis you may have worked on to a technical / non-technical audience, working through a modeling case study, etc.

These are skills you get better at with practice. Some Data Science bootcamps weave this in as part of the curriculum so the students are more comfortable with this by the end of the program whereas others may reserve time towards the last few weeks of the program to work on these.

Curriculum : It is difficult to learn everything you need to become a Data Scientist in 6 or 12 weeks. You want to look for a program that will give you enough breadth and depth and a good enough foundation to start and build a career in this field.

Location : Majority of the Data Science bootcamp programs are based either in the Bay Area, New York or scattered through Europe (London, Dublin, Berlin) and most graduates end up working in those places. I've seen some setup shop in other tech metros like Boston, Seattle and Denver.

Contact Alumni : There is a lot of information to be gleaned from talking to past students of Data Science bootcamp programs you're considering. You'll get a raw and unfiltered view of their experience.

Projects : You should look for programs that will enable you work on variety of projects with small, medium and large data. This way you'll have a broad range of experiences and a portfolio of interesting projects or analysis to talk about once you hit the interview trail.

In-Person vs Online : It is very difficult to replicate the collaborative environment of a full time in-person Data Science bootcamp in an online setting. Assuming there are no other extenuating factors, choosing a full time in-person bootcamp should be the preferred option.

Established vs New Programs : This is actually one of the most frequent questions I get. Prospective students are usually torn between going with a more established program which has gone through several cohorts and has established a track record versus a new and upcoming program which may have gone through one or two cohorts or is just getting started.

Prospective students need to evaluate Data Science bootcamp programs they're considering on their merits and the factors that are actually most important to the student. There are advantages going with either a more established program or a much newer one. The prospective student needs to do some introspecting after which the path they have to take becomes very obvious.

To keep things in context, none of these Data Science bootcamp programs existed 3 years ago.

Alumni Network : Generally, this is a perk. Going to a bootcamp with a strong alum network could sometimes make the difference. You could be exposed to opportunities and / or jobs that you may not otherwise have access to. Having access to an active and collaborative alum network is worth its weight in gold (or whatever precious metal you prefer)

Outcomes : Students going through Data Science bootcamps usually have different goals. For most, it's probably getting a Data Science / Machine Learning focused job. For others it could be gaining a skill set that'll enable them work on their own ideas, break into the industry and / or move up the ladder in their current job. Whatever those goals are, going with a bootcamp that can work with you or even personalize some of the curriculum to ensure you're getting the best value for your time and money would be most ideal.

As far as customizing the curriculum, some of the bootcamps with smaller cohorts will have a much easier time doing this.

These are outcome based programs so you should go with a program that'll give you the best chance of finishing with a positive outcome whatever that may be.

As with programming bootcamps, Data Science bootcamps are now becoming commoditized. Some of these Data Science bootcamps consider themselves Post Doctoral training programs while others want to own a different segment of the market.

If you're considering a Data Science bootcamp program , do go through this list and then pick the program that is the best fit and will deliver the highest delta / value for you.

Hopefully, this blog post will help start the conversation.

Saturday, March 14, 2015

In honor of Pi Day : Estimating the value of pi via Buffon's Needle

Last year we estimated the value of $\pi$ via Monte Carlo simulation. This year, we'll be revisiting the same exercise but using a different approach : Buffon's Needle. This approach is actually one of the oldest geometrical probability problems and it involves dropping needles on a lined sheet of paper and calculating the probability of the needles crossing lines in the page. This technique was first used by 18th century mathematician Georges-Louis Leclerc, Comte de Buffon.

In this scenario, we'll be dropping a bunch of randomly generated needles of length 1 on a grid with vertical lines. The spacing between the vertical lines is also of length 1. It turns out that you can estimate the value of $\pi$ by taking the fraction of the number of needles you dropped (Drops) and those that crossed any of the vertical lines (Hits) and multiplying by twice the length of a needle. See this ipython notebook for code used.

The following two graphs show our grid with 100 and 1000 randomly generated needles respectively

Let's work through the math:

$ 2 \times needlelength \times \frac{Drops}{Hits} \approx \pi $

where length of needle is 1 and the length of the spacing between the vertical grid lines is also 1

The graphs below were generated from a few hundred trials. For each trial, we increased the number of randomly generated needles. We can see the estimated value of $\pi$ is about 3.12 which is a bit off from the true value of 3.14. I suspect there might be something going on with how the random needle center coordinates are generated since the needle graphs above are showing some symmetry. Regardless, we are still within 1% of the true value of $\pi$.

It's actually pretty cool to see how the value of $\pi$ sneaks out from the woodwork. There's probably a more intuitive way to explain how $\pi$ shows up in places we least expect

For all the code used for this analysis, visit this ipython notebook

Saturday, February 28, 2015

Some more interesting links-5,YC Companies, ipython and more pandas

Comprehensive list of all YC companies and a comprehensive list of all accelerators / cohorts

Good article explaining *args and **kwargs and generators in python

Pivot Tables in pandas and more pandas

Gallery of some of the best ipython notebooks

Interactive ipython notebooks

Monday, January 19, 2015

Slides from Getting Started with Vowpal Wabbit Talk #Vowpal Wabbit

Slides from Getting Started with Vowpal Wabbit talk

Wednesday, December 31, 2014

Year in Review

It's been a really interesting year.. I moved to the Bay Area. It's one thing to read about Silicon Valley or visit briefly. It's another to actually live out here and experience all it has to offer. This is the center of this data revolution everyone seems to be talking about. Obviously, if you can manage the ridiculously expensive housing out here and how much more expensive everything is out here, then you should be fine.

To wrap up the year, here is Jeff Leeks' Non-comprehensive list of awesome things other people did in 2014 . It has an rlang slant since he's a statistician.

I had more blog posts and traffic this year than each of the previous 3 year combined. Hoping this trend continues. Just looking at my traffic, it does appear there is a lot more interest in Data Science Education and immersive experiences like boot camps.

Going forward, I plan to do more tutorial style posts showing side projects or other interesting tech I encounter.

I do want to spend more time delving into Deep Learning. Starting with the nuts and bolts and then moving to available libraries / implementations and sharing some of what I learn along the way... stay tuned

Monday, December 22, 2014

Some more interesting links-4, Machine Intelligence, TDA, ipython notebooks

Most Topological Data Analysis tools are either stuck in academic research papers or Company intellectual property. DataRefiner might help to change that

Python for Exploratory Computing : Collection of ipython notebook showing python basics, statistics and advanced python topics

A collection of ipython notebooks on hacking security data

This is the future of education Open Loop University, where your education is spread over several years. You'll have periods of work with schooling interlaced inbetween

Detailed infograph showing major players in the Machine Intelligence space

You should look at this if you're interested in the Quantified Self space

I've been looking for something like this. Instant temporary ipython notebooks hosted in the cloud

An extensive Deep Learning Reading list

Nice reading on Generative vs Discriminative Algorithms (Naive Bayes - Logistic Regression)

Sunday, August 17, 2014

Getting Started with Vowpal Wabbit - Part 1 : Installation

After a very long hiatus, I'm back blogging. I'm really excited about how the year is shaping up.... stay tuned.

I discovered Vowpal Wabbit about a year ago but only recently started using it. Vowpal Wabbit is a very fast out-of-core learning system. Its the brain child of John Langford. and development has been supported by Microsoft Research and Yahoo Research (past)

This is the first part of a series about getting started with Vowpal Wabbit

To get started on OSX, you need to ensure you already have XCode and Homebrew installed. If you already have these installed, run the command below to update Homebrew

1	brew update

Vowpal Wabbit has a few dependencies that also need to be installed via brew. The official docs have the boost library as the only external dependency, but I was having a few issues until I installed automake and libtool

1
2
3

brew install automake
brew install boost
brew install libtool

In order to prevent conflicts with Apple's own libtool, a "g" is appended when you install libtool so you have instead: glibtool and glibtoolize. The code below adds a symbolic link.

1 2	cd /usr/local/bin/ ln -s glibtoolize libtoolize

Clone the Vowpal Wabbit git repo for the latest code

1 2	git clone https://github.com/JohnLangford/vowpal_wabbit.git cd vowpal_wabbit

Then you should run

./autogen.sh 
./configure 
make 
make install

If you are having set up issues or issues with dependencies, you may want to spin up a virtual machine. If you're on Ubuntu, you should run

sudo apt-get install vowpal-wabbit

Two of the most informative blogs out there with great coverage of Vowpal Wabbit are MLWave and FastML (it looks like this is behind a paywall)

Sunday, April 27, 2014

Some more interesting links-3, Quantified Self, Bandits

Extreme Quantified Self. This MIT professor analyzed about 90,000 hours of video / 140,000 hours of audio / 200 terabytes of home videos to understand how his child's speech developed. This is probably one of the coolest things I've seen. He started a company (Bluefin Labs) around the technology he used for the analysis and then sold that company to Twitter for a fat wad of cash. This is his TED talk

You definitely want to utilize the resources at your local public library. These days they have amazing resources like access to Safari which gives you access to O'Reilly and Packt titles

Some very good advice if you're on the interview trail - Always Be Coding

A nice visualization / simulation of what's happening in a Multi - armed Bandit problem

Friday, April 25, 2014

Zipfian Academy - All 12 weeks

Here you go.. a week to week summary of my experience at Zipfian Academy

Friday, April 18, 2014

Week 12 : Zipfian Academy - And That's All folks....

And so, all great things must come to an end. This is the final week for the bootcamp program. We continued interview prep, white boarding and code reviews. Apparently interviewing feels like having a full time job. Towards the end of the week, we continued with project one on ones, some more white boarding and interview prep, runtime and complexity analysis.

At the end of the week, we had a get together to celebrate the past the 12 weeks. A handful of alums from the last cohort attended and it's kind of cool to see what past alums are doing now.. some are at stealth startups and startups while others are working at some very impressive companies.

Highlights of the week:

We had a guest lecture from former cosmologist and Data Scientist @datamusing on using topic models to understand restaurant reviews. The topics were learned from the review corpus using LDA and NNMF. He also had a pretty cool d3 + flask visualization to show the results
We spent a day at the Big Data Innovation Summit. The morning talks mostly felt like business sales pitches. The afternoon talks were a lot more interesting as there were breakouts for Data Science, Machine Learning, Hadoop, etc
In the Data Science breakouts, there were a lot of LDA related talks including using topic modeling in Health Care and using LDA to extract features for matches in a dating website.
Lots of interview prep and white boarding

And so this is it. My hope is that someone actually finds my ramblings over the past 12 weeks somewhat helpful in forging their own path into Data Science......signing off.

Sunday, April 13, 2014

Slides from Project Presentation #How Will My Senator Vote?

Here are slides from my project presentation on analyzing How Senators vote in Congress and building a model to predict how they would vote on future bills

Saturday, April 12, 2014

Week 11 : Zipfian Academy -The Beginning of the End

We started the week wrapping things up with our personal projects, putting together decks for our Hiring Day presentations and doing mock runs of our presentations. Towards mid-week, we did more mock runs and put final touches on our presentation decks.

Hiring Data was pretty hectic. It started off with a short mixer with representatives from the various companies that attended. Each of the companies did a quick presentation on who they were and what they were looking for. Once that was done, we proceeded to presenting each of our projects taking a few questions from the audience at the end of each presentation. There were a lot of really cool projects.

After project presentations and lunch, we had "speed dating" sessions with each of the companies that attended. It was a couple of minutes introducing yourself to the company, hearing what they were looking for and seeing if there's a good fit. It was quite tiring going through 16 or so interviews in the span of two hours but it was a worthwhile experience.

Most of us spent the last day of the week cleaning up and refactoring our project code.

Project Next Steps : I do plan to continue working on my project down the line, making some more improvements to my pipeline, looking at new and richer data sources, asking more interesting questions and doing some more analysis to improve my prediction accuracy. There's still a lot of ground to cover here. I also plan to use Latent Dirichlet Allocation (LDA) to extract better features from my data as you can pull out really rich and interesting features from your data using topic modeling. My original model used a "bag of words" approach. The eventual goal would be to release this as a web app anyone could use.

Highlights from the week:

We started the week with a guest lecture from @itsthomson. He is the founder of framed.io. He just finished the YC program and had lots of words of wisdom. He walked us through his experience making the transition from academia to Data Science, moving to a Chief Scientist role and now Founder. It's refreshing to hear from someone that has gone through the process. Some quotes from his lecture : "Data is the most offensive (vs defense) resource a company has",.. "In Data Science, you have to know a little of everything",.."Being technical helps, but being convincing is better",.. "Understanding how your analysis ties back to your business / organization is key"
We attended a Data Science for Social Good panel event at TaggedHQ. The panelist included CTO - Code for America, CEO - Lumiata, Data Scientist - BrightBytes, Data Scientist - OPower and Lead Data Scientist - Change.org. These companies are utilizing data science to make a difference. It was a very insightful panel session.
Hiring Day was rather interesting. 16 companies attended. The companies came from different verticals including CRM, consulting, social good, social, health, payments, real estate, education and infrastructure. It was interesting hearing some of the problems they were trying to solve in their respective domains

Saturday, April 5, 2014

Data Science Bootcamp Programs - Full TIme, Part Time and Online

I've gotten a lot of inquiries on options to move into Data science. This is my attempt to answer that question. If I excluded any programs from this, please feel free to ping me. You'll see that there are quite a few options and you need to find the best fit based on your profile. This list does not include any university programs.

Everyone seems to reference the quote from Google Economist Hal Varian "Being a statistician is the sexiest job of the 21st century" and the McKinsey report about the shortage in Data Science talent.

For a guide on factors to consider when Choosing a Data Science Bootcamp Program, the article should be helpful.

We are collecting and publishing detailed Data Science Bootcamp Reviews from students that have attended and graduated from the various Data Science Bootcamps

Visit this link for more in depth coverage of Data Science Bootcamp Programs including Interviews with Data Science Bootcamp Founders

Regarding Data Science Interview Resources, I hear from a lot of people including those asking about interview resources and the most efficient way to prepare for Data Science Interviews. At a lot of companies and startups, a very important component of the interview process is either the Take Home Data Challenge and/or Onsite Data Challenge. Another important component is the Theory interview, I'll talk more on this later..

This is a also a great resource for individuals who feel they have the background and experience to interview for jobs without going through a bootcamp type program.

To become more familiar with and get efficient working on Data Challenges, I recommend taking a look at the Collection of Data Science Take-home Challenges book. It gives very clear and realistic examples of some of the types of problems you could face on a Data Challenge and projects you could potentially work on as a data scientist

Here we go...

Full Time

Zipfian Academy : This is not a 0-60 school. It's more like 40-80. They are currently about to graduate their second cohort.

Notes : Of all the Data Science bootcamps, Zipfian has the most ambitious curriculum. Graduates from the first cohort are currently working in Data Scientist roles across the Bay Area. I'm currently part of the second cohort
Location : San Francisco, CA
Requirement : Familiar with programming, statistics and math. Quantitative background
Duration : 12 weeks

Update : Since the initial post went up a few months ago, Zipfian Academy has added two more programs

Data Engineering 12 - week Immersive : This follows the same format as the Data Science Immersive. The first cohort for this program will start January 2015

Notes : This follows the same format as the Data Science Immersive
Location : San Francisco, CA
Requirement : Quantitative / Software Engineering background
Duration : 12 weeks

Data Fellows 6 - week Fellowship : The first cohort for the fellows program will start Summer 2014

Notes : This program is free for accepted fellows
Location : San Francisco, CA
Requirement : Significant Data Science Skills, Quantitative background
Duration : 6 weeks

Also see a recent google hangout explaining these new programs : Zipfian Academy Data Fellows Program - Information Session

Data Science Europe Bootcamp : This looks like its modeled after the Insight program. Select a small group of very smart people with advanced degrees and help them get ready for Data Science roles in 6 weeks.

Interview with Data Science Europe Founder

Data Science Eutope Student Reviews

Notes : It enrolls the first cohort January 2015. Also if you don't receive an offer for a quantitative job with 6 months of completing the course, you'll receive a full refund on tuition paid. They're currently on their second cohort and have a 100% placement rate
Location : Dublin, Ireland
Requirement : Quantitative Degree, Programming knowledge and Statistics background. It looks like they prefer graduate students and Post Docs but are open to applications from undergrads.
Duration : 6 weeks

Insight Data Science : Accepts only PhDs or PostDocs. They have completed 5 cohorts in Palo Alto and are opening up a new class in New York this summer. From their website, it does look like they have almost perfect placement. It is project based self directed learning, so if you need some hand holding or you're not already very familiar with the material this may not be the program for you

Insight Data Science Student Reviews

Notes : No Fees, pays Stipend
Location : Palo Alto, CA / New York, NY
Requirement : PhD / PostDoc
Duration : 7 weeks

Insight Data Engineering : They'll enroll the first cohort this summer. Bootcamp will focus on the data engineering track. It is project based self directed learning, so if you need some hand holding or you're not already very familiar with the material this may not be the program for you

Notes : No Fees
Location : Palo Alto, CA
Requirement : strong background in math, science and software engineering
Duration : 7 weeks

Data Science Retreat : Follows the same format as Zipfian but is based in Europe

Interview with Data Science Retreat Founder

Notes : Curriculum is mostly in R, though they support other languages (python). They have tiered pricing for the class, so you can pay for which tier meets your needs
Location : Berlin
Requirement : Experience with programming, databases, R, Python
Duration : 12 weeks

Data Science For Social Good : hosted by the University of Chicago. The students work with non-profits, federal agencies and local governments on projects that have a social impact

Notes : they focus on civic projects or projects with social impact
Location : Chicago, IL
Requirement : It looks like they target academics (undergraduate and graduate students)
Duration : 12 weeks

Metis Data Science Bootcamp : This looks like its modeled after the Zipfian program from a duration / structure / curriculum stand point. It is owned by Kaplan which also recently acquired Dev Bootcamp. Looks like the big .edu players are trying to make a play for the tech bootcamp space

Interview with Metis Data Science Cofounder

Notes : It enrolls the first cohort Fall 2014. For individuals who are not already in the US or are international students, you could obtain an M-1 visa to attend. They're probably the first bootcamp that are able to issue M-1 student visas
Location : New York, NY and San Francisco, CA
Requirement : Familiarity with Statistics and Programming
Duration : 12 weeks

Science to Data Science : They accept only PhDs / Post Docs or those close to completing their PhD studies. We are seeing more bootcamps adopt this model.

Interview with Pivigo / S2DS CEO

Notes : It enrolls the first cohort August 2014. There is a small registration fee for the course otherwise the program is free for participants
Location : London, UK
Requirement : PhD / Post Doc
Duration : 5 weeks

Level Data Analyst Bootcamp : This is one of the first full time Data Analyst bootcamps we've seen and its run by a University which is also a first. I think folks in academia have realized that the typical university structure can't keep up with the pace of innovation in the space

Interview with Level Education Director

Notes : Curriculum looks standard for the Data Analyst and Marketing Analytics job track. They also run hybrid and full - time programs
Location : Boston, MA, Charlotte, NC, Seattle, WA, Silicon Valley, CA
Requirement :
Duration : 8 weeks

Praxis Data Science : This is another program coming with an interesting approach. Another option for individuals with a strong STEM and programming background who want to make a move into Data Science

Notes : It enrolls the first cohort Summer 2015. They also offer a money back guarantee and will refund up to half of the fees paid if you're unable to find a job within 3 months. This speaks to the fact that they have a vested interest in their students' success. The curriculum also seems to focus on building the practical skills needed to both land a role and continue to grow as a Data Scientist.
Location : Silicon Valley, CA
Requirement : Looks like they're looking for people with a STEM background (advanced degrees preferred) and programming / quantitative experience
Duration : 6 weeks

Insight Health Data Science : This is the first significant deviation we've seen from the norm (focus - wise). This program seems to have the same structure as the other Insight programs but the focus here is solely in Healthcare and Life Sciences. It is project based self directed learning, so if you need some hand holding or you're not already very familiar with the material this may not be the program for you

Notes : No Fees
Location : Boston, MA
Requirement : PhD / PostDoc
Duration : 7 weeks

Startup.ML Data Science Fellowship : Startup.ML is taking an interesting approach to Data Science education. Their fellows work on real problems with established Data Science teams or on undefined startup problems.

Notes : No Fees. They also enrolled their first cohort in March 2015. I would imagine the typical profile here is someone that may be much further along.
Location : San Francisco, CA
Requirement : Background in Software Engineering, Quantitative Analysis. Advanced Quantitative degrees
Duration : 4 months

ASI Data Science Fellowship : This is another program modeled after the Insight program. They pair students with an Industry partner which allows students to work on real business problems / data. They also have a modular program which allows for some customization.

Notes : No Fees
Location : London, UK
Requirement : PhD
Duration : 8 weeks

GA Data Science Immersive : General Assembly was actually one of the first outfits to start part time Data Science classes. It looks like they've decided to also jump into the fray with a full time Data Science Immersive

Notes : They've been doing part time Data Science classes for at least two years already
Location : San Francisco, CA
Requirement : Seems like they're interested in folks with quantitative backgrounds looking to transition to Data Science
Duration : 12 weeks

Catenus Science : Catenus is also taking a very different approach here. Catenus Science is a paid apprenticeship program helping skilled Data Scientists explore opportunities at different startups / domians

Notes : Paid Apprenticeship. Rotate through three different startups applying you skills to month long projects with these startups. The next sesison starts June 2016
Location : San Francisco, CA
Requirement : Background and Experience in Statistics, Machine Learning, Programming, Product Development. They're probbaly looking for people who are much further along.
Duration : 13 weeks

The Data Incubator : Accepts only STEM PhDs or PostDocs. The first class is starting summer 2014.

Notes : No Fees
Location : New York, NY
Requirement : PhD / PostDoc
Duration : 6 weeks

NYC Data Science Academy : This looks like its also modeled after the Zipfian 12 week immersive. Another option for non-postdocs on the east coast looking to make the transition to Data Science

Notes : It enrolls the first cohort February 2015. Just looking at the curriculum, it appears well thought out and seems to cover a lot of breadth. They focus on R and Python and spend significant amounts of the course time covering both ecosystems.
Location : Manhattan, NY
Requirement : Looks like they prefer people with STEM advanced degrees or equivalent experience in a Quantitative discipline or programming
Duration : 12 weeks

Silicon Valley Data Academy : This also looks like another program modeled after the Insight program. It does look like they skew towards applicants that are much further along the skills spectrum

Notes : No Fees and they run both Data Science and Data Engineering programs
Location : Redwood City, CA
Requirement : Advanced Degrees / PhD / Post Docs , Extensive quantitative / engineering background
Duration : 8 weeks

Microsoft Research Data Science Summer School : targets upper level undergraduate students attending college in the New York area. Program instructors are research scientists from Microsoft Research

Notes : Each student receives a stipend and a laptop
Location : New York, NY
Requirement : upper level undergraduate students interesting in continuing to graduate school in computer science or related field or breaking into Data Science
Duration : 8 weeks

Part Time

General Assembly - Data Science : San Francisco / New York. Part time program over 11 weeks (2 evenings a week)
Hackbright - Data Science San Francisco. Full Stack Data Science class over one weekend
District Data Labs : Washington DC. Data workshops and project based courses on weekends
Persontyle : London, UK. Offering R based Data Science short classes
Data Science Dojo : Silicon Valley, CA / Seattle, WA / Austin, TX. Offering data science talks, tutorials and hands on workshops and are looking to build a data science community
AmpCamp : This is run by UC Berkeley AMPLab. Over two days, attendees learn how to solve big data problems using tools from the Berkeley Data Analytics Stack. The event is also live streamed and archived on YouTube
NextML

Online
If you have enough time and patience to work through problems yourself, some of these resources will get you started with Data Science.

These bootcamps are popping up and thriving because there is currently an imbalance between demand and supply of Data Science talent and the acceptance rates at some of full time bootcamps are anywhere from 1 in 20 to 1 in 40

p.s : I need to stress that with any of the programs listed above, you need to do your due diligence and ask the tough questions to find out if it's a good fit for you. You probably want to be on the look out for programs that are not transparent about their placement.

Update 1 - 05/14 : Added the new Zipfian programs, Persontyle
Update 2 - 07/14 : Added Metis, Data Science Europe, Science to Data Science
Update 3 - 08/14 : Added Data Science Dojo
Update 4 - 10/14 : Added AMPLab

Update 5 - 11/14 : Added Coursera/UIUC, Udacity Data Analyst Nanodegree, Thinkful, DataInquest

Update 6 - 12/14 : Added NYC Data Science Academy
Update 7 - 01/15 : Added Next.ML, Bitbootcamp, DataQuest

Update 8 - 04/15 : Added Praxis Data Science, Insight Health Data Science

Update 9 - 05/15 : Added Startup.ML Fellowship, ASI Fellowship
Update 10 - 09/15 : Added Silicon Valley Data Academy
Update 11 - 01/16 : Added GA Data Science Immersive, Level Data Analyst Bootcamp, Udacity ML Nanodegree, Leada
Update 12 - 05/16 : Added Catenus Science