Yet Another Data Blog: 2013

Tuesday, December 31, 2013

Year In review

It has been a really interesting year...changed the name of this blog to make it more relevant and had more visitors this year than last. I also had more blog posts this year than in the past two years combined and of course realized along the way that blogging is hard work. You don't know how your content will be received or even if anyone is reading it.

In the next year, I plan to do more tutorial based blogs about projects I'm working on or tech I'm using. That'll be more useful for myself and my potential readers if there are even any. I'm looking forward to a really exciting next couple of months and will try and document every step along the way.

A few new things I'd like to explore next year are getting immersed with the current Internet of Things revolution (a la Raspberry Pi, Sensors etc) and the Quantified Self movement. These paradigms provide really interesting data to analyze and I've seen a lot of fascinating projects built on them.

Tuesday, October 29, 2013

NSA, PRISM, Data Brokers and your metadata

Earlier this summer, we were learnt about some really interesting NSA programs like PRISM and XKeyscore that scooped up with tons of data about everything you do online, telephone and email metadata like who you call, how long the call lasts, who you email- when and how often, etc. And it seems like every few weeks we are graced with more revelations like the NSA collecting millions of email address books, sharing call metadata collected by their NATO allies and spying on other world leaders.

There is some emphasis the data collected was just metadata (data about data), but having access to call and email metadata can probably be more intrusive than having access to the contents of actual phone calls or emails. This is because metadata can expose your network of friends or contacts, how often and how long you correspond with those contacts, etc. You can always control what you write in an email or say over a phone call, but you probably can't avoid contact with other human beings or live in isolation.

To test this, I found this really cool application called Immersive, designed by some folks from the MIT Media Lab. It helps you visualize your network using just email metadata (From , To, and cc fields and time stamp information) They use network detection and clustering algorithms to build up you network. I was actually very surprised when I tried it on my inbox. On the immersive network diagram a node represents a person while a line represent the communication between people.

I would say I was quite surprised there was such an uproar over these NSA programs. If you are doing anything online these days, the expectation of ultimate privacy is probably an illusion, as someone is always listening and you've probably already given you data for free to Linkedin, Facebook and other social media platforms.

Data brokers have actually been collecting data on everyone for ages. Companies like Acxiom, Versium and LexisNexis have troves of 'personally identifiable' and public data on hundreds of millions of people which they sell to third parties who use this data for "marketing purposes". They use this data to build a better profile of you - the websites you visit, your financial situation, whether you have kids, what type of ads you click on, what type of programmes you watch, you religious / political affiliations and the list just goes on. If you're really curious about the type of data Acxiom has on you, do check out About The Data.

The credit rating agencies also boasts a huge cache of you personal and financial data. The only way to really escape any one of these companies is to live off the grid. Below is nice visual on the sonetimes opaque world of data brokers.

by amccartney

If history is any judge, this is probably not the last we've heard of such NSA data aggregation programs. Back in the 40's, there was Project SHAMROCK which collected all incoming and outgoing telegraphic data from the US and there were a few similar projects through the decades. This much is obvious, the intelligence community is fraught with half truths and there are probably a few other programs out there we still don't know about yet. It's like a game of poker and you don't show your hand until you have to or are forced to (a la Snowden).

Tuesday, October 22, 2013

Real Time Speech Translation

Now this is pretty cool, real time language transalation...

Language translation has come a long way from Hidden Markov Models, Pattern Recognition and now Deep Learning Neural Networks. These deep neural networks are also behind the new Google 'Hummingbird' Algorithm

Tuesday, October 15, 2013

US Wealth Inequality : A Data Introspective

A excellent example of data bringing some of the facts most of us already know about wealth to life.

Saturday, October 5, 2013

Overview : Data Week 2013

The first two days of the conference started with bootcamps. There was a good representation of topics from Machine learning, Hadoop, Visualization, NoSQL and R. The rest of the week was filled with exhibitions, talks and presentations by practitioners in the field.

Trends
The are a few recurrent themes or trends that I noticed

Everyone is trying to interface with Hadoop. Everyone is jumping on the Hadoop bandwagon.. yes I mean everyone including Database / Statistics / ETL giants like SAS, Oracle, Informatica, Teradata among others. There were companies showing their SQL on Hadoop solutions and querying JSON data with SQL, etc
A lot of established companies are making big data plays, a few examples.. Monsanto's acquisition of Climate Corporation, Home Depot's acquisition of BlackLocus, CSC's acquisition of InfoChimps and Ebay's acquisition of Decide just to name a few. They understand that data is now a strategic asset and having talent on hand to analyze and extract actionable insights from data is just par for the course.
More Open Source Software efforts developed at some of the most innovative companies and then released into the wild .. like Cassandra (Facebook), Storm (Twitter), Impala (Cloudera).. obviously this is not new
Democratizing data and insights via easy to use APIs

The underlying theme here is that if you can turn data into useful products that makes people's lives easier and/or more efficient or help companies understand and/or monetize their customers better, you're on your way to huge valuations and IPO riches (right...)

Collaboration
There were talks by Causes and Civic Data Challenge around collaboration, analyzing and using data for public good. You probably didn't know every time you waste 10 seconds of your life trying to figure out the words on a recaptcha, you're actually helping to digitize textbooks and when you try to learn a new language on duolingo, you're helping to translate the web. These massive online collaborative efforts have helped sites like Wikipedia which now has an army of proofreaders and writers.

Democratizing Data
There were a lot of vendors showcasing their data, analytics, machine learning and visualization APIs. Democratizing data by getting it in the hands of more people / decision makers and reducing time to insights will only help organizations become more agile and nimble.

Evolution of the Data Scientist
The term was officially coined a few years ago, but the role has always been around in various forms. Organizations are trying to resolve how to fit data scientist into the product pipeline, some of them discussed embedding them with various teams, deploying the data science team as a skunk-works, etc

Summary
In all, there was a wide variety of speakers and topics about data. The value you get from these conferences is in discovering new/innovative companies in the space working on cool products/ideas and getting a chance to meet, listen to and converse with most of the names behind the big data tools, packages and utilities you already use.

Thursday, September 19, 2013

9 Major Tasks in Data Science

Most problems in Data Science can be grouped under these 9 broad tasks. Of course some problems can also be classified under one or more of these tasks.

Classification (Class probability estimation)
Regression (Numerical value estimation)
Similarity Matching (Recommendation systems)
Clustering
Co-occurrence grouping (Market basket analysis or associative rule mining)
Profiling
Link Prediction (Graphs or networks)
Data Reduction (Dimensionality Reduction : PCA, SVD, Factor Analysis)
Casual Modeling (Cause and effect pairs or groups)

Adapted from Data Science for Business

Monday, August 26, 2013

Machine Learning as a Service

There are a couple of start-ups working on delivering machine learning as a service. Some of these solutions provide a turn-key data science environment : some data prep, munging, models, predictions and visualizations

Prediction.io - this is an open source machine learning server. It looks like they focus mainly on recommendations and you can build and fine tune your own models

Yhat - build and deploy models with R and Python

BigML - this has a user friendly interface and great visualizations for non-techies. You also have the choice to modify and fine tune your models and can solve both classification and regression problems here

wise.io - they seem to have a wider variety of problems they can tackle. They also have their own optimized Random Forest implementation that showed some impressive stats when benchmarked against R, Weka and Scikit-Learn

Precog - another strong contender in the space. They have a wider variety of type of input data - logs. JSON, NoSQL, etc. They were recently acquired, so their service may be shut down.

Ersatz - their service uses deep neural networks (black box). Looks like they're still in private beta

Tuesday, August 13, 2013

Some Interesting links. Hyperloop, Google n-grams, PredPol

Finally, Elon Musk unveiled some details about the Hyperloop transport system. It would be quite interesting if this makes it out of conception and actually gets built. Right now, California is on the verge of spending more than 10x of what it would cost to build the Hyperloop system, to build one of the slowest high-speed rail systems in the world. A nice infograph of the Hyperloop Transport System

A very detailed Data Mining Map

An interesting visual mashup showing the Start-up Universe This mashup shows everything you'd want to know including company details, funding, VC's, rounds, etc. The data for this mashup is from the CrunchBase API

I recently came across Google's N-gram viewer.It graphs yearly counts of n-grams over the past 200 years. You could literally see when people started using "Donut" over "Doughnuts" link The rise of "Donut" starts at around the same time breakfast chains like Dunkin' Donuts were founded

The Atlas learning environment from O'Reilly. You could read a bunch of tech books both in early release and published.

Predpol, a system that helps predict crime in real time. They claim to place police officers at the right time and place, giving them the best chance to prevent crimes. The systems analyses historical crime data and assigns probabilities of future events / crimes to regions of space and time. This is probably sounding like Tom Cruises' sci-fi thriller, The Minority Report, but without psychics and bathtubs. The system predicts the 'when' and the 'where' of a crime but not the whom. I would not be surprised if the 'whom' can be predicted in the near future. PredPol was designed by scientists at UCLA, Santa Clara, UCI

And finally, the road to becoming a Data Scientist can be a long and winding one, Swami Chandrasekaran's take on the Data Science Curriculum

Monday, February 11, 2013

R for Pythonistas

This gives a nice translation of commands between R and Python-numpy. Numpy for R

Wednesday, February 6, 2013

Installing JAGS and rjags on Ubuntu 12.04

I ran into a few issues earlier trying to install JAGS/rjags on Ubuntu. This set of instructions assumes you already have R installed on your system

Download and Install R (32/64) -bit
Download the Bayesian Sampling Program JAGS from here : Extract the contents of the tar file and install the JAGS program by running the following commands from the JAGS root folder.

Download and install the rjags package : I tried install.packages('rjags') in R, which failed initially on my system (Ubuntu). Now, this may work if you're on a Mac or Windows. If you experience the same problems, you need to download the appropriate rjags tar file and and then install it manually.

Rrasch

JAGS is a program that is used for the analysis of Bayesian Hierarchical Models using MCMC (Markov Chain Monte Carlo) Simulations. I used JAGS over BUGS(WinBUGS/OpenBUGS) because it is cross platform, runs on Linux and it appears to be more robust among other reasons.

Thursday, January 10, 2013

CES 2013

While it would be really fun to follow the technology pilgrimage to Las Vegas every January, that's not always possible. Instead, I'll highlight a few of the ideas and gadgets from CES that I have found interesting.

Nest The smart thermostat. They've been around for a while. This thermostat helps you save on energy by learning your usage patterns and it looks really cool.

Aeros. They let you watch live TV over the internet for $8/month or $80/year. Finally a cord cutting alternative that doesn't break the bank. They capture over the air signals and give you access to about 30 channels. Currently only available in the New York area, but are expanding across the country

Liquipel Everyone probably know someone who's lost a phone to water. Liquipel applies a nano coat to your phone or tablet to protect them from liquid spills

Lockitron Keyless entry powered by your phone. You could literally open your apartment door from any where in the world