Yet Another Data Blog: October 2013

Tuesday, October 29, 2013

NSA, PRISM, Data Brokers and your metadata

Earlier this summer, we were learnt about some really interesting NSA programs like PRISM and XKeyscore that scooped up with tons of data about everything you do online, telephone and email metadata like who you call, how long the call lasts, who you email- when and how often, etc. And it seems like every few weeks we are graced with more revelations like the NSA collecting millions of email address books, sharing call metadata collected by their NATO allies and spying on other world leaders.

There is some emphasis the data collected was just metadata (data about data), but having access to call and email metadata can probably be more intrusive than having access to the contents of actual phone calls or emails. This is because metadata can expose your network of friends or contacts, how often and how long you correspond with those contacts, etc. You can always control what you write in an email or say over a phone call, but you probably can't avoid contact with other human beings or live in isolation.

To test this, I found this really cool application called Immersive, designed by some folks from the MIT Media Lab. It helps you visualize your network using just email metadata (From , To, and cc fields and time stamp information) They use network detection and clustering algorithms to build up you network. I was actually very surprised when I tried it on my inbox. On the immersive network diagram a node represents a person while a line represent the communication between people.

I would say I was quite surprised there was such an uproar over these NSA programs. If you are doing anything online these days, the expectation of ultimate privacy is probably an illusion, as someone is always listening and you've probably already given you data for free to Linkedin, Facebook and other social media platforms.

Data brokers have actually been collecting data on everyone for ages. Companies like Acxiom, Versium and LexisNexis have troves of 'personally identifiable' and public data on hundreds of millions of people which they sell to third parties who use this data for "marketing purposes". They use this data to build a better profile of you - the websites you visit, your financial situation, whether you have kids, what type of ads you click on, what type of programmes you watch, you religious / political affiliations and the list just goes on. If you're really curious about the type of data Acxiom has on you, do check out About The Data.

The credit rating agencies also boasts a huge cache of you personal and financial data. The only way to really escape any one of these companies is to live off the grid. Below is nice visual on the sonetimes opaque world of data brokers.

by amccartney

If history is any judge, this is probably not the last we've heard of such NSA data aggregation programs. Back in the 40's, there was Project SHAMROCK which collected all incoming and outgoing telegraphic data from the US and there were a few similar projects through the decades. This much is obvious, the intelligence community is fraught with half truths and there are probably a few other programs out there we still don't know about yet. It's like a game of poker and you don't show your hand until you have to or are forced to (a la Snowden).

Tuesday, October 22, 2013

Real Time Speech Translation

Now this is pretty cool, real time language transalation...

Language translation has come a long way from Hidden Markov Models, Pattern Recognition and now Deep Learning Neural Networks. These deep neural networks are also behind the new Google 'Hummingbird' Algorithm

Tuesday, October 15, 2013

US Wealth Inequality : A Data Introspective

A excellent example of data bringing some of the facts most of us already know about wealth to life.

Saturday, October 5, 2013

Overview : Data Week 2013

The first two days of the conference started with bootcamps. There was a good representation of topics from Machine learning, Hadoop, Visualization, NoSQL and R. The rest of the week was filled with exhibitions, talks and presentations by practitioners in the field.

Trends
The are a few recurrent themes or trends that I noticed

Everyone is trying to interface with Hadoop. Everyone is jumping on the Hadoop bandwagon.. yes I mean everyone including Database / Statistics / ETL giants like SAS, Oracle, Informatica, Teradata among others. There were companies showing their SQL on Hadoop solutions and querying JSON data with SQL, etc
A lot of established companies are making big data plays, a few examples.. Monsanto's acquisition of Climate Corporation, Home Depot's acquisition of BlackLocus, CSC's acquisition of InfoChimps and Ebay's acquisition of Decide just to name a few. They understand that data is now a strategic asset and having talent on hand to analyze and extract actionable insights from data is just par for the course.
More Open Source Software efforts developed at some of the most innovative companies and then released into the wild .. like Cassandra (Facebook), Storm (Twitter), Impala (Cloudera).. obviously this is not new
Democratizing data and insights via easy to use APIs

The underlying theme here is that if you can turn data into useful products that makes people's lives easier and/or more efficient or help companies understand and/or monetize their customers better, you're on your way to huge valuations and IPO riches (right...)

Collaboration
There were talks by Causes and Civic Data Challenge around collaboration, analyzing and using data for public good. You probably didn't know every time you waste 10 seconds of your life trying to figure out the words on a recaptcha, you're actually helping to digitize textbooks and when you try to learn a new language on duolingo, you're helping to translate the web. These massive online collaborative efforts have helped sites like Wikipedia which now has an army of proofreaders and writers.

Democratizing Data
There were a lot of vendors showcasing their data, analytics, machine learning and visualization APIs. Democratizing data by getting it in the hands of more people / decision makers and reducing time to insights will only help organizations become more agile and nimble.

Evolution of the Data Scientist
The term was officially coined a few years ago, but the role has always been around in various forms. Organizations are trying to resolve how to fit data scientist into the product pipeline, some of them discussed embedding them with various teams, deploying the data science team as a skunk-works, etc

Summary
In all, there was a wide variety of speakers and topics about data. The value you get from these conferences is in discovering new/innovative companies in the space working on cool products/ideas and getting a chance to meet, listen to and converse with most of the names behind the big data tools, packages and utilities you already use.