Data Acquisition : I initially planned to source my data using APIs from The Sunlight Foundation and Votesmart but realized quickly things might take much longer with the APIs since I needed several different datasets and also needed a way to aggregate all the data. I decided it would be more optimal to go straight to the source: US Senate website. Setting up and debugging my data ingestion pipeline took another two days and by the end of the week I had all the data I needed..scraped, cleaned and packaged nicely in a database and several python pickle objects.
Data Transformation : Getting the data is one thing and transforming it to get it ready for analysis and modeling is another. Most Data Scientist tend to spend a lot of project time cleaning, aggregating and transforming data.
Highlights from the week:
- Got a chance to attend a meetup organized by BaRUG (Bay Area R Users Group). There was a talk from the author of the caret package (this is kind of the R version of scikit-learn) and another from the Human Rights Data Analysis Group - they use R to build statistical models to work on human rights projects across the globe.
- We had a guest talk from a former physicist who is now a Data Scientist @WalmartLabs. He works with a group that deals with algorithmic business optimization. The talk was actually quite insightful as he touched on some interesting pain points.."reconciling technical and business needs "..."The simpler the model the better"
- We also had another guest lecture on visualization. The speaker also worked on this awesome visualization of BART employee salaries
- Several of us attended a D3 workshop organized by the VUDlab at UC Berkeley
No comments:
Post a Comment