See the projects and the lessons learned
I’ve decided to try and catalog some of the side projects I’ve done that I feel can discuss publicly. Some of these were borne from professional projects, while others were just things that struck my curiosity. (Disclaimer: Unless explicitly granted permission, I never publish full results of professional projects. However, when possible I get permission to obliquely discuss them without identifying companies nor final results).
Feel free to contact me if you’d like more information on any of these!
Projects and lessons learned (in a vague alphabetic order).
BasesRoaded
This is a baseball road trip planner, used to create individual road trips visiting a number of professional baseball stadiums. Available on the web since 2012, it can be found at www.basesroaded.com. It is up to date for the current season and used by many.
Lessons learned:
As in other projects, the scraping of third party data (to get the yearly schedule) remains painful as sites change their HTML and access from year to year. I gained a great appreciation for web design, particularly how hard it is to do well !
BayesMatch – LinkedIn Consolidation
A project to consolidate LinkedIn contacts of a companies sales staff. Used Bayesian analysis to do a spell checker like comparison to determine common companies in the employee’s contacts list
Lesson learned : It wasn’t that effective for such a limited amount of data (this was a small company). At some level, the automation wasn’t much less effort than just a databased reporting system using some fancy regex matching and a rudimentary spell check a la Peter Norvig.
Career Year in Baseball
An investigation into the concept of the “career year” in major league baseball This was a scraping of data for the statistics followed by an analysis in R. It can be found here.
Concert Finder
A project to provide concert notifications for bands I was interested in. Essentially, I wanted to be notified whenever one of 25 or so bands announced a concert in my city. Initially this individually scraped each band’s website on a nightly basis. This became too labor intensive as websites changed structure, and many were poorly formed HTML in the first place. I started adding bands that were listed in www.bandsintown.com via their API. Eventually they shut this free access down, so I stopped doing it altogether.
Lesson learned : Relying on scraping individual sites scales poorly for a small project without other people to maintain. Relying on a 3rd party API to access their data has it’s own perils as seen elsewhere as well.
CPL Notifier
This was a scraping system to notify me of pending due dates for books I had borrowed from the local public library. This was prior to such notifications being mailed by the library, at which point I terminated my project.
Lesson learned: It’s always easier to use a free public service than try to recreate and maintain it !
Crime on the 606
An investigation into whether crime was increasing along the 606 trail in Chicago (also known as the Bloomingdale trail). This used crime data from the Chicago Open Data Portal to do a spatial analysis using GIS tools. Some results can be found here.
Lessons learned: When you live in a big city, maybe you don’t want to be aware of where every local crime occurs
The Traveling Divvyer
A whimsical entry in the 2015 Divvy Data challenge, this site created a route to visit all Divvy stations in an efficient way to visit all Divvy stations from any chosen station. The website can be found here.
Lessons learned : This was a fun exercise and my first foray into deploying Leaflet to present dynamic maps on a website
Eat Countries – Eating Globally, Locally
A project to investigate how many country’s cuisines are represented by restaurants in the Chicagoland area. This was scraping UN data for countries, a lot of manual research and tabulation for restaurants followed by a GIS analysis in R. This can be found here.
Lesson learned: There’s still a lot of eating to be done.
FARS Bicycle Fatalities Involving Cars
An investigation into national data for bicycling fatalities involving cars. The incident data was available from the FARS data site. An introductory post of this can be found here.
GolfPairs
A project to determine golf pairings for a golf weekend. The objective was to provide pairings for 2 foursomes to enjoy a golf weekend where each golfer played the same amount with the other golfers. The center of the project was a simulation in R, available here.
Golf Road Trip
A tool to create road trips to visit top public golf courses in the US.
National Park Road Trip
Using the basic functionality I had used in other applications, this was originally going to be a tool to dynamically plan road trips between the national parks. I never implemented the full functionality via website, but here is a static version of the best trip it generated.
NOMT
A website used to investigate the Twitter usage profile for a chosen user. NOMT was initially chosen to be an acronym for Not On My Time, as this was developed as a tool for a boss looking into excessive employee Twitter usage during work hours.
Lesson learned: Fun technically, but I have no interest in facilitating such snooping, so it was never built out much past it’s initial usage. The throttled version is still available here (only uses a very small subset of a user’s tweets merely for demonstrative purposes).
Quotes on Data
A bot to post quotes about statistics and data in general to the Twitter account @quotesondata. You can see the Twitter account here.
RATS and Cats
A project for a local animal shelter, this was an investigation into whether TNR (trap, neuter, release) programs for feral cats effect local rodent populations. This was a spatial analysis comparing locations of TNR cat colonies and public 311 data as a proxy for rodent population.
Lesson learned: I hoped to help the shelter for advocacy purposes (as well as it seeming like an interesting question). In the end, the results weren’t very definitive and I think even if they had been I don’t know they would have made for effective advocacy unless the results were overwhelmingly positive. In the end, I think I could have been more helpful merely providing more effective data collection to track the colonies.
Space Filling Curves for the TSP
I became interested in the concept of space filling curves as an algorithmic approach to the traveling salesman problem. An example of this applied to the Divvy data of this project can be found here.
Stolen Bike Finder
The idea here was to provide a service for tracking down stolen bicycles listed on Craigslist. It would take a bicycle listing from a stolen bicycle registry and automatically search Craigslist (and possibly other sources) for possible matches in order to notify the bike’s owner to possible leads of the stolen bike being listed. Early results seemed to be fairly effective. However, it used the 3Taps API to access the Craigslist listings. This API was shut down due to a legal ruling and access to the Craigslist data became untenable, effectively making the overall project infeasible. Thus, I shelved it.
Lessons learned: Again, the dangers of relying on 3rd party data as a cornerstone for a project.
Twit Compare
This was a somewhat simplistic attempt to determine whether two Twitter accounts were written by the same author. The approach used natural language processing to determine the similarity of the two bodies of Twitter posts and was inspired by the analysis Frederick Mosteller had done on the Federalist papers (note: I am not a NLP professional, so I’m sure there must be more state of the art approaches these days).
Lessons learned: I was able to implement a local website to do this, but in the end it probably isn’t a great approach to the problem. First, how many people really care to do this ? ( I had more of an intellectual interest than a practical one). Second, the applications tend to be curiosities, e.g. trying to deduce if someone is behind a parody account. But in that case, the “voice” of the parody account tends to be different than that used in the author’s actual Twitter account, which sort of negates the approach.
I toyed with expanding a bit to allow feeds from systems other than Twitter, but the same obstacles and lack of applicability remain.
Texas BBQ Road tripping
A quick project using some existing functionality to create an efficient routing to visit Texas Monthly’s top 50 BBQ destinations (that would be a pretty ambitious trip, but nevertheless….). I emailed this to the Texas Monthly BBQ team, but never published it as I felt the input data (the ranked BBQ locations) was proprietary to them and I never heard back so was not comfortable using it publicly.
Chicago’s Tallest Buildings by Ward
An investigation into the tallest building by ward in Chicago. This was a mashup of data using the Chicago open data portal. You can find the discussion here.
Wisconsin Road Trip Eating
A website I use for myself which is a mashup of locations of 1) supper clubs, 2) breweries, 3) bike trails , 4) best burgers and 5) pizza farms in the state of Wisconsin. I use this to scout locations for short trips for biking, eating and drinking. Visit the site here.
Lessons learned: There are a lot of breweries popping up ! This does not get maintained so well as a result.