A TSP Variant – An efficient route for visiting all counties in a state ?

i was reading a bicycle tourist’s idea of visiting every county in his home state. He had diligently chipped away at the goal for a number of years, visiting a few at a time.

The question occurred to me as to how we would route that if you wanted to do it in one fell swoop (assuming one had the physical ability to do so). It’s sort of the variant of the famous traveling salesman problem (TSP), but with a couple of differences.

If we decided we wanted to visit a consistent location in each county (say the county seat, or maybe the centroid of the county), then it would be exactly the traveling salesman problem. But I’m thinking of a slightly different formulation. Visiting a county in this formulation is merely setting foot in said county (equivalent to saying I’ve been to Wisconsin if I’ve been anywhere at all in Wisconsin, rather than having to had visit the capital of Madison in order to count it as having visited the state)

So how does this effect our attempted solution to the problem ?

1. The most significant, it would seem, is in construction of the distance matrix. In a standard TSP problem, we usually use some agreed upon distance matrix between locations. So when we try to solve the problem of visiting all county seats in a state, for instance, we know all the points we might visit in advance. So we might use Google driving distance as our distance metric and we can calculate all the distances in advance. And while we may have multiple routes possible to get from one stadium to another, we might agree to take Google’s shortest suggestion as our distance. In our new formulation in this problem, we need to determine what the distance from one county to another is. How do we do this.? Is it the smallest distance between any two points in the two counties ?

2. The entrance point to a county (stepping into the county) is not necessarily (and almost certainly not) the same point where we leave the county to enter the next county. So we need to account for this, as well as any distance traveled within a county to get from the entrance point to the exit point of a given point.

So I don’t know if this is a well known variant of the TSP, but it’s something I will look into over some upcoming posts.

Mapping the Mathematical Streets of Paris

I recently came across a reference that stated over 100 streets in Paris are named after mathematicians. Having never heard that before, I decided to investigate and see what that looked like.

The following site had tallied 91 streets that were named after mathematicians and I thought I’d see what they might look on a map of Paris itself.
http://www-groups.dcs.st-and.ac.uk/~history/Honours/ParisNames.html

.The next step was to hopefully find some source of (hopefully) open data that represented Paris streets. I was able to find exactly this at the following site (after some Google translation; I don’t speak French beyond some intermediate menu ordering skills)

https://opendata.paris.fr/explore/dataset/voie/export/

I grabbed a shapefile from here and imported the street data into PostgreSQL using the shp2pgsql utility to do so (see here for specifics on using the utility). Now I had a table containing GIS information for all Paris streets, from which I needed to be able to filter those streets which were named after mathematicians.

I had previously imported the list of the 91 streets named after mathematicians into its own PostgreSQL table, so now naively thought a quick inner join of the street data to this table would get the GIS data for the 91 streets. But , of course, this is not the way it works when joining 2 disparate sources of data. It’s not the way it EVER works. Surprisingly, I was able to retrieve 80 of 91 just with a join on name, but this still meant I had to reconsider just attempting a join as my source. As the data was small enough, I just chose to add a column to the overall street table indicating whether it was named after a mathematician. Then, I could automatically update the 80 streets indicated above and , with a little manual intervention, update the remaining 11. If there were thousands of such streets, this would require some sort of more sophisticated attempt at pattern matching. As it is, a few SQL statements and I was able to have 86 of the 91 streets readily identifiable.

The following from the original list of 91 streets will not show on the map, 4 because I couldn’t match to an actual street and 1 (Emmy Noether) because it appears to be out of city limits. Still, we will be mapping 86 streets named after mathematicians.

Rue Arago
Emmy Noether – Out of city limits
Square du General Morin
Square Oronce Fine
Square Charle Hermite

Without further ado, here is the map of Paris streets, showing all streets in light blue and those named after mathematicians in dark red. I used QGIS with PostgreSQL data as a back end. Can we make this look nicer ? I bet we can !

Click for a larger view.

Extract Chicago Crime Data from the Open Data Portal Using Python

As part of the investigation in crime around the 606 trail, we pulled the crime data from Chicago’s open data portal. The following is a straight forward sample Python script for pulling a daily extract. Note that the available crime data on the portal lags by 7 days, so this script will extract a day’s worth of data from 8 days prior to the run date.

import urllib
import json
from datetime import datetime,timedelta
now = datetime.now()

sep  = ','

# Base url for Chicago Open Data Portal crime API; we'll add date and location filters
baseurl="https://data.cityofchicago.org/resource/6zsd-86xi.json"
datebetw = "?$where=date between "

# Crime data availability lags by 7 days; we run on a daily basis and grab 8 days ago 
end_date = now - timedelta(days=8)
st_date = now - timedelta(days=9)

dateocc_1 = "'" + str(st_date.year) + '-' +  str(st_date.month) + '-' + str(st_date.day) + 'T00:00:00' + "'" 
dateocc_2 = "'" + str(end_date.year) + '-' +  str(end_date.month) + '-' + str(end_date.day) + 'T00:00:00' + "'" 


# the syntax for this filter is  'within_box(location_col, NW_lat, NW_long, SE_lat, SE_long)'
boxurl = 'within_box(location, 41.924393, -87.736558, 41.902645, -87.6538)'

# Create the overall URL to interogate API with our data and location filters
ourl = baseurl + datebetw + dateocc_1  + ' and ' + dateocc_2  + ' AND ' + boxurl


jsonurl = urllib.urlopen(ourl)
text = json.loads(jsonurl.read())

# Create a file name for the daily extract with a timestamp
fname = 'extract_' + str(now.year) + str(now.strftime('%m')) + str(now.strftime('%d')) + '_T' + str(now.strftime('%H')) +  str(now.strftime('%M')) +  str(now.strftime('%S')) +'.txt'    

tfile = open(fname, 'w')

# Loop through crimes a write a comma separated line for each crime
for t in text:
	tfile.write(t['id'] + sep + t['block'] + sep + t['date'] + sep + t['description'] + sep + t['primary_type'] + sep + t['location_description']) 
	tfile.write("\n")

tfile.close()

 

 

See the projects and the lessons learned

I’ve decided to try and catalog some of the side projects I’ve done that I feel can discuss publicly. Some of these were borne from professional projects, while others were just things that struck my curiosity. (Disclaimer: Unless explicitly granted permission, I never publish full results of professional projects. However, when possible I get permission to obliquely discuss them without identifying companies nor final results).

Feel free to contact me if you’d like more information on any of these!

Projects and lessons learned (in a vague alphabetic order).

BasesRoaded
This is a baseball road trip planner, used to create individual road trips visiting a number of professional baseball stadiums. Available on the web since 2012, it can be found at  www.basesroaded.com. It is up to date for the current season and used by many.

Lessons learned:
As in other projects, the scraping of third party data (to get the yearly schedule) remains painful as sites change their HTML and access from year to year. I gained a great appreciation for web design, particularly how hard it is to do well !

BayesMatch – LinkedIn Consolidation
A project to consolidate LinkedIn contacts of a companies sales staff. Used Bayesian analysis to do a spell checker like comparison to determine common companies in the employee’s contacts list

Lesson learned : It wasn’t that effective for such a limited amount of data (this was a small company). At some level, the automation wasn’t much less effort than just a databased reporting system using some fancy regex matching and a rudimentary spell check a la Peter Norvig.

Career Year in Baseball

An investigation into the concept of the “career year” in major league baseball This was a scraping of data for the statistics followed by an analysis in R.  It can be found here.

Concert Finder

A project to provide concert notifications for bands I was interested in. Essentially, I wanted to be notified whenever one of 25 or so bands announced a concert in my city. Initially this individually scraped each band’s website on a nightly basis. This became too labor intensive as websites changed structure, and many were poorly formed HTML in the first place. I started adding bands that were listed in www.bandsintown.com via their API. Eventually they shut this free access down, so I stopped doing it altogether.

Lesson learned : Relying on scraping individual sites scales poorly for a small project without other people to maintain. Relying on a 3rd party API to access their data has it’s own perils as seen elsewhere as well.

CPL Notifier

This was a scraping system to notify me of pending due dates for books I had borrowed from the local public library. This was prior to such notifications being mailed by the library, at which point I terminated my project.

Lesson learned: It’s always easier to use a free public service than try to recreate and maintain it !

Crime on the 606

An investigation into whether crime was increasing along the 606 trail in Chicago (also known as the Bloomingdale trail). This used crime data from the Chicago Open Data Portal to do a spatial analysis using GIS tools.  Some results can be found here.

Lessons learned: When you live in a big city, maybe you don’t want to be aware of where every local crime occurs

The Traveling Divvyer

A whimsical entry in the 2015 Divvy Data challenge, this site created a route to visit all Divvy stations in an efficient way to visit all Divvy stations from any chosen station. The website can be found here.

Lessons learned : This was a fun exercise and my first foray into deploying Leaflet to present dynamic maps on a website

Eat Countries – Eating Globally, Locally

A project to investigate how many country’s cuisines are represented by restaurants in the Chicagoland area. This was scraping UN data for countries, a lot of manual research and tabulation for restaurants followed by a GIS analysis in R.  This can be found here.

Lesson learned: There’s still a lot of eating to be done.

FARS Bicycle Fatalities Involving Cars

An investigation into national data for bicycling fatalities involving cars. The incident data was available from the FARS data site. An introductory post of this can be found here.

GolfPairs

A project to determine golf pairings for a golf weekend. The objective was to provide pairings for 2 foursomes to enjoy a golf weekend where each golfer played the same amount with the other golfers. The center of the project was a simulation in R, available here.

Golf Road Trip

A tool to create road trips to visit top public golf courses in the US.

National Park Road Trip

Using the basic functionality I had used in other applications, this was originally going to be a tool to dynamically plan road trips between the national parks. I never implemented the full functionality via website, but here is a static version of the best trip it generated.

NOMT

A website used to investigate the Twitter usage profile for a chosen user. NOMT was initially chosen to be an acronym for Not On My Time, as this was developed as a tool for a boss looking into excessive employee Twitter usage during work hours.

Lesson learned: Fun technically, but I have no interest in facilitating such snooping, so it was never built out much past it’s initial usage. The throttled version is still available here (only uses a very small subset of a user’s tweets merely for demonstrative purposes).

Quotes on Data

A bot to post quotes about statistics and data in general to the Twitter account @quotesondata. You can see the Twitter account here.

RATS and Cats

A project for a local animal shelter, this was an investigation into whether TNR  (trap, neuter, release) programs for feral cats effect local rodent populations. This was a spatial analysis comparing locations of TNR cat colonies and public 311 data as a proxy for rodent population.

Lesson learned: I hoped to help the shelter for advocacy purposes (as well as it seeming like an interesting question). In the end, the results weren’t very definitive and I think even if they had been I don’t know they would have made for effective advocacy unless the results were overwhelmingly positive. In the end, I think I could have been more helpful merely providing more effective data collection to track the colonies.

Space Filling Curves for the TSP

I became interested in the concept of space filling curves as an algorithmic approach to the traveling salesman problem. An example of this applied to the Divvy data of this project can be found here.

Stolen Bike Finder

The idea here was to provide a service for tracking down stolen bicycles listed on Craigslist. It would take a bicycle listing from a stolen bicycle registry and automatically search Craigslist (and possibly other sources) for possible matches in order to notify the bike’s owner to possible leads of the stolen bike being listed. Early results seemed to be fairly effective. However, it used the 3Taps API to access the Craigslist listings. This API was shut down due to a legal ruling and access to the Craigslist data became untenable, effectively making the overall project infeasible. Thus, I shelved it.

Lessons learned: Again, the dangers of relying on 3rd party data as a cornerstone for a project.

Twit Compare 

This was a somewhat simplistic attempt to determine whether two Twitter accounts were written by the same author. The approach used natural language processing to determine the similarity of the two bodies of Twitter posts and was inspired by the analysis Frederick Mosteller had done on the Federalist papers (note: I am not a NLP professional, so I’m sure there must be more state of the art approaches these days).

Lessons learned: I was able to implement a local website to do this, but in the end it probably isn’t a great approach to the problem. First, how many people really care to do this ? ( I had more of an intellectual interest than a practical one). Second, the applications tend to be curiosities, e.g. trying to deduce if someone is behind a parody account. But in that case, the “voice” of the parody account tends to be different than that used in the author’s actual Twitter account, which sort of negates the approach.

I toyed with expanding a bit to allow feeds from systems other than Twitter, but the same obstacles and lack of applicability remain.

Texas BBQ Road tripping

A quick project using some existing functionality to create an efficient routing to visit Texas Monthly’s top 50 BBQ destinations (that would be a pretty ambitious trip, but nevertheless….). I emailed this to the Texas Monthly BBQ team, but never published it as I felt the input data (the ranked BBQ locations) was proprietary to them and I never heard back so was not comfortable using it publicly.

Chicago’s Tallest Buildings by Ward

An investigation into the tallest building by ward in Chicago. This was a mashup of data using the Chicago open data portal. You can find the discussion here.

Wisconsin Road Trip Eating

A website I use for myself which is a mashup of locations of 1) supper clubs, 2) breweries, 3) bike trails , 4) best burgers and 5) pizza farms in the state of Wisconsin. I use this to scout locations for short trips for biking, eating and drinking.  Visit the site here.

Lessons learned: There are a lot of breweries popping up ! This does not get maintained so well as a result.

A GIS Buffer for the 606 Trail

In a previous post, I examined creating a bounding box for the 606 trail to enable analysis of surrounding crime.  While this is straight forward and a convenient heuristic to explain what we were doing, notice that it’s not fully what we wanted as an area from which to to analyze crimes. Specifically, the bounding box we produced wouldn’t be very good border for investigating crime near the ends of the trail. A better way to address the area would be to use the GIS concept of a buffer, where in this case we’d have an area that looked like a semi-circle on each end of the rectangle of the bounding box. This is relatively easy to accomplish within GIS software such as QGIS.

The downside of this approach is that, while we have a more appropriate take on the area we want to consider,  it’s not quite as straight forward to retrieve crimes of interest. With the bounding box approach with SQL, we could just use standard SQL and use the corners of the box as filters to return the crimes of interest. With those curved ends, it’s not quite that easy, but all is not lost if we use GIS features such as those included in the POSTGIS extension of POSTGRES.

In the image below, the buffer is the entire yellow area, whereas the original bounding box is in the dashed line. You can see how the buffer encompasses more area at the ends, which makes sense given our ends. This buffer is about a city block around the trail.

 

In the following, we can expand this buffer to encompass the area about 2 blocks (about a quarter mile) around the trail. The original bounding box is still dashed, and the previous 1 block buffer is in green. The new 2 block buffer is in light blue.

 

 

 

 

 

Naturally occurring GIS – An Oil Stain of Michigan in Algoma, WI

Inspired by this New Jersey shaped sidewalk crack, I found the following oil stain on a street in Algoma, WI. Its not a perfect match (the ”thumb” portion is a bit off) , but awfully close to the shape of the lower peninsula of Michigan.

 

(Above image from http://www.thepumpguide.com/michiganLP/ )

(above map created in QGIS from the Michigan open data portal county shapefile)

Collecting U.S. Bicycle Fatality Data from the FARS database

We were looking to investigate bicycle fatalities involving cars, first at a more recent and local level and then eventually at a national and historical level. This post will detail our data collection at the national and historical level.

The National Highway Traffic Safety Administration (NHTSA) maintains an online resource for collecting information on “public yearly data regarding fatal injuries suffered in motor vehicle traffic crashes”. This is known as the Fatality Analysis Reporting System (FARS) and contains extensive information available to the public. A subset of this information of interest for this project is the crash level detail for all crashes resulting in the death of a bicyclist. The site also has a very nice query system, but we were hoping to mash up the data with some external data, so wanted to store the data for those purposes. Visit the FARS website here

Data Formats
In March of 2017, the FTP data from the NHTSA is available from 1975 through 2015. We wanted to store the bicycle crashes in a SQL database, so were hoping to see the data available in a CSV or SQL format. Unfortunately, this was only possible for the year 2015. Prior to this, data is available in either DBF or SAS format. We decided to use the DBF format and convert to CSV files which we would the load to the SQL database.

Data Necessary for Reporting Bike Fatalities from FARS
The information contained in the FARS database is really quite extensive. We were looking for a subset of this, namely crashes involving bicycle fatalities, so would not need the entirety of the data. The database is very well documented; indeed there’s a 500+ page manual on the data elements available. After poring through (OK, skimming) this information,  we determined we would need the following tables from the FARS system to start.

NOTE: The following is generally the case for a number of years being analyzed; starting in 2015, the reporting for identifying bicyclist is slightly different. Additionally, not all the years have the exact format, so there is a bit of work in importing and storing

Accident table
This is the main table in the FARS system. It contains every crash that has occurred, indexed by a state code and a case number within the state. Additionally, it contains crash level data such as timing, location,weather and number of people involved.

Person table
The Person table contains one entry for each person involved in an accident, and is tied to the accident via the state and state case number. It is here that we can determine whether a person involved in an crash was a bicyclist and also whether the crash resulted in a fatality for the bicyclist.

Additional Master Data
In addition to the individual crash data noted above, we would need data to translate such values as state name, city name and county name. This data itself was not available from the FARS website, but the codes are standardized and so were available elsewhere.

A quick snapshot

Using the data, here’s a snapshot of all auto crashes resulting in bike deaths between the years 2011 and 2014. Besides being taken aback by the how many there are, there’s not much to be gleaned here as, is often the case, the most incidents occur in the most populated areas. But having the data in SQL form,we are ready to start more in depth analyses. Feel free to contact  if you have interest in this data.  At some point we may stand the SQL database on a website, but for now are keeping it local.

Bike accidents 2011 thru 2014

 

Crimes occuring ON the 606 trail

In a previous post, I gave an overview of a project to investigate the occurrence of crimes near the 606 (Bloomingdale) trail in Chicago. Data for the investigation came from the City of Chicago Open Data Portal, and investigation into the data made me wonder whether I could list a subset of this data, namely the crimes that actually occurred ON the trail (versus just in the vicinity).

Identifying the crimes occurring on the trail itself came with it’s own set of challenges; I’m not sure it can be done perfectly due to a number of factors. That said, I think we can probably do a reasonable job.

 

The Challenges:

The location of crimes on the data portal have been adjusted for privacy. That is, the latitude and/or longitude are not exact so that individual homes can not be explicitly identified from the data. I’m not sure the algorithm used to due this (hopefully that’s the point of doing it), but merely knowing a crime occurred exactly on the trail does not mean the latitude and longitude reported will indicate so.

Another challenge we may face is that it may not be clear when a crime is “on the trail”.
Consider this citizen’s report here.  This sure sounds like a crime I would say occurred on the trail, but in fact sort of occurred at the foot of the ramp, so maybe is not and is actually not reported as such.

 

An Approximate Solution

There is a column on each crime row which indicates the location type of the crime. Crimes occurring on the trail are being reported with a value of ‘PARK PROPERTY’ in this column. Thus we can use a combination of latitude/longitude (proximity to the trail) and this indicator to narrow down our possibilities. But there are still issues , as there are a number of parks that abut the trail. Crimes in these parks may likely be caught in this sieve.

 

The Results
Let’s use our methods above and see what the results are. These are the crimes identified as occurring on the trail, from it’s opening on 6/6/15 through the 1/31/17.

Crime ON 606 trail thru 013117

A GEOJSON and KML file for the 606 (Bloomingdale) trail

As part of my explication of the process for investigating crime “around” the 606, I had
created a bounding box for the trail and wanted to visually display the trail within the box.
I couldn’t find a handy GEOJSON or KML file for the trail, so I created my own and
will make it available here.

The methodology was relatively crude, and the resultant file is likely far from perfect.
I rode the trail on my bike using the GPS tracking app MyTracks, and then converted
the resulting KML file into GEOJSON using the ogr2ogr utility within QGIS.

Here are the files:

KML for the 606

GEOJSON file for the 606

 

 

 

A bounding box for the 606 trail

In a previous post, we had discussed investigating crime data around the 606 (Bloomingdale) trail in Chicago. One of the main issues I faced was how to identify crimes that occurred “around” the 606. Each detailed crime record contains information including:

  • Ward number
  • An approximate address
  • Approximate Latitude
  • Approximate Longitude

Using the ward number wasn’t really a possibility for a number of reasons:

  1.  Many records didn’t have a ward on the record (can’t I just stop here ?)
  2.  Chicago wards are irregularly shaped; thus there is no particular ward
    which is always near the trail
  3. Wards are often redistricted , whereby the boundaries of a ward are changed.
    It wasn’t clear to me whether a ward listed on the record was a snapshot at a time
    or a reflection of the current ward definitions (I would guess the former)

I also considered using the approximate address. This would require maintaining a list of all streets adjacent to the trail and selecting based on a range of street addresses on this street. That’s probably doable, but prone to error and omission (mostly mine, not the crime data).

In the end I decided to define a bounding box that encompassed the entire trail. Then I could use the latitude and longitude from a record to determine whether it fell within the
bounding box and thus would be included in the analysis. Additionally, I could fairly easily alter the size of this bounding box to include a smaller or larger area to be analyzed.

 

Defining the bounding box

The 606 is referred to as a linear park.  In reality, of course, it isn’t perfectly straight,
and unlike a line it actually has some width to it (I’d guess around 50 feet in most areas). But I think for our purposes it is close enough to a line that we can define a rectangle around it that encompasses the entire trail and the line is fairly centered in the box.  Also fortunately, the line is oriented in a fairly E-W direction which our task a bit easier.

The question now becomes how big that rectangle should be. My first pass was to make it a city block in each direction in a N-S direction from the line itself. Subsequent analyses may widen or narrow this rectangle as we investigate further.

Though we won’t actually use this visual, this is what the bounding box within a city block of the trail would look like. We would  use the boundaries of this box and select anything that fell within the box for an analysis. Recall this is just a first pass; one could argue that the box should be wider or extended on its ends.

 

bbox_1blk

Here is another view of the bounding box with the trail placed inside in green

Bounding Box and 606 trail from QGIS - Cropped

Top