Looking to take a road trip of major league baseball stadiums in 2020 ? The Bases Roaded website to plan your baseball road trip has been updated to reflect the 2020 schedule and is now available.

Happy planning and happy travels !

Third Landing
Investigations in Data

Looking to take a road trip of major league baseball stadiums in 2020 ? The Bases Roaded website to plan your baseball road trip has been updated to reflect the 2020 schedule and is now available.

Happy planning and happy travels !

I have done a fair amount of road trip related projects, including routing baseball trips as well as golf trips . The golf trip tool linked here is a sort of ambitious traveling tool, picking top golf courses and building an efficient road trip for the selected courses.

But what if we looked at this a little differently, maybe not quite so ambitiously ? Suppose we wanted to know all the golf courses within a given driving distance of your home town ? That is, you might be asking where can I go within 3 hours of home ? Well, you might say, I can drive about 200 miles in 3 hours, so let me find all the courses within 200 miles. That is a sort of standard GIS problem (finding all points of interest within a given distance). But I might suggest that if you are planning on drawing a circle with a radius of 200 miles around your home, then you don’t live in Chicago, with its traffic and that big lake sitting there.

But all is not lost; we are looking at a slightly different problem which is fairly well known. Rather than drawing that circle, we will draw a polygon called an isochrone (iso = same, chrone = time) where anything falling within the polygon is within a given driving distance of the center point.

For now, we will show an example of an isochrone , and then later walk through the process of a creating some isochrones that will show golf courses within a given driving time of a point. Here is an isochrone which represents all points within a one hour drive of downtown (note the lack of driving on the actual lake). This was created using QGIS.

I have done a number of applications that have a flavor of the Traveling Salesman Problem (TSP). That is, they concern themselves with visiting a number of locations in an efficient manner. None of these applications have ever concerned themselves with finding the optimal manner of visiting all the locations; a “good enough” solution has sufficed to date, and I’m not sure I have the mental horsepower to understand all that is involved in producing optimal or even near-optimal solutions.

That said, the problem that has always particularly intrigued me is that of the baseball road trip. This problem has an additional twist to it. On the one hand, it is only a 30 stop tour, which when using programs like Concorde is a piece of cake. But the additional twist is that there is not a game in every city every day. So it is as if the traveling salesman can only visit certain of his or her clients on certain days (Jones in Boise will only see you on Monday or Tuesday).

This twist really throws a wrench into any formulation of the TSP that I have ever come across (granted, I am not a professional researcher in the field). Essentially ,everyday, the cities which can be visited are just a subset of the 30 cities with baseball stadiums. And the next day, it is also a subset , but now possibly a different subset.

I’ve never been able to formulate this into any sort of structure, and as such have used greedy-ish algorithms of a sort. They simplify the solution, but that solution is likely far from optimal. So it is something I think about, though not all the time, and not particularly successfully !

**UPDATE:** It turns out the formulation of this variation of the TSP is called Traveling Salesman Problem with Time Windows (TSPTW)

The Traveling Salesperson Problem (TSP) is a well known problem with many applications.

One such application could be visiting all the baseball stadiums in Major League Baseball.

These type of trips (or subsets of it) are the passion of many people, and there are even websites devoted to planning the routes (disclaimer: I am the author of that fine site).

I was looking for an application to illustrate the functionality of the pgRouting module in PostgreSQL, and due to my familiarity with the baseball road trip, this seemed like a good fit. Well, sort of…..note that one twist in visiting a baseball stadium on a roadtrip is that there is not always a game happening and ostensibly you would want to visit a stadium only when there was one. However , for this illustration, we will just try and calculate the best route for visiting all the parks exactly once regardless of whether there is a game happening. (again , go to BasesRoaded.com to do it only when games are being played !)

So without further ado, let’s tackle this with pgRouting.

**PRELIMINARIES:**

Before we are going to calculate anything based on the distances between stadiums, we would normally need get the distances between all the stadiums. What we are calculating here is what is known as a distance matrix. For each of the 30 MLB teams, it will contain the distance from its stadium to each of the other 29 stadiums. However, in this implementation, we will have pgRouting calculate the distances so we will skip this and only get the latitudes and longitudes of the stadiums.

Since I already have these steps completed due to my site BasesRoaded.com, I will use them here. But in general, the steps to get them was something like:

1. **Stadiums:** Prepare list of all 30 stadiums.

2. **Geocode:** Geocode this list to get the associated latitude and longitude for each stadium.

3. **Directions:** Use a direction API like Google or Mapquest to calculate the distances between all the stadiums. Note that we will not be doing this step here, as we will let pgRouting calculate the Euclidean distances using the lat/longs of the stadiums.

```
CREATE TABLE stadium
(
id smallint NOT NULL,
name character varying(50) NOT NULL,
"long" numeric(9,6) NOT NULL,
lat numeric(9,6) NOT NULL,
)
```

**pgRouting:**

I had some issues that I assume were my failing, but in the end I was able to use pgRouting to produce a tour of all the MLB stadiums. I was running PgRouting 2.2.2 within PostgreSQL 9.5. In particular, working within pgAdmin I was not able to successfully nest the _pgr_makeDistanceMatrix function within the pgr_tsp call using a nested SQL call. I think this may have been due to escaping (or not escaping) single quotes, but I wasn’t able to

figure that out. So rather than producing a distance matrix via SQL and essentially piping it into pgr_tsp, I produced the distance matrix, inserted into a table and then cut and paste the value into a pgr_tsp call.

**Create and store the distance matrix**

The following SQL will insert exactly one row into dm_tbl. The first column of the insert will be an NxN distance matrix.

```
insert into dm_tbl
Select dmatrix,ids from (
SELECT dmatrix,ids from _pgr_makeDistanceMatrix('SELECT id::int4, long::float8 x,lat::float8 y FROM stadium order by id')
) dm1
```

**Run pgr_tsp to produce the tour**

The pgRouting documentation had the example below of running pgr_tsp with a hard coded distance matrix. I just cut and pasted my distance matrix from the table above and

overlaid the simpler version below. I had a hard time doing this from within pgAdmin, likely due to the size of pasting a large distance matrix into the SQL window. So I created a version of the following with the full distance matrix in a file and executed from the psql command line.

```
SELECT seq, id FROM pgr_tsp('{{0,1,2,3},{1,0,4,5},{2,4,0,6},{3,5,6,0}}'::float8[],1);
SELECT seq, id FROM pgr_tsp('INSERT FULL DISTANCE MATRIX HERE'::float8[],1);
```

Here are the results from the pgr_tsp call. After this, we will map the route using QGIS.

**RESULTS:**

seq | name

—–+——————————

0 | Wrigley Field

1 | Guaranteed Rate Field

2 | Comerica Park

3 | Progressive Field

4 | Rogers Centre

5 | Fenway Park

6 | Citi Field

7 | Yankee Stadium

8 | Citizens Bank Park

9 | Oriole Park at Camden Yards

10 | Nationals Park

11 | PNC Park

12 | Great American Ball Park

13 | SunTrust Park

14 | Marlins Park

15 | Tropicana Field

16 | Minute Maid Park

17 | Globe Life Park in Arlington

18 | Chase Field

19 | PETCO Park

20 | Angels Stadium of Anaheim

21 | Dodger Stadium

22 | OAC Coliseum

23 | ATT Park

24 | Safeco Field

25 | Coors Field

26 | Kauffman Stadium

27 | Busch Stadium

28 | Target Field

29 | Miller Park

Here is a map of the route generated in QGIS. The routes shown are actual driving routes. Note that this is not a closed loop as it doesn’t return to the starting point of Chicago from the final point of Milwaukee (though it would be easy enough to do so).

In another post, I made reference to producing the latitudes and longitudes of all the stadiums in Major League Baseball. I thought I would share that data here.

In other posts, I had need to determine driving distances between major league baseball ballparks. The link below contains a CSV file containing a distance matrix for such distances.

Below is a snippet of the file. There is one line for for each pairing of a to/from combination of stadiums. The file represents a symmetrical matrix, so for instance there is a line for Wrigley Field —> Miller Park as well as a line for Miller Park —> Wrigley Field. These two lines will have the same driving distance. There are no lines in the file representing the distance from a stadium to itself (it will always be 0), so as there are 30 teams overall there are 870 lines in the file.

Driving distances were computed using the Mapquest driving distance API. Distances are in miles.

‘stad1′,’stad2′,’name’,’name’,’dist1′

1,2,’Angels Stadium of Anaheim’,’Chase Field’,357

1,3,’Angels Stadium of Anaheim’,’SunTrust Park’,2173

1,4,’Angels Stadium of Anaheim’,’Oriole Park at Camden Yards’,2642

1,5,’Angels Stadium of Anaheim’,’Fenway Park’,2987

Here is the entire file:

In previous analyses of the concept of the career year in baseball, I had occasion to calculate some statistics by year of a player’s career. That is, I wanted to analyze questions like ‘How did a player do in their 4th professional season compared to their 3rd professional season and how did that compare to other player’s development from their 3rd to their 4th season ?”

I had data stored in a MySQL database and had player statistics by calendar year. So, for instance, I had Billy Williams’ home runs, RBIs, WAR, etc for 1968, 1969 and so on, but I had no ready indicator to say which year was his 4th season. If I had the list of his seasons laid out in front of me I could work that out, but I wanted to do comparisons from the database across players, so I needed some way to compute this query. Essentially, I was looking to rank seasons by calendar year.

In Oracle and PostgreSQL, I could use a windowing function and RANK function to accomplish this. However, I am not aware of this functionality being available in MySQL, so I had to replicate using standard MySQL functionality.

Let’s say I have a table of players containing their yearly home run totals by calendar year (e.g. 2012, 2013, etc.) but I want to calculate a value to indicate home run totals by year of career (e.g rookie year, 2nd year of career, etc). So my sample data looks like the following. but I want to find the average number a player hit in the second year of his career. That’s not comparing the same year, but rather asking us to average Bob’s number from 2002, Phil’s number from 1994 and Ron’s number from 2014. Again we’re really ranking each player by hr_yr and selecting the row from each player with rank 2. (ignoring for now if we had a degenerate case of a 1 year career)

We can use the following SQL to accomplish this, which will add a column yr_of_career to the result which would could then be queried for such things as retrieving only those rows where yr_of_career = 2 in our example above. Essentially, yr_of_career is a running counter within each row of each player.

```
SELECT @row_number:=CASE WHEN @pname=pname THEN @row_number+1 ELSE 1 END AS yr_of_career,
@pname:=pname AS names, hr_yr, hr_num
FROM (select pname,hr_yr,hr_num from test_hrs order by hr_yr) plyr,
(SELECT @row_number:=0,@pname:='') as cnt
```

Here are the results of the query:

A while back, I had undertaken a little project to determine how many country’s restaurants were represented in the Chicagoland area. This was one of the earliest projects I used that leveraged location and GIS information, and I thought I would revisit it with a more updated knowledge of GIS tools. The first pass at this was using some of the spatial capabilities in R , and it produced what I had hoped to at the time. Then, I was really looking for a sort of ‘at a glance’ view, without providing too much detail.

Now I’d like to use a new approach to eventually produce an interactive web map. The approach here will be to showcase some different components:

- Use PostgreSQL to store and process data on countries and the restaurants representing them in Chicago.
- Use QGIS to produce the map locally.
- Use the QGIS2Web plug in to produce javascript to be incorporated into a website.
- Incorporate the code to present an interactive map on a website.

I will add further details of some of the technical details involved in this later. For now, let’s see the results.

Here is the newly updated map, hosted on a Django website. This map is interactive; clicking on a green colored country will indicate the restaurant in Chicago associated with that country. White colored countries do not have restaurants represented in Chicago.

Also, keep in mind that this is mainly an updating of the technology and not necessarily the content. I notice several of the restaurants have closed since the initial listing. I may update at some point, but if anyone wants to point out any changes, please let me know here.

You can also see the map below directly here.

In a previous post, we discussed the problem of determining an efficient route to visit every county. Part of that post explicated the difficulty of even defining the problem. In this post, let’s take a look at using space filling curves (SFC) to address the problem. In other posts, we had discussed using SFC to address the traveling salesman problem (TSP). Let’s look at applying these to our current problem; in particular, they seem to let us skirt some of the definitional problems we encountered in our previous post.

Specifically, using SFC obviates the need to define where we enter and exit a county and where we head net when we are in a county. As in other SFC problems we discussed, we are just going to overlay a SFC over the state and then trace the curve as our route. As we trace, we’ll note when we are entering a previously unvisited county.

A new issue introduced with this method is we will most likely visit a given county more than once. Whether this matters at this point is unclear. For now, let’s keep our eyes on the prize: creating an efficient to visit each county in a state. We’ll start with one of my favorite biking states – Wisconsin. Our basic steps will be to 1) determine a bounding box for the state, 2) create a space filling curve within that box and 3) traverse the space filling curve from its start to its end, accounting for each county as it is first visited.

**Determining a boundary box for Wisconsin**

We could do this in any most GIS software (e.g. ArcGIS or QGIS), but for our purposes we will just use good old Google maps and sort of eyeball a bounding box by determining the extreme most north, south, east and west points (in lat and long coordinates). These are , respectively, 46.972535, 42.512362, -86.834162 and -92.898615. Note these may not be exact, but we don’t really need exacting precision here.

**Create a space filling curve within the bounding box**

We did something similar like this previously when we introduced this concept. Here’s an example of a fairly coarse curve overlaid on the county map of Wisconsin. We’ll likely use a finer curve, but this is better for illustrative purposes, as a finer curve doesn’t look as clear. We’ll traverse this curve from the southwest corner, ending up in the southeast corner.

**Traversing the curve – Results**

We can do this programatically as before and for now we’ll just show the results visually. The route highlighted below was derived from traversing a space filling curve that is a little finer than that shown above.

How does it look ? Well, it seems like a reasonable enough route, except for the NE corner portion. That portion , with Door County highlighted in green, actually requires a drive over water, which seems to be a weakness of this approach. Just using the naked eye, it would seem to be a reasonably efficient drive to come just around that bay, but the SFC can’t find it with the straight forward method we are using. So it seems this method will suffer a bit with odd shapes like that. Still, unusual shapes like that (or any sort of “isolated” destination) will likely cause problems for most solutions. Consider a baseball road trip to all the MLB stadiums; at some point, you’ll always have a drive out to Seattle and back that seems out of the way. But the SFC approach here is actually returning a route that is NOT driveable and it’s not clear given the nature of an SFC (essentially one continuous curve in an area) how once would work around this.

In any case, we should compare this method versus others for a more neatly shaped area that doesn’t have this sort of unusual shape.

I’ve spent the majority of my life in the Chicagoland area. So it came as somewhat of a surprise to me when I recently ran across the name of a city that was less than 30 miles away that I had never heard of.

It got me thinking if there were cities even closer to me that I never heard of. (Yes, I’m going to keep ending this sentence with a preposition throughout). How would one go about finding this answer definitively? Well, “definitively” in the sense of ranking the closest cities to a given location. I hold no hope I can determine whether someone has heard of a city before !

This is the exact sort of proximity analysis for which spatial databases are so useful. So how would we go about approaching this problem ? To start, let’s assume that we know the location from which we are trying to determine the closest cities. (At some point, I may put a little site together to allow entry of coordinates, but for now let’s just say we know the latitude and longitude of the location in which we are interested)

**Step 1 – Determine our methodology**

If we are going to determine “closest” cities, we first have to determine what we mean by closest. Two metrics for the distance between our point and a given city come to mind:

a. The distance between our location and the closest point in a city

b. The distance from our location to the center (centroid) of a city

For this analysis, let’s say we are going to use (b), the distance to the centroid of the city. So we will want to

1. determine all candidate cities nearby,

2. compute the centroids of all the candidate cities above,

3. compute the distance from our location to all of the centroids

4. rank the distances from closest to furthest centroid

5. determine the closest city which we’ve never heard of (alas, this must remain a manual step)

**Step 2 – Determine our candidate cities**

This represents sort of a rough first pass at the problem. We want to say “what are all the cities” around here, and then proceed to compute their distances. An easy first cut might be all cities in the same county I am in. But what if I’m near a county border ? Wouldn’t I want to investigate the adjoining counties too ? So maybe I can pick my county and all the counties touching my county. What if that’s not big enough to yield a city I’ve never heard of ? Maybe I grab all the cities in my state. But what if I live near a state border ? It makes sense the closest city may be in that neighboring state. You can see the sort of issues we face in the first cut here. To start, though, let’s do this for the state of our location.

**Where do I get a listing (and some sort of GIS coordinates) for all the cities in my state?**

The US government has shapefiles available that contain cities by state. I grabbed the Illinois file from here

https://catalog.data.gov/dataset/tiger-line-shapefile-2013-state-illinois-current-places

I want to do this analysis from PostgreSQL/PostGIS, so let’s import that shapefile into a PostgreSQL table. A helpful cheat sheet for such a process can be found here. We will use the shp2pgsql command to do this step.

> shp2pgsql “c:\folder\test.shp” public.tshp> test.sql

> psql -h localhost -d closest_city -U username -f test.sql

After successfully running this, we should have the data representing all the cities in Illinois in table tshp.

**Step 3 – Compute centroids**

In a simplified version of the problem, in Step 2 sometimes we might be able to retrieve the centroids of the candidate cities. If that’s not the case, we will have some polygon representation of a city and wish to compute the centroid of the polygon. We can normally use the features of a spatial database to compute this and this is true for PostGIS. Specifically, we can use the ST_Centroid function.

**Step 4 – Compute the distance to the centroids**

Again, this is where a spatial database really shines. To compute the distance between our location and the centroids, it is a fairly straightforward call to a function, in this case ST_Distance.

**Step 5 – Rank the distances**

We can just use the old trusted SQL keyword ORDER BY to achieve this.

**Step 6 – Consolidate the above functions into one SQL statement**

select name, to_char(ST_Distance(ST_SetSRID(ST_Centroid(geom),4326)::geography,ST_SetSRID(ST_MakePoint(-87.684,41.836),4326)::geography)/1609,'9999.00') as dist_mi from tshp order by dist_mi

Let’s break that calculated field down a little. It’s basically a formatted version of the statement ST_DISTANCE(A,B)

where A is the centroid of each city in our table (calculated from the geom column) and B is our location of interest from which we are calculating distance, in this case the centroid of the city of Chicago. In both cases of A and B we set the SRID to 4326 with the ST_SetSRID function so we are comparing equivalent projections. We also cast A and B from a geometry to a geography (via the ::geography operator) so our distance calculation will return meters and subsequently divide by 1609 to convert the distance to miles.

The final to_char conversion is just a simple way of rounding to 2 digits for our printing purposes.

**Step 7 – Determine the closest city you’ve never heard of**

I won’t list the distances to all 1,367 cities from Chicago , but here are the 10 closest Illinois cities to Chicago and the 10 furthest cities from Chicago still in Illinois. Note that these are ‘as the crow flies’ straight line distances (in miles), not driving distances. For the purposes of our exercise, the closest city I had never heard of was Fairmont,IL at 27.14 miles (this game will be hard to play twice, as now I’ve heard of it).

Closest:

Furthest:

Top