Speedrunning the New York Subway

Speedrunning the New York Subway

It all began at Empire Hacking, where I encountered the inimitable @ImmigrantJackson, a YouTuber with a penchant for public transit and a dream: to break the world record for visiting every subway stop in New York City in the least time possible.

The rules were simple. He didn’t need to exit the subway car; simply passing through a stop, even on an express train, counted as a “visit.” Stops could be visited multiple times, although this would of course be suboptimal. And there was no need to visit Staten Island … because we’re civilized. We live in a society.

The coterie of curious computer coders coalescing around the inquisitor quickly classified this question as a case of the Traveling Salesman Problem (TSP). TSP is a classical problem in computer science where one must find the shortest route for a traveling salesman to visit each city on a map.

TSP is known to be computationally intractable to solve optimally for networks even as small as the New York subway system. Therefore, everyone dismissed the problem as “impossible.”

I knew that there were algorithms capable of solving TSP very quickly, producing a route guaranteed to be at most 50% longer than the optimal solution.

The first challenge was that, even when you discard Staten Italy, the subway system still has a lot of stations. There are close to 500. This exercise was fun and all, but I wasn’t about to spend hours encoding hundreds of stations, lines, transfer points, headways, and average trip times into a program.

Fortunately, the MTA has a public API implementing GTFS. It’s just a collection of CSV files that are quite straightforward to parse.

The dataset is sufficient to construct a graph with a node for each subway station and an edge if there is a subway line that connects the two stations. Actually, the data represents subway platforms rather than stations, which is necessary to calculate things like transfer times between train lines.

The New York Subway network is a directed graph: some neighboring stations are accessible to each other only in one direction.

For example, the Aqueduct Racetrack station only has a platform for northbound trains, not southbound trains.

The best algorithms for approximating a solution to TSP on directed graphs are not great, only guaranteeing a solution of three or four times the length of the optimal solution.

I decided to relax the problem to an undirected graph (i.e., I assume that every station has trains to its neighbors bidirectionally). This turned out not to be an issue and permitted the use of the Christofides algorithm, which guarantees a solution at most 50% longer than optimal.

The next and final relaxation is to partially ignore actual timetables and headways. When a transfer between lines is necessary, I assume that the transfer will require the “minimum transfer time” reported by GTFS (i.e., the amount of time required to walk from one platform to another) plus one-half the average headway for that line throughout the day.

The Christofides algorithm was able to approximate a solution to the TSP for my relaxed subway network graph in a matter of milliseconds. This is compared to the unfathomable amount of computation required to brute-force calculate the optimal solution, an amount so large that it afforded me the horrifying opportunity to learn that there is a whole community of “googologists” who compete with each other to name large numbers.

The resulting tour visits all 474 stations, 155 of which are visited more than once. The tour requires 34 transfers. The expected time for completing this tour is 20 hours, 42 minutes.

This is neat and all, but we spent an unnecessary amount of energy encoding the New York subway system as a graph. What else might we do with it?

One straightforward computation is to calculate each station’s eigenvector centrality.

Eigenvector centrality calculates the probability that you are at a particular station if you were to pause your infinite tour at any arbitrary point in time.

Eigenvector centrality is actually what Google originally used in its PageRank algorithm to rank the relative importance (centrality) of web pages.

Each page is like a subway station, and hyperlinks on the page are like the subway lines connecting them.

Eigenvector centrality is relatively easy to calculate, either directly (it’s related to the eigenspectrum of the graph’s adjacency matrix, thanks to spooky spectral graph theory magic) or using a technique called power iteration (which relies on a convergence that happens when you multiply the adjacency matrix by a vector a bunch of times).

Either way, you can calculate it with a single function call in NetworkX.

What do you think will be the most probable stop on your infinite subway tour? Or, another way to ask the same question: Which stop will you visit most often on your tour?

Unsurprisingly, it will be Times Square, with a probability of a little over 30%.

The next most probable station is 42nd St. Port Authority Bus Terminal, coming in at about 8%. Then 50th St., 59th St. Columbus Circle, Grand Central 42nd St., and 34th St. Penn Station are all between 4% and 5%

Conclusion

First of all, don’t attend Empire Hacking without the expectation of being intellectually stimulated.

Secondly, don’t immediately discount a problem just because it is NP-hard or computationally intractable to solve; you might be able to approximate a solution of sufficient optimality.

Thirdly, Times Square may not be the “center of the universe,” but it is definitely the center of the New York subway system.

Finally, remember to like, , and subscribe!