'Sane'itizing MTA's Turnstile Datasets: Geotagging

June 16, 2020

Update: Looks like others have geotagged the turnstile dataset too.

A Backstory

in early 2019, I conducted research at BMCC on a noise pollution platform detailing the estimated noise pollution of neighborhoods in NYC (full paper on the platform here if you’re interested). Essentially, the platform inferred the level and type of noise in an area from public ubiquitous datasets using machine learning. After familiarizing myself with the paper, I hypothesized a corelation with noise levels and the number of people within an area at any given time. As you would have guessed, it’s not easy to know the exact number of people in a neighborhood at a given time, but given that more than half of NYC uses public transportation, rideship data from the MTA would be helpful for testing my hypothesis. Dealing with the MTA dataset was not fun though.

The Turnstile Dataset

The MTA turnstile dataset is a great asset for many analysis projects (see this for example), but with a few issues; limited information on column definitions (I googled control area and saw something about airspace??), cumulative data counting backwards unexpectly, weird looking outliers and all sorts of strangeness. Unfortunately, I wasted a lot of time trying to make this dataset somewhat useful (and turns out I’m not the only one). Of course I tried googling but, you know, scarcity and what not. Anyways, in an attempt to save the next gal/guy from burning hours away trying to tag the station location onto the MTA turnstile dataset, I’m sharing this NAAS or notebook-as-a-sitepost. Haha, okay I promise I’ll be less cheeky for the rest of the post.

I also would like to experiment creating a pipeline API for processing MTA datasets using some sort of app hosting platform with a free, limited option(like Heroku), to kind of test ways of optimizing for limited resources. A bit like what game developers did with 8 and 16 Bit hardware, but that’s for another time.

Introducing the Dataset

The turnstile dataset is available for public use at http://web.mta.info/developers/turnstile.html. Each dataset contains a week’s worth of aggrigated, 4 hour snapshots of the state of a counter within the station turnstiles. Lets briefly look at it:

import pandas  # Good ol' pandas
import numpy as np
import urllib.request
from parsel import Selector  # Scrapy's parsel makes it easy to select html elements from a web document (https://github.com/scrapy/parsel)



# So, this function maps the date of the dataset to the dataset location. 
# Its not exactly necessary here, but I'm reusing code I already have so bear with me
def get_available_datasets():
    """
    fetches links to available datasets from MTA turnstile website.
    returns a dict of  date: url pairs.
    date format example: Saturday, April 11, 2020
    """

    MTA_URL = 'http://web.mta.info/developers/turnstile.html'
    DATA_URL_IDENTIFIER = 'data/nyct/turnstile/turnstile_'
    MTA_DOMAIN = 'http://web.mta.info/developers/'

    with urllib.request.urlopen(MTA_URL) as response:
      home_page = Selector(response.read().decode("utf8"))

    links = home_page.xpath('//a[contains(@href, "{}")]'.format(DATA_URL_IDENTIFIER))
    return {link.xpath('text()').get(): MTA_DOMAIN + link.xpath('@href').get() for link in links}

  
# Okay, let's pull a dataset and create a dataframe. We'll use last week's data (Saturday, June 13, 2020)

url = get_available_datasets()['Saturday, June 13, 2020']  # Ideally, what's returned from this function should be cached, but it'll run once so ¯\_(ツ)_/¯

with urllib.request.urlopen(url) as response:
  df = pandas.read_csv(response)

df.head(5)

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS
0	A002	R051	02-00-00	59 ST	NQR456W	BMT	06/06/2020	00:00:00	REGULAR	7420920	2521129
1	A002	R051	02-00-00	59 ST	NQR456W	BMT	06/06/2020	04:00:00	REGULAR	7420920	2521130
2	A002	R051	02-00-00	59 ST	NQR456W	BMT	06/06/2020	08:00:00	REGULAR	7420928	2521141
3	A002	R051	02-00-00	59 ST	NQR456W	BMT	06/06/2020	12:00:00	REGULAR	7420941	2521163
4	A002	R051	02-00-00	59 ST	NQR456W	BMT	06/06/2020	16:00:00	REGULAR	7420972	2521174

Not bad right? Besides the columns that aren’t immediately obvious (C/A, SCP, etc) we can see that the first 4 rows are counts from the 59th ST station. Wait, which station? There’s multiple stations at 59 ST in Manhattan.

Well if we look at the stops at that station (represented by the LINENAME column), we can guess that it must be the Lexington Avenue/59th Street station. You can also confirm this by looking for all the stations with 59 ST in the names. You’ll see that the one for Columbus Circle says “Columbus” in its name:

# Computer, I command you to give me all the Station names with "59 ST" in them
df[df["STATION"].str.contains("59 ST")]["STATION"].unique()

array(['59 ST', '5 AV/59 ST', '59 ST COLUMBUS'], dtype=object)

It would’ve been nice if it was geotagged though, right?

Yeah?

We’ll, lets tag it then.

There’s a dataset with latitude/longitude pairs for stations from MTA that we can use to geotag our turnstile datasets. The dataset can be downloaded here. Unfortunately, the dataset isn’t consistent with the turnstile dataset, and we cannot simply join the two datasets on, say, the name of the station. So we need to seek alternative measures.

The easiest solution I could think of was fuzzy string matching on station names, along with a bit of manual pairing. Using this method, I was able to match most of the stations listed in the turnstile dataset with a corresponding latitude/longitude coordinates. It wasn’t easy though and did take a considerable amount of hours.

For explanation sake, I’ll briefly take you through that journey. Don’t worry, I’ll try to make it bearable.

First, lets get the station location dataset I mentioned earlier and take a look:

with urllib.request.urlopen("http://web.mta.info/developers/data/nyct/subway/Stations.csv") as response:
  stationsdf = pandas.read_csv(response)

stationsdf.head(5)

	Station ID	Complex ID	GTFS Stop ID	Division	Line	Stop Name	Borough	Daytime Routes	Structure	GTFS Latitude	GTFS Longitude	North Direction Label	South Direction Label
0	1	1	R01	BMT	Astoria	Astoria - Ditmars Blvd	Q	N W	Elevated	40.775036	-73.912034	NaN	Manhattan
1	2	2	R03	BMT	Astoria	Astoria Blvd	Q	N W	Elevated	40.770258	-73.917843	Ditmars Blvd	Manhattan
2	3	3	R04	BMT	Astoria	30 Av	Q	N W	Elevated	40.766779	-73.921479	Astoria - Ditmars Blvd	Manhattan
3	4	4	R05	BMT	Astoria	Broadway	Q	N W	Elevated	40.761820	-73.925508	Astoria - Ditmars Blvd	Manhattan
4	5	5	R06	BMT	Astoria	36 Av	Q	N W	Elevated	40.756804	-73.929575	Astoria - Ditmars Blvd	Manhattan

Great, now we have quite a bit of location data for each station. Let’s see what shows up when we search for “59 st” in the stop name. That’ll give us an idea of the possibility of joining on stop name/STATION columns:

# Computer, I am once again asking you to give me all the Station names with "59 St" in them
stationsdf[stationsdf["Stop Name"].str.contains("59 St")]

	Station ID	Complex ID	GTFS Stop ID	Division	Line	Stop Name	Borough	Daytime Routes	Structure	GTFS Latitude	GTFS Longitude	North Direction Label	South Direction Label
6	7	613	R11	BMT	Astoria	Lexington Av/59 St	M	N W R	Subway	40.762660	-73.967258	Queens	Downtown & Brooklyn
7	8	8	R13	BMT	Astoria	5 Av/59 St	M	N W R	Subway	40.764811	-73.973347	Queens	Downtown & Brooklyn
34	35	35	R41	BMT	4th Av	59 St	Bk	N R	Subway	40.641362	-74.017881	Manhattan	Coney Island - Bay Ridge
160	161	614	A24	IND	8th Av - Fulton St	59 St - Columbus Circle	M	A B C D	Subway	40.768296	-73.981736	Uptown & The Bronx	Downtown & Brooklyn
315	315	614	125	IRT	Broadway - 7Av	59 St - Columbus Circle	M	1	Subway	40.768247	-73.981929	Uptown & The Bronx	Downtown
400	400	613	629	IRT	Lexington Av	59 St	M	4 5 6	Subway	40.762526	-73.967967	Uptown & The Bronx	Downtown & Brooklyn

Yikes. Okay. So a couple of issues here (funny thing is, I haven’t noticed these issues until I started writing this. Datascience ami right?) The first one is that the names in our location dataset aren’t the station names; they’re actually the names of the stops. I totally missed that earlier, but fortunately, they line up enough for a lot of the names in the “STATION” column of our turnstile dataset. We can still use this dataset, but we’ll have to do more than just join on “STATION” and “Stop Name”.

Okay, lets examine our station/stop locations dataset first. From our example above, we can see that some stations are broken down into multiple stops, with differing line and routes (for example, the first and last rows). What you should also notice is that the complex id is the same for both rows. Coincidence? Well, lets find out: (You’d think the station ID would be the same, but nooope. I tried searching for a key for the column names, but found nothing sigh.)

# What this one liner does is group the rows by "Complex ID", filter the groups that have more than one row, 
# and sort them so we can see the groups next to each other. 
stationsdf.groupby('Complex ID').filter(lambda g: len(g) > 1).sort_values("Complex ID", ascending=False)

	Station ID	Complex ID	GTFS Stop ID	Division	Line	Stop Name	Borough	Daytime Routes	Structure	GTFS Latitude	GTFS Longitude	North Direction Label	South Direction Label
174	174	636	A41	IND	8th Av - Fulton St	Jay St - MetroTech	Bk	A C F	Subway	40.692338	-73.987342	Manhattan	Euclid - Lefferts - Rockaways - Coney Island
24	25	636	R29	BMT	Broadway	Jay St - MetroTech	Bk	R	Subway	40.692180	-73.985942	Manhattan	Bay Ridge - 95 St
330	330	635	142	IRT	Broadway - 7Av	South Ferry	M	1	Subway	40.702068	-74.013664	Uptown & The Bronx	NaN
22	23	635	R27	BMT	Broadway	Whitehall St	M	R W	Subway	40.703087	-74.012994	Uptown & Queens	Brooklyn
127	128	630	L17	BMT	Canarsie	Myrtle - Wyckoff Avs	Bk	L	Subway	40.699814	-73.911586	Manhattan	Canarsie - Rockaway Parkway
...	...	...	...	...	...	...	...	...	...	...	...	...	...
462	461	461	R09	BMT	Astoria	Queensboro Plaza	Q	N W	Elevated	40.750582	-73.940202	Astoria - Flushing	Manhattan
166	167	167	A32	IND	8th Av - Fulton St	W 4 St	M	A C E	Subway	40.732338	-74.000495	Uptown - Queens	Downtown & Brooklyn
167	167	167	D20	IND	6th Av - Culver	W 4 St	M	B D F M	Subway	40.732338	-74.000495	Uptown - Queens	Downtown & Brooklyn
149	151	151	A12	IND	8th Av - Fulton St	145 St	M	A C	Subway	40.824783	-73.944216	Uptown & The Bronx	Downtown & Brooklyn
150	151	151	D13	IND	Concourse	145 St	M	B D	Subway	40.824783	-73.944216	Uptown & The Bronx	Downtown & Brooklyn

86 rows × 13 columns

Okay. Lets take some samples and test. We’ve already seen for Lexington Av/59th St that the “Daytime Routes” (AKA trains that run there) mostly add up to the “LINENAME” in our turnstile dataset (except for the Q train, which I assume is there because of the access to the Lexington Av/63rd St Station, which runs the Q train. There also was a substitution of the Q for the W from 2010-2016. details here if you’re interested). Let’s try rows 3 & 4; Whitehall St and South Ferry. If you add the “Daytime Routes”, you’ll see it corresponds with the “LINENAME” values for “SOUTH FERRY” in the turnstile dataset:

df[df["STATION"] == "SOUTH FERRY"].head(1)

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS
128358	R101	R001	02-00-00	SOUTH FERRY	1RW	IRT	06/06/2020	01:00:00	REGULAR	7025506	283662

Great. So maybe we can use the LINENAME/Daytime Routes to geotag the turnstile dataset.

Lets try it.

First, I’ll try to consolidate the “Daytime Routes” for the stops stops that are in the same station, and replace the “Daytime Routes” for each with the consolidated value. Afterward, we’ll check Lex av/59th and South Ferry/Whitehall:

# this needs to run ONLY ONCE or you'll get the world of duplicates
stationsdf["Daytime Routes"] = stationsdf[["Complex ID","Daytime Routes"]].groupby("Complex ID")["Daytime Routes"].transform(lambda x: ''.join([y.replace(' ', '') for y in x]))

# Check Lex av/59th
stationsdf[stationsdf["Stop Name"].str.contains("59 St")]

	Station ID	Complex ID	GTFS Stop ID	Division	Line	Stop Name	Borough	Daytime Routes	Structure	GTFS Latitude	GTFS Longitude	North Direction Label	South Direction Label
6	7	613	R11	BMT	Astoria	Lexington Av/59 St	M	NWR456	Subway	40.762660	-73.967258	Queens	Downtown & Brooklyn
7	8	8	R13	BMT	Astoria	5 Av/59 St	M	NWR	Subway	40.764811	-73.973347	Queens	Downtown & Brooklyn
34	35	35	R41	BMT	4th Av	59 St	Bk	NR	Subway	40.641362	-74.017881	Manhattan	Coney Island - Bay Ridge
160	161	614	A24	IND	8th Av - Fulton St	59 St - Columbus Circle	M	ABCD1	Subway	40.768296	-73.981736	Uptown & The Bronx	Downtown & Brooklyn
315	315	614	125	IRT	Broadway - 7Av	59 St - Columbus Circle	M	ABCD1	Subway	40.768247	-73.981929	Uptown & The Bronx	Downtown
400	400	613	629	IRT	Lexington Av	59 St	M	NWR456	Subway	40.762526	-73.967967	Uptown & The Bronx	Downtown & Brooklyn

Great. Now let’s see if we can get somewhere with this. We know we can’t directly join because the stations don’t have the exact names. The next best thing is to fuzzy search. We’ll use fuzzywuzzy to do the fuzzy matching, so now we need a strategy to ensure our dataset is as accurate as possible.

The easiest strategy I can think of is to match the names first then match the Routes and assume the highest pair is correct (we’ll test of course). With this in mind, I can match the closest name with routes that are essentially the same thing.

Lets see how that plays out:

from fuzzywuzzy import process, fuzz
from operator import itemgetter


station_stops = {x[0].replace(' ', '').replace('-', ''): x[0] for x in zip(stationsdf["Stop Name"])}

station_stop_names = station_stops.keys()

def match_records(row):
  """
  Receives a unique (STATION, LINENAME) row from the turnstile dataset and attempts to match it with a record
  from the stop/station location dataset using the names and Routes.
  
  The function runs fuzzy search on the station name and the linename/routes and aggregates the results.
  Then it sorts the results in descending order by the match ratio of the name first, then the linename/route.
  Finally, it returns the result with the highest name and line/route match ratio
  
  Returns a dictionary in the format 
    {'station': '59 ST',        == Corresponds to the STATION column in the turnstile dataset
   'name_match': ('59St', 100), == Corresponds to the matched stop name keys & the match ratio
   'line_match': 'NWR456',      == Corresponds to the lines/Routes 
   'stop_name_match': '59 St',  == Corresponds to the actual, unmodified stop name
   'line_fuzz_ratio': 77}       == Corresponds to the match ratio of the Routes with that of the LINENAME From the
                                   turnstile dataset 
  """
  
  # preprocess the station name to increase the accuracy of fuzzy matching
  station_name = row["STATION"].replace(' ', '').replace('-', '')
  
  station_routes = row["LINENAME"]
  
  # first, get the potential matches. 
  candidates = process.extract(station_name, 
                               station_stop_names, scorer=fuzz.token_sort_ratio, limit=10)
  
  
  # stores the results of all the line matches for each name match 
  records = []
 
  for x in candidates:
    
    # Find corresponding record(s) from the station locations dataframe 
    candidate_match = stationsdf[stationsdf["Stop Name"] == station_stops[x[0]]]["Daytime Routes"].unique()
    
    for y in candidate_match:
      records.append({
        "station": row["STATION"],
        "name_match": x,
        "line_match": y,
        "station_line": station_routes,
        "stop_name_match": station_stops[x[0]],
        "line_fuzz_ratio": fuzz.token_sort_ratio(station_routes, y)
      })
  
  # sorts the results by name first, line/route second, in descending order, and returns the first result
  return sorted(records, key=lambda x: (x["name_match"][1], x["line_fuzz_ratio"]), reverse=True)[0]

# Now we apply the function to our dataframes
results = df[['STATION','LINENAME']].drop_duplicates().reset_index().apply(match_records, axis=1)

Okay we’re close! There are a couple of stations that aren’t matching correctly, or don’t exist in your locations dataset. We’ll have to do some manual vetting to take care of those (don’t worry, I did it myself.)

Now, lets clean our data, fix the wrong matches, and use the mapping to geotag the turnstile dataset.

# Here's a list of items that weren't a perfect match for both station names and routes *surpressed to the first 2* 
# We'll review these and remove the ones we cant fix, or update the mismatches
[x for x in results if x['line_fuzz_ratio'] < 100 and x["name_match"][1] < 100][:2]

[{'station': 'WHITEHALL S-FRY',
  'name_match': ('WhitehallSt', 83),
  'line_match': 'RW1',
  'station_line': 'R1W',
  'stop_name_match': 'Whitehall St',
  'line_fuzz_ratio': 67},
 {'station': 'DELANCEY/ESSEX',
  'name_match': ('DelanceySt', 75),
  'line_match': 'JMZF',
  'station_line': 'FJMZ',
  'stop_name_match': 'Delancey St',
  'line_fuzz_ratio': 75}]

# Fixed missed matches here:
# the patched key is there to identify records that were manually corrected
mismatches = {
 'FLATBUSH AV-B.C': {'station': 'FLATBUSH AV-B.C',
  'name_match': ('Flatbush Av - Brooklyn College', 61),
  'line_match': '25',
  'station_line': '25',
  'stop_name_match': 'Flatbush Av - Brooklyn College',
  'line_fuzz_ratio': 100,
  'patched': True},
  
  'ROCKAWAY PARK B': {'station': 'ROCKAWAY PARK B',
  'name_match': ('RockawayParkBeach116St', 100),
  'line_match': 'AS',
  'station_line': 'AS',
  'stop_name_match': 'Rockaway Park - Beach 116 St',
  'line_fuzz_ratio': 100,
  'patched': True},
# ... surpressed
}

  # The location of these aren't in the dataset, so we'll ignore them. 
  # I don't know what some of the names correspond to either 
unknowns = [
  "RIT-ROOSEVELT",
  "RIT-MANHATTAN",
  "ST. GEORGE",
  "ORCHARD BEACH",
  "NEWARK HW BMEBE",
  "HARRISON",
  "JOURNAL SQUARE",
  'EXCHANGE PLACE',
  'PAVONIA/NEWPORT',
  'CITY / BUS',
  # ... surpressed
]

# Lets now filter the unknowns and fix the mismatches
results_filtered = [x for x in results if x['station'] not in unknowns]

# and also, fix the mismatches
results_fixed = [mismatches[x['station']] if mismatches.get(x['station'], None) is not None else x for x in results_filtered]

Alright, now its time to geotag each record. To do so, I’ll map station names and routes to a particular location, then later use that to update the turnstile records

# maps the station and line to the geolocations

_ = [
  x.update(
    stationsdf[
      (stationsdf["Stop Name"] == x["stop_name_match"]) & 
      (stationsdf["Daytime Routes"] == x["line_match"])
    ]
    [["GTFS Latitude", "GTFS Longitude"]].head(1).to_dict('list')
  ) for x in results_fixed]

# Some records have the lat/long values in a list (like [40.824073] for example). the lines below fixes that
_ = [
  x.update(
    {'GTFS Latitude': x['GTFS Latitude'][0] if type(x['GTFS Latitude']) is list else x, 
     'GTFS Longitude': x['GTFS Longitude'][0] if type(x['GTFS Longitude']) is list else x}
  ) for x in results_fixed]


# I'll convert it into a df, then use 'station' and 'station_line' to join it on the turnstile dataset
mappingdf = pandas.DataFrame(results_fixed).drop(['name_match', 'line_match', 'stop_name_match', 'line_fuzz_ratio', 'patched'], axis=1)
pandas.merge(df, mappingdf, left_on=['STATION', 'LINENAME'], right_on=['station', 'station_line'], how='left')

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS	station	station_line	GTFS Latitude	GTFS Longitude
0	A002	R051	02-00-00	59 ST	NQR456W	BMT	06/06/2020	00:00:00	REGULAR	7420920	2521129	59 ST	NQR456W	40.762526	-73.967967
1	A002	R051	02-00-00	59 ST	NQR456W	BMT	06/06/2020	04:00:00	REGULAR	7420920	2521130	59 ST	NQR456W	40.762526	-73.967967
2	A002	R051	02-00-00	59 ST	NQR456W	BMT	06/06/2020	08:00:00	REGULAR	7420928	2521141	59 ST	NQR456W	40.762526	-73.967967
3	A002	R051	02-00-00	59 ST	NQR456W	BMT	06/06/2020	12:00:00	REGULAR	7420941	2521163	59 ST	NQR456W	40.762526	-73.967967
4	A002	R051	02-00-00	59 ST	NQR456W	BMT	06/06/2020	16:00:00	REGULAR	7420972	2521174	59 ST	NQR456W	40.762526	-73.967967
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
206657	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	06/12/2020	05:00:00	REGULAR	5554	514	NaN	NaN	NaN	NaN
206658	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	06/12/2020	09:00:00	REGULAR	5554	514	NaN	NaN	NaN	NaN
206659	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	06/12/2020	13:00:00	REGULAR	5554	514	NaN	NaN	NaN	NaN
206660	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	06/12/2020	17:00:00	REGULAR	5554	514	NaN	NaN	NaN	NaN
206661	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	06/12/2020	21:00:00	REGULAR	5554	514	NaN	NaN	NaN	NaN

206662 rows × 15 columns

Done! We’re all geotagged now!

Imrovements and Contributions Welcomed!

I’ll have the code and dataset available here. Feel free to open issues and PRs for fixes/improvements.

That is all.