'Sane'itizing MTA's Turnstile Datasets: Geotagging
Update: Looks like others have geotagged the turnstile dataset too.
A Backstory
in early 2019, I conducted research at BMCC on a noise pollution platform detailing the estimated noise pollution of neighborhoods in NYC (full paper on the platform here if you’re interested). Essentially, the platform inferred the level and type of noise in an area from public ubiquitous datasets using machine learning. After familiarizing myself with the paper, I hypothesized a corelation with noise levels and the number of people within an area at any given time. As you would have guessed, it’s not easy to know the exact number of people in a neighborhood at a given time, but given that more than half of NYC uses public transportation, rideship data from the MTA would be helpful for testing my hypothesis. Dealing with the MTA dataset was not fun though.
The Turnstile Dataset
The MTA turnstile dataset is a great asset for many analysis projects (see this for example), but with a few issues; limited information on column definitions (I googled control area and saw something about airspace??), cumulative data counting backwards unexpectly, weird looking outliers and all sorts of strangeness. Unfortunately, I wasted a lot of time trying to make this dataset somewhat useful (and turns out I’m not the only one). Of course I tried googling but, you know, scarcity and what not. Anyways, in an attempt to save the next gal/guy from burning hours away trying to tag the station location onto the MTA turnstile dataset, I’m sharing this NAAS or notebook-as-a-sitepost. Haha, okay I promise I’ll be less cheeky for the rest of the post.
I also would like to experiment creating a pipeline API for processing MTA datasets using some sort of app hosting platform with a free, limited option(like Heroku), to kind of test ways of optimizing for limited resources. A bit like what game developers did with 8 and 16 Bit hardware, but that’s for another time.
Introducing the Dataset
The turnstile dataset is available for public use at http://web.mta.info/developers/turnstile.html. Each dataset contains a week’s worth of aggrigated, 4 hour snapshots of the state of a counter within the station turnstiles. Lets briefly look at it:
import pandas # Good ol' pandas
import numpy as np
import urllib.request
from parsel import Selector # Scrapy's parsel makes it easy to select html elements from a web document (https://github.com/scrapy/parsel)
# So, this function maps the date of the dataset to the dataset location.
# Its not exactly necessary here, but I'm reusing code I already have so bear with me
def get_available_datasets():
"""
fetches links to available datasets from MTA turnstile website.
returns a dict of date: url pairs.
date format example: Saturday, April 11, 2020
"""
MTA_URL = 'http://web.mta.info/developers/turnstile.html'
DATA_URL_IDENTIFIER = 'data/nyct/turnstile/turnstile_'
MTA_DOMAIN = 'http://web.mta.info/developers/'
with urllib.request.urlopen(MTA_URL) as response:
home_page = Selector(response.read().decode("utf8"))
links = home_page.xpath('//a[contains(@href, "{}")]'.format(DATA_URL_IDENTIFIER))
return {link.xpath('text()').get(): MTA_DOMAIN + link.xpath('@href').get() for link in links}
# Okay, let's pull a dataset and create a dataframe. We'll use last week's data (Saturday, June 13, 2020)
url = get_available_datasets()['Saturday, June 13, 2020'] # Ideally, what's returned from this function should be cached, but it'll run once so ¯\_(ツ)_/¯
with urllib.request.urlopen(url) as response:
df = pandas.read_csv(response)
df.head(5)
C/A | UNIT | SCP | STATION | LINENAME | DIVISION | DATE | TIME | DESC | ENTRIES | EXITS | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 06/06/2020 | 00:00:00 | REGULAR | 7420920 | 2521129 |
1 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 06/06/2020 | 04:00:00 | REGULAR | 7420920 | 2521130 |
2 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 06/06/2020 | 08:00:00 | REGULAR | 7420928 | 2521141 |
3 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 06/06/2020 | 12:00:00 | REGULAR | 7420941 | 2521163 |
4 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 06/06/2020 | 16:00:00 | REGULAR | 7420972 | 2521174 |
Not bad right? Besides the columns that aren’t immediately obvious (C/A, SCP, etc) we can see that the first 4 rows are counts from the 59th ST station. Wait, which station? There’s multiple stations at 59 ST in Manhattan.
Well if we look at the stops at that station (represented by the LINENAME column), we can guess that it must be the Lexington Avenue/59th Street station. You can also confirm this by looking for all the stations with 59 ST in the names. You’ll see that the one for Columbus Circle says “Columbus” in its name:
# Computer, I command you to give me all the Station names with "59 ST" in them
df[df["STATION"].str.contains("59 ST")]["STATION"].unique()
array(['59 ST', '5 AV/59 ST', '59 ST COLUMBUS'], dtype=object)
It would’ve been nice if it was geotagged though, right?
Yeah?
We’ll, lets tag it then.
There’s a dataset with latitude/longitude pairs for stations from MTA that we can use to geotag our turnstile datasets. The dataset can be downloaded here. Unfortunately, the dataset isn’t consistent with the turnstile dataset, and we cannot simply join the two datasets on, say, the name of the station. So we need to seek alternative measures.
The easiest solution I could think of was fuzzy string matching on station names, along with a bit of manual pairing. Using this method, I was able to match most of the stations listed in the turnstile dataset with a corresponding latitude/longitude coordinates. It wasn’t easy though and did take a considerable amount of hours.
For explanation sake, I’ll briefly take you through that journey. Don’t worry, I’ll try to make it bearable.
First, lets get the station location dataset I mentioned earlier and take a look:
with urllib.request.urlopen("http://web.mta.info/developers/data/nyct/subway/Stations.csv") as response:
stationsdf = pandas.read_csv(response)
stationsdf.head(5)
Station ID | Complex ID | GTFS Stop ID | Division | Line | Stop Name | Borough | Daytime Routes | Structure | GTFS Latitude | GTFS Longitude | North Direction Label | South Direction Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | R01 | BMT | Astoria | Astoria - Ditmars Blvd | Q | N W | Elevated | 40.775036 | -73.912034 | NaN | Manhattan |
1 | 2 | 2 | R03 | BMT | Astoria | Astoria Blvd | Q | N W | Elevated | 40.770258 | -73.917843 | Ditmars Blvd | Manhattan |
2 | 3 | 3 | R04 | BMT | Astoria | 30 Av | Q | N W | Elevated | 40.766779 | -73.921479 | Astoria - Ditmars Blvd | Manhattan |
3 | 4 | 4 | R05 | BMT | Astoria | Broadway | Q | N W | Elevated | 40.761820 | -73.925508 | Astoria - Ditmars Blvd | Manhattan |
4 | 5 | 5 | R06 | BMT | Astoria | 36 Av | Q | N W | Elevated | 40.756804 | -73.929575 | Astoria - Ditmars Blvd | Manhattan |
Great, now we have quite a bit of location data for each station. Let’s see what shows up when we search for “59 st” in the stop name. That’ll give us an idea of the possibility of joining on stop name/STATION columns:
# Computer, I am once again asking you to give me all the Station names with "59 St" in them
stationsdf[stationsdf["Stop Name"].str.contains("59 St")]
Station ID | Complex ID | GTFS Stop ID | Division | Line | Stop Name | Borough | Daytime Routes | Structure | GTFS Latitude | GTFS Longitude | North Direction Label | South Direction Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | 613 | R11 | BMT | Astoria | Lexington Av/59 St | M | N W R | Subway | 40.762660 | -73.967258 | Queens | Downtown & Brooklyn |
7 | 8 | 8 | R13 | BMT | Astoria | 5 Av/59 St | M | N W R | Subway | 40.764811 | -73.973347 | Queens | Downtown & Brooklyn |
34 | 35 | 35 | R41 | BMT | 4th Av | 59 St | Bk | N R | Subway | 40.641362 | -74.017881 | Manhattan | Coney Island - Bay Ridge |
160 | 161 | 614 | A24 | IND | 8th Av - Fulton St | 59 St - Columbus Circle | M | A B C D | Subway | 40.768296 | -73.981736 | Uptown & The Bronx | Downtown & Brooklyn |
315 | 315 | 614 | 125 | IRT | Broadway - 7Av | 59 St - Columbus Circle | M | 1 | Subway | 40.768247 | -73.981929 | Uptown & The Bronx | Downtown |
400 | 400 | 613 | 629 | IRT | Lexington Av | 59 St | M | 4 5 6 | Subway | 40.762526 | -73.967967 | Uptown & The Bronx | Downtown & Brooklyn |
Yikes. Okay. So a couple of issues here (funny thing is, I haven’t noticed these issues until I started writing this. Datascience ami right?) The first one is that the names in our location dataset aren’t the station names; they’re actually the names of the stops. I totally missed that earlier, but fortunately, they line up enough for a lot of the names in the “STATION” column of our turnstile dataset. We can still use this dataset, but we’ll have to do more than just join on “STATION” and “Stop Name”.
Okay, lets examine our station/stop locations dataset first. From our example above, we can see that some stations are broken down into multiple stops, with differing line and routes (for example, the first and last rows). What you should also notice is that the complex id is the same for both rows. Coincidence? Well, lets find out: (You’d think the station ID would be the same, but nooope. I tried searching for a key for the column names, but found nothing sigh.)
# What this one liner does is group the rows by "Complex ID", filter the groups that have more than one row,
# and sort them so we can see the groups next to each other.
stationsdf.groupby('Complex ID').filter(lambda g: len(g) > 1).sort_values("Complex ID", ascending=False)
Station ID | Complex ID | GTFS Stop ID | Division | Line | Stop Name | Borough | Daytime Routes | Structure | GTFS Latitude | GTFS Longitude | North Direction Label | South Direction Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
174 | 174 | 636 | A41 | IND | 8th Av - Fulton St | Jay St - MetroTech | Bk | A C F | Subway | 40.692338 | -73.987342 | Manhattan | Euclid - Lefferts - Rockaways - Coney Island |
24 | 25 | 636 | R29 | BMT | Broadway | Jay St - MetroTech | Bk | R | Subway | 40.692180 | -73.985942 | Manhattan | Bay Ridge - 95 St |
330 | 330 | 635 | 142 | IRT | Broadway - 7Av | South Ferry | M | 1 | Subway | 40.702068 | -74.013664 | Uptown & The Bronx | NaN |
22 | 23 | 635 | R27 | BMT | Broadway | Whitehall St | M | R W | Subway | 40.703087 | -74.012994 | Uptown & Queens | Brooklyn |
127 | 128 | 630 | L17 | BMT | Canarsie | Myrtle - Wyckoff Avs | Bk | L | Subway | 40.699814 | -73.911586 | Manhattan | Canarsie - Rockaway Parkway |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
462 | 461 | 461 | R09 | BMT | Astoria | Queensboro Plaza | Q | N W | Elevated | 40.750582 | -73.940202 | Astoria - Flushing | Manhattan |
166 | 167 | 167 | A32 | IND | 8th Av - Fulton St | W 4 St | M | A C E | Subway | 40.732338 | -74.000495 | Uptown - Queens | Downtown & Brooklyn |
167 | 167 | 167 | D20 | IND | 6th Av - Culver | W 4 St | M | B D F M | Subway | 40.732338 | -74.000495 | Uptown - Queens | Downtown & Brooklyn |
149 | 151 | 151 | A12 | IND | 8th Av - Fulton St | 145 St | M | A C | Subway | 40.824783 | -73.944216 | Uptown & The Bronx | Downtown & Brooklyn |
150 | 151 | 151 | D13 | IND | Concourse | 145 St | M | B D | Subway | 40.824783 | -73.944216 | Uptown & The Bronx | Downtown & Brooklyn |
86 rows × 13 columns
Okay. Lets take some samples and test. We’ve already seen for Lexington Av/59th St that the “Daytime Routes” (AKA trains that run there) mostly add up to the “LINENAME” in our turnstile dataset (except for the Q train, which I assume is there because of the access to the Lexington Av/63rd St Station, which runs the Q train. There also was a substitution of the Q for the W from 2010-2016. details here if you’re interested). Let’s try rows 3 & 4; Whitehall St and South Ferry. If you add the “Daytime Routes”, you’ll see it corresponds with the “LINENAME” values for “SOUTH FERRY” in the turnstile dataset:
df[df["STATION"] == "SOUTH FERRY"].head(1)
C/A | UNIT | SCP | STATION | LINENAME | DIVISION | DATE | TIME | DESC | ENTRIES | EXITS | |
---|---|---|---|---|---|---|---|---|---|---|---|
128358 | R101 | R001 | 02-00-00 | SOUTH FERRY | 1RW | IRT | 06/06/2020 | 01:00:00 | REGULAR | 7025506 | 283662 |
Great. So maybe we can use the LINENAME/Daytime Routes to geotag the turnstile dataset.
Lets try it.
First, I’ll try to consolidate the “Daytime Routes” for the stops stops that are in the same station, and replace the “Daytime Routes” for each with the consolidated value. Afterward, we’ll check Lex av/59th and South Ferry/Whitehall:
# this needs to run ONLY ONCE or you'll get the world of duplicates
stationsdf["Daytime Routes"] = stationsdf[["Complex ID","Daytime Routes"]].groupby("Complex ID")["Daytime Routes"].transform(lambda x: ''.join([y.replace(' ', '') for y in x]))
# Check Lex av/59th
stationsdf[stationsdf["Stop Name"].str.contains("59 St")]
Station ID | Complex ID | GTFS Stop ID | Division | Line | Stop Name | Borough | Daytime Routes | Structure | GTFS Latitude | GTFS Longitude | North Direction Label | South Direction Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | 613 | R11 | BMT | Astoria | Lexington Av/59 St | M | NWR456 | Subway | 40.762660 | -73.967258 | Queens | Downtown & Brooklyn |
7 | 8 | 8 | R13 | BMT | Astoria | 5 Av/59 St | M | NWR | Subway | 40.764811 | -73.973347 | Queens | Downtown & Brooklyn |
34 | 35 | 35 | R41 | BMT | 4th Av | 59 St | Bk | NR | Subway | 40.641362 | -74.017881 | Manhattan | Coney Island - Bay Ridge |
160 | 161 | 614 | A24 | IND | 8th Av - Fulton St | 59 St - Columbus Circle | M | ABCD1 | Subway | 40.768296 | -73.981736 | Uptown & The Bronx | Downtown & Brooklyn |
315 | 315 | 614 | 125 | IRT | Broadway - 7Av | 59 St - Columbus Circle | M | ABCD1 | Subway | 40.768247 | -73.981929 | Uptown & The Bronx | Downtown |
400 | 400 | 613 | 629 | IRT | Lexington Av | 59 St | M | NWR456 | Subway | 40.762526 | -73.967967 | Uptown & The Bronx | Downtown & Brooklyn |
Great. Now let’s see if we can get somewhere with this. We know we can’t directly join because the stations don’t have the exact names. The next best thing is to fuzzy search. We’ll use fuzzywuzzy to do the fuzzy matching, so now we need a strategy to ensure our dataset is as accurate as possible.
The easiest strategy I can think of is to match the names first then match the Routes and assume the highest pair is correct (we’ll test of course). With this in mind, I can match the closest name with routes that are essentially the same thing.
Lets see how that plays out:
from fuzzywuzzy import process, fuzz
from operator import itemgetter
station_stops = {x[0].replace(' ', '').replace('-', ''): x[0] for x in zip(stationsdf["Stop Name"])}
station_stop_names = station_stops.keys()
def match_records(row):
"""
Receives a unique (STATION, LINENAME) row from the turnstile dataset and attempts to match it with a record
from the stop/station location dataset using the names and Routes.
The function runs fuzzy search on the station name and the linename/routes and aggregates the results.
Then it sorts the results in descending order by the match ratio of the name first, then the linename/route.
Finally, it returns the result with the highest name and line/route match ratio
Returns a dictionary in the format
{'station': '59 ST', == Corresponds to the STATION column in the turnstile dataset
'name_match': ('59St', 100), == Corresponds to the matched stop name keys & the match ratio
'line_match': 'NWR456', == Corresponds to the lines/Routes
'stop_name_match': '59 St', == Corresponds to the actual, unmodified stop name
'line_fuzz_ratio': 77} == Corresponds to the match ratio of the Routes with that of the LINENAME From the
turnstile dataset
"""
# preprocess the station name to increase the accuracy of fuzzy matching
station_name = row["STATION"].replace(' ', '').replace('-', '')
station_routes = row["LINENAME"]
# first, get the potential matches.
candidates = process.extract(station_name,
station_stop_names, scorer=fuzz.token_sort_ratio, limit=10)
# stores the results of all the line matches for each name match
records = []
for x in candidates:
# Find corresponding record(s) from the station locations dataframe
candidate_match = stationsdf[stationsdf["Stop Name"] == station_stops[x[0]]]["Daytime Routes"].unique()
for y in candidate_match:
records.append({
"station": row["STATION"],
"name_match": x,
"line_match": y,
"station_line": station_routes,
"stop_name_match": station_stops[x[0]],
"line_fuzz_ratio": fuzz.token_sort_ratio(station_routes, y)
})
# sorts the results by name first, line/route second, in descending order, and returns the first result
return sorted(records, key=lambda x: (x["name_match"][1], x["line_fuzz_ratio"]), reverse=True)[0]
# Now we apply the function to our dataframes
results = df[['STATION','LINENAME']].drop_duplicates().reset_index().apply(match_records, axis=1)
Okay we’re close! There are a couple of stations that aren’t matching correctly, or don’t exist in your locations dataset. We’ll have to do some manual vetting to take care of those (don’t worry, I did it myself.)
Now, lets clean our data, fix the wrong matches, and use the mapping to geotag the turnstile dataset.
# Here's a list of items that weren't a perfect match for both station names and routes *surpressed to the first 2*
# We'll review these and remove the ones we cant fix, or update the mismatches
[x for x in results if x['line_fuzz_ratio'] < 100 and x["name_match"][1] < 100][:2]
[{'station': 'WHITEHALL S-FRY',
'name_match': ('WhitehallSt', 83),
'line_match': 'RW1',
'station_line': 'R1W',
'stop_name_match': 'Whitehall St',
'line_fuzz_ratio': 67},
{'station': 'DELANCEY/ESSEX',
'name_match': ('DelanceySt', 75),
'line_match': 'JMZF',
'station_line': 'FJMZ',
'stop_name_match': 'Delancey St',
'line_fuzz_ratio': 75}]
# Fixed missed matches here:
# the patched key is there to identify records that were manually corrected
mismatches = {
'FLATBUSH AV-B.C': {'station': 'FLATBUSH AV-B.C',
'name_match': ('Flatbush Av - Brooklyn College', 61),
'line_match': '25',
'station_line': '25',
'stop_name_match': 'Flatbush Av - Brooklyn College',
'line_fuzz_ratio': 100,
'patched': True},
'ROCKAWAY PARK B': {'station': 'ROCKAWAY PARK B',
'name_match': ('RockawayParkBeach116St', 100),
'line_match': 'AS',
'station_line': 'AS',
'stop_name_match': 'Rockaway Park - Beach 116 St',
'line_fuzz_ratio': 100,
'patched': True},
# ... surpressed
}
# The location of these aren't in the dataset, so we'll ignore them.
# I don't know what some of the names correspond to either
unknowns = [
"RIT-ROOSEVELT",
"RIT-MANHATTAN",
"ST. GEORGE",
"ORCHARD BEACH",
"NEWARK HW BMEBE",
"HARRISON",
"JOURNAL SQUARE",
'EXCHANGE PLACE',
'PAVONIA/NEWPORT',
'CITY / BUS',
# ... surpressed
]
# Lets now filter the unknowns and fix the mismatches
results_filtered = [x for x in results if x['station'] not in unknowns]
# and also, fix the mismatches
results_fixed = [mismatches[x['station']] if mismatches.get(x['station'], None) is not None else x for x in results_filtered]
Alright, now its time to geotag each record. To do so, I’ll map station names and routes to a particular location, then later use that to update the turnstile records
# maps the station and line to the geolocations
_ = [
x.update(
stationsdf[
(stationsdf["Stop Name"] == x["stop_name_match"]) &
(stationsdf["Daytime Routes"] == x["line_match"])
]
[["GTFS Latitude", "GTFS Longitude"]].head(1).to_dict('list')
) for x in results_fixed]
# Some records have the lat/long values in a list (like [40.824073] for example). the lines below fixes that
_ = [
x.update(
{'GTFS Latitude': x['GTFS Latitude'][0] if type(x['GTFS Latitude']) is list else x,
'GTFS Longitude': x['GTFS Longitude'][0] if type(x['GTFS Longitude']) is list else x}
) for x in results_fixed]
# I'll convert it into a df, then use 'station' and 'station_line' to join it on the turnstile dataset
mappingdf = pandas.DataFrame(results_fixed).drop(['name_match', 'line_match', 'stop_name_match', 'line_fuzz_ratio', 'patched'], axis=1)
pandas.merge(df, mappingdf, left_on=['STATION', 'LINENAME'], right_on=['station', 'station_line'], how='left')
C/A | UNIT | SCP | STATION | LINENAME | DIVISION | DATE | TIME | DESC | ENTRIES | EXITS | station | station_line | GTFS Latitude | GTFS Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 06/06/2020 | 00:00:00 | REGULAR | 7420920 | 2521129 | 59 ST | NQR456W | 40.762526 | -73.967967 |
1 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 06/06/2020 | 04:00:00 | REGULAR | 7420920 | 2521130 | 59 ST | NQR456W | 40.762526 | -73.967967 |
2 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 06/06/2020 | 08:00:00 | REGULAR | 7420928 | 2521141 | 59 ST | NQR456W | 40.762526 | -73.967967 |
3 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 06/06/2020 | 12:00:00 | REGULAR | 7420941 | 2521163 | 59 ST | NQR456W | 40.762526 | -73.967967 |
4 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 06/06/2020 | 16:00:00 | REGULAR | 7420972 | 2521174 | 59 ST | NQR456W | 40.762526 | -73.967967 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
206657 | TRAM2 | R469 | 00-05-01 | RIT-ROOSEVELT | R | RIT | 06/12/2020 | 05:00:00 | REGULAR | 5554 | 514 | NaN | NaN | NaN | NaN |
206658 | TRAM2 | R469 | 00-05-01 | RIT-ROOSEVELT | R | RIT | 06/12/2020 | 09:00:00 | REGULAR | 5554 | 514 | NaN | NaN | NaN | NaN |
206659 | TRAM2 | R469 | 00-05-01 | RIT-ROOSEVELT | R | RIT | 06/12/2020 | 13:00:00 | REGULAR | 5554 | 514 | NaN | NaN | NaN | NaN |
206660 | TRAM2 | R469 | 00-05-01 | RIT-ROOSEVELT | R | RIT | 06/12/2020 | 17:00:00 | REGULAR | 5554 | 514 | NaN | NaN | NaN | NaN |
206661 | TRAM2 | R469 | 00-05-01 | RIT-ROOSEVELT | R | RIT | 06/12/2020 | 21:00:00 | REGULAR | 5554 | 514 | NaN | NaN | NaN | NaN |
206662 rows × 15 columns
Done! We’re all geotagged now!
Imrovements and Contributions Welcomed!
I’ll have the code and dataset available here. Feel free to open issues and PRs for fixes/improvements.
That is all.