'Sane'itizing MTA's Turnstile Datasets: Geotagging

Update: Looks like others have geotagged the turnstile dataset too.

A Backstory

in early 2019, I conducted research at BMCC on a noise pollution platform detailing the estimated noise pollution of neighborhoods in NYC (full paper on the platform here if you’re interested). Essentially, the platform inferred the level and type of noise in an area from public ubiquitous datasets using machine learning. After familiarizing myself with the paper, I hypothesized a corelation with noise levels and the number of people within an area at any given time. As you would have guessed, it’s not easy to know the exact number of people in a neighborhood at a given time, but given that more than half of NYC uses public transportation, rideship data from the MTA would be helpful for testing my hypothesis. Dealing with the MTA dataset was not fun though.

The Turnstile Dataset

The MTA turnstile dataset is a great asset for many analysis projects (see this for example), but with a few issues; limited information on column definitions (I googled control area and saw something about airspace??), cumulative data counting backwards unexpectly, weird looking outliers and all sorts of strangeness. Unfortunately, I wasted a lot of time trying to make this dataset somewhat useful (and turns out I’m not the only one). Of course I tried googling but, you know, scarcity and what not. Anyways, in an attempt to save the next gal/guy from burning hours away trying to tag the station location onto the MTA turnstile dataset, I’m sharing this NAAS or notebook-as-a-sitepost. Haha, okay I promise I’ll be less cheeky for the rest of the post.

I also would like to experiment creating a pipeline API for processing MTA datasets using some sort of app hosting platform with a free, limited option(like Heroku), to kind of test ways of optimizing for limited resources. A bit like what game developers did with 8 and 16 Bit hardware, but that’s for another time.

Introducing the Dataset

The turnstile dataset is available for public use at http://web.mta.info/developers/turnstile.html. Each dataset contains a week’s worth of aggrigated, 4 hour snapshots of the state of a counter within the station turnstiles. Lets briefly look at it:

import pandas  # Good ol' pandas
import numpy as np
import urllib.request
from parsel import Selector  # Scrapy's parsel makes it easy to select html elements from a web document (https://github.com/scrapy/parsel)



# So, this function maps the date of the dataset to the dataset location. 
# Its not exactly necessary here, but I'm reusing code I already have so bear with me
def get_available_datasets():
    """
    fetches links to available datasets from MTA turnstile website.
    returns a dict of  date: url pairs.
    date format example: Saturday, April 11, 2020
    """

    MTA_URL = 'http://web.mta.info/developers/turnstile.html'
    DATA_URL_IDENTIFIER = 'data/nyct/turnstile/turnstile_'
    MTA_DOMAIN = 'http://web.mta.info/developers/'

    with urllib.request.urlopen(MTA_URL) as response:
      home_page = Selector(response.read().decode("utf8"))

    links = home_page.xpath('//a[contains(@href, "{}")]'.format(DATA_URL_IDENTIFIER))
    return {link.xpath('text()').get(): MTA_DOMAIN + link.xpath('@href').get() for link in links}

  
# Okay, let's pull a dataset and create a dataframe. We'll use last week's data (Saturday, June 13, 2020)

url = get_available_datasets()['Saturday, June 13, 2020']  # Ideally, what's returned from this function should be cached, but it'll run once so ¯\_(ツ)_/¯

with urllib.request.urlopen(url) as response:
  df = pandas.read_csv(response)

df.head(5)

C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS
0 A002 R051 02-00-00 59 ST NQR456W BMT 06/06/2020 00:00:00 REGULAR 7420920 2521129
1 A002 R051 02-00-00 59 ST NQR456W BMT 06/06/2020 04:00:00 REGULAR 7420920 2521130
2 A002 R051 02-00-00 59 ST NQR456W BMT 06/06/2020 08:00:00 REGULAR 7420928 2521141
3 A002 R051 02-00-00 59 ST NQR456W BMT 06/06/2020 12:00:00 REGULAR 7420941 2521163
4 A002 R051 02-00-00 59 ST NQR456W BMT 06/06/2020 16:00:00 REGULAR 7420972 2521174

Not bad right? Besides the columns that aren’t immediately obvious (C/A, SCP, etc) we can see that the first 4 rows are counts from the 59th ST station. Wait, which station? There’s multiple stations at 59 ST in Manhattan.

Well if we look at the stops at that station (represented by the LINENAME column), we can guess that it must be the Lexington Avenue/59th Street station. You can also confirm this by looking for all the stations with 59 ST in the names. You’ll see that the one for Columbus Circle says “Columbus” in its name:

# Computer, I command you to give me all the Station names with "59 ST" in them
df[df["STATION"].str.contains("59 ST")]["STATION"].unique()
array(['59 ST', '5 AV/59 ST', '59 ST COLUMBUS'], dtype=object)

It would’ve been nice if it was geotagged though, right?

Yeah?

We’ll, lets tag it then.

There’s a dataset with latitude/longitude pairs for stations from MTA that we can use to geotag our turnstile datasets. The dataset can be downloaded here. Unfortunately, the dataset isn’t consistent with the turnstile dataset, and we cannot simply join the two datasets on, say, the name of the station. So we need to seek alternative measures.

The easiest solution I could think of was fuzzy string matching on station names, along with a bit of manual pairing. Using this method, I was able to match most of the stations listed in the turnstile dataset with a corresponding latitude/longitude coordinates. It wasn’t easy though and did take a considerable amount of hours.

For explanation sake, I’ll briefly take you through that journey. Don’t worry, I’ll try to make it bearable.

First, lets get the station location dataset I mentioned earlier and take a look:

with urllib.request.urlopen("http://web.mta.info/developers/data/nyct/subway/Stations.csv") as response:
  stationsdf = pandas.read_csv(response)

stationsdf.head(5)

Station ID Complex ID GTFS Stop ID Division Line Stop Name Borough Daytime Routes Structure GTFS Latitude GTFS Longitude North Direction Label South Direction Label
0 1 1 R01 BMT Astoria Astoria - Ditmars Blvd Q N W Elevated 40.775036 -73.912034 NaN Manhattan
1 2 2 R03 BMT Astoria Astoria Blvd Q N W Elevated 40.770258 -73.917843 Ditmars Blvd Manhattan
2 3 3 R04 BMT Astoria 30 Av Q N W Elevated 40.766779 -73.921479 Astoria - Ditmars Blvd Manhattan
3 4 4 R05 BMT Astoria Broadway Q N W Elevated 40.761820 -73.925508 Astoria - Ditmars Blvd Manhattan
4 5 5 R06 BMT Astoria 36 Av Q N W Elevated 40.756804 -73.929575 Astoria - Ditmars Blvd Manhattan

Great, now we have quite a bit of location data for each station. Let’s see what shows up when we search for “59 st” in the stop name. That’ll give us an idea of the possibility of joining on stop name/STATION columns:

# Computer, I am once again asking you to give me all the Station names with "59 St" in them
stationsdf[stationsdf["Stop Name"].str.contains("59 St")]

Station ID Complex ID GTFS Stop ID Division Line Stop Name Borough Daytime Routes Structure GTFS Latitude GTFS Longitude North Direction Label South Direction Label
6 7 613 R11 BMT Astoria Lexington Av/59 St M N W R Subway 40.762660 -73.967258 Queens Downtown & Brooklyn
7 8 8 R13 BMT Astoria 5 Av/59 St M N W R Subway 40.764811 -73.973347 Queens Downtown & Brooklyn
34 35 35 R41 BMT 4th Av 59 St Bk N R Subway 40.641362 -74.017881 Manhattan Coney Island - Bay Ridge
160 161 614 A24 IND 8th Av - Fulton St 59 St - Columbus Circle M A B C D Subway 40.768296 -73.981736 Uptown & The Bronx Downtown & Brooklyn
315 315 614 125 IRT Broadway - 7Av 59 St - Columbus Circle M 1 Subway 40.768247 -73.981929 Uptown & The Bronx Downtown
400 400 613 629 IRT Lexington Av 59 St M 4 5 6 Subway 40.762526 -73.967967 Uptown & The Bronx Downtown & Brooklyn

Yikes. Okay. So a couple of issues here (funny thing is, I haven’t noticed these issues until I started writing this. Datascience ami right?) The first one is that the names in our location dataset aren’t the station names; they’re actually the names of the stops. I totally missed that earlier, but fortunately, they line up enough for a lot of the names in the “STATION” column of our turnstile dataset. We can still use this dataset, but we’ll have to do more than just join on “STATION” and “Stop Name”.

Okay, lets examine our station/stop locations dataset first. From our example above, we can see that some stations are broken down into multiple stops, with differing line and routes (for example, the first and last rows). What you should also notice is that the complex id is the same for both rows. Coincidence? Well, lets find out: (You’d think the station ID would be the same, but nooope. I tried searching for a key for the column names, but found nothing sigh.)

# What this one liner does is group the rows by "Complex ID", filter the groups that have more than one row, 
# and sort them so we can see the groups next to each other. 
stationsdf.groupby('Complex ID').filter(lambda g: len(g) > 1).sort_values("Complex ID", ascending=False)

Station ID Complex ID GTFS Stop ID Division Line Stop Name Borough Daytime Routes Structure GTFS Latitude GTFS Longitude North Direction Label South Direction Label
174 174 636 A41 IND 8th Av - Fulton St Jay St - MetroTech Bk A C F Subway 40.692338 -73.987342 Manhattan Euclid - Lefferts - Rockaways - Coney Island
24 25 636 R29 BMT Broadway Jay St - MetroTech Bk R Subway 40.692180 -73.985942 Manhattan Bay Ridge - 95 St
330 330 635 142 IRT Broadway - 7Av South Ferry M 1 Subway 40.702068 -74.013664 Uptown & The Bronx NaN
22 23 635 R27 BMT Broadway Whitehall St M R W Subway 40.703087 -74.012994 Uptown & Queens Brooklyn
127 128 630 L17 BMT Canarsie Myrtle - Wyckoff Avs Bk L Subway 40.699814 -73.911586 Manhattan Canarsie - Rockaway Parkway
... ... ... ... ... ... ... ... ... ... ... ... ... ...
462 461 461 R09 BMT Astoria Queensboro Plaza Q N W Elevated 40.750582 -73.940202 Astoria - Flushing Manhattan
166 167 167 A32 IND 8th Av - Fulton St W 4 St M A C E Subway 40.732338 -74.000495 Uptown - Queens Downtown & Brooklyn
167 167 167 D20 IND 6th Av - Culver W 4 St M B D F M Subway 40.732338 -74.000495 Uptown - Queens Downtown & Brooklyn
149 151 151 A12 IND 8th Av - Fulton St 145 St M A C Subway 40.824783 -73.944216 Uptown & The Bronx Downtown & Brooklyn
150 151 151 D13 IND Concourse 145 St M B D Subway 40.824783 -73.944216 Uptown & The Bronx Downtown & Brooklyn

86 rows × 13 columns

Okay. Lets take some samples and test. We’ve already seen for Lexington Av/59th St that the “Daytime Routes” (AKA trains that run there) mostly add up to the “LINENAME” in our turnstile dataset (except for the Q train, which I assume is there because of the access to the Lexington Av/63rd St Station, which runs the Q train. There also was a substitution of the Q for the W from 2010-2016. details here if you’re interested). Let’s try rows 3 & 4; Whitehall St and South Ferry. If you add the “Daytime Routes”, you’ll see it corresponds with the “LINENAME” values for “SOUTH FERRY” in the turnstile dataset:

df[df["STATION"] == "SOUTH FERRY"].head(1) 

C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS
128358 R101 R001 02-00-00 SOUTH FERRY 1RW IRT 06/06/2020 01:00:00 REGULAR 7025506 283662

Great. So maybe we can use the LINENAME/Daytime Routes to geotag the turnstile dataset.

Lets try it.

First, I’ll try to consolidate the “Daytime Routes” for the stops stops that are in the same station, and replace the “Daytime Routes” for each with the consolidated value. Afterward, we’ll check Lex av/59th and South Ferry/Whitehall:

# this needs to run ONLY ONCE or you'll get the world of duplicates
stationsdf["Daytime Routes"] = stationsdf[["Complex ID","Daytime Routes"]].groupby("Complex ID")["Daytime Routes"].transform(lambda x: ''.join([y.replace(' ', '') for y in x]))
# Check Lex av/59th
stationsdf[stationsdf["Stop Name"].str.contains("59 St")]

Station ID Complex ID GTFS Stop ID Division Line Stop Name Borough Daytime Routes Structure GTFS Latitude GTFS Longitude North Direction Label South Direction Label
6 7 613 R11 BMT Astoria Lexington Av/59 St M NWR456 Subway 40.762660 -73.967258 Queens Downtown & Brooklyn
7 8 8 R13 BMT Astoria 5 Av/59 St M NWR Subway 40.764811 -73.973347 Queens Downtown & Brooklyn
34 35 35 R41 BMT 4th Av 59 St Bk NR Subway 40.641362 -74.017881 Manhattan Coney Island - Bay Ridge
160 161 614 A24 IND 8th Av - Fulton St 59 St - Columbus Circle M ABCD1 Subway 40.768296 -73.981736 Uptown & The Bronx Downtown & Brooklyn
315 315 614 125 IRT Broadway - 7Av 59 St - Columbus Circle M ABCD1 Subway 40.768247 -73.981929 Uptown & The Bronx Downtown
400 400 613 629 IRT Lexington Av 59 St M NWR456 Subway 40.762526 -73.967967 Uptown & The Bronx Downtown & Brooklyn

Great. Now let’s see if we can get somewhere with this. We know we can’t directly join because the stations don’t have the exact names. The next best thing is to fuzzy search. We’ll use fuzzywuzzy to do the fuzzy matching, so now we need a strategy to ensure our dataset is as accurate as possible.

The easiest strategy I can think of is to match the names first then match the Routes and assume the highest pair is correct (we’ll test of course). With this in mind, I can match the closest name with routes that are essentially the same thing.

Lets see how that plays out:

from fuzzywuzzy import process, fuzz
from operator import itemgetter


station_stops = {x[0].replace(' ', '').replace('-', ''): x[0] for x in zip(stationsdf["Stop Name"])}

station_stop_names = station_stops.keys()

def match_records(row):
  """
  Receives a unique (STATION, LINENAME) row from the turnstile dataset and attempts to match it with a record
  from the stop/station location dataset using the names and Routes.
  
  The function runs fuzzy search on the station name and the linename/routes and aggregates the results.
  Then it sorts the results in descending order by the match ratio of the name first, then the linename/route.
  Finally, it returns the result with the highest name and line/route match ratio
  
  Returns a dictionary in the format 
    {'station': '59 ST',        == Corresponds to the STATION column in the turnstile dataset
   'name_match': ('59St', 100), == Corresponds to the matched stop name keys & the match ratio
   'line_match': 'NWR456',      == Corresponds to the lines/Routes 
   'stop_name_match': '59 St',  == Corresponds to the actual, unmodified stop name
   'line_fuzz_ratio': 77}       == Corresponds to the match ratio of the Routes with that of the LINENAME From the
                                   turnstile dataset 
  """
  
  # preprocess the station name to increase the accuracy of fuzzy matching
  station_name = row["STATION"].replace(' ', '').replace('-', '')
  
  station_routes = row["LINENAME"]
  
  # first, get the potential matches. 
  candidates = process.extract(station_name, 
                               station_stop_names, scorer=fuzz.token_sort_ratio, limit=10)
  
  
  # stores the results of all the line matches for each name match 
  records = []
 
  for x in candidates:
    
    # Find corresponding record(s) from the station locations dataframe 
    candidate_match = stationsdf[stationsdf["Stop Name"] == station_stops[x[0]]]["Daytime Routes"].unique()
    
    for y in candidate_match:
      records.append({
        "station": row["STATION"],
        "name_match": x,
        "line_match": y,
        "station_line": station_routes,
        "stop_name_match": station_stops[x[0]],
        "line_fuzz_ratio": fuzz.token_sort_ratio(station_routes, y)
      })
  
  # sorts the results by name first, line/route second, in descending order, and returns the first result
  return sorted(records, key=lambda x: (x["name_match"][1], x["line_fuzz_ratio"]), reverse=True)[0]

# Now we apply the function to our dataframes
results = df[['STATION','LINENAME']].drop_duplicates().reset_index().apply(match_records, axis=1)

Okay we’re close! There are a couple of stations that aren’t matching correctly, or don’t exist in your locations dataset. We’ll have to do some manual vetting to take care of those (don’t worry, I did it myself.)

Now, lets clean our data, fix the wrong matches, and use the mapping to geotag the turnstile dataset.

# Here's a list of items that weren't a perfect match for both station names and routes *surpressed to the first 2* 
# We'll review these and remove the ones we cant fix, or update the mismatches
[x for x in results if x['line_fuzz_ratio'] < 100 and x["name_match"][1] < 100][:2]
[{'station': 'WHITEHALL S-FRY',
  'name_match': ('WhitehallSt', 83),
  'line_match': 'RW1',
  'station_line': 'R1W',
  'stop_name_match': 'Whitehall St',
  'line_fuzz_ratio': 67},
 {'station': 'DELANCEY/ESSEX',
  'name_match': ('DelanceySt', 75),
  'line_match': 'JMZF',
  'station_line': 'FJMZ',
  'stop_name_match': 'Delancey St',
  'line_fuzz_ratio': 75}]
# Fixed missed matches here:
# the patched key is there to identify records that were manually corrected
mismatches = {
 'FLATBUSH AV-B.C': {'station': 'FLATBUSH AV-B.C',
  'name_match': ('Flatbush Av - Brooklyn College', 61),
  'line_match': '25',
  'station_line': '25',
  'stop_name_match': 'Flatbush Av - Brooklyn College',
  'line_fuzz_ratio': 100,
  'patched': True},
  
  'ROCKAWAY PARK B': {'station': 'ROCKAWAY PARK B',
  'name_match': ('RockawayParkBeach116St', 100),
  'line_match': 'AS',
  'station_line': 'AS',
  'stop_name_match': 'Rockaway Park - Beach 116 St',
  'line_fuzz_ratio': 100,
  'patched': True},
# ... surpressed
}

  # The location of these aren't in the dataset, so we'll ignore them. 
  # I don't know what some of the names correspond to either 
unknowns = [
  "RIT-ROOSEVELT",
  "RIT-MANHATTAN",
  "ST. GEORGE",
  "ORCHARD BEACH",
  "NEWARK HW BMEBE",
  "HARRISON",
  "JOURNAL SQUARE",
  'EXCHANGE PLACE',
  'PAVONIA/NEWPORT',
  'CITY / BUS',
  # ... surpressed
]
# Lets now filter the unknowns and fix the mismatches
results_filtered = [x for x in results if x['station'] not in unknowns]

# and also, fix the mismatches
results_fixed = [mismatches[x['station']] if mismatches.get(x['station'], None) is not None else x for x in results_filtered]

Alright, now its time to geotag each record. To do so, I’ll map station names and routes to a particular location, then later use that to update the turnstile records

# maps the station and line to the geolocations

_ = [
  x.update(
    stationsdf[
      (stationsdf["Stop Name"] == x["stop_name_match"]) & 
      (stationsdf["Daytime Routes"] == x["line_match"])
    ]
    [["GTFS Latitude", "GTFS Longitude"]].head(1).to_dict('list')
  ) for x in results_fixed]

# Some records have the lat/long values in a list (like [40.824073] for example). the lines below fixes that
_ = [
  x.update(
    {'GTFS Latitude': x['GTFS Latitude'][0] if type(x['GTFS Latitude']) is list else x, 
     'GTFS Longitude': x['GTFS Longitude'][0] if type(x['GTFS Longitude']) is list else x}
  ) for x in results_fixed]


# I'll convert it into a df, then use 'station' and 'station_line' to join it on the turnstile dataset
mappingdf = pandas.DataFrame(results_fixed).drop(['name_match', 'line_match', 'stop_name_match', 'line_fuzz_ratio', 'patched'], axis=1)
pandas.merge(df, mappingdf, left_on=['STATION', 'LINENAME'], right_on=['station', 'station_line'], how='left')

C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS station station_line GTFS Latitude GTFS Longitude
0 A002 R051 02-00-00 59 ST NQR456W BMT 06/06/2020 00:00:00 REGULAR 7420920 2521129 59 ST NQR456W 40.762526 -73.967967
1 A002 R051 02-00-00 59 ST NQR456W BMT 06/06/2020 04:00:00 REGULAR 7420920 2521130 59 ST NQR456W 40.762526 -73.967967
2 A002 R051 02-00-00 59 ST NQR456W BMT 06/06/2020 08:00:00 REGULAR 7420928 2521141 59 ST NQR456W 40.762526 -73.967967
3 A002 R051 02-00-00 59 ST NQR456W BMT 06/06/2020 12:00:00 REGULAR 7420941 2521163 59 ST NQR456W 40.762526 -73.967967
4 A002 R051 02-00-00 59 ST NQR456W BMT 06/06/2020 16:00:00 REGULAR 7420972 2521174 59 ST NQR456W 40.762526 -73.967967
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
206657 TRAM2 R469 00-05-01 RIT-ROOSEVELT R RIT 06/12/2020 05:00:00 REGULAR 5554 514 NaN NaN NaN NaN
206658 TRAM2 R469 00-05-01 RIT-ROOSEVELT R RIT 06/12/2020 09:00:00 REGULAR 5554 514 NaN NaN NaN NaN
206659 TRAM2 R469 00-05-01 RIT-ROOSEVELT R RIT 06/12/2020 13:00:00 REGULAR 5554 514 NaN NaN NaN NaN
206660 TRAM2 R469 00-05-01 RIT-ROOSEVELT R RIT 06/12/2020 17:00:00 REGULAR 5554 514 NaN NaN NaN NaN
206661 TRAM2 R469 00-05-01 RIT-ROOSEVELT R RIT 06/12/2020 21:00:00 REGULAR 5554 514 NaN NaN NaN NaN

206662 rows × 15 columns

Done! We’re all geotagged now!

Imrovements and Contributions Welcomed!

I’ll have the code and dataset available here. Feel free to open issues and PRs for fixes/improvements.

That is all.