Data Cleaning and Investigation

import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Chicago Divvy Bike Data

I downloaded Chicago Divvy data for the last quarter before the pandemic (Q4 of 2019) from: https://divvy-tripdata.s3.amazonaws.com/index.html

Map of the system is here: https://member.divvybikes.com/map/

List of stations can be found here: https://data.cityofchicago.org/Transportation/Divvy-Bicycle-Stations/bbyy-e7gq/data

The downloaded dataset has more than 700K rides, and the I am choosing to focus on rides centered around my campus in Hyde Park.

bike = pd.read_csv("Bike.csv")
bike.head(5)
trip_id start_time end_time bikeid tripduration from_station_id from_station_name to_station_id to_station_name usertype gender birthyear
0 25317046 10/7/19 13:39 10/7/19 14:05 4231 1533.0 424 Museum of Science and Industry 355 South Shore Dr & 67th St Customer NaN NaN
1 25577704 10/27/19 16:44 10/27/19 16:49 523 283.0 345 Lake Park Ave & 56th St 423 University Ave & 57th St Subscriber Male 1989.0
2 25823341 12/4/19 17:24 12/4/19 17:32 376 486.0 423 University Ave & 57th St 419 Lake Park Ave & 53rd St Subscriber Male 1993.0
3 25735449 11/19/19 18:32 11/19/19 18:35 5708 174.0 248 Woodlawn Ave & 55th St 322 Kimbark Ave & 53rd St Subscriber Male 1998.0
4 25716762 11/16/19 21:09 11/16/19 21:19 1993 633.0 322 Kimbark Ave & 53rd St 426 Ellis Ave & 60th St Subscriber Male 2001.0
#function to calculate the number of duplicated rows
def num_dupe_func(df):
    df_nodupe = df.drop_duplicates()
    return len(df) - len(df_nodupe)
num_dupe_func(bike)
20
#creating df with no duplicates
bikeND = bike.drop_duplicates()
#columns with missing values
print('columns with missing values:')
print(bikeND.isnull().sum())
columns with missing values:
trip_id                0
start_time             0
end_time               0
bikeid                 0
tripduration           0
from_station_id        0
from_station_name      2
to_station_id          0
to_station_name        2
usertype               0
gender               206
birthyear            199
dtype: int64
#missing rows
nullrows = bikeND.isnull().any(axis='columns')
print('# missing rows:', nullrows.sum())
# missing rows: 210
#investigating how missing values differ by user type
null_df = bikeND[nullrows]
null_df.head(10)
null_df.groupby(by='usertype').count()
trip_id start_time end_time bikeid tripduration from_station_id from_station_name to_station_id to_station_name gender birthyear
usertype
Customer 199 199 199 199 199 199 199 199 199 0 0
Subscriber 11 11 11 11 11 11 9 11 9 4 11

The table above counts the non null values in each column grouped by user type from a table containing only all the rows with null values. Based on the this, customers are only missing data in their gender and birthyear columns. Subscribers are missing two values ‘from station’ names and two values ‘to station’ names, and seven subscribers are missing their gender as well (all birth years are there).

Filling in missing station names using station ids

id_name = bikeND.iloc[:,[5,6]].drop_duplicates()
#find index and missing station names
filled_df = bikeND.iloc[:,[5,6,7,8]].fillna(0)
print(filled_df[filled_df.iloc[:,1]==0])
print(filled_df[filled_df.iloc[:,3]==0])
     from_station_id from_station_name  to_station_id      to_station_name
12               423                 0            418  Ellis Ave & 53rd St
175              418                 0            328  Ellis Ave & 58th St
     from_station_id         from_station_name  to_station_id to_station_name
44               423  University Ave & 57th St            426               0
276              345   Lake Park Ave & 56th St            423               0
#find station name based on station id
print(id_name[id_name.from_station_id==423])
print(id_name[id_name.from_station_id==418])
print(id_name[id_name.from_station_id==426])
    from_station_id         from_station_name
2               423  University Ave & 57th St
12              423                       NaN
     from_station_id    from_station_name
9                418  Ellis Ave & 53rd St
175              418                  NaN
    from_station_id    from_station_name
46              426  Ellis Ave & 60th St
bikeND.loc[12, 'from_station_name'] = 'University Ave & 57th St'
bikeND.loc[175, 'from_station_name'] = 'Ellis Ave & 53rd St'
bikeND.loc[44, 'to_station_name'] = 'Ellis Ave & 60th St'
bikeND.loc[276, 'to_station_name'] = 'University Ave & 57th St' 
#check that station names are imputed
bikeND.loc[[12,175,44,276],:]
trip_id start_time end_time bikeid tripduration from_station_id from_station_name to_station_id to_station_name usertype gender birthyear
12 25600923 10/29/19 17:49 10/29/19 17:54 2054 328.0 423 University Ave & 57th St 418 Ellis Ave & 53rd St Subscriber Male 1980.0
175 25507288 10/21/19 8:38 10/21/19 8:42 4392 288.0 418 Ellis Ave & 53rd St 328 Ellis Ave & 58th St Subscriber Female 2000.0
44 25601662 10/29/19 18:23 10/29/19 18:27 486 248.0 423 University Ave & 57th St 426 Ellis Ave & 60th St Subscriber Male 2000.0
276 25366919 10/10/19 8:10 10/10/19 8:15 875 292.0 345 Lake Park Ave & 56th St 423 University Ave & 57th St Subscriber Male 1986.0

Investigating the Data

What was the longest ride?

bikeND.sort_values(by = 'tripduration', ascending = False).head(1)
trip_id start_time end_time bikeid tripduration from_station_id from_station_name to_station_id to_station_name usertype gender birthyear
408 25575836 10/27/19 15:12 10/28/19 14:09 1187 82667.0 424 Museum of Science and Industry 653 Cornell Dr & Hayes Dr Customer NaN NaN

Where do people go when they take long trips?

#top 244 longest trips (top 10%)
sort_tripdur = bikeND.sort_values(by='tripduration', ascending = False).iloc[0:244]

#where do they go?
end_where = sort_tripdur.groupby(by='to_station_name').count().idxmax()

#select out msi
msi_df = sort_tripdur[sort_tripdur.to_station_id==424]

#when do they arrive? what day of the week?
go_when = msi_df.groupby(by='end_hour').count().idxmax()
week_when = msi_df.groupby(by='weekday').count().idxmax()

print('most common ending location:', end_where[0])
print('most common hour of arrival:', go_when[0])
print('most common weekday to go:', week_when[0])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-14-72f2d571538d> in <module>
      9 
     10 #when do they arrive? what day of the week?
---> 11 go_when = msi_df.groupby(by='end_hour').count().idxmax()
     12 week_when = msi_df.groupby(by='weekday').count().idxmax()
     13 

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   6519             squeeze=squeeze,
   6520             observed=observed,
-> 6521             dropna=dropna,
   6522         )
   6523 

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    531                 observed=observed,
    532                 mutated=self.mutated,
--> 533                 dropna=self.dropna,
    534             )
    535 

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    779                 in_axis, name, level, gpr = False, None, gpr, None
    780             else:
--> 781                 raise KeyError(gpr)
    782         elif isinstance(gpr, Grouper) and gpr.key is not None:
    783             # Add key to exclusions

KeyError: 'end_hour'

I chose to look into where people tended to go when they took longer trips. I sorted the dataframe by trip duration and selected the top 244 longest trips which would be the top 10%. From there, I looked at where people went by grouping the dataframe by the ‘to_station_name’ column. It turns out that the top 10% longest trips were to the Museum of Science and Industry. Next, I wanted to investigate when people got there and on what day of the week. Thus, I selected out the MSI as the final destination and I looked into when the most common time that people got there was and on what day of the week they went. Ultimately, it turns out the top 10% of longest rides are to the MSI on Sunday, arriving at 1 PM! Sounds like a good way to spend a Sunday :)

How do durations of trips differ between men and women?

females = bikeND[bikeND.gender=='Female']
males = bikeND[bikeND.gender=='Male']

bikeND.boxplot(by = 'gender', column = 'tripduration', figsize=(7,7)) 
plots.title('Trip Duration Grouped by Gender')
plots.ylabel('Trip Duration')
plots.xlabel('Gender')
plots.suptitle('')
plots.ylim(0,2000)
plots.show()

females.hist(column = 'tripduration', bins=1000)
plots.title('Trip Duration - Females')
plots.xlabel('Trip Duration')
plots.ylabel('Frequency')
plots.xlim(0,1500)
plots.show()

males.hist(column = 'tripduration', bins=1500)
plots.title('Trip Duration - Males')
plots.xlabel('Trip Duration')
plots.ylabel('Frequency')
plots.xlim(0,1500)
plots.show()
_images/data_cleaning_23_0.png _images/data_cleaning_23_1.png _images/data_cleaning_23_2.png

The boxplot shows us that men have a smaller IQR, and the median trip duration appears slightly shorter for men than women. Men also appear to have more outliers beyond the max. Based on the histograms, both men and women have similar distributions in their trip durations, but it is harder to discern differences between the genders here.

How does age (being in college vs not) impact trip duration?

no_nas = bikeND.dropna()
college = np.ones(no_nas.shape[0])
college[no_nas.birthyear<1997]=0 #not college
college[no_nas.birthyear>2001]=0 #not college
no_nas.insert(12, 'college', college)
no_nas.head(12)

students = no_nas[no_nas.college==1]
notstud = no_nas[no_nas.college==0]
students.hist(column = 'tripduration', bins=1000)
plots.title('Trip Duration - College')
plots.xlabel('Trip Duration')
plots.ylabel('Frequency')
plots.xlim(0,1500)
plots.show()

notstud.hist(column = 'tripduration', bins=150)
plots.title('Trip Duration - Not in College')
plots.xlabel('Trip Duration')
plots.ylabel('Frequency')
plots.xlim(0,1500)
plots.show()

no_nas.boxplot(by = 'college', column = 'tripduration', figsize=(7,7)) 
plots.title('Trip Duration Grouped by Age')
plots.suptitle('')
plots.ylabel('Trip Duration')
plots.xlabel('Not in College (0), College (1)')
plots.ylim(0,2000)
plots.show()
_images/data_cleaning_26_0.png _images/data_cleaning_26_1.png _images/data_cleaning_26_2.png

The histograms show similar distributions (both right skewed), however, college students generally take shorter rides as that distribution is very strongly right skewed. The boxplot supports what is seen in the histograms. The college student trip duration median is less than that of individuals not in college, and their IQR is also much smaller. The college max duration is also much lower than the non college max trip duration. Hence, college students generally take shorter rides.

#Calculating Age v Trip Duration
age = 2021-no_nas.birthyear
no_nas.insert(13, 'age', age)
plots.scatter(no_nas.age, no_nas.tripduration/60)
plots.ylim(0,125)
plots.xlabel('Age')
plots.ylabel('Trip Duration (min)')
plots.title('Trip Duration v Age')
plots.show()
_images/data_cleaning_29_0.png

There does not appear to be much of trend regarding trip duration and age, although most of the points are between ages 20-40. There also appears to be a substational amount of points between 40-60, but not as many. The trip duration is generally less than 20 minutes, and younger riders (20-40) take more longer trips than individuals over 40.

Exploring trip duration by day of the week

dow = pd.DatetimeIndex(bikeND['start_time']).day_name()
bikeND.insert(12, 'weekday', dow)
downum = pd.DatetimeIndex(bikeND['start_time']).dayofweek
bikeND.insert(13, 'weekday_num', downum)
dow_mean = bikeND.groupby(by='weekday_num').mean()
dow_mean['Week Day'] = ['M', 'Tu', 'W', 'Th', 'F', 'Sa', 'Su']
dow_mean.plot.bar(x = 'Week Day', y = 'tripduration')
plots.ylabel('Mean Trip Duration')
plots.title('Mean Trip Duration by Day of the Week')
plots.show()
_images/data_cleaning_33_0.png

The plot above shows the average trip duration by the day of the week. On average, it appears the longest trips are taken on Saturday and the shortest trips are taken on Wednesday. In general, the weekend (Fri, Sat, Sun) has the longest average trips which is to be expected since people have more time to go out and do things.

Trips to and from our main campus library

Station 423 is in front of the Regenstein Library. I want to know what the most common hour of the day is (10am, 11pm etc) to start and end a trip there.

All the trips that start between 10:00am and 10:59am are considered in the 10am group.

round_start = pd.to_datetime(bikeND['start_time']).dt.floor('H')
round_shr = pd.to_datetime(round_start).dt.time
bikeND.insert(14, 'start_hour', round_shr)
round_end = pd.to_datetime(bikeND['end_time']).dt.floor('H')
round_ehr = pd.to_datetime(round_end).dt.time
bikeND.insert(15, 'end_hour', round_ehr)
reg_df = bikeND[bikeND.from_station_id==423]
start_reg = reg_df.groupby(by='start_hour').count().idxmax()
end_reg = reg_df.groupby(by='end_hour').count().idxmax()
print('Most common start time:', start_reg[0])
print('Most common end time:', end_reg[0])
Most common start time: 17:00:00
Most common end time: 17:00:00

The most common start and end time regarding the Reg is 5 pm (17:00)

What day were the most rides taken?

start_date = pd.to_datetime(bikeND['start_time']).dt.date
bikeND.insert(16, 'start_date', start_date)
pop_start = bikeND.groupby(by='start_date').count().idxmax()
print('Most common start date:', pop_start[0])
Most common start date: 2019-10-03