Visual Analytics Assignment

Bluebikes - Enabling Optimization of Bike Sharing Operations

Author

Affiliation

Vikram Shashank Dandekar

 

Published

April 5, 2021

DOI

Overview

Purpose of Visual Analytics Assignment

The objective of this assignment is to explore the methods of visualization that can be applied to the trip data of Blue Bikes, which is a public share bike system in Boston. The focus of this assignment is on analyzing the flow of bikes moving in and out of the numerous stations that have been set up. More specifically, we would like to look at the balance (deficit or excess) of bikes at the stations. This would eventually link to the Shiny application development of the Visual Analytics Project.

Visual Components to be explored

  1. Using a full month of data, we would like to visualize the distribution of deficit or excess across all the different stations. Ideally, there would be an option to customize district selection to see how the distribution changes.

  2. For a good interactive experience, the application user should be able to compare station to station data for balance. A comparison of at least three stations side by side would be useful.

  3. It would be interesting to visualize to the smallest detail, the movement in overall balance of bikes throughout the days and weeks.

Where possible, interactive visualizations are good to have, but they may not be necessary. We will look at this in closer detail.

Step by Step Data Preparation

Installing packages

First, we will check for the following packages to see whether they have been installed - sf, tidyverse, lubridate, data.table, leaflet, plotly, leaflet.extras, psych, ggstatsplot, hrbrthemes, hms and infer. If not installed, R will go ahead to install these packages before launching them.

packages = c('sf', 'data.table','tidyverse','leaflet','leaflet.extras','lubridate', 'infer', 'ggstatsplot', 'hrbrthemes', 'hms', 'plotly', 'psych')
for (p in packages){
  if(!require(p, character.only = T)){
  install.packages(p)
  }
  library(p,character.only= T )
}

Importing the Data

Let us bring in two data sets of Blue Bikes. The first data set, which we will label trip is the trip data for the month of January 2020, while the second data set which we will label station summarizes the details of all the Blue Bikes stations that are available.

trip <- read_csv('data/202001-bluebikes-tripdata.csv')
station <- read.csv('data/current_bluebikes_stations.csv', skip = 1)

Data Wrangling & Preparation

Inspecting the structure of the station data set:

str(station)
'data.frame':   364 obs. of  7 variables:
 $ Number     : Factor w/ 364 levels "A32000","A32001",..: 15 339 162 337 264 159 291 349 82 297 ...
 $ Name       : Factor w/ 364 levels "175 N Harvard St",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Latitude   : num  42.4 42.4 42.3 42.4 42.4 ...
 $ Longitude  : num  -71.1 -71.1 -71.1 -71.1 -71.1 ...
 $ District   : Factor w/ 9 levels "Arlington","Boston",..: 2 8 2 8 4 2 4 5 2 4 ...
 $ Public     : Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ Total.docks: int  18 19 17 15 23 19 25 15 12 19 ...

It is noted that the Station name is a factor type. We will need to ensure that the same is true for the trip data set, so that when we join the data together, there will be no errors.

Inspecting the structure of the trip data set:

str(trip)
spec_tbl_df [128,598 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ tripduration           : num [1:128598] 478 363 284 193 428 ...
 $ starttime              : POSIXct[1:128598], format: "2020-01-01 00:04:05" ...
 $ stoptime               : POSIXct[1:128598], format: "2020-01-01 00:12:04" ...
 $ start station id       : num [1:128598] 366 219 219 396 60 372 36 36 36 36 ...
 $ start station name     : chr [1:128598] "Broadway T Stop" "Boston East - 126 Border St" "Boston East - 126 Border St" "Main St at Beacon St" ...
 $ start station latitude : num [1:128598] 42.3 42.4 42.4 42.4 42.4 ...
 $ start station longitude: num [1:128598] -71.1 -71 -71 -71.1 -71.1 ...
 $ end station id         : num [1:128598] 93 212 212 387 49 178 23 23 23 23 ...
 $ end station name       : chr [1:128598] "JFK/UMass T Stop" "Maverick Square - Lewis Mall" "Maverick Square - Lewis Mall" "Norman St at Kelvin St" ...
 $ end station latitude   : num [1:128598] 42.3 42.4 42.4 42.4 42.4 ...
 $ end station longitude  : num [1:128598] -71.1 -71 -71 -71.1 -71.1 ...
 $ bikeid                 : num [1:128598] 6005 3168 3985 2692 4978 ...
 $ usertype               : chr [1:128598] "Customer" "Subscriber" "Subscriber" "Subscriber" ...
 $ birth year             : num [1:128598] 1969 2000 2001 1978 1987 ...
 $ gender                 : num [1:128598] 0 1 1 1 1 1 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   tripduration = col_double(),
  ..   starttime = col_datetime(format = ""),
  ..   stoptime = col_datetime(format = ""),
  ..   `start station id` = col_double(),
  ..   `start station name` = col_character(),
  ..   `start station latitude` = col_double(),
  ..   `start station longitude` = col_double(),
  ..   `end station id` = col_double(),
  ..   `end station name` = col_character(),
  ..   `end station latitude` = col_double(),
  ..   `end station longitude` = col_double(),
  ..   bikeid = col_double(),
  ..   usertype = col_character(),
  ..   `birth year` = col_double(),
  ..   gender = col_double()
  .. )

It is apparent that some modifications need to be made to the data set to enable a good visualization later:

#Calculating Age from Birth year
trip$Age <- 2020 - trip$'birth year'

#Modifying Gender data
trip$gender <- as.factor(recode(trip$gender, '0' = 'Female', '1' = 'Male', '2' = 'Prefer not to say'))

#Separating date and time data
trip_1 <- trip %>%
  separate(starttime, into = c("start_date", "start_time"), sep = " ") %>%
  separate(stoptime, into = c("stop_date", "stop_time"), sep = " ")

#Formatting date and time data types
trip_1$start_date <- ymd(trip_1$start_date)
trip_1$stop_date <- ymd(trip_1$stop_date)
trip_1$start_time <- as_hms(trip_1$start_time)
trip_1$stop_time <- as_hms(trip_1$stop_time)

#Convert start station name and end station name to factor
trip_1$`start station name` <- as.factor(trip_1$`start station name`)
trip_1$`end station name` <- as.factor(trip_1$`end station name`)

The processed trip data set file will be saved for use in subsequent analysis

save(trip_1, station, file = "data/station_trip_processed.Rdata")
date_window_start <- ymd("2020-01-01")
date_window_stop <- ymd("2020-01-31")

# Unique values of date and station names
dates <- seq(date_window_start, date_window_stop, by = 'days')
stations <- station$Name

#Create a full table with all combinations
date_station <- expand_grid(stations, dates)
names(date_station) <- c("start station name", "trip_date")

#Count number of trips out and save as table
num_out_station_date <- trip_1 %>%
  group_by(`start station name`, start_date) %>%
  count()

num_out_station_date$start_date = ymd(num_out_station_date$start_date)

#Join the date_station table to the num_out_station_date table
trips_out <- date_station %>%
  left_join(num_out_station_date, by = c("start station name" = "start station name", "trip_date" = "start_date"))

#Replacing NA values with zeros
trips_out$n <- trips_out$n %>% replace_na(0)

#A glimpse at the trips_out table
head(trips_out)
# A tibble: 6 x 3
  `start station name` trip_date      n
  <fct>                <date>     <dbl>
1 175 N Harvard St     2020-01-01     7
2 175 N Harvard St     2020-01-02    16
3 175 N Harvard St     2020-01-03    22
4 175 N Harvard St     2020-01-04    10
5 175 N Harvard St     2020-01-05     8
6 175 N Harvard St     2020-01-06    10
#Repeating for in-trips, Count number of trips in and save as table
num_in_station_date <- trip_1 %>%
  group_by(`end station name`, stop_date) %>%
  count()

num_in_station_date$stop_date = ymd(num_in_station_date$stop_date)

#Join the date_station table to the num_in_station_date table
trips_in <- date_station %>%
  left_join(num_in_station_date, by = c("start station name" = "end station name", "trip_date" = "stop_date"))

#Replacing NA values with zeros
trips_in$n <- trips_in$n %>% replace_na(0)

#A glimpse at the trips_in table
head(trips_in)
# A tibble: 6 x 3
  `start station name` trip_date      n
  <fct>                <date>     <dbl>
1 175 N Harvard St     2020-01-01     8
2 175 N Harvard St     2020-01-02    15
3 175 N Harvard St     2020-01-03    17
4 175 N Harvard St     2020-01-04    15
5 175 N Harvard St     2020-01-05    12
6 175 N Harvard St     2020-01-06    11

Now, we combine the trips_in and trips_out tables and create a summary table that also shows the overall balance per station per date:

summary_balance <- trips_in %>% 
  left_join(trips_out, by = c("start station name", "trip_date"))

names(summary_balance) <- c("station_name", "trip_date", "n_in", "n_out")

summary_balance <- summary_balance %>%
  mutate(balance = n_in - n_out)

head(summary_balance)
# A tibble: 6 x 5
  station_name     trip_date   n_in n_out balance
  <fct>            <date>     <dbl> <dbl>   <dbl>
1 175 N Harvard St 2020-01-01     8     7       1
2 175 N Harvard St 2020-01-02    15    16      -1
3 175 N Harvard St 2020-01-03    17    22      -5
4 175 N Harvard St 2020-01-04    15    10       5
5 175 N Harvard St 2020-01-05    12     8       4
6 175 N Harvard St 2020-01-06    11    10       1

There is one more condition to eliminate. There are instances where both n_in and n-out are zero, this happens because not all the stations in the station data set may be present in the trip data set for the month of Jan 2020. Therefore, a filtering needs to be done to remove these data points so that they do not impact the statistical analysis later.

summary_balance <- summary_balance %>%
  filter(n_in!=0 | n_out!=0)

It would be interesting to eventually see the balance distributions at a District level. Therefore, let’s bring in the District names by using the station table and joining to the summary_balance table:

summary_balance_1 <- summary_balance %>%
  left_join(station, by = c("station_name" = 'Name')) %>%
  select("station_name",  "trip_date", "n_in", "n_out", "balance", "District", "Total.docks") 

Exploration of visualization methods

This section is an exploration of the different visualizations that can be applied to the prepared data set. The objective of this section is to evaluate which are the most appropriate methods that could be utilized in the final Shiny application.

Balance Across Stations and Districts

Let’s find a good way to visualize the distribution of the balance across stations, starting with a boxplot:

p <- ggplot(summary_balance_1) +
  geom_boxplot(aes(x = station_name, y = balance),
               outlier.size = 0.1)
ggplotly(p)
175 N Harvard St191 Beacon St30 Dane St359 Broadway - Broadway at Fayette Street699 Mt Auburn St700 Commonwealth Ave.700 Huron Ave75 Binney St84 Cambridgepark DrAirport T Stop - Bremen St at Brooks StAlbany St at E. Brookline StAlewife MBTA at Steel PlaceAlewife Station at Russell FieldAmerican Legion Hwy at Canterbury StAmes St at BroadwayAmes St at Main StAquarium T Stop - 200 Atlantic AveAshmont T StopAssembly Square TB.U. Central - 725 Comm. Ave.Back Bay T Stop - Dartmouth St at Stuart StBartlett St at John Elliot SqBeacon St at Tappan StBelgrade Ave at Walworth StBennington St at Byron StBennington St at Constitution BeachBerkshire Street at Cambridge StreetBinney St / Sixth StBlue Hill Ave at Almont StBoston City Hall - 28 State StBoston Convention and Exhibition Center - Summer St at West Side DrBoston East - 126 Border StBoston LandingBoston Public MarketBowdoin St at Quincy StBoylston St at Exeter StBoylston St at Jersey StBrigham Circle - Francis St at Huntington AveBrighton Center - Washington St at Cambridge StBrighton Mills - 370 Western AveBroadway at Beacham StBroadway at Central StBroadway at Lynde StBroadway at Maple StBroadway St at Mt Pleasant StBrookline Town HallBrookline Village - Station Street at MBTABunker Hill Community CollegeCambridge Dept. of Public Works -147 Hampshire St.Cambridge Main Library at Broadway / Trowbridge StCambridge St - at Columbia St / Webster AveCambridge St at Joy StCambridgeSide Galleria - CambridgeSide PL at Land BlvdCentral Sq Post Office / Cambridge City Hall at Mass Ave / Pleasant StCentral Square at Mass Ave / Essex StCentral Square East BostonCentre St at Parkway YMCACentre St at W. Roxbury Post OfficeCharles Circle - Charles St at Cambridge StCharlestown Navy YardChelsea St at Saratoga StChelsea St at Vine StChild St at North StChinatown Gate PlazaChinatown T StopChristian Science Plaza - Massachusetts Ave at Westland AveCleveland CircleCodman Square LibraryColleges of the Fenway - Fenway at Avenue Louis PasteurColumbia Rd at Ceylon StColumbia Rd at Tierney Community CenterCommonwealth Ave at Agganis WayCommonwealth Ave At Babcock StCommonwealth Ave at Chiswick RdCommonwealth Ave at Griggs StCommonwealth Ave at Kelton StCommunity Path at Cedar StreetCongress St at Boston City HallConway Park - Somerville AvenueCoolidge Corner - Beacon St at Centre StCopley Square - Dartmouth St at Boylston StCross St at Hanover StCurtis Hall - South St at Centre StDana ParkDanehy ParkDartmouth St at Newbury StDavis SquareDeerfield St at Commonwealth AveDiscovery Park - 30 Acorn Park DriveDorchester Ave at Gillette ParkDudley Square - Bolling BuildingEast Boston Neighborhood Health Center - 20 Maverick SquareEast Somerville Library (Broadway and Illinois)Edgerly Education CenterEdwards Playground - Main St at Eden StEF - North Point ParkEgleston Square - Atherton St at Washington StEncoreEverett Square (Broadway at Chelsea St)Fan PierFaneuil St at Arlington StFarragut Rd at E. 6th StForest HillsFoss ParkFour Corners - 157 Washington StFranklin Park Zoo - Franklin Park Rd at Blue Hill AveFresh Pond ReservationGallivan Blvd at Adams StGeiger Gibson Community Health CenterGlendale Square (Ferry St at Broadway)Government Center - Cambridge St at Court StGreen Street T Stop - Green St at Amory StGrove Hall Library - 41 Geneva AveGrove St at Community PathHarrison Ave at Bennet StHarrison Ave at Mullins WayHarvard Ave at Brainerd RdHarvard Kennedy School at Bennett St / Eliot StHarvard Law School at Mass Ave / Jarvis StHarvard Square at Brattle St / Eliot StHarvard Square at Mass Ave/ DunsterHarvard St at Greene-Rose Heritage ParkHarvard University / SEAS Cruft-Pierce Halls at 29 Oxford StHarvard University Housing - 115 Putnam Ave at Peabody TerraceHarvard University Radcliffe Quadrangle at Shepard St / Garden StHarvard University River Houses at DeWolfe St / Cowperthwaite StHMS/HSPH - Avenue Louis Pasteur at Longwood AveHonan LibraryHuntington Ave at Mass ArtHuron Ave At Vassal LaneHyde Park Ave at Walk Hill StHyde Square - Barbara St at Centre StID Building EastID Building WestInk Block - Harrison Ave at Herald StInman Square at Springfield St.Innovation Lab - 125 Western Ave at Batten WayJackson Square T StopJamaica St at South StJFK/UMass T StopKendall StreetKendall TKenmore SquareKennedy-Longfellow School 158 Spring StLafayette Square at Mass Ave / Main St / Columbia StLansdowne T StopLechmere Station at Cambridge St / First StLesley UniversityLewis Wharf at Atlantic AveLongwood Ave at Binney StLower Cambridgeport at Magazine St / Riverside RdMagoun Square at Trum FieldMain St at Austin StMain St at Baldwin StMain St at Thompson SqMass Ave at Albany StMass Ave at Hadley/WaldenMass Ave T StationMassachusetts Ave at Columbus AveMattapan LibraryMattapan T StopMaverick Square - Lewis MallMaverick St at Massport PathMedford St at Charlestown BCYFMIT at Mass Ave / Amherst StMIT Pacific St at Purrington StMIT Stata Center at Vassar St / Main StMIT Vassar StMLK Blvd at Washington StMt AuburnMugar Way at Beacon StMurphy Skating Rink - 1880 Day BlvdMuseum of ScienceNashua Street at Red Auerbach WayNCAAA - Walnut Ave at Crawford StNew Balance - 20 Guest StNewmarket Square T Stop - Massachusetts Ave at Newmarket SquareOak Square - 615 Washington StOne Brigham CircleOne Broadway / Kendall Sq at Main St / 3rd StOne Kendall Square at Hampshire St / Portland StOne Memorial DriveOrient Heights T Stop - Bennington St at Saratoga StPackard's Corner - Commonwealth Ave at Brighton AvePackard Ave at Powderhouse BlvdPark Dr at Buswell StPark Plaza at Charles St S.Perry ParkPorter Square StationPowder House Circle - Nathan Tufts ParkPrudential Center - 101 Huntington AvePurchase St at Pearl StRindge Avenue - O'Neill LibraryRing RdRogers St & Land BlvdRoslindale Village - South StRoslindale Village - Washington StRowes Wharf at Atlantic AveRoxbury Crossing T Stop - Columbus Ave at Tremont StRoxbury YMCA - Warren St at MLK BlvdRuggles T Stop - Columbus Ave at Melnea Cass BlvdS Huntington Ave at Heath StSavin Hill T Stop - S Sydney St at Bay StSeaport Blvd at Sleeper StSeaport Hotel - Congress St at Seaport LnSeaport Square - Seaport Blvd at Northern AveShawmut Ave at Oak St WShawmut T StopSoldiers Field Park - 111 Western AveSomerville City HallSouth Boston Library - 646 E BroadwaySouth End Library - Tremont St at W Newton StSouth Station - 700 Atlantic AveSpaulding Rehabilitation Hospital - Charlestown Navy YardSpring St at Powell StState Street at Channel CenterStony Brook T StopStuart St at Berkeley StStuart St at Charles StSullivan SquareSurface Rd at India StSurface Rd at Summer StSydney St at Carson StTalbot Ave At Blue Hill AveTappan St at Brookline Hills MBTAThe Dimock CenterThe Eddy - New St at Sumner StThe Lawn on DThird at BinneyTremont St at E Berkeley StTremont St at Hamilton PlTremont St at Northampton StTremont St at W. Dedham StTremont St at West StTroy BostonUnion Square - Brighton Ave at Cambridge StUnion Square - SomervilleUniversity of Massachusetts Boston - Campus CenterUniversity ParkVassal Lane at Tobin/VLUSVerizon Innovation Hub 10 Ware StreetW Broadway at D StW Broadway at Dorchester StWarren St at Chelsea StWashington St at Brock StWashington St at Lenox StWashington St at Melnea Cass BlvdWashington St at Rutland StWashington St at Talbot AveWashington St at Waltham StWatermark Seaport - Boston Wharf Rd at Seaport BlvdWentworth Institute of Technology - Huntington Ave at Vancouver StWest End ParkWhittier St Health CenterWilliams St at Washington StWilson Square-50-250255075
station_namebalance

From this visualization, one can roughly tell that the majority of the mean balances are at or near zero level, with a handful of extreme outliers. However, this boxplot visualization is so packed with information that it is messy & hard to read, even if it was interactive. Given the large number of stations in the data set, it seems to be more prudent to enable the user to analyze subsets of the data. From an macro perspective, let’s take a look at whether looking at the distribution using histograms would be a better visualization.

#Plotting a histogram using ggstatsplot
gghistostats(data = summary_balance_1, x = balance)

This is a good visual to show us the distribution of the balance for all the stations across the month. We can already see a symmetric t distribution centered on a mean balance of 0.01. Now, let’s look at density Plots:

#Plotting a density plot
ggplot(summary_balance_1, aes(balance, color = District))+
  geom_density(na.rm = TRUE)
#Faceting density plots by District
ggplot(summary_balance_1, aes(balance))+
  geom_density(na.rm = TRUE) +
  facet_wrap(~District)

Both these views of density plot help us see that even across districts, the distributions are very similar.

Let us also look at the distributions across districts and stations:

#Checking distribution across districts
ggbetweenstats(data = summary_balance_1, 
               x = District, 
               y = balance)
#Selecting three station names to compare side by side

station_compare <- summary_balance_1 %>%
  filter(station_name == "175 N Harvard St"| station_name == "191 Beacon St"| station_name == "30 Dane St")

ggbetweenstats(data = station_compare, 
               x = station_name, 
               y = balance)

This would be a good set of plots for the application to give a top line summary of the balance distribution across Districts. This can also be applied to compare specific stations side by side as an interactive feature.

Let us now look at what happens when the same visualizations are made interactive using ggplotly:

#Distribution across districts - Interactive
p2 <- ggbetweenstats(data = summary_balance_1, 
               x = District, 
               y = balance)

ggplotly(p2)
Boston(n = 4614)Brookline(n = 186)Cambridge(n = 1814)Everett(n = 155)Somerville(n = 583)-50050100
District balance
#Three station names to compare side by side - Interactive

p3 <- ggbetweenstats(data = station_compare, 
               x = station_name, 
               y = balance)

ggplotly(p3)
175 N Harvard St(n = 31)191 Beacon St(n = 31)30 Dane St(n = 27)-505
station_name balance

It can be observed that some of the statistical results are missing in the interactive version of the visualizations as compared to the static version. In this context, it would make more sense to use the static versions, because when comparing the balance across stations, the key statistical data that comes with the static plot makes a big difference to the user’s understanding - like number of observations, mean, F score & confidence intervals. In this scenario, it would not be critically necessary to have interactivity to zoom into specific data points.

Balance per Station across the month

Now, let’s look at visualizing the trend of the balance across the month per station. For the purpose of testing the visualization, we will look at the station named 175 N Harvard St:

#Filter for a selected station, and plot a line graph

harvard <- summary_balance %>%
  filter(station_name == "175 N Harvard St")

p <- ggplot(harvard) +
  geom_line(aes(x = trip_date, y = balance)) + theme_light()

ggplotly(p, dynamicTicks = TRUE) %>%
  rangeslider() %>%
  layout(hovermode = "x")

This interactive view will help the user to look into the data at a detailed level, looking at balance fluctuations at a station level across the month, days and hours using the range slider. We will include this feature into the application with full interactivity. Potentially, we can also look into having up to three stations balance data at this detailed level for the detail specific user to look into comparing balance across stations at a day/time level if required.

Finally, let’s look into whether histograms & density plots are helpful at a station level:

#Creating a separate data set filtered for 191 Beacon Street station
beacon <- summary_balance %>%
  filter(station_name == "191 Beacon St")

#Density plot for 191 Beacon Street station data
ggplot(beacon, aes(balance)) +
  geom_density(na.rm = TRUE)
#Histogram for 191 Beacon Street station data
gghistostats(data = beacon, x = balance)

This view of density plot & histogram per station is not very helpful to the user, as compared to the station comparison visualizations above. Therefore, this is not recommended to be used in the final application.

Reflections on Visualization methods

For an application to allow its users to interact fully with the data at all levels, it is essential that there are features that enable both macro and micro analysis. This will ensure that the application caters to all types of users, not only those who want to get a sense of the big picture, but also individuals who need to deep-dive into the data.

There is also the realization that it is not always necessary to have an interactive visualization. The key point is that the visualization should serve it’s purpose, even if it is a static view.

After an exploration of the visualization methods, the recommendations for features to be used in an application are as follows:

Sketch of Proposed Visualization

Future Work

Looking ahead, more work could be done on exploring the relationship of the balance to the days of the week to identify some trends and enable Bluebikes as a company to further optimize their operations. As this project is only focusing on 1 month of data, perhaps the application could be further enhanced to allow users to upload any month’s data set and perform the same balance analysis.

References

  1. http://hselab.org/use-r-to-explore-station-balance-in-bike-share-systems-and-learn-a-little-dplyr-programming.html

  2. https://nycdatascience.com/blog/student-works/data-visualization-and-analysis-of-nice-ride/

  3. https://ggplot2.tidyverse.org/reference/geom_qq.html

  4. https://data.library.virginia.edu/understanding-q-q-plots/

  5. https://mse.redwoods.edu/darnold/math15/spring2013/R/Activities/AssessingNormality1.html

  6. https://indrajeetpatil.github.io/ggstatsplot_slides/slides/ggstatsplot_presentation.html#74

  7. https://plotly-r.com/improving-ggplotly.html

  8. https://rpubs.com/aaronsc32/games-howell-test