Visual Analytics Assignment

Bluebikes - Enabling Optimization of Bike Sharing Operations

Author

Affiliation

Vikram Shashank Dandekar

Published

April 5, 2021

DOI

Overview

Purpose of Visual Analytics Assignment

The objective of this assignment is to explore the methods of visualization that can be applied to the trip data of Blue Bikes, which is a public share bike system in Boston. The focus of this assignment is on analyzing the flow of bikes moving in and out of the numerous stations that have been set up. More specifically, we would like to look at the balance (deficit or excess) of bikes at the stations. This would eventually link to the Shiny application development of the Visual Analytics Project.

Visual Components to be explored

Using a full month of data, we would like to visualize the distribution of deficit or excess across all the different stations. Ideally, there would be an option to customize district selection to see how the distribution changes.
For a good interactive experience, the application user should be able to compare station to station data for balance. A comparison of at least three stations side by side would be useful.
It would be interesting to visualize to the smallest detail, the movement in overall balance of bikes throughout the days and weeks.

Where possible, interactive visualizations are good to have, but they may not be necessary. We will look at this in closer detail.

Step by Step Data Preparation

Installing packages

First, we will check for the following packages to see whether they have been installed - sf, tidyverse, lubridate, data.table, leaflet, plotly, leaflet.extras, psych, ggstatsplot, hrbrthemes, hms and infer. If not installed, R will go ahead to install these packages before launching them.

packages = c('sf', 'data.table','tidyverse','leaflet','leaflet.extras','lubridate', 'infer', 'ggstatsplot', 'hrbrthemes', 'hms', 'plotly', 'psych')
for (p in packages){
  if(!require(p, character.only = T)){
  install.packages(p)
  }
  library(p,character.only= T )
}

Importing the Data

Let us bring in two data sets of Blue Bikes. The first data set, which we will label trip is the trip data for the month of January 2020, while the second data set which we will label station summarizes the details of all the Blue Bikes stations that are available.

trip <- read_csv('data/202001-bluebikes-tripdata.csv')
station <- read.csv('data/current_bluebikes_stations.csv', skip = 1)

Data Wrangling & Preparation

Inspecting the structure of the station data set:

str(station)

'data.frame':   364 obs. of  7 variables:
 $ Number     : Factor w/ 364 levels "A32000","A32001",..: 15 339 162 337 264 159 291 349 82 297 ...
 $ Name       : Factor w/ 364 levels "175 N Harvard St",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Latitude   : num  42.4 42.4 42.3 42.4 42.4 ...
 $ Longitude  : num  -71.1 -71.1 -71.1 -71.1 -71.1 ...
 $ District   : Factor w/ 9 levels "Arlington","Boston",..: 2 8 2 8 4 2 4 5 2 4 ...
 $ Public     : Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ Total.docks: int  18 19 17 15 23 19 25 15 12 19 ...

It is noted that the Station name is a factor type. We will need to ensure that the same is true for the trip data set, so that when we join the data together, there will be no errors.

Inspecting the structure of the trip data set:

str(trip)

spec_tbl_df [128,598 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ tripduration           : num [1:128598] 478 363 284 193 428 ...
 $ starttime              : POSIXct[1:128598], format: "2020-01-01 00:04:05" ...
 $ stoptime               : POSIXct[1:128598], format: "2020-01-01 00:12:04" ...
 $ start station id       : num [1:128598] 366 219 219 396 60 372 36 36 36 36 ...
 $ start station name     : chr [1:128598] "Broadway T Stop" "Boston East - 126 Border St" "Boston East - 126 Border St" "Main St at Beacon St" ...
 $ start station latitude : num [1:128598] 42.3 42.4 42.4 42.4 42.4 ...
 $ start station longitude: num [1:128598] -71.1 -71 -71 -71.1 -71.1 ...
 $ end station id         : num [1:128598] 93 212 212 387 49 178 23 23 23 23 ...
 $ end station name       : chr [1:128598] "JFK/UMass T Stop" "Maverick Square - Lewis Mall" "Maverick Square - Lewis Mall" "Norman St at Kelvin St" ...
 $ end station latitude   : num [1:128598] 42.3 42.4 42.4 42.4 42.4 ...
 $ end station longitude  : num [1:128598] -71.1 -71 -71 -71.1 -71.1 ...
 $ bikeid                 : num [1:128598] 6005 3168 3985 2692 4978 ...
 $ usertype               : chr [1:128598] "Customer" "Subscriber" "Subscriber" "Subscriber" ...
 $ birth year             : num [1:128598] 1969 2000 2001 1978 1987 ...
 $ gender                 : num [1:128598] 0 1 1 1 1 1 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   tripduration = col_double(),
  ..   starttime = col_datetime(format = ""),
  ..   stoptime = col_datetime(format = ""),
  ..   `start station id` = col_double(),
  ..   `start station name` = col_character(),
  ..   `start station latitude` = col_double(),
  ..   `start station longitude` = col_double(),
  ..   `end station id` = col_double(),
  ..   `end station name` = col_character(),
  ..   `end station latitude` = col_double(),
  ..   `end station longitude` = col_double(),
  ..   bikeid = col_double(),
  ..   usertype = col_character(),
  ..   `birth year` = col_double(),
  ..   gender = col_double()
  .. )

It is apparent that some modifications need to be made to the data set to enable a good visualization later:

Add a column that shows the age of each rider using the birth year column
Convert the starttime and stoptime columns into - Start Date, Stop Date, Start Time, Stop Time
Change the gender column to show “Female”, “Male” or “Prefer not to say” instead of “0”, “1” and “2”

#Calculating Age from Birth year
trip$Age <- 2020 - trip$'birth year'

#Modifying Gender data
trip$gender <- as.factor(recode(trip$gender, '0' = 'Female', '1' = 'Male', '2' = 'Prefer not to say'))

#Separating date and time data
trip_1 <- trip %>%
  separate(starttime, into = c("start_date", "start_time"), sep = " ") %>%
  separate(stoptime, into = c("stop_date", "stop_time"), sep = " ")

#Formatting date and time data types
trip_1$start_date <- ymd(trip_1$start_date)
trip_1$stop_date <- ymd(trip_1$stop_date)
trip_1$start_time <- as_hms(trip_1$start_time)
trip_1$stop_time <- as_hms(trip_1$stop_time)

#Convert start station name and end station name to factor
trip_1$`start station name` <- as.factor(trip_1$`start station name`)
trip_1$`end station name` <- as.factor(trip_1$`end station name`)

The processed trip data set file will be saved for use in subsequent analysis

save(trip_1, station, file = "data/station_trip_processed.Rdata")

date_window_start <- ymd("2020-01-01")
date_window_stop <- ymd("2020-01-31")

# Unique values of date and station names
dates <- seq(date_window_start, date_window_stop, by = 'days')
stations <- station$Name

#Create a full table with all combinations
date_station <- expand_grid(stations, dates)
names(date_station) <- c("start station name", "trip_date")

#Count number of trips out and save as table
num_out_station_date <- trip_1 %>%
  group_by(`start station name`, start_date) %>%
  count()

num_out_station_date$start_date = ymd(num_out_station_date$start_date)

#Join the date_station table to the num_out_station_date table
trips_out <- date_station %>%
  left_join(num_out_station_date, by = c("start station name" = "start station name", "trip_date" = "start_date"))

#Replacing NA values with zeros
trips_out$n <- trips_out$n %>% replace_na(0)

#A glimpse at the trips_out table
head(trips_out)

# A tibble: 6 x 3
  `start station name` trip_date      n
  <fct>                <date>     <dbl>
1 175 N Harvard St     2020-01-01     7
2 175 N Harvard St     2020-01-02    16
3 175 N Harvard St     2020-01-03    22
4 175 N Harvard St     2020-01-04    10
5 175 N Harvard St     2020-01-05     8
6 175 N Harvard St     2020-01-06    10

#Repeating for in-trips, Count number of trips in and save as table
num_in_station_date <- trip_1 %>%
  group_by(`end station name`, stop_date) %>%
  count()

num_in_station_date$stop_date = ymd(num_in_station_date$stop_date)

#Join the date_station table to the num_in_station_date table
trips_in <- date_station %>%
  left_join(num_in_station_date, by = c("start station name" = "end station name", "trip_date" = "stop_date"))

#Replacing NA values with zeros
trips_in$n <- trips_in$n %>% replace_na(0)

#A glimpse at the trips_in table
head(trips_in)

# A tibble: 6 x 3
  `start station name` trip_date      n
  <fct>                <date>     <dbl>
1 175 N Harvard St     2020-01-01     8
2 175 N Harvard St     2020-01-02    15
3 175 N Harvard St     2020-01-03    17
4 175 N Harvard St     2020-01-04    15
5 175 N Harvard St     2020-01-05    12
6 175 N Harvard St     2020-01-06    11

Now, we combine the trips_in and trips_out tables and create a summary table that also shows the overall balance per station per date:

summary_balance <- trips_in %>% 
  left_join(trips_out, by = c("start station name", "trip_date"))

names(summary_balance) <- c("station_name", "trip_date", "n_in", "n_out")

summary_balance <- summary_balance %>%
  mutate(balance = n_in - n_out)

head(summary_balance)

# A tibble: 6 x 5
  station_name     trip_date   n_in n_out balance
  <fct>            <date>     <dbl> <dbl>   <dbl>
1 175 N Harvard St 2020-01-01     8     7       1
2 175 N Harvard St 2020-01-02    15    16      -1
3 175 N Harvard St 2020-01-03    17    22      -5
4 175 N Harvard St 2020-01-04    15    10       5
5 175 N Harvard St 2020-01-05    12     8       4
6 175 N Harvard St 2020-01-06    11    10       1

There is one more condition to eliminate. There are instances where both n_in and n-out are zero, this happens because not all the stations in the station data set may be present in the trip data set for the month of Jan 2020. Therefore, a filtering needs to be done to remove these data points so that they do not impact the statistical analysis later.

summary_balance <- summary_balance %>%
  filter(n_in!=0 | n_out!=0)

It would be interesting to eventually see the balance distributions at a District level. Therefore, let’s bring in the District names by using the station table and joining to the summary_balance table:

summary_balance_1 <- summary_balance %>%
  left_join(station, by = c("station_name" = 'Name')) %>%
  select("station_name",  "trip_date", "n_in", "n_out", "balance", "District", "Total.docks")

Exploration of visualization methods

This section is an exploration of the different visualizations that can be applied to the prepared data set. The objective of this section is to evaluate which are the most appropriate methods that could be utilized in the final Shiny application.

Balance Across Stations and Districts

Let’s find a good way to visualize the distribution of the balance across stations, starting with a boxplot:

p <- ggplot(summary_balance_1) +
  geom_boxplot(aes(x = station_name, y = balance),
               outlier.size = 0.1)
ggplotly(p)

From this visualization, one can roughly tell that the majority of the mean balances are at or near zero level, with a handful of extreme outliers. However, this boxplot visualization is so packed with information that it is messy & hard to read, even if it was interactive. Given the large number of stations in the data set, it seems to be more prudent to enable the user to analyze subsets of the data. From an macro perspective, let’s take a look at whether looking at the distribution using histograms would be a better visualization.

#Plotting a histogram using ggstatsplot
gghistostats(data = summary_balance_1, x = balance)

This is a good visual to show us the distribution of the balance for all the stations across the month. We can already see a symmetric t distribution centered on a mean balance of 0.01. Now, let’s look at density Plots:

#Plotting a density plot
ggplot(summary_balance_1, aes(balance, color = District))+
  geom_density(na.rm = TRUE)

#Faceting density plots by District
ggplot(summary_balance_1, aes(balance))+
  geom_density(na.rm = TRUE) +
  facet_wrap(~District)

Both these views of density plot help us see that even across districts, the distributions are very similar.

Let us also look at the distributions across districts and stations:

#Checking distribution across districts
ggbetweenstats(data = summary_balance_1, 
               x = District, 
               y = balance)

#Selecting three station names to compare side by side

station_compare <- summary_balance_1 %>%
  filter(station_name == "175 N Harvard St"| station_name == "191 Beacon St"| station_name == "30 Dane St")

ggbetweenstats(data = station_compare, 
               x = station_name, 
               y = balance)

This would be a good set of plots for the application to give a top line summary of the balance distribution across Districts. This can also be applied to compare specific stations side by side as an interactive feature.

Let us now look at what happens when the same visualizations are made interactive using ggplotly:

#Distribution across districts - Interactive
p2 <- ggbetweenstats(data = summary_balance_1, 
               x = District, 
               y = balance)

ggplotly(p2)

#Three station names to compare side by side - Interactive

p3 <- ggbetweenstats(data = station_compare, 
               x = station_name, 
               y = balance)

ggplotly(p3)

It can be observed that some of the statistical results are missing in the interactive version of the visualizations as compared to the static version. In this context, it would make more sense to use the static versions, because when comparing the balance across stations, the key statistical data that comes with the static plot makes a big difference to the user’s understanding - like number of observations, mean, F score & confidence intervals. In this scenario, it would not be critically necessary to have interactivity to zoom into specific data points.

Balance per Station across the month

Now, let’s look at visualizing the trend of the balance across the month per station. For the purpose of testing the visualization, we will look at the station named 175 N Harvard St:

#Filter for a selected station, and plot a line graph

harvard <- summary_balance %>%
  filter(station_name == "175 N Harvard St")

p <- ggplot(harvard) +
  geom_line(aes(x = trip_date, y = balance)) + theme_light()

ggplotly(p, dynamicTicks = TRUE) %>%
  rangeslider() %>%
  layout(hovermode = "x")

This interactive view will help the user to look into the data at a detailed level, looking at balance fluctuations at a station level across the month, days and hours using the range slider. We will include this feature into the application with full interactivity. Potentially, we can also look into having up to three stations balance data at this detailed level for the detail specific user to look into comparing balance across stations at a day/time level if required.

Finally, let’s look into whether histograms & density plots are helpful at a station level:

#Creating a separate data set filtered for 191 Beacon Street station
beacon <- summary_balance %>%
  filter(station_name == "191 Beacon St")

#Density plot for 191 Beacon Street station data
ggplot(beacon, aes(balance)) +
  geom_density(na.rm = TRUE)

#Histogram for 191 Beacon Street station data
gghistostats(data = beacon, x = balance)

This view of density plot & histogram per station is not very helpful to the user, as compared to the station comparison visualizations above. Therefore, this is not recommended to be used in the final application.

Reflections on Visualization methods

For an application to allow its users to interact fully with the data at all levels, it is essential that there are features that enable both macro and micro analysis. This will ensure that the application caters to all types of users, not only those who want to get a sense of the big picture, but also individuals who need to deep-dive into the data.

There is also the realization that it is not always necessary to have an interactive visualization. The key point is that the visualization should serve it’s purpose, even if it is a static view.

After an exploration of the visualization methods, the recommendations for features to be used in an application are as follows:

From a macro standpoint, it will make sense to have a visualization that shows the overall distribution of the data, with some selections available up to choices of district.
It is recommended to have comparison features between districts and more importantly stations. This provides a good second-level look into the balance data.
Finally, it is crucial to have a feature that looks at the lowest detail of the balance data - at days/hours level if required. This is where an interactive plot would be helpful for the user, as demonstrated earlier with the slider feature as an addition.

Sketch of Proposed Visualization

Future Work

Looking ahead, more work could be done on exploring the relationship of the balance to the days of the week to identify some trends and enable Bluebikes as a company to further optimize their operations. As this project is only focusing on 1 month of data, perhaps the application could be further enhanced to allow users to upload any month’s data set and perform the same balance analysis.