NYC Citi Bike Share Program Analysis. Part 1: Performing analysis

January 15, 2023 | Andrew Kotovets

Project background

Since 2013, the Citi Bike Program has implemented a robust infrastructure for collecting data on the program's utilization. Through the team's efforts, each month bike data is collected, organized, and made public on the "Citi Bike Data" webpage.

This project involved extracting a few NYC Citi Bike datasets and using to visualize and determine answers to company leadership questions.

The data was extracted from the website, transformed in a Jupyter Notebook using Python and Pandas then cleaned in a BigQuery using SQL queries and pushed out to a combined csv to then load into Tableau.

Each of the visualizations created contains interactive features for users to explore the different trends related to short-term customers and annual Citi Bike members, peak hours and days of the week, and the breakdown of bike types and seasons (winter and summer).

A report of findings you can read here.

Some of the questions answered include:

How many trips have been recorded total during the chosen period?
By what percentage has total ridership grown?
How has the proportion of short-term customers and annual members changed?
What are the peak hours in which bikes are used during summer months?
What are the peak hours in which bikes are used during winter months?
What are the top 10 stations in the city for starting a journey/ ending a journey?
What are the bottom 10 stations in the city for starting a journey/ ending a journey?
What is the bikes type breakdown of all participants, short-term customers and annual members?
How does the average trip duration change?

Datasets Downloaded: (3 months of Winter 2021-2022 & 3 months of Summer 2022)

202112-citibike-tripdata.csv
202201-citibike-tripdata.csv
202202-citibike-tripdata.csv
202206-citibike-tripdata.csv
202207-citibike-tripdata.csv
202208-citibike-tripdata.csv

Note: The datasets used are not included in this folder due to storage space, but can be downloaded from this website directly.

Performing analysis

Transforming dataset with Python and Pandas

Use Pandas to calculate the difference between two dates (Started time and Ended time). Check the format of the data: df.info()

Convert the dates to datetime format:

df['started_at'] = pd.to_datetime(df['started_at'], errors='coerce')
df['ended_at'] = pd.to_datetime(df['ended_at'], errors='coerce')

Calculate the difference between two dates as a timedelta and creating a new column 'tripduration': df['tripduration'] = df['ended_at'] - df['started_at']
Calculate the difference between two dates in seconds and creating a new column 'tripduration_seconds': df['tripduration_seconds'] = (df['ended_at'] - df['started_at']).dt.total_seconds()