US Interstate-94 Traffic : Analysis and Predictive Modeling
Interstate 94 (or I-94) is a major interstate highway in the US that connects Minneapolis and St Paul, Minnesota. Due to its strategic importance, the I-94 experiences significant traffic volumes, often influenced by factors such as time of day, weather conditions, holidays, and so on. According to a report cited by WILX 10 (a television station based in Michigan, USA), it is also the deadliest highway in Michigan in terms of people dying from car crashes. These factors alone highlight the crucial necessity for both policymakers and commuters alike to understand the factors that drive traffic volume along this highway, as well as what can be done to overcome them.
In this article, we will be using the Jupyter environment to explore the Interstate Traffic Dataset (US) from Kaggle, which contains hourly data on the traffic volume for westbound I-94. The data was collected by the Minnesota Department of Transportation (MnDOT) from 2012 to 2018 at a station roughly midway between the two cities. This article is split into two sections: Exploration, where we will be doing exploratory data analysis and answering key questions regarding the dataset, and Prediction, where will be applying Machine Learning (ML) models for regression analysis (prediction) on traffic volume based on the attributes in the dataset.
Exploration
Data Preprocessing
Before we proceed with our exploratory data analysis (EDA), we must do a bit of data cleaning. Duplicate rows must be identified and removed, the date-time attribute must be appropriately formatted, and missing values must be handled correctly.
df = pd.read_csv("Metro_Interstate_Traffic_Volume.csv")
# Formatting date_time
df['date_time'] = pd.to_datetime(df['date_time'], format='%d-%m-%Y %H:%M')
df.sort_values('date_time', inplace=True)
# Remove duplicate rows
df = df.drop_duplicates()
# Handle missing values
df['holiday'] = df['holiday'].fillna('No Holiday')
df['temp'] = df['temp'].fillna(df['temp'].mean())
df['rain_1h'] = df['rain_1h'].fillna(0)
df['snow_1h'] = df['snow_1h'].fillna(0)
df['clouds_all'] = df['clouds_all'].fillna(df['clouds_all'].median())
df['weather_main'] = df['weather_main'].fillna(df['weather_main'].mode()[0])
df['weather_description'] = df['weather_description'].fillna(df['weather_description'].mode()[0])
Next, we will extract ‘year’, ‘month’, ‘day’, and ‘hour’ from the date_time attribute to be parsed as separate columns. This will give us more granularity when analyzing variables in relation to time.
Question 1: What does the distribution of rain and snow look like?
The first thing we would like to investigate is the distribution of rain and snow throughout the 7-year period. In other words, how much rain and snow (in mm/hr) has been falling along the I-94 highway from 2012 to 2018? We can answer this by plotting a line chart of the total rain over time by month, and observing the shape of the graph for potential outliers.
# Substitute 'rain' with 'snow' to plot a similar chart for snow
# Create a new column for the year-month
df['year_month'] = df['year'].astype(str) + '-' + df['month'].astype(str).str.zfill(2)
# Aggregate rain by month
monthly_rain = df.groupby('year_month')['rain_1h'].sum().reset_index()
plt.figure(figsize=(12, 6))
plt.plot(monthly_rain['year_month'], monthly_rain['rain_1h'], marker='o', linestyle='-')
plt.xlabel('Year-Month')
plt.ylabel('Total Rain (mm)')
plt.title('Total Rain Over Time (Monthly)')
# only label months where total rain exceeds 8000 mm
xticks_labels = monthly_rain['year_month'].where(monthly_rain['rain_1h'] > 8000, '')
plt.xticks(ticks=range(len(monthly_rain)), labels=xticks_labels, rotation=0)
plt.tight_layout()
plt.show()
Based on the rain plot above, total rain per month rarely exceeded 1000mm through the whole 7-year period, save for one peculiar July in 2016 where the total rain was close to 10000mm, which might indicate long periods of thunderstorms and heavy rain.
Similar to the rain distribution, the distribution of snow remained relatively stagnant throughout the whole period (close to 0). The only exceptions are during December 2015 and January 2016, where the amount of snow was close to 8mm in the first month and 3mm in the second month, which may indicate long periods of heavy snow.
Question 2 : What are the different categories of weather?
Next, we will explore the different types of weather conditions available in the dataset. The weather_main attribute contains the different types of weather conditions (Clouds, Rain, Snow etc.) of each instance (recorded every hour), whereas the weather_description attribute describes each weather condition in more detail (overcast clouds, light rain, light snow etc.) The sunburst plot below offers a detailed breakdown of each weather condition :
import pandas as pd
import plotly.graph_objs as go
import plotly.offline as pyo
# Group by 'weather_main' and unique 'weather_description' for each category
weather_categories = df.groupby('weather_main')['weather_description'].unique()
labels = []
parents = []
for category, subcategories in weather_categories.items():
labels.append(category)
parents.append("")
for subcat in subcategories:
labels.append(subcat)
parents.append(category)
trace = go.Sunburst(
labels=labels,
parents=parents,
branchvalues='total'
)
layout = go.Layout(
margin=dict(t=0, l=0, r=0, b=0)
)
fig = go.Figure(trace, layout)
pyo.iplot(fig)
It is clear from the chart that Thunderstorm has the most number of subcategories (9), followed by Rain and Snow (7 and 9 respectively), along with the other weather conditions. Squall, Smoke, Mist, Haze, and Fog have only one subcategory each.
Question 3 : How many different holidays are there?
Next, we shall explore all the different holidays. Since days which are holidays are far less frequent than regular days, including the “None” category in the plot (i.e regular day) will significantly skew the data, so our breakdown will include all the holidays excluding the regular days :
# Remove 'None' values
df = df[df['holiday'] != 'None']
counts = df['holiday'].value_counts().reset_index()
counts.columns = ['holiday', 'count']
# Generate a list of colors (one color for each holiday)
colors = plt.cm.get_cmap('tab20', len(counts))
plt.figure(figsize=(12, 8))
plt.barh(counts['holiday'], counts['count'], color=colors(range(len(counts))))
plt.xlabel('Count')
plt.ylabel('Holiday')
plt.title('Breakdown of Holidays (Excluding "None")')
plt.gca().invert_yaxis() # Invert y-axis to have the highest count at the top
for bar, color in zip(bars, colors(range(len(counts)))):
bar.set_color(color)
plt.show()
A total of 11 unique holidays were present in the dataset. The most frequent holiday by count is Labor Day (7 instances), followed by 6 and 5 counts from the other attributes in the bar chart above.
Question 4: What is the distribution of traffic volume by hour?
Next, we would like to know at which hours of the day (over a 24 hour period) is traffic volume the highest and lowest. This can be done by plotting the distributions of traffic volume per month using boxplots.
plt.figure(figsize=(12, 6))
bp = plt.boxplot([df[df['hour'] == h]['traffic_volume'] for h in range(24)], showfliers=False, patch_artist=True)
for box in bp['boxes']:
box.set_facecolor('orange')
for median in bp['medians']:
median.set_color('black')
plt.xlabel('Hour of the Day')
plt.ylabel('Traffic Volume')
plt.title('Traffic Volume Distribution by Hour of the Day')
plt.xticks(range(1, 25), labels=range(0, 24))
plt.grid(True)
plt.show()
As expected, we can observe that traffic volume peaks during the early morning hours when people drive to work (0600 to 0700hrs), and during the late afternoon when people return from work (1600 to 1700hrs). The least traffic volume is during the late hours from 0200 to 0300hrs.
Question 5: What is the average traffic volume per month?
Next, we would like to find out the average traffic volume of each month from 2012 to 2018. We can do this by plotting a line chart of Average Traffic Volume against Year-Month to help us visualize which months and years contain the highest and lowest average traffic volumes.
# Aggregate traffic volume by month
df['year_month'] = df['date_time'].dt.to_period('M') # Create a new column for year-month
monthly_traffic = df.groupby('year_month')['traffic_volume'].mean().reset_index()
plt.figure(figsize=(12, 6))
plt.plot(monthly_traffic['year_month'].astype(str), monthly_traffic['traffic_volume'], marker='o', linestyle='-')
plt.xlabel('Year-Month')
plt.ylabel('Average Traffic Volume')
plt.title('Average Traffic Volume Over Time (Monthly)')
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Several peaks and troughs can be observed in the line graph above. The highest recorded peak was during the month of Labor Day in May 2017, followed by New Year month in January 2016 and New Year month in January 2013. This is expected, since national holidays like Labor Day and New Year’s often invite higher traffic levels than usual.
Question 6: Which factors contribute the most to traffic volume?
Lastly, the most important question is to determine which of the attributes we have discussed contribute the most to the spike in traffic volume. Since the time series data we have explored is non-linear, a simple correlation analysis would be insufficient. We must employ more complex models such as Random Forest and Gini Index to determine which attributes are the most important in predicting the target label, which in this case is traffic volume.
model = RandomForestRegressor(random_state=42)
model.fit(x_train, y_train)
feature_importances = model.feature_importances_
feature_names = x_train.columns
importance_dict = dict(zip(feature_names, feature_importances))
sorted_importances = sorted(importance_dict.items(), key=lambda x: x[1], reverse=True)
print("Feature Importances:")
for feature, importance in sorted_importances:
print(f"{feature}: {importance:.4f}")
plt.figure(figsize=(10, 6))
plt.bar([x[0] for x in sorted_importances], [x[1] for x in sorted_importances])
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid
plt.show()
Based on the Gini Importance Index obtained by the Random Forest Regressor, it can be observed that the ‘hour’ attribute contributes the most in predicting traffic volume (Gini Index: 0.8255). This is expected since we had previously observed a clear increase and decrease in traffic volume throughout various hours of the day. Other attributes such as weather, rain, snow, and holiday contribute very little to traffic volume, since these attributes remain relatively stagnant throughout most of the duration, save for the occasional outliers.
Prediction
In the second part of the article, we will be performing regression analysis to predict traffic volume based on the other attributes in the dataset. The regression model we will be using is Random Forest. Recall that this what our cleaned dataset looks like :
Before feeding the data into the Random Forest, we must first prepare it accordingly. First, we encode the categorical attributes into numerical attributes using the LabelEncoder function:
catcol = ['holiday','weather_main','weather_description','day']
encoder = LabelEncoder()
for col in catcol:
df[col] = encoder.fit_transform(df[col])
Next, we must normalize the traffic_volume and temp attributes into using the MinMaxScaler method. This is to ensure the ranges of these attributes fall within a predetermined range (min and max).
st = MinMaxScaler()
df['traffic_volume'] = st.fit_transform(df[['traffic_volume']])
df['temp'] = st.fit_transform(df[['temp']])
Then we must split the data into x and y labels, which will then be split into training and test sets (15% test set for our case).
x = df.drop('traffic_volume',axis = 1)
y = df['traffic_volume']
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size = 0.85,shuffle = True,random_state=42)
Finally, we can now train and test the Random Forest model and evaluate its performance based on three key metrics : Mean Absolute Square Error, Mean Square Error, and R-squared.
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test,y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.8f}')
print(f'Mean Absolute Error: {mae:8f}')
print(f'R-squared: {r2:.8f}')
The results obtained by the Random Forest model are as follows :
Mean Squared Error: 0.00302850
Mean Absolute Error: 0.030635
R-squared: 0.95976209
The MSE score achieved is very low, indicating that the model’s predictions are very close to the actual values. Similar to the MSE, the MAE score is also quite low, indicating that the model predicts very closely to the actual values. Lastly, the R-squared value is very high, indicating that the model has strong explanatory power and fits the data very well.