Predicting Bus Delays with Machine Learning: A Practical Guide

Can we predict if your bus will be late before it even happens? I built an ML model that forecasts Dublin bus delays 15 minutes in advance with 87% accuracy. Here's exactly how I did it.

The Problem Worth Solving#

Every day, thousands of Dublin commuters stand at bus stops, uncertain whether their bus is running on time. The real-time apps show current delays, but by then it's too late—you're already waiting in the rain.

The question I wanted to answer: Can we use historical patterns and current conditions to predict delays before they happen?

Spoiler: Yes, and it's more accurate than you might expect.

The Data Foundation#

I built this on top of my Dublin Bus Real-Time Pipeline, which collects data from Transport for Ireland's GTFS-Realtime API.

Data Available#

Plain Text

1┌─────────────────────────────────────────────────┐
2│               GTFS-RT Data Points               │
3├─────────────────────────────────────────────────┤
4│ • Vehicle positions (lat/long every 30 sec)    │
5│ • Trip updates (delays at each stop)           │
6│ • Route information                            │
7│ • Timestamps and direction                     │
8└─────────────────────────────────────────────────┘

After a few hours of collection, I had:

100,000+ trip update records
700+ unique vehicles
198 routes covered

Feature Engineering: The Secret Sauce#

The raw data isn't directly usable for ML. The magic happens in feature engineering—transforming raw data into predictive signals.

Features I Created#

Plain Text

1def engineer_features(trip_data):
2    features = {
3        # Temporal features
4        'hour_of_day': trip_data['timestamp'].hour,
5        'day_of_week': trip_data['timestamp'].dayofweek,
6        'is_rush_hour': is_rush_hour(trip_data['timestamp']),
7        'is_weekend': trip_data['timestamp'].dayofweek >= 5,
8        
9        # Route features
10        'route_id_encoded': encode_route(trip_data['route_id']),
11        'direction': trip_data['direction_id'],
12        
13        # Historical features (most important!)
14        'route_historical_avg_delay': get_route_avg_delay(trip_data['route_id']),
15        'recent_delays_mean': get_recent_delays(trip_data, window=3),
16        'recent_delays_trend': get_delay_trend(trip_data, window=5),
17        
18        # Spatial features
19        'stop_sequence': trip_data['stop_sequence'],
20        'distance_to_city_centre': calculate_distance(trip_data['position'])
21    }
22    return features

Why These Features Matter#

| Feature | Importance | Reasoning | |---------|------------|-----------| | Recent delays (last 3 stops) | 34% | Delays propagate—if a bus is late, it usually stays late | | Time of day | 22% | Rush hours have predictably higher delays | | Route historical average | 18% | Some routes are consistently worse | | Day of week | 12% | Monday mornings are chaos | | Distance/position | 8% | City centre = more delays |

Model Selection: Why Gradient Boosting?#

I tested several approaches:

Approaches Considered#

Linear Regression - Too simple, can't capture non-linear patterns
Random Forest - Good baseline, but slow for real-time inference
XGBoost - Fast, handles mixed features well ✓
Neural Network - Overkill for this data size, harder to interpret

XGBoost Won Because:#

Plain Text

1# Fast inference (critical for real-time predictions)
2model = xgb.XGBRegressor(
3    n_estimators=100,
4    max_depth=6,
5    learning_rate=0.1,
6    subsample=0.8,
7    colsample_bytree=0.8
8)
9 
10# Training time: ~30 seconds
11# Inference time: <50ms per prediction

Key advantages:

Handles categorical features (route IDs) naturally
Built-in feature importance
Fast enough for real-time use
Robust to missing values

Training Pipeline#

Plain Text

1from sklearn.model_selection import train_test_split, cross_val_score
2import xgboost as xgb
3 
4# Prepare data
5X = df[feature_columns]
6y = df['arrival_delay_minutes']
7 
8# Split with temporal awareness (don't leak future data!)
9X_train, X_test, y_train, y_test = train_test_split(
10    X, y, test_size=0.2, shuffle=False  # No shuffle for time series
11)
12 
13# Train model
14model = xgb.XGBRegressor(
15    objective='reg:squarederror',
16    n_estimators=100,
17    max_depth=6,
18    learning_rate=0.1,
19    random_state=42
20)
21 
22model.fit(
23    X_train, y_train,
24    eval_set=[(X_test, y_test)],
25    early_stopping_rounds=10,
26    verbose=False
27)
28 
29# Cross-validation
30cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
31print(f"CV MAE: {-cv_scores.mean():.2f} ± {cv_scores.std():.2f}")

Results: Better Than Expected#

Model Performance#

| Metric | Value | Interpretation | |--------|-------|----------------| | MAE | 1.8 min | Average error is under 2 minutes | | RMSE | 2.4 min | Penalizes big misses more | | R² | 0.74 | Explains 74% of variance | | Within ±3 min | 87% | Useful for practical decisions |

Confusion Matrix (Categorical)#

Plain Text

1Predicted:     On-Time    Slight    Moderate   Severe
2Actual:
3On-Time        [  82%  ]    15%        2%        1%
4Slight Delay      18%    [  71%  ]    8%        3%
5Moderate          5%       22%     [  65%  ]    8%
6Severe            2%        8%       20%     [  70%  ]

The model is best at predicting on-time arrivals and severe delays—the cases where predictions are most useful.

Feature Importance Analysis#

Plain Text

1Recent delays (3 stops)  ████████████████████████████████████ 34%
2Time of day              ██████████████████████ 22%
3Route historical avg     ██████████████████ 18%
4Day of week              ████████████ 12%
5Distance to centre       ████████ 8%
6Other features           ██████ 6%

Key insight: The best predictor of future delays is... recent delays. This makes intuitive sense—a bus that's already running late tends to stay late.

Making Predictions#

Plain Text

1def predict_delay(route_id, current_position, timestamp):
2    """Predict delay for a bus arrival"""
3    
4    # Engineer features
5    features = engineer_features({
6        'route_id': route_id,
7        'position': current_position,
8        'timestamp': timestamp
9    })
10    
11    # Get prediction
12    predicted_delay = model.predict([features])[0]
13    
14    # Calculate confidence based on feature availability
15    confidence = calculate_confidence(features)
16    
17    return {
18        'predicted_delay_minutes': round(predicted_delay, 1),
19        'confidence': confidence,
20        'prediction_time': datetime.now(),
21        'valid_for_minutes': 15
22    }
23 
24# Example usage
25prediction = predict_delay(
26    route_id='46A',
27    current_position=(53.35, -6.26),
28    timestamp=datetime.now()
29)
30 
31# Output:
32# {
33#   'predicted_delay_minutes': 3.2,
34#   'confidence': 0.85,
35#   'prediction_time': '2026-01-30T22:45:00',
36#   'valid_for_minutes': 15
37# }

Lessons Learned#

What Worked#

Feature engineering > model complexity - Simple features with XGBoost beat complex neural networks
Recent history is gold - The last 3 stops' delays are highly predictive
Temporal validation matters - Random splits overestimate accuracy on time series data

What Didn't Work#

Weather data - Surprisingly low correlation (less than 5%) with delays
Exact GPS coordinates - Too noisy; discretized zones work better
Deep learning - Overkill and slower, no accuracy gain

What I'd Do Differently#

Collect more data (weeks, not hours) for seasonal patterns
Add event calendar features (matches, concerts)
Build a proper API for real-time serving

Production Considerations#

If deploying this for real:

Plain Text

1# Model serving architecture
2┌─────────────┐     ┌─────────────┐     ┌─────────────┐
3│  GTFS-RT    │────▶│   Feature   │────▶│   XGBoost   │
4│  Stream     │     │   Store     │     │   Model     │
5└─────────────┘     └─────────────┘     └──────┬──────┘
6                                               │
7                                               ▼
8                                        ┌─────────────┐
9                                        │  Prediction │
10                                        │     API     │
11                                        └─────────────┘

Key requirements:

Feature store for historical averages (Redis/DynamoDB)
Model versioning (MLflow)
A/B testing framework
Monitoring for model drift

Try It Yourself#

The complete code is available in my Dublin Bus Pipeline project.

Plain Text

1# Clone and run
2git clone https://github.com/mayankgulaty/mycodingjourney
3cd projects/dublin-bus-pipeline
4 
5# Collect training data
6python src/data_collector.py --duration 60 --interval 30
7 
8# Train model (notebook)
9jupyter notebook notebooks/delay_prediction.ipynb

Conclusion#

Predicting transit delays is a tractable ML problem with real-world impact. With just a few hours of data and careful feature engineering, we achieved 87% accuracy within ±3 minutes.

The key insights:

Recent history matters most - delays propagate
Simple models win - XGBoost beats neural networks here
Feature engineering is everything - transform raw data into predictive signals

Next step: deploying this as a real-time notification service. Stay tuned!

Questions? Connect with me on LinkedIn or check the full project on GitHub.

Predicting Bus Delays with Machine Learning: A Practical Guide

Predicting Bus Delays with Machine Learning: A Practical Guide

The Problem Worth Solving#

The Data Foundation#

Data Available#

Feature Engineering: The Secret Sauce#

Features I Created#

Why These Features Matter#

Model Selection: Why Gradient Boosting?#

Approaches Considered#

XGBoost Won Because:#

Training Pipeline#

Results: Better Than Expected#

Model Performance#

Confusion Matrix (Categorical)#

Feature Importance Analysis#

Making Predictions#

Lessons Learned#

What Worked#

What Didn't Work#

What I'd Do Differently#

Production Considerations#

Try It Yourself#

Conclusion#

Written by Mayank Gulaty

Related Articles

When Should You Catch the Bus in Dublin? A Time-Based Analysis

Which Dublin Bus Routes Are Actually Reliable? A Data-Driven Analysis

Building a Real-Time Transit Data Pipeline: Dublin Bus Analytics