CovidCast: Predict to Protect

CovidCast

By Sam Celarek

"Like a weather forecast for pandemics, COVIDCast leverages state-of-the-art machine learning and epidemiological models to deliver precise outbreak predictions."

🎯 Project Overview

During COVID 19, the unknown course of the pandemic is almost as deadly as the virus itself. COVIDCast aims to address this problem by Predicting to Protect. To do this, COVIDCast utlizes the expert knowledge of epidemiological models and the forecasting power of times series models to predict the next 14 days of new COVID cases and protect governments, hospitals, and people from the coming storm.

CovidCast


📊 Dataset

The data for this project comes from three sources:

These repositories gathered data from a number of sources all over the world including the WHO, John Hopkins Hospital, and the CDC. I used 6 different csv’s in the master file, combining together mobility data (Google), weather data (Google), government restrictions (Google), hospitalizations and tests (OWID), case counts and epidemiological variables (CovsirPhy). There was some feature overlap in the datasets from each source, but this was used to help impute missing data in the other datasets to create a more complete final master_df.


🧹 Data Cleaning

Originally 30% of the data was missing. I used a variety of techniques to make this data more manageable:


👓 EDA

To properly apply time series models to the data, I had to assess:

Here is my Preprocessing and EDA Presentation


💠 Modeling

COVIDCast works by taking Epidemiological SIRD model estimated parameters of spread, death, and recovery and plugging them into the time series models as exogenous variables to give the model better information about the underlying nature of the disease being predicted.

🦠 Epidemiological Model Overview

🧪 Time Series Forecasting Algorithms

🔢 Feature Selection and Tuning the Models:

For ARIMA time series models, I wanted to added exogenous variables that have information about how a target is going to fluctuate in the next time steps. To determine these features I had to assess each variables’ Stationarity, Granger-Causality, Linear Correlation Strength, Multi-Collinearity, and Importance. I found that 6 variables I thought to be important due to background knowledge were actually the best additional regressors for the model. I used autoarima for grid searching SARIMAX’s orders, and found the order of (3,0,2) (2, 1, 1) [7] with intercept to be the most predictive in cross-validation.

For Prophet modeling, I grid searched with cross-validation over a range of 3 to 15 of the most important features as determined by Recursive Feature Elimination with LGMBoost as my regressor. The best performing model using this technique had 11 exogenous variables. I then grid searched over the hyperparameters to land on the final settings of changepoint_prior_scale=10 , seasonality_prior_scale=0.01, holidays_prior_scale=10, and growth='linear'.


📈 Results and Discussion

The training period was from February 15th, 2020 to March 5th, 2023 and the testing period was the 16 days from March 5th,2023 to March 21st, 2023. The forecasts and benchmarks below are based on the models’ performance during the testing period.

Forecasts:

SARIMAX Model:

CovidCast

The SARIMAX model adeptly captures the weekly COVID case variations, demonstrating minimal residuals for low case counts. Although it occasionally misses predicting peaks, the observed values still lie within its 95% confidence interval.

Prophet Model:

CovidCast

The Prophet model seems to capture a long-term trend, even venturing into negative COVID case counts. Efforts to employ a logistic growth curve didn't enhance its accuracy. It struggles with weekly fluctuations and, at times, is directionally incorrect. Given its current state, it's not recommended for predicting COVID cases.

Testing metrics:

CovidCast

Comparing the testing scores, the SARIMAX model with 6 exogenous variables and times series order of: (3,0,2) (2, 1, 1) [7] with an intercept demonstrates the best understanding of daily COVID case trends, showcasing superior scores across all metrics.

Next Steps

Future endeavors for this project include adapting the target variable to deaths, hospitalizations, or weekly COVID case averages to assess the forecast accuracy. Also, literature on COVID prediction suggests that an RNN with LSTMs yield a cutting-edge sMAPE of 5%. I would like to recreate this RNN model, and explore whether integrating the SIRD model parameters further enhances its predictive power.

And, of course, the ambitious goal for this project is for COVIDCast to Predict to Protect against any future pandemics.

Thank you for your interest in COVIDCast. For further inquiries or insights, contact via this GitHub repository or at scelarek@gmail.com.

Best Wishes,
Sam Celarek

Other Resources