During COVID 19, the unknown course of the pandemic is almost as deadly as the virus itself. COVIDCast aims to address this problem by Predicting to Protect. To do this, COVIDCast utlizes the expert knowledge of epidemiological models and the forecasting power of times series models to predict the next 14 days of new COVID cases and protect governments, hospitals, and people from the coming storm.
The data for this project comes from three sources:
These repositories gathered data from a number of sources all over the world including the WHO, John Hopkins Hospital, and the CDC. I used 6 different csv’s in the master file, combining together mobility data (Google), weather data (Google), government restrictions (Google), hospitalizations and tests (OWID), case counts and epidemiological variables (CovsirPhy). There was some feature overlap in the datasets from each source, but this was used to help impute missing data in the other datasets to create a more complete final master_df.
Originally 30% of the data was missing. I used a variety of techniques to make this data more manageable:
current_hospitalizations
, fatal
).excess_mortality
, derived_reproduction_rate
).vaccine_policy
, new_vaccinations
)To properly apply time series models to the data, I had to assess:
Here is my Preprocessing and EDA Presentation
COVIDCast works by taking Epidemiological SIRD model estimated parameters of spread, death, and recovery and plugging them into the time series models as exogenous variables to give the model better information about the underlying nature of the disease being predicted.
ARIMA, SARIMA, SARIMAX models: These models predict future trends using moving averages, linear autoregression, and differencing. They also incorporate seasonality and exogenous variables, making them potent tools for forecasting. However they don’t model nonlinear trends very well and they require a lot of prior programming by the forecaster.
Prophet model: From Facebook’s own description, “Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.”
For ARIMA time series models, I wanted to added exogenous variables that have information about how a target is going to fluctuate in the next time steps. To determine these features I had to assess each variables’ Stationarity, Granger-Causality, Linear Correlation Strength, Multi-Collinearity, and Importance. I found that 6 variables I thought to be important due to background knowledge were actually the best additional regressors for the model. I used autoarima for grid searching SARIMAX’s orders, and found the order of (3,0,2) (2, 1, 1) [7] with intercept
to be the most predictive in cross-validation.
For Prophet modeling, I grid searched with cross-validation over a range of 3 to 15 of the most important features as determined by Recursive Feature Elimination with LGMBoost as my regressor. The best performing model using this technique had 11 exogenous variables. I then grid searched over the hyperparameters to land on the final settings of changepoint_prior_scale=10
, seasonality_prior_scale=0.01
, holidays_prior_scale=10
, and growth='linear'
.
The training period was from February 15th, 2020 to March 5th, 2023 and the testing period was the 16 days from March 5th,2023 to March 21st, 2023. The forecasts and benchmarks below are based on the models’ performance during the testing period.
The SARIMAX model adeptly captures the weekly COVID case variations, demonstrating minimal residuals for low case counts. Although it occasionally misses predicting peaks, the observed values still lie within its 95% confidence interval.
The Prophet model seems to capture a long-term trend, even venturing into negative COVID case counts. Efforts to employ a logistic growth curve didn't enhance its accuracy. It struggles with weekly fluctuations and, at times, is directionally incorrect. Given its current state, it's not recommended for predicting COVID cases.
Comparing the testing scores, the SARIMAX model with 6 exogenous variables and times series order of: (3,0,2) (2, 1, 1) [7] with an intercept demonstrates the best understanding of daily COVID case trends, showcasing superior scores across all metrics.
Future endeavors for this project include adapting the target variable to deaths, hospitalizations, or weekly COVID case averages to assess the forecast accuracy. Also, literature on COVID prediction suggests that an RNN with LSTMs yield a cutting-edge sMAPE of 5%. I would like to recreate this RNN model, and explore whether integrating the SIRD model parameters further enhances its predictive power.
And, of course, the ambitious goal for this project is for COVIDCast to Predict to Protect against any future pandemics.
Thank you for your interest in COVIDCast. For further inquiries or insights, contact via this GitHub repository or at scelarek@gmail.com.