Model Overview

The COVID-19 Outbreak Detection Tool is designed to detect recent county-level COVID-19 outbreaks and predict how fast an outbreak would spread in each county in the United States. Using a machine learning approach, the tool estimates the doubling time of COVID-19 cases in each county by accounting for reported COVID-19 cases, COVID-19 deaths, face mask mandate, social distancing policies, social vulnerability index of each county (e.g. income level, employment rate), as well as the daily numbers of tests performed and positivity rates of each state. The tool is updated at least once a week.

Methodology

We built a machine learning based generalized random forest model to estimate the growth rate of incident COVID-19 cases in each county, as defined below:

x left-parenthesis t 0 plus t right-parenthesis equals x left-parenthesis t 0 right-parenthesis e Superscript r t Baseline comma

where x(t0) is the current incident case number, and x(t0+t) is the incident case number t days later if the exponential growth rate r persists.1 In our heat map, we convert this exponential growth rate to a more intuitive concept of case doubling time, i.e., if an exponential growth rate r persists, the initial case number would double in StartFraction ln left-parenthesis 2 right-parenthesis Over r EndFraction days.

The incident case number at a time point x(t) is defined as:

x left-parenthesis t right-parenthesis equals upper I left-parenthesis t right-parenthesis minus upper I left-parenthesis t minus 22 right-parenthesis ,

where I(t) and I(t-22) are the cumulative infectious case number up to time t and t-22, respectively. We assume a patient is either recovered or deceased 22 days after getting infected.2 We used the 7-day moving average to smooth out the weekend effect of case number oscillations.3 Furthermore, our analysis excludes counties with less than 20 incident cases at time t to reduce noise from the data.

To obtain robust and up-to-date county-level exponential growth rate estimations, we followed the below steps:

  1. We first constructed a database of daily COVID-19 case number growth rates for each county by utilizing the historical cumulative confirmed case numbers reported by the New York Times ( https://github.com/nytimes/COVID-19-data ).
  2. We then augmented these growth rates with various features capturing relevant factors affecting the disease spread. This includes county specific characteristics such as the geographic, demographic and social vulnerability index, along with time specific characteristics, such as economic stimulus packages and social distancing interventions. The following features were considered included:
    • Historical growth pattern and cumulative confirmed case number in each county:
      • daily case growth rate throughout the history
      • initial cumulative case number for each day
    • Geographical location of each county, as captured by the county centroid longitude and latitude
    • Social vulnerability index (SVI) of each county such as (see the SVI data dictionary for a complete list):
      • per capita income
      • employment rate
      • insurance coverage
    • COVID-19 related economic and social distancing policy of each state such as (See the CUSP data dictionary for a complete list):
      • face mask mandate
      • mandate quarantine for those entering the state
      • paid sick leave
    • Daily number of tests performed and their characteristics:
      • total PCR tests
      • positive PCR tests
      • total antibody tests
      • positive antibody tests
      • total antigen tests
      • positive antigen tests
  3. Finally, we used the generalized random forest algorithm4 to match the incident case number growth trends based on these features and estimated an exponential growth rate for each growth pattern. This method allows us to balance the bias-variance tradeoff in exponential growth rate estimations.
    1. First, our estimations are based on the most recent growth pattern for each county, which accounts for changing conditions and policies in a county and thus reduces bias. For example, the case number growth patterns before a mask mandate should not be used to estimate exponential growth rate after the mandate because it will confound the model.
    2. Second, this method pools together all relevant trends across counties and throughout the history in our exponential growth rate estimation for each county, which reduces the variance of our estimates. For example, to estimate the exponential growth rate of county A, the algorithm is able to use data from a different county B to reduce the estimation variance, provided that county A and B are sufficiently similar in various county- and state-level features.

Implementation and analysis were performed in R.

 

References

  1. Ma J. Estimating epidemic exponential growth rate and basic reproduction number. Infectious Disease Modelling. 2020;5:129-141.
  2. The University of Melbourne. Coronavirus 10-day forecast. Available at http://covid19forecast.science.unimelb.edu.au/ . August, 26, 2020.
  3. Bergman A, Sella Y, Agre P, Casadevall A. Oscillations in US COVID-19 incidence and mortality data reflect diagnostic and reporting factors. Msystems. 2020;5(4).
  4. Athey S, Tibshirani J, Wager S. Generalized random forests. The Annals of Statistics. 2019;47(2):1148-1178.

Data Sources