PRediction Error Sum of Squares (PRESS)

$PRESS$ is a measure of the quality of a regression model using residuals. $PRESS$ is a validation-type estimator of error that uses the deleted residuals to provide an estimate of the prediction error. When comparing alternate regression models, selecting the model with the lowest value of the $PRESS$ statistic is a good approach because it means that the equation produces the least error when making new predictions (see Helsel et al., 2020).

It is particularly valuable in assessing multiple forms of multiple linear regressions, but it is also useful for simply comparing different options for a single explanatory variable in single-variable regression models.

Usage

press(data, ...)

# S3 method for class 'data.frame'
press(data, truth, estimate, na_rm = TRUE, ...)

press_vec(truth, estimate, na_rm = TRUE, ...)

Arguments

data: A data.frame containing the columns specified by the truth and estimate arguments.
...: Not currently used.
truth: The column identifier for the true results (that is numeric). This should be an unquoted column name although this argument is passed by expression and supports quasiquotation (you can unquote column names). For _vec() functions, a numeric vector.
estimate: The column identifier for the predicted results (that is also numeric). As with truth this can be specified different ways but the primary method is to use an unquoted variable name. For _vec() functions, a numeric vector.
na_rm: A logical value indicating whether NA values should be stripped before the computation proceeds.

Value

A tibble with columns .metric, .estimator, and .estimate and 1 row of values.

For grouped data frames, the number of rows returned will be the same as the number of groups.

For press_vec(), a single numeric value (or NA).

Details

The $PRESS$ is only relevant for comparisons to other regression models with the same response variable units (Rasmunsen et al., 2009).

It estimates as follows: $$ PRESS = \sum_{i=1}^{n}{(sim_i - obs_i)^2} $$

where:

$sim$ defines model simulations at time step $i$
$obs$ defines model observations at time step $i$

Note

The $PRESS$ statistic is not appropriate for comparison of models having different transformations of response variable, e.g. linear regression and log-transformed linear regression (Helsel et al., 2020).

References

Rasmussen, P. P., Gray, J. R., Glysson, G. D. & Ziegler, A. C. Guidelines and procedures for computing time-series suspended-sediment concentrations and loads from in-stream turbidity-sensor and streamflow data. in U.S. Geological Survey Techniques and Methods book 3, chap. C4 53 (2009) https://pubs.usgs.gov/tm/tm3c4/.

Helsel, D. R., Hirsch, R. M., Ryberg, K. R., Archfield, S. A. & Gilroy, E. J. Statistical Methods in Water Resources. 484 (2020) doi:10.3133/tm4A3 .

Examples

library(tidyhydro)
data(avacha)

# Supply truth and predictions as bare column names
press(avacha, obs, sim)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 press   standard     228469.

# Or as numeric vectors
press_vec(avacha$obs, avacha$sim)
#> [1] 228469.5