Bondad de ajuste en modelos no lineales


“All models are wrong, but some are useful” - George Box.    

Partiendo de la idea anterior, en esta entrada del blog intentaremos dar algunas ideas sobre cómo seleccionar el modelo "más" útil, o el menos inútil. Para ello debemos describir qué aspectos de los datos son capturados correctamente por el modelo y cuáles no. Compararemos el "ajuste" de varios modelos, considerando el concepto de parsimonia. Es decir, no elegiremos el modelo basados solamente en que reproduzca mejor los datos observados, sino que el modelo seleccionado deberá ser el más simple (el más parsimonioso). Existen numerosos índices de ajuste para evaluar el grado de congruencia entre el modelo y los datos, y se recomienda presentar varios de ellos para complementar el análisis.
Las medidas que presentaré se pueden utilizar para cualquier tipo de modelo de regresión, pero es especialmente útil para modelos no-lineales, donde no podemos usar el famoso R2.

¿CÓMO EVALUAR LA BONDAD DE AJUSTE EN UN MODELO NO LINEAL?

  • ESTADÍSTICOS DE BONDAD DE AJUSTE
  • ANÁLISIS DE LOS RESIDUOS
  • INTERVALOS DE CONFIANZA Y PREDICCIÓN

  • ESTADÍSTICOS DE BONDAD DE AJUSTE o BONDAD DE PREDICCIÓN
Este es un resumen de los principales estadísticos/medidas que se pueden calcular mediante R.

  • deviance. Función del paquete stat. La "Devianza" del modelo es una medida del grado de diferencia entre las frecuencias observadas y predichas por el modelo de la variable dependiente, de forma que a mayor devianza, peor es el modelo. Su cálculo es -2 veces el logaritmo neperiano de la verosimilitud del modelo. La devianza nos puede orientar durante la etapa de selección del modelo final. Idealmente el modelo final, el mejor modelo, debería tener la menor devianza de los modelos analizados.
  • PseudoR2=1-(deviance(fit)/sum((Y-mean(Y))^2))
Cameron, A.C., Windmeijer, F.A.G., 1996. R^2 measures for count data regression models with applications to health-care utilization.  J. Bus. Econom. Statist. 14, 209–220     
  • Medidas de la función pcrGOF del paquete qpcR, para el modelo ajustado nls.
Rsq=1 - (rss/tss)R-square value of a fitted model
Rsq.ad=1 - (n - 1)/(n - p) * (1 - rsq), adjusted R-square value of a fitted model
AICAkaike's An Information Criterion
AICcAkaike's second-order corrected Information Criterion
BIC, Bayesian information criterion or Schwarz Criterion
HQICHannan-Quinn Information Criterion (es una variante del BIC con una pequeña penalización de la magnitud del tamaño muestral).

HQIC = -2 \cdot log(\mathcal{L}_{max}) + 2 \cdot k \cdot log(log(n))
with \mathcal{L}_{max} = maximum likelihood, k = number of parameters and n = number of observations.For n > \sim 20 or so BIC is the strictest in penalizing loss of degree of freedom by having more parameters in the fitted model. For n > \sim 40 AIC is the least strict of the three and HQIC holds the middle ground, or is the least penalizing for n < \sim 20.

resVar, residual variance, 
RMSE, root-mean-squared-error, RMSE = √{\overline{(y_i-\hat{y_i})^2}}
P.square, the PRESS P^2 value (if PRESS = TRUE).

  • Medidas de la función gof del paquete hydroGOF, para la comparación entre las observaciones y las predicciones realizadas por el modelo ajustado nls.
me
Mean Error
mae
Mean Absolute Error
mse
Mean Squared Error
rmse
Root Mean Square Error
nrmse
Normalized Root Mean Square Error ( -100% <= nrms <= 100% )
PBIAS
Percent Bias
pbiasfdc
PBIAS in the slope of the midsegment of the Flow Duration Curve
RSR
Ratio of RMSE to the Standard Deviation of the Observations, RSR = rms / sd(obs). ( 0 <= RSR <= +Inf )
rSD
Ratio of Standard Deviations, rSD = sd(sim) / sd(obs)
NSeff
Nash-Sutcliffe Efficiency ( -Inf <= NSeff <= 1 ). 

Una métrica para medir la capacidad de un modelo para predecir los valores observados. Este índice produce resultados menores o iguales a 1, si el resultado es 1 el ajuste es perfecto, si es cero el error es del mismo orden de magnitud que la varianza de los datos observados por lo que la media de los datos observados tendrá una capacidad predictora similar al modelo. Valores inferiores a cero implican que la media tiene una capacidad predictora más alta que el modelo (lo que implica desde luego que el modelo es muy malo). Este índice no es sensible al efecto de los valores proporcionales pero sigue siendo sensible a los valores extremos.

NSE = 1 - ( sum( (obs - sim)^2 ) / sum( (obs - mean(obs))^2 )The Nash-Sutcliffe efficiency (NSE) is a normalized statistic that determines the relative magnitude of the residual variance ("noise") compared to the measured data variance ("information") (Nash and Sutcliffe, 1970).  
mNSeff
Modified Nash-Sutcliffe Efficiency. 
mNSE = 1 - ( sum( abs(obs - sim)^j ) / sum( abs(obs - mean(obs))^j ).
rNSeff
Relative Nash-Sutcliffe Efficiency.
rNSE = 1 - ( sum( ( (obs - sim)/ mean(obs) )^2 ) / sum( abs( (obs - mean(obs)) / mean(obs) )^2 )

d
Index of Agreement ( 0 <= d <= 1 ).

d = 1 - [ ( sum( (obs - sim)^2 ) ] / sum( ( abs(sim - mean(obs)) + abs(obs - mean(obs)) )^2 )
The Index of Agreement (d) developed by Willmott (1981) as a standardized measure of the degree of model prediction error and varies between 0 and 1.
A value of 1 indicates a perfect match, and 0 indicates no agreement at all (Willmott, 1981).
The index of agreement can detect additive and proportional differences in the observed and simulated means and variances; however, it is overly sensitive to extreme values due to the squared differences (Legates and McCabe, 1999).

d1
Modified Index of Agreement
mNSeff = 1 - ( sum( abs(obs - sim)^j ) / sum( abs(obs - mean(obs))^j )

When j=1, the modified NSeff is not inflated by the squared values of the differences, because the squares are replaced by absolute values.
rd
Relative Index of Agreement
rd = 1 - [ sum( ( (obs - sim) / obs )^2 ] / sum( ( (abs(sim - mean(obs) ) + abs(obs - mean(obs) ) ) / mean(obs) )^2 )
It varies between 0 and 1. A value of 1 indicates a perfect match, and 0 indicates no agreement at all.

cp
Persistence Index ( 0 <= PI <= 1 )

cp = 1 - [ sum( (obs[2:n] - sim[2:n])^2 ] / sum( ( obs[2:n] - obs[1:(n-1)] )^2 )
Coefficient of persistence (Kitadinis and Bras, 1980; Corradini et al., 1986) is used to compare the model performance against a simple model using the observed value of the previous day as the prediction for the current day.
The coefficient of persistence compare the predictions of the model with the predictions obtained by assuming that the process is a Wiener process (variance increasing linearly with time), in which case, the best estimate for the future is given by the latest measurement (Kitadinis and Bras, 1980).
Persistence model efficiency is a normalized model evaluation statistic that quantifies the relative magnitude of the residual variance (noise) to the variance of the errors obtained by the use of a simple persistence model (Moriasi et al., 2007).
CP ranges from 0 to 1, with CP = 1 being the optimal value and it should be larger than 0.0 to indicate a minimally acceptable model performance.

r
Pearson Correlation coefficient ( -1 <= r <= 1 )
r.Spearman
Spearman Correlation coefficient ( -1 <= r.Spearman <= 1 )
R2
Coefficient of Determination ( 0 <= R2 <= 1 ). Gives the proportion of the variance of one variable that is predictable from the other variable
bR2
R2 multiplied by the coefficient of the regression line between sim and obs ( 0 <= bR2 <= 1 )
KGE
Kling-Gupta efficiency between sim and obs ( 0<=KGE<= 1)

This goodness-of-fit measure was developed by Gupta et al. (2009) to provide a diagnostically interesting decomposition of the Nash-Sutcliffe efficiency (and hence MSE), which facilitates the analysis of the relative importance of its different components in the context of hydrological modelling".
In the computation of this index, there are three main components involved:
-) r : the Pearson product-moment correlation coefficient. Ideal value is r=1.
-) Alpha: the ratio between the standard deviation of the simulated values and the standard deviation of the observed ones. Ideal value is Alpha=1.
-) Beta : the ratio between the mean of the simulated values and the mean of the observed ones. Ideal value is Beta=1.

KGE = 1 - ED
ED = √{ (s[1]*(r-1))^2 +(s[2]*(α-1))^2 + (s[3]*(β-1))^2 }
α=σ_s/σ_o
KGE = 1 - sqrt[ (s[1]*(r-1))^2 + (s[2]*(Alpha-1))^2 + (s[3]*(Beta-1))^2]
Kling-Gupta efficiencies range from -Inf to 1. Essentially, the closer to 1, the more accurate the model is. 



Más información: link1 y link2.






Algunas órdenes para generar gráficos con los valores observados (obs) y predichos (sim) por el modelo ajustado. Se agregan además los valores de las medidas de ajuste detalladas anteriormente.





#gráfico de valores observados y valores predichos (o simulados) y medidas de bondad de ajuste
#sea fit.nlsveert el modelo ajustado mediante nls
library(hydroGOF)
sim=c(NA,predict(fit.nlsvert))
obs=Rt
ggof(sim,obs, digits=4)

  • ANÁLISIS DE LOS RESIDUOS

# ejemplo del paquete hydroTSM, para series temporales.
library(hydroTSM)
data(OcaEnOnaQts)  
## 3 ts, 3 boxplots and 3 histograms
  hydroplot(OcaEnOnaQts, FUN=mean, ylab= "Q", var.unit = "m3/s")




  • INTERVALOS DE CONFIANZA Y PREDICCIÓN


 
library(nls2)
fm <- fit.nlsvert
ic=predict(as.lm(fm), interval = "confidence")
pr=predict(as.lm(fm), interval = "prediction")
par(mfrow=c(2,1))
#P-factor is the percent of observations that are within the given uncertainty bounds.
#R-factor represents the average width of the given uncertainty bounds divided by the standard deviation of the observations.
#dates, date.fmt="%Y-%m-%d"
date=NULL;for(i in 1960:2009)
{
datei=paste(i,"-01","-01",sep="")
date=c(date,datei)
}
plotbands(x=Rt[-1],lband=ic[,2],uband=ic[,3],sim=sim[-1],main="Confidence Bounds for x",dates=date[-length(date)])
plotbands(x=Rt[-1],lband=pr[,2],uband=pr[,3],sim=sim[-1],main="Confidence Bounds for predictions",dates=date[-length(date)])





In: http://www.ats.ucla.edu/stat/mult_pkg/faq/general/psuedo_rsquareds.htm



FAQ: What are pseudo R-squareds?



As a starting point, recall that a non-pseudo R-squared is a statistic generated in ordinary least squares (OLS) regression that is often used as a goodness-of-fit measure.  In OLS,




where N is the number of observations in the model, y is the dependent variable, y-bar is the mean of the y values, and y-hat is the value predicted by the model. The numerator of the ratio is the sum of the squared differences between the actual y values and the predicted y values.  The denominator of the ratio is the sum of squared differences between the actual y values and their mean. 


There are several approaches to thinking about R-squared in OLS.  These different approaches lead to various calculations of pseudo R-squareds with regressions of categorical outcome variables. 



  1. R-squared as explained variability - The denominator of the ratio can be thought of as the total variability in the dependent variable, or how much y varies from its mean.  The numerator of the ratio can be thought of as the variability in the dependent variable that is not predicted by the model.  Thus, this ratio is the proportion of the total variability unexplained by the model.  Subtracting this ratio from one results in the proportion of the total variability explained by the model.  The more variability explained, the better the model.



  2. R-squared as improvement from null model to fitted model - The denominator of the ratio can be thought of as the sum of squared errors from the null model--a model predicting the dependent variable without any independent variables.  In the null model, each y value is predicted to be the mean of the y values. Consider being asked to predict a y value without having any additional information about what you are predicting.  The mean of the y values would be your best guess if your aim is to minimize the squared difference between your prediction and the actual value.  The numerator of the ratio would then be the sum of squared errors of the fitted model.  The ratio is indicative of the degree to which the model parameters improve upon the prediction of the null model.  The smaller this ratio, the greater the improvement and the higher the R-squared.



  3. R-squared as the square of the correlation - The term "R-squared" is derived from this definition.  R-squared is the square of the correlation between the model's predicted values and the actual values.  This correlation can range from -1 to 1, and so the square of the correlation then ranges from 0 to 1.  The greater the magnitude of the correlation between the predicted values and the actual values, the greater the R-squared, regardless of whether the correlation is positive or negative.  




When analyzing data with a logistic regression, an equivalent statistic to R-squared does not exist.  The model estimates from a logistic regression are maximum likelihood estimates arrived at through an iterative process.  They are not calculated to minimize variance, so the OLS approach to goodness-of-fit does not apply.  However, to evaluate the goodness-of-fit of logistic models, several pseudo R-squareds have been developed.   These are "pseudo" R-squareds because they look like R-squared in the sense that they are on a similar scale, ranging from 0 to 1 (though some pseudo R-squareds never achieve 0 or 1) with higher values indicating better model fit, but they cannot be interpreted as one would interpret an OLS R-squared and different pseudo R-squareds can arrive at very different values.  Note that most software packages report the natural logarithm of the likelihood due to floating point precision problems that more commonly arise with raw likelihoods.


Commonly Encountered Pseudo R-Squareds












Pseudo R-SquaredFormulaDescription
Efron's



Efron's mirrors approaches 1 and 3 from the list above--the model residuals are squared, summed, and divided by the total variability in the dependent variable, and this R-squared is also equal to the squared correlation between the predicted values and actual values. 


When considering Efron's, remember that model residuals from a logistic regression are not comparable to those in OLS. The dependent variable in a logistic regression is not continuous and the predicted value (a probability) is. In OLS, the predicted values and the actual values are both continuous and on the same scale, so their differences are easily interpreted. 

McFadden's



Mfull = Model with predictors


Mintercept = Model without predictors



McFadden's mirrors approaches 1 and 2 from the list above.  The log likelihood of the intercept model is treated as a total sum of squares, and the log likelihood of the full model is treated as the sum of squared errors (like in approach 1). The ratio of the likelihoods suggests the level of improvement over the intercept model offered by the full model (like in approach 2).

A likelihood falls between 0 and 1, so the log of a likelihood is less than or equal to zero.  If a model has a very low likelihood, then the log of the likelihood will have a larger magnitude than the log of a more likely model.  Thus, a small ratio of log likelihoods indicates that the full model is a far better fit than the intercept model.

If comparing two models on the same data, McFadden's would be higher for the model with the greater likelihood. 
McFadden's (adjusted)




McFadden's adjusted mirrors the adjusted R-squared in OLS by penalizing a model for including too many predictors. If the predictors in the model are effective, then the penalty will be small relative to the added information of the predictors.  However, if a model contains predictors that do not add sufficiently to the model, then the penalty becomes noticeable and the adjusted R-squared can decrease with the addition of a predictor, even if the R-squared increases slightly.

Note that negative McFadden's adjusted R-squared are possible.
Cox & Snell


Cox & Snell's mirrors approach 2 from the list above.  The ratio of the likelihoods reflects the improvement of the full model over the intercept model (the smaller the ratio, the greater the improvement).Consider the definition of L(M).  L(M) is the conditional probability of the dependent variable given the independent variables. If there are N observations in the dataset, then L(M) is the product of N such probabilities.  Thus, taking the nthroot of the product L(M) provides an estimate of the likelihood of each Y value.  Cox & Snell's presents the R-squared as a transformation of the -2ln[L(MIntercept)/L(MFull)] statistic that is used to determine the convergence of a logistic regression.

Note that Cox & Snell's pseudo R-squared has a maximum value that is not 1: if the full model predicts the outcome perfectly and has a likelihood of 1, Cox & Snell's is then 1-L(MIntercept)2/Nwhich is less than one. 
Nagelkerke / Cragg & Uhler's


Nagelkerke/Cragg & Uhler's mirrors approach 2 from the list above.  It adjusts Cox & Snell's so that the range of possible values extends to 1.To achieve this, the Cox & Snell R-squared is divided by its maximum possible value, 1-L(MIntercept)2/N.  Then, if the full model perfectly predicts the outcome and has a likelihood of 1, Nagelkerke/Cragg & Uhler's R-squared = 1.

When L(Mfull) = 1, then R2 = 1
When L(Mfull) = L(Mintercept), then R2 = 0.

McKelvey & Zavoina



McKelvey & Zavoina's mirrors approach 1 from the list above, but its calculations are based on predicting a continuous latent variable underlying the observed 0-1 outcomes in the data.  The model predictions of the latent variable can be calculated using the model coefficients (NOT the log-odds) and the predictor variables.McKelvey & Zavoina's also mirrors approach 3.  Because of the parallel structure between McKelvey & Zavoina's and OLS R-squareds, we can examine the square root of McKelvey & Zavoina's to arrive at the correlation between the latent continuous variable and the predicted probabilities.

Note that, because y* is not observed, we cannot calculate the variance of the error (the second term in the denominator).  It is assumed to be π2/3 in logistic models.
Count


Count R-Squared does not approach goodness of fit in a way comparable to any OLS approach.  It transforms the continuous predicted probabilities into a binary variable on the same scale as the outcome variable (0-1) and then assesses the predictions as correct or incorrect.Count R-Square treats any record with a predicted probability of .5 or greater as having a predicted outcome of 1 and any record with a predicted probability less than .5 as having a predicted outcome of 0.  Then, the predicted 1s that match actual 1s and predicted 0s that match actual 0s are tallied.  This is the number of records correctly predicted, given this cutoff point of .5. The R-square is this correct count divided by the total count. 
Adjusted Count

 


n = Count of most frequent outcome

The Adjusted Count R-Square mirrors approach 2 from the list above.  This adjustment is unrelated to the number of predictors and is not comparable to the adjustment to OLS or McFadden's R-Squareds.Consider this scenario: If you are asked to predict who in a list of 100 random people is left-handed or right-handed, you could guess that everyone in the list is right handed and you would be correct for the majority of the list. Your guess could be thought of as a null model.

The Adjusted Count R-Squared controls for such a null model.  Without knowing anything about the predictors, one could always predict the more common outcome and be right the majority of the time. An effective model should improve on this null model, and so this null model is the baseline for which the Count R-Square is adjusted.  The Adjusted Count R-squared then measures the proportion of correct predictions beyond this baseline. 


Espero que les haya interesado, iré agregando más material próximamente!.

Comentarios

Publicar un comentario