Remuestreo
Los métodos de remuestreo o submuestreo (i.e. resampling) evalúan los estadísticos en remuestras o submuestras obtenidas a partir de los datos originales, y mediante estos valores obtienen estimadores de lmedidas de exactitud o de la distribución muestral del estadístico.
El remuestreo son métodos que permiten:
- Estimar la precisión de muestras estadísticas (según su media, mediana, varinza, etc..).
- Intercambiar marcadores de puntos de datos al realizar test de significación (test de permutación, tambien denominados tests exactos, tests de aleatoriedad, o pruebas de re-aleatoriedad)
- Validar modelos para el uso de subconjuntos aleatorios (bootstrapping, validación cruzada)
Tipos de remuestreo.
Existen al menos 4 grupos principales de remuestreo:
- aleatorización de pruebas exactas (randomization exact test)
- validación cruzada (cross-validation): se usa para evaluar los resultados de un análisis estadístico y garantizar que son independientes de la partición entre datos de entrenamiento y prueba. Definida una muestra, se realizan particiones de las mismas y se evalúan los estadísticos para cada una de las particiones. El caso más sencillo consiste en tomar 2 particiones, una de entrenamiento y la otra de prueba.
- jackknife: para la estimación del sesgo y del error estándar de un estadístico. El nombre alude a las navajas, y más precisamente a aquellas multiuso, que sirven pa tó. Definida una muestra de observaciones de tamaño n, el método jacknife consiste en suprimir cada vez un conjunto de observaciones g y calcular sobre el conjunto (n-g) restante de datos el estadístico de interés. Una vez obtenidas las n estimaciones del estadístico para cada una de las muestras Jackknife, se puede calcular una estimación del sesgo asociado a la muestra original, así como una estimación no paramétrica del error estándar del estadístico.
- boostrap: su nombre refiere a su autosuficiencia, ya que significa “levantarse tirando hacia arriba de las propias correas de las botas”. El Bootstrap se basa en el remuestreo con reposición a partir de la muestra original para obtener nuevas muestras, calcular en cada una de ellas el estadístico deseado, y así obtener una distribución de valores para el estadístico en la que podemos calcular su dispersión (análogo del error estándar) y determinar unos límites de confianza utilizando esa distribución.
Descripción y aplicación en R
#### 1) Aleatorización de pruebas exactas o test de permutación (Randomization exact test or permutation test):
NOTA: Also known as the permutation test, this test was developed by R. A. Fisher (1935/1960), the founder of classical statistical testing. However, in his later years Fisher lost interest in the permutation method because there were no computers in his days to automate such a laborious method.
El propósito es simular artificialmente la aleatoriedad.
Los datos se re-asignan aleatoriamente (datos permutados) de tal manera de obtener p-valores basados en estos nuevos datos.
EJ: evaluar la diferencia entre dos m?todos. Luego de evaluar cada conjunto de datos permutados, por ejemplo 100, el investigador puede poner juntos los t-valores y obtener un "p-valor exacto" (la probabilidad de que la diferencia solo ocurra por azar). Por ejemplo, si el t-valor de 1.55 ocurre solo 5 veces en las 100, el p-valor exacto ser?a .05.
Problemas: d?bil ante tama?os de muestras peque?os.
#### 2) Cross-validation:
NOTA: Simple cross-validation was proposed by Kurtz (1948) as a remedy for the Rorschach test, a form of personality test which was criticized by psychometricians for its lack of common psychometric attributes such as data normality. Based on Kurtz's simple cross-validation, Mosier (1951) developed double cross-validation, which was later extended to multicross-validation by Krus and Fuller (1982).
El propósito es averiguar si los resultados son replicables o solo una cuesti?n de fluctuaciones aleatorias, es decir, verificar la replicabilidad de los resultados.
La muestra se divide aleatoriamente en 2 o más subconjuntos y los resultados de la prueba sn validados comparando las sub-muestras.
Varias categor?as: simple, doble o m?ltiple:
*Simple cross-validation. Take regression as an example. In the process of implementing a simple cross-validation, the first sub-sample is usually used for deriving the regression equation while another sub-sample is used for generating predicted scores from the first regression equation. Next, the cross-validity coefficient is computed by correlating the predicted scores and the observed scores on the outcome variable.
*Double cross-validation. Double cross-validation is a step further than its simple counterpart. Take regression as an example again. In double cross-validation regression equations are generated in both sub-samples, and then both equations are used to generate predicted scores and cross-validity coefficients.
*Multicross-validation. Multicross-validation is an extension of double cross-validation. In this form of cross-validation, double cross-validation procedures are repeated many times by randomly selecting sub-samples from the data set. In the context of regression analysis, beta weights computed in each sub-sample are used to predict the outcome variable in the corresponding sub-sample. Next, the observed and predicted scores of the outcome variable in each sub-sample are used to compute the cross validated coefficient.
Problemas: d?bil ante tama?os de muestras peque?os. Ang (1998) argued that cross-validation is problematic because splitting an already small sample increases the risk that the beta weights are just artifacts of the sub-sample. Thus, Ang endorsed use of Jackknife, which will be discussed next.
Es ?til cuando tenemos que elegir entre modelos (por ejemplo, anidados) ya que la validaci?n cruzada nos permite elegir el modelo que "mejor predice" los valores, a diferencia de los m?todos de m?nimos cuadrados o m?xima verosimilitud que eligen los par?metros dentro de un conjunto de modelos que "mejor ajustan" los valores.
NOTA: a pesar de que la validaci?n cruzada es ?til para elegir entre modelos, no tenemos idea de si la diferencia entre ellos es estad?sticamente significativa.
La manera m?s f?cil de hacer validaci?n cruzada es dividir los datos en dos partes: el conjunto de entrenamiento (training set) que ser? usado para ajustar varios modelos a los datos, y el conjunto de prueba (test set) que se usar? para elegir entre modelos.
un inconveniente del m?todo anterior es que reduce el poder de los datos para ajustar un modelo ocultando algunas observaciones para validaci?n. Podemos minimizar esta p?rdida tomando un conjunto de entrenamiento tan grande como sea posible, pero en tal caso no tendr?amos suficientes puntos para la validaci?n. Evitamos esto utilizando una estratgia "leave one out", superficialmente similar al jackknife. (el jackknife trabaja con un modelo particular paara estimar un estimador de sesgo y varianza muestral, mientras que en la validaci?n cruzada estamos eligiendo entre modelos).
Ejemplo: an?lisis de regersi?n
? Leaving out the ith pair of points (Xi, Yi), use least squares methods to find parameter estimates ?a-i and ?b-i.
? Hence, find the best predictor of the left out Yi: ?Y-i = ?a-i + ?b-i*Xi
? Repeat the above for each point and calculate the cross validation score, or estimate predictive error (1/n)*sum(Yi-?Y-i)^2
? If we also calculate the cross validation score for the log(X) model we can again use predictive error as a basis to choose between the models.
#### 3) Jackknife: Also known as the Quenouille-Tukey Jackknife, this tool was invented by Maurice Quenouille (1949) and later developed by John W. Tukey (1958). As the father of EDA, John Tukey attempted to use Jackknife to explore how a model is influenced by subsets of observations when outliers are present. The name "Jackknife" was coined by Tukey to imply that the method is an all-purpose statistical tool.
El prop?sito es detectar valores extremos y c?mo cada sub-muestra afecta al modelo.
Se repite el mismo test dejando un sujetopor vez ("leave one out"). Este procedimiento es especialmente ?til cuando la dispersi?n de la distribuci?n es amplia o existen valores extremos en el conjunto de datos. En estos casos se espera que el Jackknife retorne una estimaci?n de menor sesgo.
Se lo puede ver como una versi?n m?s organizada o sistem?tica del bootstrap.
Algoritmo:
Given a data set of n observations we create n new data sets, each of size n - 1, by leaving out one point at a time.
? Suppose we have a sample of observations X = {X1, . . . Xn} that we will use to estimate a parameter Theta.
? Suppose that T = T(X) is an estimator of Theta.
? Draw n jackknife samples X-1, X-2, ...., X-n by leaving out the first, second . . . nth observation respectively.
? Calculate T-i=T(X-i) for each i.
? Calculate T* =(1/n)*sum(T-i)
? The jackknife estimate of variance is ((n-1)/n)*sum(T-i-T*)^2
? And the jackknife estimate of bias is (n - 1)(T*- T) which leads to a bias corrected estimate of nT - (n - 1)T* when we subtract this from T.
Ejemplo:
dat=c(3.9140011, 0.1819694, 2.5604846, 0.3703330, 6.7950865, 4.0200424, 0.660441) # queremos estimar su media.
jk = rep(0,length(dat))
for (i in 1:length(jk)) jk[i] = mean(dat[-i])
mean(jk) #[1] 2.643194
(length(jk)-1)/length(jk) * sum((jk-mean(jk))^2) # [1] 0.855013
mean(dat) #[1] 2.643194, observemos que la media de la muestra jackknife es igual a la media de los datos originales.
var(dat)/length(dat) #[1] 0.855013, observemos que la estimaci?n usual de la varianza (insesgada) de la media muestra es la misma que la del estimados jackknife de la varianza. Sin embargo, la estimaci?n bootstrap de la varianza de ma media muestral conveerge al MLE (sesgado).
# f?rmula general de jackknife
jackknife=function(T,x)
{
n = length(x)
jk = rep(0,n)
for (i in 1:n)
jk[i] = T(x[-i])
Tstar = mean(jk)
jkvar = (n-1)/n * sum((jk-Tstar)^2)
list(var = jkvar, bias = (n-1)*(Tstar - T(x)), corrected = n*T(x)-(n-1)*Tstar)
}
salida=jackknife(T=mean,x=dat)
# la estimaci?n de la varianza por jackknife puede utilizarse para realizar intervalos de confianza y test de hip?tesis, tal como se realiza con la manera usual de estimar la varianza.
#### 4) Bootstrap: This technique was invented by Bradley Efron (1979, 1981, 1982) and further developed by Efron and Tibshirani (1993). "Bootstrap" means that one available sample gives rise to many others by resampling (a concept reminiscent of pulling yourself up by your own bootstrap). While the original objective of cross-validation is to verify replicability of results and that of Jackknife is to detect outliers, Efron (1981, 1982) developed bootstrap with inferential purposes.
El prop?sito es la inferencia.
Genera resultados con menor sesgo y m?s consistentes, respecto al Jackknife (Fan & Wang, 1996).
Comparado con la t?cnica de Jackknife, la estrategia de remuestreo del Bootstrao es m?s rigurosa en t?rminos de la magnitud de la replicaci?n. En Jackknife, el n?mero de remuestreos est? limitado por el n?mero de observaciones (n-1), pero en bootstrap, la muestra original puede ser duplicada tantas veces como que quiera y por tanto esta muestra expandida se trata como una poblaci?n virtual. Las muestras se obtienen de esta poblaci?n para veerificar los estimadores.
A diferencia de la validaci?n cruzada o el Jackknife, el Bootstrap emplea un muestreo con reemplazamiento, que -en t?rminos de simulaci?n de aleatoriedad- es m?s adecuado que el muestreo sin reemplazo.
A diferencia de la validaci?n cruzada o el Jackknife, donde el n de las submuestras es menor que el de la muestra original, en Bootstrap cada remuestreo tiene el mismo n?mero de observacinones que la muestra original. Por tanto, tiene la ventaja de modelar los impactos del tama?o real de la muestra (Fan & Wang, 1996).
Algoritmo:
? Suppose we have a sample of observations X = {X1, . . . Xn} that we want to use to estimate a parameter Theta.
? Suppose also that we have derived T(X), an estimator of Theta, and want to evaluate its sampling variance.
? Draw a bootstrap sample X1 of size n by sampling with replacement from X.
? Calculate T1=T(X1)
? Repeat the resampling b times to get T1, T2, ..., Tb
? Calculate the mean of the bootstrapped statistics meanT=(1/b)*sum(Ti)
? Finally, calculate the bootstrap estimate of variance (1/(b-1))*sum(Ti-meanT)^2
Ejemplo:
dat=c(3.9140011, 0.1819694, 2.5604846, 0.3703330, 6.7950865, 4.0200424, 0.660441) #sea el conjunto de datos dat, queremos usarlo para estimar la media de la poblaci?n de la que proviene
mean(dat)
#para evaluar la varianza de esta estimaci?n, podemos comenzar tomando una muestra bootstrap y calculando la media en ella.
boot1=sample(dat, replace=TRUE)
boot1
mean(boot1) # esto nos da T1=2.461595
# si lo ponemos en un loop podemos generar una muestra completa de bootstrap y evaluar la varianza.
bs=rep(0,100000)
for (i in 1:length(bs)) bs[i] = mean(sample(dat,replace=TRUE))
var(bs) # mi estimaci?n de la varianza muestra del estimador es 0.7376547
sum((dat-mean(dat))^2) / length(dat) / length(dat) # la comparamos con la forma usual de obtener la varianza muestral
#cuantas muestras tomar depende en qu? tan r?pido queramos calcular nuestras estimaciones.
# f?rmula general de bootsrap
bootstrap=function(T,x,b=10000)
{
bs = rep(0,b)
for (i in 1:b)
bs[i] = T(sample(x,replace=TRUE))
bs
}
salida=bootstrap(T=mean,x=dat,b=10000) #mi estimaci?n de la varianza muestra del estimador es 0.7312076
# la estimaci?n de la varianza por bootstrap puede utilizarse para realizar intervalos de confianza y test de hip?tesis, tal como se realiza con la manera usual de estimar la varianza.
## Variantes del bootstrap
# i) smoothed bootstrap. Se agrega a cada remuestreo una cantidad comparativamente peque?a de valores generados aleatoriamente. Esto tiene el mismo efecto que ajustar una estimaci?n del n?cle (kernel) de densidad a los datos, y remuestrar en ellos.
# ii) parametric bootstrap. Se ajusta una densidad param?trica a los datos observados y se realiza el remuestreo en ella.
# iii) wild bootstrap. Para datos distribuidos sim?tricamente respecto a 0. Cada valor remuestrado se multiplica por -1 con una probabilidad 0.5 antes de ser usado.
# iv) balanced bootstrap. concatenamos b copias del conjunto original de datos en un vector y permutamso aleatoriamente en ?l. Los primeros n elementos se vuelven la primer muestra bootstrap, los degundos n elementos la segunda muestra y as? sucesivamente.
#### 6) Markov chain Monte Carlo or MCMC (o Metropolis & Hastings samplers)
Dada una distribuci?n phi en un n?mero finito de estados, podemos construir alg?n proceso equivalente a una cadena de Markov irreducible P de tal manera que la distribuci?n erg?dica de la cadena sea phi.
mirar tambi?n: http://cran.r-project.org/web/packages/mcmc/vignettes/demo.pdf
Mirar en: http://www.iiap.res.in/astrostat/School07/R/MCMC.html
A toy example to calculate P(-1 < X < 0) when X is a Normal(0,1) random variable:
xs = rnorm(10000) # simulate 10,000 draws from N(0,1)
xcount = sum((xs>-1) & (xs<0)) # count number of draws between -1 and 0
xcount/10000 # Monte Carlo estimate of probability
pnorm(0)-pnorm(-1) # Compare it to R's answer (cdf at 0) - (cdf at -1)
#i) Metropolis sampling
Implica una etapa de proposici?n y una etapa de aceptaci?n, de tal manera que para movernos desde el estado i al estado j debemos primero proponer el movimiento y luego aceptar el movimiento.
Algoritmo:
? Proposals are generated according to a symmetric transition matrix Q, so that Qi,j = Qj,i ,or the probability of proposing a move to state j from state i is the same as proposing a move to state i from state j.
? A proposed move from state i to state j is accepted with probability min(1,phij/phii)
? If the proposal is accepted we move to state j, and if it is rejected, we stay in state i.
para correr esta cadena solo neceistamos definir el mecanismo de propuesta, que no depende de phi, y luego evaluar la propuesta utilizando la proporci?n phij/phii. Note que no necesitariamente tenemos que conocer Qi,j, solo necesitamos conocer que Q sea sim?trica.
Note que para muestras v?lidas de phi, la cadena de markov debe ser irreducible.
Ejemplo: implementar Metropolis sampler dado un vector de probabilidades phi.
f.metro=function(pi,n=10000)
{
s = 1:length(pi)
x = rep(0,n)
x[1] = sample(s,1)
for (i in 2:n)
{
if (runif(1) < 0.5)
z = x[i-1] - 1
else
z = x[i-1] + 1
if (z < 1 | z > length(s))
a = 0
else
a = min(1, pi[z]/pi[x[i-1]])
if (runif(1) < a)
x[i] = z
else
x[i] = x[i-1]
}
x
}
f.runmc(P)
pi=
z=f.metro(pi,100000)
table(z)/length(z)
#z
#1 2 3 4 5 6
#0.20470 0.18060 0.04159 0.04179 0.12459 0.40673
# esta salida concuerda bastante bien con el vector phi de entrada.
# From Robert & Casells: A Metropolis{Hastings sample is then generated with the following R code:
a=2.7; b=6.3; c=2.669 # initial values
Nsim=5000
X=rep(runif(1),Nsim) # initialize the chain
for (i in 2:Nsim)
{
Y=runif(1)
rho=dbeta(Y,a,b)/dbeta(X[i-1],a,b)
X[i]=X[i-1] + (Y-X[i-1])*(runif(1)<rho)
}
#i) Hasting?s sampling
Este m?todo generaliza el m?todo de Metropolis al caso en el cual Q no es sim?trico.
Tiene la misma estructura de propuesta y aceptaci?n, pero la probabilidad de aceptaci?n viene dada por max(1,(phij*Qj,i)/(phii*Qi,j)).
Para realizar este muestreo necesitamos conocer m?s expl?citamente en detalle la matriz Q, especificamente necesitamos la proporci?n Qi,j/Qj,i.
### M?s ejemplos:
http://www.mayin.org/ajayshah/KB/R/documents/boot.html
http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-bootstrapping.pdf
http://www.statmethods.net/advstats/bootstrapping.html
http://cran.r-project.org/doc/Rnews/Rnews_2002-3.pdf
##### Types of resampling:
# bibliograf?a:
http://pareonline.net/getvn.asp?v=8&n=19
http://episun7.med.utah.edu/~alun/teach/compstats/notes.pdf
http://cran.r-project.org/web/packages/mcmc/vignettes/demo.pdf
EJEMPLOS
#################################################################################
#### Resample data
#################################################################################
## Basado en http://www.zoology.ubc.ca/~schluter/zoo502stats/Rtips.resample.html
#Let's assume that the data are a sample of measurements for a single variable stored in a vector "x". The data may be numeric or categorical.
#A single bootstrap replicate is obtained as follows. The replace option is used to indicate that sampling is carried out with replacement.
xboot = sample(x, replace=TRUE)
#Calculate the statistic of interest (for example, the mean) on the resampled data in "xboot" and store the result in a vector created for this purpose.
z = vector() # initialize z (do only once)
z[1] = mean(xboot) # first bootstrap replicate estimate
#Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for subsequent analysis.
#In other cases, two or more variables are measured on individuals (e.g., stem height, leaf area, petal diameter, etc). Assume that each row of a data frame "mydata" is a different individual, and each column a different variable.
#To resample individuals (i.e., rows),
iboot = sample(1:nrow(mydata), replace=TRUE))
bootdata = mydata[iboot,]
#The data frame "bootdata" will contain a single bootstrap replicate including all the variables.
#Calculate the statistic of interest on the resampled data and store the result in vector created for this purpose. For example, to calculate the correlation between two variables x and y in "bootdata",
z = vector() # initialize z (do only once)
z[1] = cor(bootdata$x, bootdata$y)
#Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for subsequent analysis.
# Basado en http://www.zoology.ubc.ca/~schluter/zoo502stats/Rtips.resample.html
### Randomization test
#A randomization test uses resampling and the computer to generate a null distribution for a test statistic. The test statistic is a measure of association between two variables or difference between groups. Each randomization step involves randomly resampling without replacement the values of one of the two variables in the data. The two variables may be categorical (character or factor), numeric, or one of each.
## Both variables categorical
#R has a built-in randomization procedure for a contingency test of association when both of two variables are categorical (call them A1 and A2). To apply it, execute the usual command for the ?2 contingency test, but set the "simulate.p.value" option to TRUE. The number of replicates in the randomization is set by the option "B" (default is 2000). Each randomization rearranges the values in the contingency table while keeping all the row and column totals fixed to their observed values.
chisq.test(A1, A2, simulate.p.value = TRUE, B = 5000)
## Data are numeric
#If one or both of the variables is numeric, then you will need to create a short loop to carry out the resampling necessary for the randomization test. Choose one of the two variables to resample (call it "x"). It doesn't matter which of the two variables you choose. Keep the other variable (call it "y") unchanged (there is no benefit to resampling both variables).
#Resample "x" without replacement to create a new vector (call it x1).
x1 = sample(x, replace = FALSE)
#Calculate the test statistic to measure association between y and the randomized variable x1. Store the result in a vector created for this purpose. For example, to calculate the correlation between the two variables,
z = vector() # initialize z (do only once)
z[1] = cor(x1, y) # first randomization result
#Repeat steps (1) and (2) many times. The result will be a large collection of replicates representing the null distribution of your test statistic.
#Use this null distribution to calculate an approximate P-value for your test. This involves calculating the proportion of values in the null distribution that are as extreme or more extreme than the observed value of the test statistic (calculated from the real data). Whether this calculation must take account of just one or both tails of the null distribution depends on your test statistic.
#################################################################################
### Permutation test: randomization test for small size samples
#################################################################################
## Basado en http://www.burns-stat.com/pages/Tutor/bootstrap_resampling.html
# The idea: Permutation tests are restricted to the case where the null hypothesis really is null -- that is, that there is no effect. If changing the order of the data destroys the effect (whatever it is), then a random permutation test can be done. The test checks if the statistic with the actual data is unusual relative to the distribution of the statistic for permuted data.
# Our example permutation test is to test volatility clustering of the S&P returns. Below is an R function that computes the statistic for Engle's ARCH test.
engle.arch.test = function (x, order=10)
{
xsq = x^2
nobs = length(x)
inds = outer(0:(nobs - order - 1), order:1, "+")
xmat = xsq[inds]
dim(xmat) = dim(inds)
xreg = lm(xsq[-1:-order] ~ xmat)
summary(xreg)$r.squared * (nobs - order)
}
# All you need to know is that the function returns the test statistic and that a big value means there is volatility clustering, but here is an explanation of it if you are interested. The test does a regression with the squared returns as the response and some number of lags (most recent previous data) of the squared returns as explanatory variables. (An estimate of the mean of the returns is generally removed first, but this has little impact in practice.) If the last few squared returns have power to predict tomorrow's squared return, then there must be volatility clustering. The tricky part of the function is the line that creates inds. This object is a matrix of the desired indices of the squared returns for the matrix of explanatory variables. Once the explanatory matrix is created, the regression is performed and the desired statistic is returned. The default number of lags to use is 10.
# A random permutation test compares the value of the test statistic for the original data to the distribution of test statistics when the data are permuted. We do this below for our example.
spx.arch.perm = numeric(1000)
for(i in 1:1000) {
spx.arch.perm[i] = engle.arch.test(sample(spxret))
}
hist(spx.arch.perm, col='yellow')
abline(v=engle.arch.test(spxret), col='blue')
# The simplest way to get a random permutation in R is to put your data vector as the only argument to the sample function. The call:
sample(spxret)
#is equivalent to:
sample(spxret, size=length(spxret), replace=FALSE)
# A simple calculation for the p-value of the permutation test is to count the number of statistics from permuted data that exceed the statistic for the original data and then divide by the number of permutations performed. In R a succinct form of this calculation is:
mean(spx.arch.perm >= engle.arch.test(spxret))
# In my case I get 0.016. That is, 16 of the 1000 permutations produced a test statistic larger than the statistic from the real data. A test using 100,000 permutations gave a value of 0.0111.
## Basado en ayuda del paquete BHH2
## Permutation test for means and variance comparisons.
library(BHH2)
# Permutation test for Tomato Data
data(tomato.data)
cat("Tomato Data (not paired):\n")
attach(tomato.data)
a = pounds[fertilizer=="A"]
b = pounds[fertilizer=="B"]
print(round(test = permtest(b,a),3))
detach()
# Permutation test for Boy's Shoes Example
data(shoes.data)
cat("Shoes Data (paired):\n")
attach(shoes.data)
x = matB-matA
print(round(test = permtest(x),3))
detach()
#################################################################################
###############################################################
library(bootstrap)
# to do a leave-on-out jackknife estimate for the mean of the
# data ?jackknife gives an example
# Having a look at the jackknife function we see that it demands
# two parameters: x and theta. x is supposed to contain the data
# and theta the function that is applied to the data
x = rnorm(20)
theta = function(x){mean(x)}
results = jackknife(x,theta)
###############################################################
#### Jackknife
###############################################################
#Bootstrap
#Now I want to program the estimation for one coefficient of a linear regression model.
DF = as.data.frame(matrix(rnorm(250), ncol=5)) # my data
model.lm = formula(V1 ~ V2 + V3 + V4) # my model
# Now I need to specify the theta function. Here x is not the
# data itself but is used as the row index vector to select
# a subset from the data frame (xdata). Also the coefficient
# to be returned is specified.
theta = function(x, xdata, coefficient)
{
coef(lm(model.lm, data=xdata[x,]))[coefficient]
}
# So now at each leave-on-out run the model is calculated with
# a subset defined by the vector x (here one is left out) and one
# coefficient is returned:
results = jackknife(1:50, theta, xdata=DF, coefficient="(Intercept)")
## To expand this code onto the estimation of all the regression coefficients is only a small step now. As the theta function is supposed to return a scalar and not a list of estimates for each coefficient, the following workaround is used: The sapply function calls the jackknife function four times prompting a different parameter estimate at each run. The prompted coeffient is passed on to the jackknife function by the three point (?) argument .
# The following function calculates all then coefficients
jackknife.apply = function(x, xdata, coefs)
{
sapply(coefs,
function(coefficient) jackknife(x, theta, xdata=xdata,
coefficient=coefficient),
simplify=F)
}
# now jackknife.apply() can be called
results = jackknife.apply(1:50, DF, c("(Intercept)", "V2", "V3", "V3"))
###############################################################
### Cross Validation
#################################################################################
# Basado en http://www.burns-stat.com/pages/Tutor/bootstrap_resampling.html
#The idea: Models should be tested with data that were not used to fit the model. If you have enough data, it is best to hold back a random portion of the data to use for testing. Cross validation is a trick to get out-of-sample tests but still use all the data. The sleight of hand is to do a number of fits, each time leaving out a different portion of the data.
#Cross validation is perhaps most often used in the context of prediction. Everyone wants to predict the stock market. So let's do it.
#Below we predict tomorrow's IBM return with a quadratic function of today's IBM and S&P returns.
predictors = cbind(spxret, ibmret, spxret^2, ibmret^2,
spxret * ibmret)[-251,]
predicted = ibmret[-1]
predicted.lm = lm(predicted ~ predictors)
#The p-value for this regression is 0.048, so it is good enough to publish in a journal. However, you might want to hold off putting money on it until we have tested it a bit more.
#In cross validation we divide the data into some number of groups. Then for each group we fit the model with all of the data that are not in the group, and test that fit with the data that are in the group. Below we divide the data into 5 groups.
group = rep(1:5, length=250)
group = sample(group)
mse.group = numeric(5)
for(i in 1:5) {
group.lm = lm(predicted[group != i] ~
predictors[group != i, ])
mse.group[i] = mean((predicted[group == i] -
cbind(1, predictors[group == i,]) %*% coef(group.lm))^2)
}
#The first command above repeats the numbers 1 through 5 to get a vector of length 250. We do not want to use this as our grouping because we may be capturing systematic effects. In our case a group would be on one particular day of the week until a holiday interrupted the pattern. Hence the next line permutes the vector so we have random assignment, but still an equal number of observations in each group. The for loop estimates each of the five models and computes the out-of-sample mean squared error.
#The mean squared error of the predicted vector taking only its mean into account is 0.794. The in-sample mean squared error from the regression is 0.759. This is a modest improvement, but in the context of market prediction might be of use if the improvement were real. However, the cross validation mean squared error is 0.800 -- even higher than from the constant model. This is evidence of overfitting (which this regression model surely is doing).
#################################################################################
#### Boostrap
#################################################################################
There are a number of R packages that are either confined to or touch upon bootstrapping or its relatives. These include:
boot: This package incorporates quite a wide variety of bootstrapping tricks.
bootstrap: A package of relatively simple functions for bootstrapping and related techniques.
coin: A package for permutation tests (which are discussed below).
Design: This package includes bootcov for bootstrapping the covariance of estimated regression parameters and validate for testing the quality of fitted models by cross validation or bootstrapping.
MChtest: This package is for Monte Carlo hypothesis tests, that is, tests using some form of resampling. This includes code for sampling rules where the number of samples taken depend on how certain the result is.
meboot: Provides a method of bootstrapping a time series.
permtest: A package containing a function for permutation tests of microarray data.
resper: A package for doing restricted permutations.
scaleboot: This package produces approximately unbiased hypothesis tests via bootstrapping.
simpleboot: A package of a few functions that perform (or present) bootstraps in simple situations, such as one and two samples, and linear regression.
There are a large number of R packages that include bootstrapping. Examples include multtest that has the boot.resample function, and Matching which has a function for a bootstrapped Kolmogorov-Smirnov test (the equality of two probability distributions).
################################################################################################
Basado en http://www.zoology.ubc.ca/~schluter/zoo502stats/Rtips.resample.html
library(boot)
### Single variable
#To use the boot package you will need to write a function to calculate the statistic of interest. The format is illustrated below for the sample mean, but any univariate function would be handled similarly. We'll call our function "boot.mean". When you have finished writing the script for a function you will need to cut and paste it into the command window so that R can access it (you'll need to do this just once in an R session). Here, "x" refers to the vector of data. "i" refers to a vector of indices, which must be included as an argument in the function and employed as shown.
boot.mean = function(x,i){boot.mean = mean(x[i])}
#The command "boot" will automatically carry out all the resampling and computations required. For this example, "x" is the vector of original data and "boot.mean" is the name of the function we created above to calculate the statistic of interest. R specifies the number of bootstrap replicate estimates desired.
z = boot(x, boot.mean, R=2000)
#The resulting object (which here named "z") is a boot object containing all the results. Use the following additional commands to pull out the results.
print(z) # Bootstrap calculation of bias and SE
sd(z$t) # Another way to get the standard error
hist(z$t) # Histogram of boostrap replicate estimates
qqnorm(z$t) # Normal quantiles of replicate estimates
boot.ci(z,type="bca") # 95% confidence interval using BCa
boot.ci(z,type="perc") # Same using percentile method
### Two or more variables
#Here's how you would use the boot command If the statistic of interest must be calculated from two (or more) variables (for example, a correlation, regression slope, or odds ratio). Assume that there are two variables of interest, "x" and "y". If not done already, put the two variables into a data frame (here called "mydata"),
mydata = cbind.data.frame(x, y, stringsAsFactors=FALSE)
# Then create a function to calculate the statistic of interest on the variables in "mydata". For example, to create a function that calculates the correlation coefficient between the two variables "x" and "y", use
boot.cor = function(mydata,i)
{
x = mydata$x[i]
y = mydata$y[i]
boot.cor = cor(x,y)
}
#Here, "i" refers to a vector of indices, which must be included as an argument in the function and employed as shown.
#Finally, pass your data frame and function to the boot command,
z = boot(mydata, boot.cor, R=2000)
#See the previous section for a list of commands to pull results from the boot object (here named "z").
################################################################################################
Basado en el difmediasTestBoot.pdf
#Uso de library( boot ) Tres ejemplos sencillos
################################################################################################
#1) Estimaci?n del par?metro exp( media ) mediante xbar=mean( x ):
################################################################################################
set.seed( 89723 )
x = c(-1.034,-1.044,-1.445,-0.954,0.223,0.096,0.487,-0.485,0.491,-1.122) #Los datos x
xbar = mean(x)
n = length(x)
expm.fun = function(x) { exp(mean(x)) }
expm = expm.fun(x)
cbind( n, xbar, expm )
#n xbar expm
#[1,] 10 -0.4787 0.6195883
library( boot )
R = 999
#Es preciso definir una funci?n de los datos, x, y de los ?ndices, que ser?n tomados al azar al hacer bootstrap, que calcule el estad?stico a "bootstrapear":
expm.i = function( x, indices )
{
expmb = exp( mean( x[indices] ) )
c( expmb )
}
#Bootstrap del estad?stico exp(media):
expm.boot = boot( data=x, statistic=expm.i, R=R )
expm.boot
#ORDINARY NONPARAMETRIC BOOTSTRAP
#Call:
#boot(data = x, statistic = expm.i, R = R)
#Bootstrap Statistics :
#original bias std. error
#t1* 0.6195883 0.01076195 0.1456947
#Boostrap del intervalo de confianza. La funci?n boot.ci aplicada a un objeto de tipo boot permite calcular diferentes IC bootstrap:
boot.ci( expm.boot, conf=0.95, type=c("perc", "bca") )
#BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#Based on 999 bootstrap replicates
#CALL :
#boot.ci(boot.out = expm.boot, conf = 0.95, type = c("perc", "bca"))
#Intervals :
#Level Percentile BCa
#95% ( 0.3928, 0.9627 ) ( 0.4067, 0.9822 )
#Calculations and Intervals on Original Scale
#Las r?plicas bootstrap se encuentran en el elemento expm.boot$t del objeto expm.boot:
expmb = expm.boot$t ### valores bootstrap de expm
sd( expmb ) [1] 0.1456947
### se bootstrap de expm
quantile( expmb, c( 0.025, 0.975) )
#2.5% 97.5%
#0.3946921 0.9600645
#La funci?n plot aplicada al objeto de tipo boot, expm.boot, produce por defecto un histograma de expmb y un plot de normalidad:
windows( width=7, height=4 )
par( pty="s" )
plot( expm.boot ) ### funci?n plot.boot: produce dos plots
title( main="Resultados de expm.boot", font.main=4, line=3, col.main="blue" )
x11()
hist( expmb, breaks=50, probability=T )
abline( v=c(expm,quantile(expmb,c(0.025,0.975))), lty=3 )
rug( expm, lwd=2, col="blue" )
lines( density( expmb ) )
################################################################################################
#### 2) Regresi?n Bootstrap sobre la muestra de pares. IC para beta1
################################################################################################
### y respuesta; x explicativa; IC para beta1
x = c(109,88,96,96,109,116,114,96,85,100,113,117,107,104,101,81)
y = c(116,77,95,79,113,122,109,94,91,88,115,119,100,115,95,90)
lmyx = lm( y ~ x )
summary( lmyx )
#Call:
#lm(formula = y ~ x)
#Coefficients:
#Estimate Std. Error t value Pr(>|t|)
#(Intercept) -9.3016 20.0530 -0.464 0.65
#x 1.0826 0.1955 5.537 7.32e-05 ***
#Residual standard error: 8.414 on 14 degrees of freedom
#Multiple R-squared: 0.6865, Adjusted R-squared: 0.6641
#F-statistic: 30.66 on 1 and 14 DF, p-value: 7.319e-05
library( boot )
R = 999
datos = cbind(x, y)
beta1.i = function( datos, i )
{
coef( lm( datos[,2][i] ~ datos[,1][i] ) )[2]
}
beta1.b = boot( data=datos, statistic=beta1.i, R=R )
beta1.b
# ORDINARY NONPARAMETRIC BOOTSTRAP
# Call:
# boot(data = datos, statistic = beta1.i, R = R)
# Bootstrap Statistics :
# original bias std. error
# t1* 1.082613 0.02671124 0.2162240
boot.ci( beta1.b, type=c("perc", "bca") )
# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 999 bootstrap replicates
# CALL :
# boot.ci(boot.out = beta1.b, type = c("perc", "bca"))
# Intervals :
# Level Percentile BCa
# 95% ( 0.795, 1.653 ) ( 0.774, 1.628 )
# Calculations and Intervals on Original Scale
windows( width=7, height=4 )
par( pty="s" )
plot( beta1.b, cex.main=0.8 )
title( main="Regresi?n lineal: beta1.b", font.main=4, line=3, col.main="blue")
x11()
hist( beta1.b$t, probability=T, breaks=30, main="Histograma de beta1.b", cex.main=0.9 )
abline(v=c( coef(lmyx)[2], quantile(beta1.b$t,c(0.025,0.975))), lty=3) Histograma de beta1.bbeta1.b$tDensity0.60.81.01.21.41.61.80.00.51.01.52.0
rug( coef(lmyx)[2], lwd=2, col="blue" )
lines( density(beta1.b$t) )
################################################################################################
#### 1) Bootstrap para la comparaci?n de medias (IC y tests bootstrap)
################################################################################################
#Dadas dos muestras, x e y, de sendas poblaciones independientes, con distribuciones F y G respectivamente, se quiere calcular un IC para la diferencia de medias.
#Por ejemplo: ?Cu?l es la diferencia de medias del consumo diario de energ?a de individuos normales y obesos?
#Consumo diario (MJ/d?a) en un grupo de 13 individuos de peso normal: 6.13, 7.05, 7.48, 7.48, 7.53, 7.58, 7.90, 8.08, 8.09, 8.11, 8.40, 10.15, 10.88.
#Consumo en un grupo de 9 individuos obesos: 7.67, 8.19, 8.21, 8.79, 9.11, 9.21, 10.34, 11.53, 12.10.
### Aplicaci?n: IC percentil bootstrap para la diferencia de medias
x = c(6.13, 7.05, 7.48, 7.48, 7.53, 7.58, 7.90, 8.08, 8.09, 8.11, 8.40, 10.15, 10.88) ### x: consumo diario de energia en individuos de peso normal
y = c(7.67, 8.19, 8.21, 8.79, 9.11, 9.21, 10.34, 11.53, 12.10) ### y: consumo diario de energia en individuos obesos
n = length(x)
m = length(y)
mx = mean(x)
my = mean(y)
se.x = sd(x)/sqrt(n)
se.y = sd(y)/sqrt(m)
### Estimaci?n de la diferencia de medias ###
dif = my-mx ### diferencia de medias observadas
se1.dif = sqrt((var(y)/m)+(var(x)/n)) ### se varianzas distintas
se2.dif = sqrt((((n-1)*var(x))+((m-1)*var(y)))/(n+m-2))*sqrt((1/m)+(1/n)) ### se2.dif: se para varianzas iguales
round(cbind( n, mx, se.x, m, my, se.y, dif, se1.dif, se2.dif ), 3)
#n mx se.x m my se.y dif se1.dif se2.dif
#[1,] 13 8.066 0.343 9 9.461 0.514 1.395 0.618 0.593
###Se puede obtener f?cilmente un IC basado en la t de student para la diferencia de medias:
alfa = 0.05; confianza = 1-alfa;
infdif = dif - qt(1-alfa/2, n+m-2) * se2.dif
supdif = dif + qt(1-alfa/2, n+m-2) * se2.dif
round(cbind( dif, se2.dif, infdif, supdif, confianza ), 3)
#dif se2.dif infdif supdif confianza
#[1,] 1.395 0.593 0.157 2.632 0.95
#Mediante bootstrap podemos obtener un IC para la diferencia de medias utilizando simplemente su estimador, la diferencia de medias muestrales. Usando R=1000 r?plicas bootstrap, realizadas por separado, para x, y para y, se simula la distribuci?n bootstrap de la diferencia de medias muestrales, de la que obtenemos el IC bootstrap percentil para la diferencia de medias.
### bootstrap para la diferencia de medias: IC percentil bootstrap ###
R = 1000
set.seed( 48761 )
tm = matrix(sample(y,R*m,replace=T), R, m) ### r?plicas bootstrap separado
cm = matrix(sample(x,R*n,replace=T), R, n)
tb = apply(tm, 1, mean)
cb = apply(cm, 1, mean)
difb = tb-cb ### replicas de diferencia de medias bootstrap
c( mean(difb), sd(difb))
#[1] 1.3928984 0.5880266
ppp = c(0.025,0.05,0.5,0.95,0.975)
round(quantile(difb, ppp), 4) ### percentiles bootstrap de diferencia de medias
#2.5% 5% 50% 95% 97.5%
#0.2375 0.3918 1.4011 2.3413 2.5311
x11()
hist( difb, breaks=25, probability=T, main="histograma de difb", cex.main=0.9, xlab="difb: diferencia de medias bootstrap" )
rug( dif, lwd=2, col="blue" )
abline( v=c(quantile(difb,0.025), quantile(difb,0.975), dif), lty=3 )
curve( dnorm(x, mean(difb), sd(difb)), add=T, lty=3 )
################################################################################################
#### 2) Bootstrap de igualdad de distribuciones de dos muestras independientes
################################################################################################
#Problema .Hay diferencias en la distribucion del consumo diario de energia de individuos normales y obesos? H0: F=G.
#El problema puede ser resuelto mediante un test de aleatorizacion, pero la metodologia bootstrap nos proporciona otros procedimientos para resolver el mismo problema.
#Supongamos que para hacer el contraste de igualdad de distribuciones utilizamos el estadistico diferencia de medias muestrales D=meanY-meanX, estimador de la diferencia de medias mu1-mu2. de ambas poblaciones, aunque D sea mas apropiado para valorar solo la diferencia de medias.
#Fijado un numero R de replicas, de la muestra conjunta z = (x, y) extraemos al azar y con reemplazamiento n+m valores en cada replica.
#Con cada replica r, tenemos un vector en el que llamaremos a los n primeros elementos e a los m ultimos, calculando con ellos el valor de la replica r bootstrap de la diferencia de medias *rz*rx*ry***rxryrd.=.
#Disponemos asi de las R simulaciones de la distribucion bootstrap de la diferencia de medias con las que podemos aproximar el valor-p bootstrap de nuestro test: p-valor bootstrap~#(dr*>d)/R, donde d es el valor observado de la diferencia de medias muestrales D.
#El proceso es similar al llevado a cabo en el test de aleatorizacion. La diferencia estriba en que ahora se extraen muestras con reemplazamiento de la muestra conjunta z = (x, y).
#Aplicaci?n del procedimiento:
z = c(x, y)
zm = matrix( sample(z, R*(n+m), replace=T), R, n+m ) ### remuestreo bootstrap conjunto
xzm = apply( zm[ , 1:n], 1, mean )
yzm = apply( zm[ , (n+1):(n+m)], 1, mean )
difzb = yzm-xzm ### replicas de diferencia de medias bootstrap
pv.difzb = mean( difzb >= dif ) ### p-valor bootstrap de la difer de medias
cbi
#nd( dif, pv.difzb) dif pv.difzb
#[1,] 1.394957 0.026
x11()
#par( mfrow=c(2,2), pty="s" )
hist( difzb, breaks=20, probability=T, main="histograma de difzb", xlab="difzb: diferencia de medias bootstrap", cex.main=0.9 )
rug( dif, lwd=2, col="blue" )
abline( v=dif, lty=3 )
curve( dnorm(x, mean(difzb), sd(difzb)), add=T, lty=3 )
#El mismo procedimiento puede ser aplicado a cualquier otro test para contrastar la igualdad de distribuciones. Veamos el uso del test t (dise?ado para la comparaci?n de medias, cuando las varianzas de ambas poblaciones, supuestamente normales, son iguales):
### R?plicas del test t bootstrap basado en la muestra conjunta ###
sumx2zm = apply( zm[ , 1:n], 1, var )*(n-1)
sumy2zm = apply( zm[ , (n+1):(n+m)], 1, var )*(m-1)
se2b = sqrt( (sumx2zm+sumy2zm)/(n+m-2) )*sqrt((1/m)+(1/n))
t2 = difzb/se2b ### replicas del estadistico t bootstrap
tobs2 = dif/se2.dif ### estadistico t observado
tobs2
#[1] 2.351313
pv.tboot = mean( t2>=tobs2 ) ### p-valor bootstrap del t bootstrap (un lado)
pv.t = 1-pt( tobs2,n+m-2 ) ### p-valor del test t para igualdad de medias
round(cbind( dif, pv.difzb, tobs2, pv.t, pv.tboot ), 4)
#dif pv.difzb tobs2 pv.t pv.tboot
#[1,] 1.395 0.026 2.3513 0.0145 0.024
x11()
hist( t2, breaks=25, probability=T, main="histograma de t2", cex.main=0.9, xlab="t2: estad?stico t bootstrap para diferencia de medias" )
abline( v=tobs2, lty=3 )
curve( dnorm(x, mean(t2), sd(t2)), add=T, lty=3 )
#Obs?rvese que no hay diferencia de procedimiento en la obtenci?n del p-valor del test basado en la diferencia de medias o en el estad?stico t de student, y el resultado es similar (ver pv.difzb y pv.tboot). En cambio, la diferencia con el p-valor del test t de student exacto es notable (ver pv.t). Hay que insistir en que contrastar F=G no es equivalente a contrastar la igualdad de sus medias.
################################################################################################
#### 3) Bootstrap de igualdad de medias de muestras independientes
################################################################################################
#Si unicamente tuvieramos interes en contrastar la hipotesis 21:0?g?g=H, mas debil que la anterior H0 : F = G, entonces lo mas adecuado es utilizar el test t-Student si se cumplen las condiciones de independencia, normalidad e igualdad de varianzas.
#Si las varianzas fueran diferentes podria plantearse el uso de otro estadistico t con una estimacion de las varianzas por separado, 221211smsnXYt+.=, donde en el denominador tenemos las varianzas muestrales de x e y.
#Dicho estadistico no tiene bajo H0 una distribucion t-Student y solo se dispone de soluciones aproximadas para obtener el valor-p. Es conocido en la estadistica clasica como ??problema de Behrens-Fisher??.
#El uso del bootstrap permite soslayar este problema, cuando tratamos de contrastar 21:0?g?g=H. Bajo esta hipotesis, es admisible que las distribuciones de ambas poblaciones, F y G, puedan ser diferentes.
#Los datos x nos dan la distribucion empirica para estimar F y los datos y la G para estimar G. Para calcular el valor-p de un test mediante bootstrap necesitaremos que ambas distribuciones empiricas sean del tipo indicado por F..21:0?g?g=H, esto es tengan iguales medias. Una forma razonable de conseguirlo sin alterar otras caracterirticas de ambas distribuciones consiste en relocalizar los datos x e y observados, de forma que mediante dicha traslacion ambas muestras presenten la misma media, por ejemplo la media de la muestra conjunta.
#Si z = (x, y) es la muestra conjunta, con media muestral )(1ymxnmnz++=, consideramos zxxxloc+.= e zyyyloc+.= y adoptamos ahora xloc e yloc como si fueran nuestros datos de partida, teniendo ambas muestras como media muestral el valor z.
#Con xloc e yloc extraemos por separado muestras bootstrap xloc* e yloc*, calculando con cada una de ellas sus medias y varianzas bootstrap para obtener las replicas bootstrap del estadistico t, t1*, ..., tR*.
#El valor-p bootstrap es Robstrt)*(#., donde es el valor observado del estadistico t. obst
#A continuacion aplicamos el procedimiento, relocalizando las muestras de modo que ambas tengan la misma media comun global.
#Aplicaci?n
xloc = x-mean(x)+mean(z) ### controles relocalizados en la media global
yloc = y-mean(y)+mean(z) ### tratados relocalizados
### bootstrap separado para test de igualdad de medias
xlocm = matrix(sample(xloc,R*n,replace=T),R,n)
ylocm = matrix(sample(yloc,R*m,replace=T),R,m)
xlocb = apply(xlocm,1,mean)
ylocb = apply(ylocm,1,mean)
dyxlocb = ylocb-xlocb ### diferencias de medias bootstrap tras relocalizacion
mean( dyxlocb >= dif ) ### p-valor bootstrap de test basado en diferencia de medias
#[1] 0.01
x11()
hist(dyxlocb, breaks=20, probability=T)
rug( dif, lwd=2, col="blue" )
abline( v=dif, lty=3 )
curve( dnorm(x, mean(dyxlocb), sd(dyxlocb)), add=T, lty=3 )
### usando el estad?stico t
vxlocb = apply(xlocm, 1, var)
vylocb = apply(ylocm, 1, var)
## replicas bootstrap del estadistico t con varianzas distintas
tyxlocb = dyxlocb/sqrt((vxlocb/n)+(vylocb/m))
mean(tyxlocb>=tobs2) ### p-valor bootstrap de tyxlocb
#[1] 0.01
x11()
hist( tyxlocb, breaks=20, probability=T )
rug( tobs2, lwd=2, col="blue" )
abline( v=tobs2, lty=3 )
curve( dnorm(x, mean(tyxlocb), sd(tyxlocb)), add=T, lty=3 )
################################################################################################
#### 4) CORRELACI?N BOOTSTRAP
################################################################################################
#Se quiere conocer la correlaci?n entre las puntuaciones de expresi?n verbal y capacidad matem?tica de los estudiantes de un curso. Para ello se ha pasado un test a 16 estudiantes y se han obtenido los resultados de x:expresi?n verbal e y:capacidad matem?tica. (Pawitan)
### Datos IQ: x expresi?n verbal, y capacidad matem?tica ###
x = c(109,88,96,96,109,116,114,96,85,100,113,117,107,104,101,81)
y = c(116,77,95,79,113,122,109,94,91,88,115,119,100,115,95,90)
n = length(x)
xm = mean(x); ym = mean(y);
s2x = var(x); sdx = sqrt(s2x)
s2y = var(y); sdy = sqrt(s2y)
cxy = cor(x, y)
plot(x, y, xlab="verbal", ylab="matematica", main="datos IQ")
abline(lsfit(x,y)$coef, lty=3)
round( cbind(n, xm, ym, s2x, sdx, s2y, sdy, cxy), 4)
#n xm ym s2x sdx s2y sdy cxy
#[1,] 16 102 101.125 123.4667 11.1116 210.7833 14.5184 0.8286
#Bajo normalidad conjunta de (X, Y), un estimador clasico del error estandar de la correlacion muestral rho. viene dado por (1-rho^2)/sqrt(n-3).
#Si se considera la transformacion de Fisher tau=0.5log((1+rho)/(1-rho)) entonces el estimador ?n. es aproximadamente normal de media tau y desviacion estandar 1/sqrt(n-3).
#Esto permite obtener IC's para tau y aplicando la transformacion inversa de Fisher a los extremos del IC, obtendremos un I.C. para la correlacion.
### se de la correlacion bajo normalidad conjunta de (X,Y)
se.cxy = (1-cxy^2)/sqrt(n-3)
### tau: transformacion de Fisher de la correlacion
tau = 0.5*log( (1+cxy)/(1-cxy) )
alfa = 0.05; confianza = 1-alfa;
### IC para tau
tau.inf = tau-qnorm(1-alfa/2)*(1/sqrt(n-3))
tau.sup = tau+qnorm(1-alfa/2)*(1/sqrt(n-3))
### IC para la correlacion
cxy.inf = (exp(2*tau.inf)-1)/(exp(2*tau.inf)+1)
cxy.sup = (exp(2*tau.sup)-1)/(exp(2*tau.sup)+1)
round(cbind( n, cxy, se.cxy, cxy.inf, cxy.sup, confianza ), 4)
#n cxy se.cxy cxy.inf cxy.sup confianza
#[1,] 16 0.8286 0.0869 0.5649 0.9387 0.95
library( boot )
R = 999
datos = cbind(x,y)
correla = function( datos, indices )
{
x = datos[,1][indices]
y = datos[,2][indices]
cor( x[indices], y[indices] )
}
corxy.b = boot( data=datos, statistic=correla, R=R )
corxy.b
#ORDINARY NONPARAMETRIC BOOTSTRAP
#Call:
#boot(data = datos, statistic = correla, R = R)
#Bootstrap Statistics :
#original bias std. error
#t1* 0.8285718 0.01169248 0.08414495
boot.ci( corxy.b, conf=c(0.9, 0.95), type="all" )
#BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#Based on 999 bootstrap replicates
#CALL :
#boot.ci(boot.out = corxy.b, conf = c(0.9, 0.95), type = "all")
#Intervals :
#Level Normal Basic
#90% ( 0.6785, 0.9553 ) ( 0.7048, 0.9575 )
#95% ( 0.6520, 0.9818 ) ( 0.6929, 1.0093 )
#Level Percentile BCa
#90% ( 0.6997, 0.9523 ) ( 0.5930, 0.9213 )
#95% ( 0.6478, 0.9642 ) ( 0.5170, 0.9407 )
#Calculations and Intervals on Original Scale
#Some BCa intervals may be unstable
#Warning message:
#In boot.ci(corxy.b, conf = c(0.9, 0.95), type = "all") : bootstrap variances needed for studentized intervals
windows( width=7, height=4 )
par( pty="s" )
plot( corxy.b, cex.main=0.8 )
title( main="Correlaci?n: corxy.b", font.main=4, line=3, col.main="blue" )
##### Bootstarap No Parametrico
B = 1000
set.seed( 98725 )
corxy.boot = NULL
for(i in 1:B)
{
star = sample(1:n, n, replace=T)
xstar = x[star]
ystar = y[star]
corxy.boot[i] = cor(xstar, ystar)
}
qcorxy.boot = quantile(corxy.boot, c(0.025, 0.05, 0.5, 0.95, 0.975))
round(qcorxy.boot, 4)
#2.5% 5% 50% 95% 97.5%
#0.7136 0.7341 0.8415 0.9258 0.9365
#Los percentiles 2.5% y 97.5% determinan los extremos de un IC bootstrap percentil del 95%.
bias.corxy = cxy - mean(corxy.boot) ### sesgo bootstrap
se.corxy.boot = sd(corxy.boot) ### se bootstrap
round(cbind(cxy, se.corxy.boot, bias.corxy, mean(corxy.boot)), 4)
#cxy se.corxy.boot bias.corxy
#[1,] 0.8286 0.0574 -0.0091 0.8377
#El error estandar bootstrap de rho. se ha calculado mediante...
#El estimador de la correlacion es sesgado, pero en este caso el sesgo estimado es pequeno, de modo que la correccion del sesgo apenas cambiara el valor estimado de la correlacion. El estimador bootstrap del sesgo es (1/B)*sum(rhp^p*-rho), y por lo tanto, un estimador de la correlacion con correccion del sesgo viene dado por: 2rho-(1/B)*sum(rho^b).
#Si la distribucion de la correlacion muestral fuera aproximadamente normal, un IC del tipo Wald, seria muy sencillo de calcular, utilizando la estimacion bootstrap de su error estandar:
### IC tipo Wald del 95% ###
wic.inf = cxy - qnorm(0.975) * se.corxy.boot
wic.sup = cxy + qnorm(0.975) * se.corxy.boot
round(cbind( wic.inf, cxy, wic.sup), 4)
#wic.inf cxy wic.sup
#[1,] 0.7161 0.8286 0.941
x11()
hist( corxy.boot, probability=T, breaks=25, main="" )
rug( qcorxy.boot )
abline(v=c(qcorxy.boot[c(1, 5)], cxy), lty=3) ### IC percentil bootstrap 95%
rug( cxy, lwd=2, col="blue" )
#La transformaci?n tau de Fisher, bootstrap:
tau.boot = 0.5*log( (1+corxy.boot)/(1-corxy.boot) )
x11()
hist( tau.boot, probability=T, breaks=25, main="" )
curve( dnorm(x, tau, 1/sqrt(n-3)), lty=3, add=T )
qb.025 = quantile( corxy.boot, 0.025 )
qb.975 = quantile( corxy.boot, 0.975 )
abline( v=c(tau,0.5*log((1+qb.025)/(1-qb.025)),0.5*log((1+qb.975)/(1-qb.975))), lty=3 )
## Basado en http://www.ats.ucla.edu/stat/R/library/bootstrap.htm
#calculating the standard error of the median
#creating the data set by taking 100 observations
#from a normal distribution with mean 5 and stdev 3
#we have rounded each observation to nearest integer
data = round(rnorm(100, 5, 3))
data[1:10]
[1] 6 3 3 4 3 8 2 2 3 2
#obtaining 20 bootstrap samples
#display the first of the bootstrap samples
resamples = lapply(1:20, function(i)
sample(data, replace = T))
resamples[1]
[[1]]:
[1] 5 1 7 6 5 2 2 6 9 5 4 6 6 3 5 4 10 7 8 1 8 0 5 2
[25] 8 3 0 9 3 2 3 10 5 8 5 4 0 4 7 3 5 6 3 6 3 2 9 7
[49] 2 4 9 6 6 0 7 5 9 3 0 6 8 5 2 3 3 3 4 3 2 9 3 3
[73] 2 3 8 2 8 3 9 6 5 2 4 3 3 7 1 3 5 9 4 3 4 2 9 0
[97] 3 6 9 7
#calculating the median for each bootstrap sample
r.median = sapply(resamples, median)
r.median
[1] 4.0 4.5 4.0 5.0 4.0 5.0 5.0 5.0 5.0 4.0 5.0 5.0 5.0 5.0 5.0 4.0 5.0 5.0
[19] 6.0 5.0
#calculating the standard deviation of the distribution of medians
sqrt(var(r.median))
[1] 0.5250313
#displaying the histogram of the distribution of the medians
hist(r.median)
#We can put all these steps into a single function where all we would need to specify is which data set to use and how many times we want to resample in order to obtain the adjusted standard error of the median. For more information on how to construct functions please consult the R library pages on introduction to functions and advanced functions.
#function which will bootstrap the standard error of the median
b.median = function(data, num) {
resamples = lapply(1:num, function(i) sample(data, replace=T))
r.median = sapply(resamples, median)
std.err = sqrt(var(r.median))
list(std.err=std.err, resamples=resamples, medians=r.median)
}
#generating the data to be used (same as in the above example)
data1 = round(rnorm(100, 5, 3))
#saving the results of the function b.median in the object b1
b1 = b.median(data1, 30)
#displaying the first of the 30 bootstrap samples
b1$resamples[1]
[[1]]:
[1] 6 6 6 0 9 3 5 6 2 7 5 3 3 2 5 3 0 4 3 7 3 1 6 3
[25] 6 3 5 9 6 4 8 4 3 1 3 2 4 5 2 3 1 7 6 3 7 10 7 2
[49] 3 3 5 10 9 6 2 3 4 5 1 3 5 0 8 1 4 2 7 8 2 2 10 6
[73] 4 8 6 3 5 2 10 5 0 7 6 5 4 9 6 0 6 6 3 4 8 10 7 6
[97] 3 3 2 3
#displaying the standard error
b1$std.err
[1] 0.5155477
#displaying the histogram of the distribution of medians
hist(b1$medians)
#we can input the data directly into the function and display
#the standard error in one line of code
b.median(rnorm(100, 5, 2), 50)$std.err
[1] 0.5104178
#displaying the histogram of the distribution of medians
hist(b.median(rnorm(100, 5, 2), 50)$medians)
### Basado en http://www.statoo.com/en/publications/bootstrap_scgn_v131.pdf
# los datos son de un experimento en el cual se utilizan aleatoriamente dos tratamientos de l?ser en los ojos de los pacientes. La respuesta es la adecuaci?n visual, medida por el n?mero de letras identificadas correctamente en un test est?ndard de ojos.
# algunos pacientes tienen solo un ojo sano, y reciben un tratamiento localizado aleatoriamente. Existen 20 apcientes con datos pareados y 20 pacientes con con solo 1 observaci?n disponible, por lo que tenemos una mezcla de comparaciones pareadas y dos muestras.
blue=c(4, 69, 87, 35, 39, 79, 31, 65, 95, 68, 62, 70, 80, 84, 79, 66, 75, 59, 77, 36, 86, 39, 85, 74, 72, 69, 85, 85, 72)
red=c(62, 80, 82, 83, 0, 81, 28, 69, 48, 90, 63, 77, 0, 55, 83, 85, 54, 72, 58, 68, 88, 83, 78, 30, 58, 45, 78, 64, 87, 65)
acui=data.frame(str=c(rep(0,20), rep(1,10)), red, blue)
# solo consideraremos los datos pareados y construiremos el cl?sico intervalo de confianza al 95% del estad?stico t-Student para la media de las diferencias.
acu.pd=acui[acui$str==0,]
dif=acu.pd$blue-acu.pd$red
n=nrow(acu.pd)
tmp=qt(.025, n-1)*sd(dif)/sqrt(n)
c(mean(dif)+tmp, mean(dif)-tmp) # [1] -.9270336 15.770335
# pero un q-q plot de las diferencias parece ma?s una Cauchy que una normal, por lo cual el modelo usual parece ser inviable. Para realizar un boostrap no-param?trico en este caso, necesitamos definir la funci?n boostrap:
acu.pd.fun=function(data, i)
{
d=data[i,]
dif=d$blue-d$red
c(mean(dif), var(dif)/nrow(d))
}
# obtenemos R=999 r?plicas bootstrap
acui.pd.b=boot(acu.pd, acu.pd.fun, R=999)
# los IC no-param?tricos resultantes pueden calcularse como antes, o utilizando directamente:
boot.ci(acu.pd.b, type=c("norm", "basic", "stud"))
#Normal Basic Studentized
#(-8.20, 14.95) (-8.10, 15.05) (-8.66, 15.77)
# una alternatica es considerar solo los datos de dos muestras y comparar las medias de las dos poblaciones issuing los pacientes que solo tienen una observacion disponible
acu.ts=acui[acui$str==1,]
# calculamos el intervalo de confianza al 05% para la diferencia de medias
acu.ts=acui[acui$str==1,]
dif=mean(acu.ts$blue)-mean(acu.ts$red)
tmp=qnorm(0.025)*sqrt(var(acu.ts$blue)/nrow(acu.ts)+var(acu.ts$red)/nrow(acu.ts))
c(dif+tmp, dif-tmp) #[1] -13.76901 19.16901
# para construir IC boostrap generamos R=999 r?plicas del estimador y varianza estimada
y=c(acui$blue[21:30], acui$red[21:30])
acu=data.frame(col=rep(c(1,2), c(10,10)), y)
acu.ts.f=function(data, i)
{
d=data[i, ]
m=mean(d$y[1:10])-mean(d$y[11:20])
v=var(d$y[1:10])/10+var(d$y[11:20])/10
c(m,v)
}
acu.ts.boot=boot(acu, acu.ts.f, R=999, strata=acu$col) # strata=acu$col asegura la simulaci?n estratificada
### Basado en http://thebiobucket.blogspot.com/2011/04/bootstrap-with-strata.html
# Here's a worked example for comparing group averages with bootstrap confidence intervals and allowing for different subsample sizes by calling the strata argument within the bootstrap function.
# The data is set up analogous to an before-after impact experiment conducted on plots across four different successional stages. Similarity was calculated for plot's composition before and after an impact. One hypothesis was that certain stages would show higher / smaller average similarities, that is, a higher / lower impact on composition. As plots within stages were situated within different subsites and the nr. of replicates was unbalanced, this had to be allowed for by use of the "strata" argument in the boot.ci call:
library(boot)
library(Hmisc)
### a factor 'stage' and a variable 'MH-Index'
stage=sort(as.factor(rep(LETTERS[1:4], 30)))
site=gl(12, 10)
MH.Index=runif(4*30, 0, 1)
### simulate effect:
n = 30
eff = c(rep(0.25, n), rep(0.4, n), rep(0.5, n), rep( 0.65,30))
MH.Index = MH.Index*eff
### dataframe:
print(sim=data.frame(stage, site, MH.Index))
### function needed for the bootstrap:
mean.fun = function(x, index){mean(x[index])}
### set up dataframe to collect results for plotting:
pldat=data.frame(mean=rep(NA,4),low=rep(NA,4),upp=rep(NA,4))
rownames(pldat)=c("A","B","C","D")
pldat
for(i in c("A","B","C","D")){
### the boot sample and the ci's:
boot=boot(sim$MH.Index[sim$stage==i], mean.fun, R=1000, strata=sim$site[sim$stage==i])
ci=boot.ci(boot,type = "norm")
### get ci's (method: normal)
ci[2]->pldat[i,"mean"] # mean
ci$normal[1,2]->pldat[i,"low"] # lower ci
ci$normal[1,3]->pldat[i,"upp"]} # upper ci
pldat
### plot:
errbar(c(1:4),pldat$mean,pldat$upp,pldat$low,ylab="MH-Similarity Index",xlab="Stage",
pch=15,xaxt="n",xlim=c(0.5,4.5))
axis(1,c(1:4),labels=row.names(pldat))
legend("topleft","errorbars =\n95%-normal\nbootstrap\nconf. int",bty="n")
Comentarios
Publicar un comentario