serie de “experiencias” independientes donde la respuesta puede ser sólo de dos categorías.
serie de experimentos independientes donde la respuesta puede ser sólo de dos categorías.
\[P(y = 1) = p\] \[P(y = 0) = 1-p\] \[P(y) = p^y(1-p)^{1-y}\]
La variable respuesta se distribuye como Binomial
\[Y_{i} \sim B(N, p)\] El enlace canónico es logit
\[\eta =log(\frac{\mu_{i}}{1-\mu{i}})\] Por lo que el valor esperado es igual al logit inverso.
\[\mu_{i} = \frac{e^{\eta}}{1+e^{\eta}}\]
Colocando las variables predictoras en la parte sistemática…
\[\mu_{i} = \frac{e^{\beta_{0} + \beta_{1}X_{1i} + ... + \beta_{k}X_{ki}}}{1+e^{\beta_{0} + \beta_{1}X_{1i} + ... + \beta_{k}X_{ki}}}\]
\[logit(p(y)) = log\frac{p(y)}{1-p(y)}\]
Donde \(p(x)\) es la probabilidad de éxito y \(1-p(x)\) es la probabilidad de fracaso.
\[log\frac{p(y)}{1-p(y)} = \beta_0 + \beta_1X_{1i} + ... + \beta_kX_{ki}\]
La respuesta es lineal en el espacio logit.
fit <- glm(y ~ x, data = dat, family = binomial) summary(fit)
Call: glm(formula = y ~ x, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.0281 -1.0458 0.5480 0.9066 1.6942 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -12.3508 2.6287 -4.698 2.62e-06 *** x 0.4972 0.1017 4.887 1.02e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 225.76 on 172 degrees of freedom Residual deviance: 194.45 on 171 degrees of freedom AIC: 198.45 Number of Fisher Scoring iterations: 4
fit <- glm(y ~ x, data = dat, family = binomial) anova(fit, test = "Chisq")
Analysis of Deviance Table Model: binomial, link: logit Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 172 225.76 x 1 31.306 171 194.45 2.204e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit0 <- glm(y ~ 1, data = dat, family = binomial) fit1 <- glm(y ~ x, data = dat, family = binomial) anova(fit0, fit1, test = "Chisq")
Analysis of Deviance Table Model 1: y ~ 1 Model 2: y ~ x Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 172 225.76 2 171 194.45 1 31.306 2.204e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(fit$coefficients)
(Intercept) x 4.326214e-06 1.644162e+00
\[LD_{50} = -\beta_{0}/\beta_{1}\]
pred <- predict(fit, type = "response") #training or test? table(data = as.numeric(pred >= 0.5), reference = dat$y) #umbral
reference data 0 1 0 27 16 1 35 95
ver también confusionMatrix
del paquete caret
cada punto corresponde a una matriz de confusión con diferente umbral.
dat <- read.table("gourlieana.txt", header = TRUE) rta <- cbind(dat$pol.si, dat$pol.no) fit <- glm(rta ~ espolon + flores, data = dat, family = binomial) summary(fit)
Call: glm(formula = rta ~ espolon + flores, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -3.1171 -1.1451 0.0584 0.8644 3.3657 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.82541 0.84009 -4.554 5.27e-06 *** espolon 0.25991 0.06662 3.902 9.56e-05 *** flores 0.01628 0.01716 0.949 0.343 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 216.85 on 80 degrees of freedom Residual deviance: 201.26 on 78 degrees of freedom AIC: 409.67 Number of Fisher Scoring iterations: 4
anova(fit, test = "Chisq")
Analysis of Deviance Table Model: binomial, link: logit Response: rta Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 80 216.84 espolon 1 14.6886 79 202.16 0.0001268 *** flores 1 0.8991 78 201.26 0.3430157 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dat <- read.table("gourlieana.txt", header = TRUE) rta <- cbind(dat$pol.si, dat$pol.no) fit.q <- glm(rta ~ espolon + flores, data = dat, family = quasibinomial) summary(fit.q)
Call: glm(formula = rta ~ espolon + flores, family = quasibinomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -3.1171 -1.1451 0.0584 0.8644 3.3657 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.82541 1.28630 -2.974 0.00391 ** espolon 0.25991 0.10200 2.548 0.01279 * flores 0.01628 0.02628 0.619 0.53741 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for quasibinomial family taken to be 2.344379) Null deviance: 216.85 on 80 degrees of freedom Residual deviance: 201.26 on 78 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 4
fit.q <- glm(rta ~ espolon + flores, data = dat, family = quasibinomial) anova(fit.q, test = "F")
Analysis of Deviance Table Model: quasibinomial, link: logit Response: rta Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev F Pr(>F) NULL 80 216.84 espolon 1 14.6886 79 202.16 6.2655 0.0144 * flores 1 0.8991 78 201.26 0.3835 0.5375 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
PL <- fit$linear.predictors^2 fit2 <- glm(rta ~ espolon + flores + PL, data = dat, family = binomial) summary(fit2)
Call: glm(formula = rta ~ espolon + flores + PL, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -3.0619 -1.1211 0.1236 0.8965 3.3048 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.29040 2.81956 -2.231 0.0257 * espolon 0.43586 0.20335 2.143 0.0321 * flores 0.02915 0.02222 1.312 0.1895 PL 0.62963 0.68520 0.919 0.3581 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 216.85 on 80 degrees of freedom Residual deviance: 200.42 on 77 degrees of freedom AIC: 410.83 Number of Fisher Scoring iterations: 4
exp(fit$coefficients) exp(confint(fit))