Es Solo Un Modelo Lineal! (ESUML!)
Statistical learning
Herramientas para modelar y entender datos complejos. Resolver el problema de encontrar una función predictiva a partir de los datos.
Respuesta = Modelo + Error \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \epsilon_{i}\]
\[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \epsilon_{i}\] \[X_{i} \sim continua\]
Call:
lm(formula = Y ~ X)
Residuals:
Min 1Q Median 3Q Max
-19.073 -6.835 -0.875 5.806 32.904
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 101.33319 5.00127 20.261 < 2e-16 ***
X -0.42624 0.05344 -7.976 2.85e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.707 on 98 degrees of freedom
Multiple R-squared: 0.3936, Adjusted R-squared: 0.3874
F-statistic: 63.62 on 1 and 98 DF, p-value: 2.853e-12
\[\color{greenyellow}{{SS_{modelo}} = \sum(\hat{y_{i}}-\bar{y})^2}\]
\[\color{red}{SS_{residual} = \sum({y_{i}}-\hat{y_{i}})^2}\]
\[\color{cyan}{SS_{total} = \sum({y_{i}}-\bar{y})^2}\]
\[R^2 = \frac{SS_{modelo}}{SS_{total}}\]
\[R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\]
\[Y_{ij} = \mu + \alpha_{i} + \epsilon_{ij}\] * Examinar la contribución relativa de distintas fuentes de variación.
* Probar la hipótesis nula de la media poblacional y las medias para cada nivel del factor son iguales.
Iris fue publicado por primera vez en 1936 por Ronald Fisher en el Annals of Eugenics. Proponía una metodología para describir “rasgos deseables” en apoyo al programa eugenésico.
Desde este año voy a usar como ejemplo los datos de pingüinos de tres especies (Adelie, Chinstraw y Gentoo) del Archipiélago de Palmer (Antártida), a traves del paquete palmerpenguins.
\[\color{greenyellow}{{SS_{modelo}} = \sum(\hat{y_{i}}-\bar{y})^2}\]
\[\color{red}{SS_{residual} = \sum({y_{i}}-\hat{y_{i}})^2}\]
\[\color{cyan}{SS_{total} = \sum({y_{i}}-\bar{y})^2}\]
\[R^2 = \frac{SS_{modelo}}{SS_{total}}\]
\[R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\]
library(palmerpenguins); data("penguins")
fit <- lm(bill_length_mm ~ species, data = penguins)
anova(fit)
## Analysis of Variance Table
##
## Response: bill_length_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 7194.3 3597.2 410.6 < 2.2e-16 ***
## Residuals 339 2969.9 8.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(palmerpenguins); data("penguins")
fit <- lm(bill_length_mm ~ species, data = penguins)
summary(fit)
##
## Call:
## lm(formula = bill_length_mm ~ species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9338 -2.2049 0.0086 2.0662 12.0951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7914 0.2409 161.05 <2e-16 ***
## speciesChinstrap 10.0424 0.4323 23.23 <2e-16 ***
## speciesGentoo 8.7135 0.3595 24.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.96 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7078, Adjusted R-squared: 0.7061
## F-statistic: 410.6 on 2 and 339 DF, p-value: < 2.2e-16
\[Y_{ij} = \mu + \alpha_{i} + \epsilon_{ij}\] \[Y_{ij} = \beta_{0} + \beta_{1}X_{i1} + ... \beta_{j}X_{ij} + \epsilon_{ij}\]
\[Y_{ij} = \mu + \alpha_{i} + \epsilon_{ij}\] \[Y_{ij} = \beta_{0} + \beta_{1}X_{i1} + ... \beta_{j}X_{ij} + \epsilon_{ij}\]
\[Y_{ij} = \beta_{0} + \beta_{dummy1}X_{i1} + \beta_{dummy2}X_{i, dummy2} + \epsilon_{ij}\]
\[Y_{i, Adelie} = \beta_{0} + \epsilon_{ij}\]
\[Y_{i, Chinstrap} = \beta_{0} + \beta_{dummy1} + \epsilon_{ij}\] \[Y_{i, Gentoo} = \beta_{0} + \beta_{dummy2} + \epsilon_{ij}\]
library(palmerpenguins); data("penguins")
fit <- lm(bill_length_mm ~ species, data = penguins)
summary(fit)
##
## Call:
## lm(formula = bill_length_mm ~ species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9338 -2.2049 0.0086 2.0662 12.0951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7914 0.2409 161.05 <2e-16 ***
## speciesChinstrap 10.0424 0.4323 23.23 <2e-16 ***
## speciesGentoo 8.7135 0.3595 24.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.96 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7078, Adjusted R-squared: 0.7061
## F-statistic: 410.6 on 2 and 339 DF, p-value: < 2.2e-16
library(palmerpenguins); data("penguins")
fit <- lm(bill_length_mm ~ species, data = penguins, na.action = na.exclude)
shapiro.test(residuals(fit))
##
## Shapiro-Wilk normality test
##
## data: residuals(fit)
## W = 0.98903, p-value = 0.01131
##
## Bartlett test of homogeneity of variances
##
## data: residuals(fit) by penguins$species
## Bartlett's K-squared = 5.6179, df = 2, p-value = 0.06027
Datos de entrenamiento y datos de prueba
set.seed(1234)
X <-rnorm(100, mean = 10, sd=20)
Y <- 100 + 0.25*X + 0.019*X^2 + rnorm(100, 0, 10)
fit1 <- lm(Y ~ X)
fit2 <- lm(Y ~ poly(X, 2))
fit3 <- lm(Y ~ poly(X, 10))
summary(fit1)$r.squared #LINEAL
[1] 0.5408381
[1] 0.7319006
[1] 0.7412149