2024-04-23
Consiste de una variable aleatoria \(Y\) con observaciones independientes \((Y_1,\ldots,Y_n)\).
La respuesta \(Y\) sigue una distribución en la familia de dispersión exponencial \[ f(y_i; \theta_i,\phi) = e^{\frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi)}. \]
El parámetro \(\theta_i\) el parámetro natural.
El parámetro \(\phi\) es el parámetro de dispersión.
Familia exponencial natural ss un caso particular de la familia de dispersión exponencial.
Ocurre cuando \[ a(\phi) = 1, \] \[ c(y_i,\phi) = c(y_i). \]
La densidad tiene la forma \[ f(y_i; \theta_i) = h(y_i) \exp{[y_i \theta_i - b(\theta_i)]}. \]
Verosimilitud de una sola observación \[ f(y_i; \theta_i,\phi) = e^{\frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi)}. \]
Logverosimilitud de una sola observación \[ \ell_i = \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi). \]
Logverosimitud \[ \ell = \sum_{i=1}^n \ell_i. \]
\(Y_i\) es la proporción muestral de éxitos.
\(n_i Y_i \sim Bi(n_i,\pi_i)\).
\(\mu_i = EY_i = \pi_i\).
Consideramos \[ \theta_i = \log \frac{\pi_i}{1-\pi_i} \] esto es, definimos \(\theta_i\) como el logit de la probabilidad de éxito.
La transformación inversa es \[ \pi_i = \frac{e^{\theta_i}}{1+e^{\theta_i}}, \] y que \(\log(1-\pi_i) = - \log\bigg (1+e^{\theta_i}\bigg )\).
\[ f(y_i; \pi_i,n_i) = \binom{n_i}{n_i y_i} \pi_i^{n_i y_i} (1-\pi_i)^{n_i - n_i y_i} = \] \[ \exp \bigg [ \frac{y_i \theta_i - \log[1+\exp(\theta_i)]}{1/n_i} + \log \binom{n_i}{n_i y_i} \bigg ], \] siendo \[ b(\theta_i) = \log [1+\exp(\theta_i)], \ a(\phi) = 1/n_i, \ c(y_i,\phi) = \log \binom{n_i}{n_i y_i}. \]
El parámetro natural es \[ \theta_i = \log \frac{\pi_i}{1-\pi_i}, \] el logit de \(\pi_i\).
La función de probabilidad es \[ f(y_i; \mu_i) = \frac{e^{-\mu_i}\mu_i^{y_i}}{y_i!} = \exp (y_i \log \mu_i - \mu_i - \log(y_i!)). \]
Tomamos \[\theta_i = \log \mu_i,\] \[b(\theta_i) = \exp{\theta_i},\] \[a(\phi) =1,\] \[c(y_i,\phi) = - \log(y_i!).\]
La componente sistemática de un modelo lineal generalizado es el vector \((\eta_1,\ldots,\eta_n)\).
Cada uno de los \(\eta_i\) es la combinación lineal de los predictores correspondientes a la \(i\)-ésima observación, es decir, \[ \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \] con \(i=1, \ldots, n\) y \(x_{ij}\) es \(j\)-ésimo predictor en el \(i\)-ésimo individuo.
La combinación lineal \(\sum_j \beta_j x_{ij}\) es el predictor lineal.
Se suele asumir que \(x_{i1} = 1\).
Mediante la función de enlace \(g\) relacionamos las componentes aleatoria y sistemática \[ g(\mu_i) = \eta_i = \sum_{j=1}^p \beta_j x_{ij}. \]
La función de verosimilitud es \[ L(\mathbf \beta) = \prod_{i=1}^n \exp\bigg \{\frac{y_i \theta_i - b(\theta_i)}{a(\phi)}+ c(y_i,\phi) \bigg \} \] y la función de logverosimilitud es \[ \ell(\mathbf \beta) = \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi)}+ \sum_{i=1}^n c(y_i,\phi). \]
Las ecuaciones de estimación son \[ \frac{\partial \ell(\mathbf \beta)}{\partial \beta_j} = \sum_{i=1}^n \frac{\partial \ell_i(\mathbf \beta)}{\partial \beta_j} \] y tienen la expresión
\[ \frac{\partial \ell(\mathbf \beta)}{\partial \beta_j} = \sum_{i=1}^n \frac{(y_i - \mu_i) x_{ij}}{var(Y_i)} \frac{\partial \mu_i}{\partial \eta_i} = 0, \] con \(j=1, \ldots, p\) siendo \[ \eta_i = \sum_{j=1}^p \beta_j x_{ij} = g(\mu_i) \] para la función de enlace \(g\).
Estas ecuaciones han de ser resueltas de un modo iterativo.
La distribución de la variable \(Y_i\) aparece en las ecuaciones de verosimilitud solamente a través de su media \(\mu_i\) y su varianza \(var(Y_i)\).
SNP1 SNP2 SNP3 SNP4 SNP5 clinica.diabetes.mellitus.type2 clinica.age
1 <NA> AG AG GT CT FALSE 40
2 AG AA AG GT TT FALSE 46
clinica.gender clinica.weight.kg clinica.smoking.status
1 Male 80 non-smoker
2 Female 72 ex-smoker
#' Transformation of the SNP to a genetic model
#' @description
#' Transformation of the SNP to a genetic model
#' @param x SNPs
#' @param type Model to be used
#' @param sep Separator
#' @export
snp2model = function(x,type=c("codominant","dominant","recessive"),
sep=""){
x1 = substr(x,1,1)
x2 = substr(x,2,2)
a = table(c(x1,x2))
recessive = names(which.min(a))
dominant = names(which.max(a))
x1 = (x1 == recessive)*1
x2 = (x2 == recessive)*1
if(type == "codominant") rs = factor(x1+x2)
if(type == "dominant") rs = factor(x1+x2 == 0)
if(type == "recessive") rs = factor(x1+x2 == 2)
rs
}
Ajustamos el modelo.
Call:
glm(formula = clinica.diabetes.mellitus.type2 ~ ., family = "binomial",
data = df1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.305569 1.169540 -7.102 1.23e-12 ***
SNP11 0.291969 0.255332 1.143 0.25284
SNP12 -15.180943 616.179573 -0.025 0.98034
SNP21 -0.169281 0.270352 -0.626 0.53122
SNP22 -0.078662 0.357949 -0.220 0.82606
SNP31 0.032512 0.288883 0.113 0.91039
SNP32 -0.005980 0.338562 -0.018 0.98591
SNP41 -0.139222 0.305349 -0.456 0.64843
SNP42 -0.084976 0.373122 -0.228 0.81985
SNP51 -0.345017 0.302315 -1.141 0.25377
SNP52 0.881428 0.537878 1.639 0.10127
clinica.age 0.060856 0.009253 6.577 4.81e-11 ***
clinica.genderMale 0.295755 0.339923 0.870 0.38427
clinica.weight.kg 0.029639 0.011201 2.646 0.00814 **
clinica.smoking.statusnon-smoker -0.560862 0.314169 -1.785 0.07423 .
clinica.smoking.statussmoker -0.454497 0.430254 -1.056 0.29081
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 584.63 on 1176 degrees of freedom
Residual deviance: 485.84 on 1161 degrees of freedom
(325 observations deleted due to missingness)
AIC: 517.84
Number of Fisher Scoring iterations: 16
snpscol = 1:5
df1 = df
df1[,snpscol] = apply(df1[,snpscol],2,snp2model,type="recessive")
fit = glm(clinica.diabetes.mellitus.type2 ~ ., family="binomial",data=df1)
summary(fit)
Call:
glm(formula = clinica.diabetes.mellitus.type2 ~ ., family = "binomial",
data = df1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.536601 1.129696 -7.557 4.14e-14 ***
SNP1TRUE -15.201578 617.579106 -0.025 0.98036
SNP2TRUE 0.018234 0.327201 0.056 0.95556
SNP3TRUE -0.011089 0.286430 -0.039 0.96912
SNP4TRUE -0.062110 0.297023 -0.209 0.83436
SNP5TRUE 0.989459 0.515546 1.919 0.05495 .
clinica.age 0.060962 0.009164 6.652 2.88e-11 ***
clinica.genderMale 0.255553 0.336521 0.759 0.44762
clinica.weight.kg 0.030913 0.011100 2.785 0.00535 **
clinica.smoking.statusnon-smoker -0.555339 0.310398 -1.789 0.07360 .
clinica.smoking.statussmoker -0.475227 0.427708 -1.111 0.26652
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 584.63 on 1176 degrees of freedom
Residual deviance: 489.85 on 1166 degrees of freedom
(325 observations deleted due to missingness)
AIC: 511.85
Number of Fisher Scoring iterations: 16
count Stage
SRR975551Aligned.out.sam.bam 4797 Cancer
SRR975552Aligned.out.sam.bam 3149 Cancer
SRR975553Aligned.out.sam.bam 3698 Cancer
SRR975554Aligned.out.sam.bam 3269 Cancer
SRR975555Aligned.out.sam.bam 3077 Cancer
SRR975556Aligned.out.sam.bam 7384 Cancer
Ajustamos un modelo loglineal de Poisson.
Call:
glm(formula = count ~ Stage, family = poisson(link = log), data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.237949 0.003833 2149.33 <2e-16 ***
StageMetastasis -0.455698 0.006153 -74.06 <2e-16 ***
StageNormal 0.376512 0.004977 75.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 74616 on 53 degrees of freedom
Residual deviance: 51857 on 51 degrees of freedom
AIC: 52398
Number of Fisher Scoring iterations: 5
Stage
.StageMetastasis
y StageNormal
pueden considerarse simultáneamente nulos.Podemos rechazar confortablemente la hipotesis nula.
Call:
glm.nb(formula = count ~ Stage, data = df, init.theta = 3.727352335,
link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.2379 0.1221 67.444 < 2e-16 ***
StageMetastasis -0.4557 0.1728 -2.638 0.00835 **
StageNormal 0.3765 0.1727 2.180 0.02927 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(3.7274) family taken to be 1)
Null deviance: 79.058 on 53 degrees of freedom
Residual deviance: 56.400 on 51 degrees of freedom
AIC: 966.78
Number of Fisher Scoring iterations: 1
Theta: 3.727
Std. Err.: 0.689
2 x log-likelihood: -958.781