LDA and QDA are classification methods based on the concept of Bayes’ Theorem with assumption on conditional Multivariate Normal Distribution. And, because of this assumption, LDA and QDA can only be used when all explanotary variables are numeric.

This post is my note about LDA and QDA, classification teachniques. All the contents in this post are based on my reading on many resources which are listed in the References part.

Details

  • Name
    • Linear Discriminant Analysis (LDA)
    • Quadratic Discriminant Analysis (QDA)

  • Data Type
    • Reponse Variable: Categorical
    • Explanatory Variable: Numeric

  • Multivariate Normal Distribution

XN(μ,Σ)=1(2π)p2Σ12exp(12(Xμ)TΣ1(Xμ)) \begin{align} X \sim N(\mu, \Sigma) &= \frac{1}{(2\pi)^\frac{p}{2}|\Sigma|^\frac{1}{2}}\exp(-\frac{1}{2}(X-\mu)^T\Sigma^{-1}(X-\mu)) \end{align}



  • Assumptions
    • LDA:
      Given a class kk, the predictors in this class, Xk=(X1k,X2k,,Xpk)X_k=(X_{1k},X_{2k},\cdots,X_{pk}), follows multivariate normal distribution with mean μk\mu_k and covariance matrix Σ\Sigma. And, the covariance matrix are all the same for all classes.
      P(X=xY=k)=N(μk,Σ)where μk=(μ1kμ2kμpk)Σ=(σ12σ12σ1pσ12σ22σ2pσ1pσ2pσp2),k \begin{align} P(X=x|Y=k) &= N(\mu_k, \Sigma) \\[5pt] \text{where } \mu_k &= \begin{pmatrix} \mu_{1k}\\ \mu_{2k}\\ \vdots\\ \mu_{pk} \end{pmatrix} \\[5pt] \Sigma &= \begin{pmatrix} \sigma_1^2& \sigma_{12}& \cdots& \sigma_{1p} \\ \sigma_{12}& \sigma_2^2& \cdots& \sigma_{2p}\\ \vdots& \vdots& \ddots& \vdots\\ \sigma_{1p}& \sigma_{2p} & \cdots & \sigma_p^2 \end{pmatrix}, \, \forall \,k \end{align}

    • QDA:
      Given a class kk, the predictors in this class, Xk=(X1k,X2k,,Xpk)X_k=(X_{1k},X_{2k},\cdots,X_{pk}), follows multivariate normal distribution with mean μk\mu_k and covariance matrix Σk\Sigma_k. And, the covariance matrix can NOT be the same for all classes.
      P(X=xY=k)=N(μk,Σk)where μk=(μ1kμ2kμpk)Σk=(σ1k2σ12kσ1pkσ12kσ2k2σ2pkσ1pkσ2pkσpk2) \begin{align} P(X=x|Y=k) &= N(\mu_k, \Sigma_k) \\[5pt] \text{where } \mu_k &= \begin{pmatrix} \mu_{1k}\\ \mu_{2k}\\ \vdots\\ \mu_{pk} \end{pmatrix} \\[5pt] \Sigma_k &= \begin{pmatrix} \sigma_{1_k}^2& \sigma_{12_k}& \cdots& \sigma_{1p_k} \\ \sigma_{12_k}& \sigma_{2_k}^2& \cdots& \sigma_{2p_k}\\ \vdots& \vdots& \ddots& \vdots\\ \sigma_{1p_k}& \sigma_{2p_k} & \cdots & \sigma_{p_k}^2 \end{pmatrix} \end{align}

  if (!require("ggplot2")) install.packages("ggplot2")
  if (!require("MASS")) install.packages("MASS")
  if (!require("grid")) install.packages("grid")
  if (!require("gridExtra")) install.packages("gridExtra")

  ########################################
  #### Generate Bivariate Normal Data ####
  ########################################
  set.seed=12345
  bvm=function(m1,m2,sigma1,sigma2,n1,n2){
      d1 <- mvrnorm(n1, mu = m1, Sigma = sigma1 )
      d2 <- mvrnorm(n2, mu = m2, Sigma = sigma2 )
      d=rbind(d1,d2)
      colnames(d)=c("X1","X2")
      d=data.frame(d)
      d$group=c(rep("1",n1),rep("2",n2))
      list(data = d)
  }

  m_1 <- c(0.5, -0.5)
  m_2 <- c(-2, 0.7)
  sigma_1 <- matrix(c(1,0.5,0.5,1), nrow=2)
  sigma_2 <- matrix(c(0.8,-0.7,-0.7,0.8), nrow=2)

  p1=ggplot(bvm(m_1,m_2,sigma_1,sigma_1,n1=2000,n2=2000)$data,
      aes(x=X1, y=X2)) +
      stat_density2d(geom="density2d", aes(color = group),contour=TRUE)+
      ggtitle("Under LDA assumption")
  p2=ggplot(bvm(m_1,m_2,sigma_1,sigma_2,n1=2000,n2=2000)$data,
      aes(x=X1, y=X2)) +
      stat_density2d(geom="density2d", aes(color = group),contour=TRUE)+
      ggtitle("Under QDA assumption")
  grid.arrange(p1, p2, ncol = 2)

  • Algorithm
    Given a class variable Y={1,2,...,K},K2Y= \{ 1, 2,..., K \}, K\geq2 and explanatory variables, X={X1,X2,...,Xp},X=\{ X_1, X_2,..., X_p \}, the Bayes’ Theorem can be written as:
    • LDA:

      P(Y=kX=x)=P(X=xY=k)P(Y=k)P(X=x)=P(X=xY=k)P(Y=k)i=1KP(X=xY=i)P(Y=i) \begin{align} P(Y=k|X=x) &= \frac{P(X=x|Y=k)P(Y=k)}{P(X=x)} \\[5pt] &= \frac{P(X=x|Y=k)P(Y=k)}{\sum_{i=1}^{K}P(X=x|Y=i)P(Y=i)} \end{align}

      Then, we define a classifier which is a function, C ⁣:Rp{1,2,...,K}C \colon \mathbb{R}^p \rightarrow \{ 1, 2,..., K \} and

      C(x)=argmaxk{1,2,...,K}P(Y=kX=x)=argmaxk{1,2,...,K}logP(Y=kX=x)argmaxk{1,2,...,K}logP(X=xY=k)+logP(Y=k)(by assuming that P(X=xY=k)N(μk,Σ))=argmaxk{1,2,...,K}logN(μk,Σ)+logP(Y=k)=argmaxk{1,2,...,K}log[1(2π)p2Σ12exp(12(Xμk)TΣ1(Xμk))]+logP(Y=k)argmaxk{1,2,...,K}12(Xμk)TΣ1(Xμk)+log(Y=k)=argmaxk{1,2,...,K}XTΣ1μk12μkTΣμk+logP(Y=k)argmaxk{1,2,...,K}δk(x) \begin{align} C(x) &=\underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}}P(Y=k|X=x) \\[5pt] &= \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}}\log P(Y=k|X=x) \\[5pt] &\propto \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} \log P(X=x|Y=k) + \log P(Y=k) \\[5pt] \quad &(\text{by assuming that } P(X=x|Y=k) \sim N(\mu_k, \Sigma)) \\[5pt] &= \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} \log N(\mu_k, \Sigma) + \log P(Y=k) \\[5pt] &= \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} \log [\frac{1}{(2\pi)^\frac{p}{2}|\Sigma|^\frac{1}{2}}\exp(-\frac{1}{2}(X-\mu_k)^T\Sigma^{-1}(X-\mu_k))] + \log P(Y=k) \\[5pt] &\propto \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} -\frac{1}{2}(X-\mu_k)^T\Sigma^{-1}(X-\mu_k) + \log (Y=k) \\[5pt] &= \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} X^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma\mu_k + \log P(Y=k) \\[5pt] &\equiv \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} \delta_k(x) \end{align}

      To calculate C(X)C(X), we need to plug in the following estimates which are all unbiased.
      P^(Y=k)=nkn,i=1Kni=nμk^=(μ1k^μ2k^μpk^)=(Xˉ1k=1n1i=1n1xikXˉ2kXˉpk)Σ^=Pooled Covariance Matrix because of total K classes=1nKi=1Kj=1nk(xjkμk^)(xjkμk^)T \begin{align} \hat{P}(Y&=k) = \frac{n_k}{n}, \quad \sum_{i=1}^{K}n_i = n \\[5pt] \hat{\mu_k} &= \begin{pmatrix} \hat{\mu_{1k}}\\ \hat{\mu_{2k}}\\ \vdots\\ \hat{\mu_{pk}} \end{pmatrix} = \begin{pmatrix} \bar{X}_{1k} = \frac{1}{n_1}\sum_{i=1}^{n_1}x_{ik}\\ \bar{X}_{2k}\\ \vdots\\ \bar{X}_{pk} \end{pmatrix} \\[5pt] \hat\Sigma &= \text{Pooled Covariance Matrix because of total K classes} \\[5pt] &= \frac{1}{n-K} \sum_{i=1}^{K}\sum_{j=1}^{n_k}(x_{jk}-\hat{\mu_k})(x_{jk}-\hat{\mu_k})^T \end{align}

    • QDA:
      The function of classifier, C(x)C(x), is as following.

      C(x)=argmaxk{1,2,...,K}logP(Y=kX=x)argmaxk{1,2,...,K}logP(X=xY=k)+logP(Y=k)(by assuming that P(X=xY=k)N(μk,Σk))=argmaxk{1,2,...,K}logN(μk,Σk)+logP(Y=k)=argmaxk{1,2,...,K}log[1(2π)p2Σk12exp(12(Xμk)TΣk1(Xμk))]+logP(Y=k)argmaxk{1,2,...,K}12(Xμk)TΣk1(Xμk)12Σk+log(Y=k)=argmaxk{1,2,...,K}12XTΣkX+XTΣk1μk12μkTΣkμk12Σk+logP(Y=k)argmaxk{1,2,...,K}δk(x) \begin{align} C(x) &=\underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} \log P(Y=k|X=x) \\[5pt] &\propto \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} \log P(X=x|Y=k) + \log P(Y=k) \\[5pt] \quad &(\text{by assuming that } P(X=x|Y=k) \sim N(\mu_k, \Sigma_k)) \\[5pt] &= \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} \log N(\mu_k, \Sigma_k) + \log P(Y=k) \\[5pt] &= \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} \log [\frac{1}{(2\pi)^\frac{p}{2}|\Sigma_k|^\frac{1}{2}}\exp(-\frac{1}{2}(X-\mu_k)^T\Sigma_k^{-1}(X-\mu_k))] + \log P(Y=k) \\[5pt] &\propto \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} -\frac{1}{2}(X-\mu_k)^T\Sigma_k^{-1}(X-\mu_k) - \frac{1}{2}|\Sigma_k|+\log (Y=k) \\[5pt] &= \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} -\frac{1}{2}X^T\Sigma_kX+X^T\Sigma_k^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma_k\mu_k - \frac{1}{2}|\Sigma_k| + \log P(Y=k) \\[5pt] &\equiv \underset{k\in \{ 1, 2,..., K \} }{\operatorname{argmax}} \delta_k(x) \end{align}

      The estimators are the same as ones in LDA except for Σk\Sigma_k.

      Σ^k=Covariance Matrix in class k=1nk1j=1nk(xjkμk^)(xjkμk^)T \begin{align} \hat\Sigma_k &= \text{Covariance Matrix in class k} \\[5pt] &= \frac{1}{n_k-1} \sum_{j=1}^{n_k}(x_{jk}-\hat{\mu_k})(x_{jk}-\hat{\mu_k})^T \end{align}

  • Decision Boundary
    • LDA:
      The decision boundary of LDA is a straight line which can be derived as below.
      δk(x)δl(x)=0XTΣ1(μkμl)12(μk+μl)TΣ(μkμl)+logP(Y=k)P(Y=l)=0b1x+b0=0 \begin{align} &\delta_k(x) - \delta_l(x) = 0 \\[5pt] &\Rightarrow X^T\Sigma^{-1}(\mu_k-\mu_l) - \frac{1}{2}(\mu_k+\mu_l)^T\Sigma(\mu_k-\mu_l)+\log \frac{P(Y=k)}{P(Y=l)} = 0 \\[5pt] &\Rightarrow b_1x+b_0=0 \end{align}

    • QDA:
      The decision boundary of QDA is a quadratic line which can be derived as below.
      δk(x)δl(x)=012XT(ΣkΣl)X+XT(Σk1μkΣl1μl)12(μkTΣkμkμlTΣlμl)12(ΣkΣl)+logP(Y=k)P(Y=l)=0b2x2+b1x+b0=0 \begin{align} &\delta_k(x) - \delta_l(x) = 0 \\[5pt] &\Rightarrow -\frac{1}{2}X^T(\Sigma_k -\Sigma_l)X + X^T(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l)-\frac{1}{2}(\mu_k^T\Sigma_k\mu_k-\mu_l^T\Sigma_l\mu_l)\\[5pt] &\quad -\frac{1}{2}(|\Sigma_k|-|\Sigma_l|)+ \log\frac{P(Y=k)}{P(Y=l)}=0\\[5pt] & \Rightarrow b_2x^2+b_1x+b_0=0 \end{align}

      The following graphs are from The Elements of Statistical Learning:Data Mining, Inference, and Prediction which gave us a clear idea how the decision bounday looks like in LDA and QDA.

  • Strengths and Weaknesses
    • Strengths:
      • It is a simple and intuitive method.
    • Weaknesses:
      1. In real life, it is really hard to have a dataset fit the assumption of multivariate normal distribution given classes. But we still can use this method even though the assumption is not met.
      2. It only makes sense when all the predictors are numeric.
  • Further topics