首页 > 其他 > 详细

Jordan Lecture Note-4: Linear & Ridge Regression

时间:2014-02-15 02:37:08      阅读:470      评论:0      收藏:0      [点我收藏+]
Linear & Ridge Regression

    对于nnbubuko.com,布布扣 个数据\{(x_1,y_1),(x_2,y_2),\cdots,(x_n,y_n)\},x_i\in\mathbb{R}^d,y_i\in\mathbb{R}{(xbubuko.com,布布扣1bubuko.com,布布扣,ybubuko.com,布布扣1bubuko.com,布布扣),(xbubuko.com,布布扣2bubuko.com,布布扣,ybubuko.com,布布扣2bubuko.com,布布扣),?,(xbubuko.com,布布扣nbubuko.com,布布扣,ybubuko.com,布布扣nbubuko.com,布布扣)},xbubuko.com,布布扣ibubuko.com,布布扣Rbubuko.com,布布扣dbubuko.com,布布扣,ybubuko.com,布布扣ibubuko.com,布布扣Rbubuko.com,布布扣 。我们采用以下矩阵来记上述数据:

\begin{equation}\mathbf{X}=\left[\begin{array}& x_1^\prime\\ x_2^\prime\\\vdots\\ x_n^\prime\end{array}\right]\quad y=\left(\begin{array}&y_1\\y_2\\\vdots\\y_n\end{array}\right)\end{equation}

X=?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣xbubuko.com,布布扣bubuko.com,布布扣1bubuko.com,布布扣bubuko.com,布布扣xbubuko.com,布布扣bubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣?bubuko.com,布布扣xbubuko.com,布布扣bubuko.com,布布扣nbubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣y=?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣ybubuko.com,布布扣1bubuko.com,布布扣bubuko.com,布布扣ybubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣?bubuko.com,布布扣ybubuko.com,布布扣nbubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣(1)bubuko.com,布布扣bubuko.com,布布扣

我们想要拟合出y=\mathbf{X}\beta+\epsilony=Xβ+?bubuko.com,布布扣 ,其中\epsilon?bubuko.com,布布扣 为服从均值为0,方差为\sigma^2σbubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣 的高斯分布。

一、 最大似然估计

\epsilon?bubuko.com,布布扣 的密度函数:

f(\epsilon)=\frac{1}{\sqrt{2\pi}\sigma}exp\{-\frac{\epsilon^2}{\sigma^2}\}=\frac{1}{\sqrt{2\pi}\sigma}exp\{-\frac{\|y-\mathbf{X}\beta\|^2}{\sigma^2}\}

f(?)=1bubuko.com,布布扣2πbubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣σbubuko.com,布布扣bubuko.com,布布扣exp{??bubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣σbubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣}=1bubuko.com,布布扣2πbubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣σbubuko.com,布布扣bubuko.com,布布扣exp{?y?Xβbubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣σbubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣}bubuko.com,布布扣

似然函数:

L(\beta)=\prod_{i=1}^n\frac{1}{\sqrt{2\pi}\sigma}exp\{-\frac{(y_i-x_i^\prime\beta)^\prime(y_i-x_i^\prime\beta)}{\sigma^2}\}

L(β)=bubuko.com,布布扣i=1bubuko.com,布布扣nbubuko.com,布布扣1bubuko.com,布布扣2πbubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣σbubuko.com,布布扣bubuko.com,布布扣exp{?(ybubuko.com,布布扣ibubuko.com,布布扣?xbubuko.com,布布扣bubuko.com,布布扣ibubuko.com,布布扣β)bubuko.com,布布扣bubuko.com,布布扣(ybubuko.com,布布扣ibubuko.com,布布扣?xbubuko.com,布布扣bubuko.com,布布扣ibubuko.com,布布扣β)bubuko.com,布布扣σbubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣}bubuko.com,布布扣

log-似然函数:

l(\beta)=n\mathop{log}\frac{1}{\sqrt{2\pi}\sigma}-\sum_{i=1}^n\frac{(y_i-x_i^\prime\beta)^\prime(y_i-x_i^\prime\beta)}{\sigma^2}

l(β)=nlog1bubuko.com,布布扣2πbubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣σbubuko.com,布布扣bubuko.com,布布扣?bubuko.com,布布扣i=1bubuko.com,布布扣nbubuko.com,布布扣(ybubuko.com,布布扣ibubuko.com,布布扣?xbubuko.com,布布扣bubuko.com,布布扣ibubuko.com,布布扣β)bubuko.com,布布扣bubuko.com,布布扣(ybubuko.com,布布扣ibubuko.com,布布扣?xbubuko.com,布布扣bubuko.com,布布扣ibubuko.com,布布扣β)bubuko.com,布布扣σbubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣

\frac{dl(\beta)}{d\beta}=0dl(β)bubuko.com,布布扣dβbubuko.com,布布扣bubuko.com,布布扣=0bubuko.com,布布扣 \Longrightarrow?bubuko.com,布布扣 (\mathbf{X}^\prime\mathbf{X})\hat{\beta}_{ML}=\mathbf{X}^\prime y(Xbubuko.com,布布扣bubuko.com,布布扣X)βbubuko.com,布布扣^bubuko.com,布布扣bubuko.com,布布扣MLbubuko.com,布布扣=Xbubuko.com,布布扣bubuko.com,布布扣ybubuko.com,布布扣

其中\mathbf{X}^\prime\mathbf{X}Xbubuko.com,布布扣bubuko.com,布布扣Xbubuko.com,布布扣 \mathbf{X}^\prime yXbubuko.com,布布扣bubuko.com,布布扣ybubuko.com,布布扣 是充分统计量。

充分统计量:直观来讲应该能够将样本中所有包含的关于未知参数的信息全部包括起来,这样的统计量就是充分统计量。具体的说,在统计量TTbubuko.com,布布扣 给定后,样本的条件分布已经不在依赖于参数\thetaθbubuko.com,布布扣 。数学定义:设有一个分布族\mathcal{F}=\{F\},(x_1,x_2,\cdots,x_n)F={F},(xbubuko.com,布布扣1bubuko.com,布布扣,xbubuko.com,布布扣2bubuko.com,布布扣,?,xbubuko.com,布布扣nbubuko.com,布布扣)bubuko.com,布布扣 是从某总体F\in\mathcal{F}FFbubuko.com,布布扣 中抽取的一个样本,T=T(x_1,x_2,\cdots,x_n)T=T(xbubuko.com,布布扣1bubuko.com,布布扣,xbubuko.com,布布扣2bubuko.com,布布扣,?,xbubuko.com,布布扣nbubuko.com,布布扣)bubuko.com,布布扣 为一个(一维或多维)统计量,如果当给定T=tT=tbubuko.com,布布扣 下, 样本(x_1,x_2,\cdots,x_n)(xbubuko.com,布布扣1bubuko.com,布布扣,xbubuko.com,布布扣2bubuko.com,布布扣,?,xbubuko.com,布布扣nbubuko.com,布布扣)bubuko.com,布布扣 的条件分布于总体分布FFbubuko.com,布布扣 无关,则称TTbubuko.com,布布扣 为此分布族的充分统计量(sufficient statistic)。

 假设(\mathbf{X}^\prime\mathbf{X})^{-1}(Xbubuko.com,布布扣bubuko.com,布布扣X)bubuko.com,布布扣?1bubuko.com,布布扣bubuko.com,布布扣 存在,则

\begin{align*}\hat{\beta}_{ML}&=(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime y\\&=\mathbf{X}^\prime\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-2}y\\&=\mathbf{X}^\prime\alpha\end{align*}

βbubuko.com,布布扣^bubuko.com,布布扣bubuko.com,布布扣MLbubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣=(Xbubuko.com,布布扣bubuko.com,布布扣X)bubuko.com,布布扣?1bubuko.com,布布扣Xbubuko.com,布布扣bubuko.com,布布扣ybubuko.com,布布扣=Xbubuko.com,布布扣bubuko.com,布布扣X(Xbubuko.com,布布扣bubuko.com,布布扣X)bubuko.com,布布扣?2bubuko.com,布布扣ybubuko.com,布布扣=Xbubuko.com,布布扣bubuko.com,布布扣αbubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣

其中\alpha=\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-2}yα=X(Xbubuko.com,布布扣bubuko.com,布布扣X)bubuko.com,布布扣?2bubuko.com,布布扣ybubuko.com,布布扣 。最后的预测模型:y=x\hat{\beta}_{ML}=x\mathbf{X}^\prime\alphay=xβbubuko.com,布布扣^bubuko.com,布布扣bubuko.com,布布扣MLbubuko.com,布布扣=xXbubuko.com,布布扣bubuko.com,布布扣αbubuko.com,布布扣

二、最小二乘法

原则:使拟合出来的直线到各点的距离之和最小。其模型如下:

\begin{equation}\mathop{\min}\quad  \sum_{i=1}^n(y_i-x_i^\prime\beta)^2\label{equ:leastSquare}\end{equation}

min bubuko.com,布布扣i=1bubuko.com,布布扣nbubuko.com,布布扣(ybubuko.com,布布扣ibubuko.com,布布扣?xbubuko.com,布布扣bubuko.com,布布扣ibubuko.com,布布扣β)bubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣(2)bubuko.com,布布扣bubuko.com,布布扣

对式子\ref{equ:leastSquare}2bubuko.com,布布扣 求导,并令其为0可得:\mathbf{X}^\prime\mathbf{X}\beta=\mathbf{X}^\prime yXbubuko.com,布布扣bubuko.com,布布扣Xβ=Xbubuko.com,布布扣bubuko.com,布布扣ybubuko.com,布布扣 ,同样假设\mathbf{X}^\prime\mathbf{X}Xbubuko.com,布布扣bubuko.com,布布扣Xbubuko.com,布布扣 可逆,故\hat{\beta}_{LS}=(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X}^\prime yβbubuko.com,布布扣^bubuko.com,布布扣bubuko.com,布布扣LSbubuko.com,布布扣=(Xbubuko.com,布布扣bubuko.com,布布扣X)bubuko.com,布布扣?1bubuko.com,布布扣Xbubuko.com,布布扣bubuko.com,布布扣ybubuko.com,布布扣

三、岭回归(Ridge regression)

当自变量之间存在多重相关性的时候,矩阵\mathbf{X}^\prime\mathbf{X}Xbubuko.com,布布扣bubuko.com,布布扣Xbubuko.com,布布扣 并不一定可逆,或者|\mathbf{X}^\prime\mathbf{X}||Xbubuko.com,布布扣bubuko.com,布布扣X|bubuko.com,布布扣 非常小,导致最小二乘法回归出来的系数会产生过拟合现象。此时可以给最小二乘法加入二次的penalty,得到岭回归。

1)从最大似然函数加上penalized -\lambda\|\beta\|^2?λβbubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣 来看。

    其中\epsilon?bubuko.com,布布扣 的密度函数:

f(\epsilon)=\frac{1}{\sqrt{2\pi}\sigma}exp\{-\frac{\epsilon^2+\lambda\|\beta\|^2}{2\sigma^2}\}=\frac{1}{\sqrt{2\pi}\sigma}exp\{-frac{(y-x^\prime\beta)^2+\lambda\beta^\prime\beta}{2\sigma^2}\}

f(?)=1bubuko.com,布布扣2πbubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣σbubuko.com,布布扣bubuko.com,布布扣exp{??bubuko.com,布布扣2bubuko.com,布布扣+λβbubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣2σbubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣}=1bubuko.com,布布扣2πbubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣σbubuko.com,布布扣bubuko.com,布布扣exp{?frac(y?xbubuko.com,布布扣bubuko.com,布布扣β)bubuko.com,布布扣2bubuko.com,布布扣+λβbubuko.com,布布扣bubuko.com,布布扣β2σbubuko.com,布布扣2bubuko.com,布布扣}bubuko.com,布布扣

     似然函数:

L(\theta)=\prod_{i=1}^n\frac{1}{\sqrt{2\pi}\sigma}exp\{-\frac{(y_i-x_i^\prime\beta)^2+\lambda\beta^\prime\beta}{2\sigma^2}\}

L(θ)=bubuko.com,布布扣i=1bubuko.com,布布扣nbubuko.com,布布扣1bubuko.com,布布扣2πbubuko.com,布布扣?bubuko.com,布布扣?bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣σbubuko.com,布布扣bubuko.com,布布扣exp{?(ybubuko.com,布布扣ibubuko.com,布布扣?xbubuko.com,布布扣bubuko.com,布布扣ibubuko.com,布布扣β)bubuko.com,布布扣2bubuko.com,布布扣+λβbubuko.com,布布扣bubuko.com,布布扣βbubuko.com,布布扣2σbubuko.com,布布扣2bubuko.com,布布扣bubuko.com,布布扣bubuko.com,布布扣}bubuko.com,布布扣

     log-似然函数:

l(\theta)=n\mathop{log}\frac{1}{\sqrt{2\pi}\sigma}-\sum_{i=1}^n\frac{(y_i-x_i^\prime\beta)^2+\lambda\beta^\prime\beta}{2\sigma^2}

    对log-似然函数求导得:

\frac{dl(\theta)}{d\theta}=-\mathbf{X}^\prime y+\mathbf{X}^\prime\mathbf{X}\beta+\lambda\beta=0 \Longrightarrow \mathbf{X}^\prime y=(\mathbf{X}^\prime\mathbf{X}+\lambda\mathbf{I})\hat{\beta_{ML}}

由于矩阵(\mathbf{X}^\prime \mathbf{X}+\lambda\mathbf{I}) 必定可逆,故:

\hat{\beta_{ML}}=(\mathbf{X}^\prime \mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^\prime y

2)从贝叶斯角度

    假设待考察的量遵循某概率分布,且根据这些概率及观察到的数据进行推断,以作出最优的决策。

    贝叶斯公式:\mathbb{P}(h|D)=\frac{\mathbb{P}(h)\mathbb{P}(D|h)}{\mathbb{P}(D)} .

    最大后验概率(Maximum a Posteriori Probability,MAP):

h_{MAP}=\mathop{argmin}_{h\in H}\mathbb{P}(h|D)=\mathop{argmin}_{h\in H}\frac{\mathbb{P}(h)\mathbb{P}(D|h)}{\mathbb{P}(D)}=\mathop{argmin}_{h\in H}\mathbb{P}(h)\mathbb{P}(D|h)

假设\beta 服从先验分布\beta\sim N(0,\lambda^{-1}) ,则

\begin{align*}\mathop{\max}_{h\in H}\mathbb{P}(h|D)&=\frac{1}{\sqrt{2\pi}\sigma}exp\{-\frac{(y-\mathbf{X}\beta)^\prime(y-\mathbf{X}\beta)}{2\sigma^2}\}\frac{\sqrt{\lambda}}{\sqrt{2\pi}}exp\{-\frac{\beta^\prime\beta}{\frac{2}{\lambda}}\}\\ &=\frac{\sqrt{\lambda}}{2\pi\sigma}exp\{-\frac{(y-\mathbf{X}\beta)^\prime(y-\mathbf{X}\beta)}{2\sigma^2}-\frac{\lambda\beta^\prime\beta}{2}\}\end{align*}


\Longrightarrow\mathop{\min}\frac{(y-\mathbf{X}\beta)^\prime(y-\mathbf{X}\beta)}{2\sigma^2}+\frac{\lambda}{2}\beta^\prime\beta

令导数等于0\Longrightarrow \frac{-\mathbf{X}^\prime(y-\mathbf{X}\beta)}{\sigma^2}+\lambda\beta=0

\Longrightarrow (\mathbf{X}^\prime\mathbf{X}+\sigma^2\lambda\mathbf{I})\hat{\beta_{MAP}}=\mathbf{X}^\prime y

\Longrightarrow \sigma^2\lambda\hat{\beta_{MAP}}=\mathbf{X}^\prime y-\mathbf{X}^\prime\mathbf{X}\hat{\beta_{MAP}}=\mathbf{X}^\prime(y-\mathbf{X}\hat{\beta_{MAP}})

\Longrightarrow \hat{\beta_{MAP}}=(\sigma^2\lambda)^{-1}\mathbf{X}^\prime(y-\mathbf{X}\hat{\beta_{MAP}})\triangleq\mathbf{X}^\prime\alpha

 其中\alpha = (\sigma^2\lambda)^{-1}(y-\mathbf{X}\hat{\beta_{MAP}})

\sigma^2\lambda\alpha = y-\mathbf{X}\hat{\beta_{MAP}}=y-\mathbf{X}\mathbf{X}^\prime\alpha

\Longrightarrow (\sigma^2\lambda+\mathbf{X}\mathbf{X}^\prime)\alpha=y

\Longrightarrow \alpha=(\mathbf{X}\mathbf{X}^\prime+\sigma^2\lambda)^{-1}y=(\mathbf{K}+\lambda\sigma^2)^{-1}y

故我们只需要知道矩阵\mathbf{K} 即可计算出\alpha \beta 值。当我们将上述矩阵\mathbf{K} 替换成kernel矩阵,则可在更高维的空间进行回归,而且我们并不需要去关心这个映射的具体形式。

Jordan Lecture Note-4: Linear & Ridge Regression

原文:http://www.cnblogs.com/boostable/p/lec_linear_ridge_regression.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!