Доказ рівнозначних формул регресії хребта

15

Я читав найпопулярніші книги в статистичному навчанні

1- Елементи статистичного навчання.

2- Вступ до статистичного навчання .

Обидва згадують, що регресія хребта має дві формули, рівнозначні. Чи є зрозумілий математичний доказ цього результату?

Я також пройшов Cross Valified , але я не можу знайти певного доказу там.

Крім того, буде LASSO насолоджуватися тим же типом доказування?

— Джеза
джерело

2

en.wikipedia.org/wiki/…

— Тейлор

1

Лассо не є формою регресії хребта.

— Сіань

@jeza, Ви могли б пояснити, чого не вистачає у моїй відповіді? Це дійсно випливає, що все може бути отримано про зв'язок.

— Рой

@jeza, Ви можете бути конкретними? Якщо ви не знаєте концепції Лагрангія для обмеженої проблеми, важко дати стислу відповідь.

— Рой

1

@jeza, обмежена проблема оптимізації може бути перетворена на оптимізацію умов функцій Лагрангія / KKT (як пояснено у поточних відповідях). Цей принцип вже має багато різних простих пояснень по всьому Інтернету. У якому напрямку потрібно більше пояснення доказів? Пояснення / доказ множника / функції Лагрангія, пояснення / доказ того, як ця проблема є оптимізацією, що стосується методу Лагранжа, різниці ККТ / Лагранжа, пояснення принципу регуляризації тощо?

— Секст Емпірік

19

Класична регресія хребта ( регуляризація Тихонова ) задана:

\arg min_{x} \frac{1}{2} {‖ x - y ‖}_{2}^{2} + λ {‖ x ‖}_{2}^{2}

$\arg \min_{x} \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} + \lambda {\left\| x \right\|}_{2}^{2}$

Наведене вище твердження полягає в тому, що наступна проблема рівнозначна:

\begin{aligned} \arg min_{x} & \frac{1}{2} {‖ x - y ‖}_{2}^{2} \\ subject to & {‖ x ‖}_{2}^{2} \leq t \end{aligned}

$\begin{align*} \arg \min_{x} \quad & \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} \\ \text{subject to} \quad & {\left\| x \right\|}_{2}^{2} \leq t \end{align*}$

Давайте визначимо в якості оптимального вирішення першого завдання і в якості оптимального вирішення другого завдання. $\hat{x}$ $\tilde{x}$

Заява про еквівалентність означає, що $\forall t, \: \exists \lambda \geq 0 : \hat{x} = \tilde{x}$ .
А саме у вас завжди може бути пара $t$ і $\lambda \geq 0$ таке рішення задачі те саме.

Як ми могли знайти пару?
Ну, вирішуючи задачі і дивлячись на властивості рішення.
Обидві проблеми є опуклими і гладкими, тому вони повинні спростити справи.

Рішення першої задачі задається в точці, коли градієнт зникає, що означає:

\hat{x} - y + 2 λ \hat{x} = 0

$\hat{x} - y + 2 \lambda \hat{x} = 0$

У KKT умови другого завдання станів:

\tilde{x} - y + 2 μ \tilde{x} = 0

$\tilde{x} - y + 2 \mu \tilde{x} = 0$

і

μ ({‖ \tilde{x} ‖}_{2}^{2} - t) = 0

$\mu \left( {\left\| \tilde{x} \right\|}_{2}^{2} - t \right) = 0$

Останнє рівняння говорить про те, що або $\mu = 0$ або ${\left\| \tilde{x} \right\|}_{2}^{2} = t$ .

Зверніть увагу, що 2 базові рівняння рівнозначні.
А саме , якщо і обидва рівняння справедливі. $\hat{x} = \tilde{x}$ $\mu = \lambda$

Отже, це означає, що у випадку ${\left\| y \right\|}_{2}^{2} \leq t$ треба встановити $\mu = 0$ що означає, що для $t$ достатньо великих розмірів для того, щоб обидва були еквівалентними, слід встановити $\lambda = 0$ .

В іншому випадку слід знайти $\mu$ де:

y^{t} {(I + 2 μ I)}^{- 1} {(I + 2 μ I)}^{- 1} y = t

${y}^{t} \left( I + 2 \mu I \right)^{-1} \left( I + 2 \mu I \right)^{-1} y = t$

Це в основному, коли ${\left\| \tilde{x} \right\|}_{2}^{2} = t$

Як тільки ви виявите, що $\mu$ розчини зіткнуться.

Що стосується випадку ${L}_{1}$ (LASSO), то він працює з тією ж ідеєю.
Єдина відмінність полягає в тому, що ми не закрили для рішення, отже, похідне з'єднання складніше.

Подивіться на мою відповідь на підтвердженій контрольною програмою StackExchange $\lambda$ Q291962 та обробці сигналу StackExchange Q21730 - Значення у Основному переслідуванні .

Зауваження
Що насправді відбувається?
В обох проблемах $x$ намагається бути максимально наближеним до $y$ .
У першому випадку $x = y$ зникне першого доданка ( відстань ${L}_{2}$ ), а у другому випадку зникне об'єктивна функція.
Різниця полягає в тому, що в першому випадку треба збалансувати ${L}_{2}$ Норму $x$ . Оскільки $\lambda$ стає вище, баланс означає, що ви повинні зробити $x$ меншим.
У другому випадку є стіна, ви наближаєте $x$ ближче і ближче до $y$ поки ти не вдаришся про стіну, яка є обмеженням її норми (За $t$ ).
Якщо стіна достатньо далеко (Високе значення $t$ ) і достатньо залежить від норми $y$ то я не має жодного значення, як і $\lambda$ має значення лише його значення, помножене на норму $y$ починає мати значення.
Точний зв'язок - це Лагранжан, зазначений вище.

Ресурси

Я знайшов цей документ сьогодні (03.04.2019):

Твердість наближення для класу задач з розрідженою оптимізацією .

— Рой
джерело

чи еквівалент означає, що \ lambda і \ t повинні бути однаковими. Бо я не можу цього бачити в доказі. спасибі

— jeza

@jeza, Як я писав вище, для будь-якого

існує

(не обов'язково дорівнює

але функція

і дані

), такі, що розв’язки двох форм однакові.

t

$t$

λ \geq 0

$\lambda \geq 0$

t

$t$

t

$t$

y

$y$

— Рой

3

@jeza, обидва

&

є по суті вільними параметрами. Після того, як ви вкажете, скажімо,

, це дасть конкретне оптимальне рішення. Але

залишається вільним параметром. Отже, на даний момент твердження полягає в тому, що може бути деяке значення

яке дало б той самий оптимальний варіант рішення. Там практично відсутні обмеження на те , що

має бути; це не так, як це має бути якась фіксована функція

, наприклад,

чи щось таке.

λ

$\lambda$

t

$t$

λ

$\lambda$

t

$t$

t

$t$

t

$t$

λ

$\lambda$

t = λ / 2

$t=\lambda/2$

— gung - Відновіть Моніку

@Royi, я хотів би знати 1-, чому ваша формула має (1/2), тоді як формули, про які йдеться, ні? 2- використовуєте KKT, щоб показати еквівалентність двох формул? 3- якщо так, я все одно не бачу цієї еквівалентності. Я не впевнений, але те, що я сподіваюся побачити, - це доказ, що показує формулу один = формула друга.

— jeza

1. Просто легше, коли ви диференціюєте термін ЛС. Ви можете перемістити форму мій

до OP

з коефіцієнтом два. 2. Я використовував KKT для другого випадку. Перший випадок не має обмежень, отже, ви можете їх просто вирішити. 3. Між ними немає рівняння закритої форми. Я показав логіку і як можна створити графік, що їх з'єднує. Але, як я писав, це змінюватиметься для кожного

(Це залежить від даних).

λ

$\lambda$

λ

$\lambda$

y

$y$

— Рой

9

Менш суворий математично, але, можливо, більш інтуїтивний підхід до розуміння того, що відбувається, - це почати з версії обмеження (рівняння 3.42 у питанні) та вирішити її, використовуючи методи "множника Лагранжа" ( https: //en.wikipedia .org / wiki / Lagrange_multiplier або улюблений багатовимірний текст числення). Пам'ятайте лише, що в обчисленні - вектор змінних, але в нашому випадку є постійним, а - змінним вектором. Після застосування технології множника Лагранжа ви закінчуєте перше рівняння (3.41) (після того, як викинете зайвий який є постійним відносно мінімізації і може бути проігнорований). $x$ $x$ $\beta$ $-\lambda t$

This also shows that this works for lasso and other constraints.

— Greg Snow
джерело

8

It's perhaps worth reading about Lagrangian duality and a broader relation (at times equivalence) between:

optimization subject to hard (i.e. inviolable) constraints
optimization with penalties for violating constraints.

Quick intro to weak duality and strong duality

Assume we have some function $f(x,y)$ of two variables. For any $\hat{x}$ and $\hat{y}$ , we have:

min_{x} f (x, \hat{y}) \leq f (\hat{x}, \hat{y}) \leq max_{y} f (\hat{x}, y)

$\min_x f(x, \hat{y}) \leq f(\hat{x}, \hat{y}) \leq \max_y f(\hat{x}, y)$

Since that holds for any $\hat{x}$ and $\hat{y}$ it also holds that:

max_{y} min_{x} f (x, y) \leq min_{x} max_{y} f (x, y)

$\max_y \min_x f(x, y) \leq \min_x \max_y f(x, y)$

This is known as weak duality. In certain circumstances, you have also have strong duality (also known as the saddle point property):

max_{y} min_{x} f (x, y) = min_{x} max_{y} f (x, y)

$\max_y \min_x f(x, y) = \min_x \max_y f(x, y)$

When strong duality holds, solving the dual problem also solves the primal problem. They're in a sense the same problem!

Lagrangian for constrained Ridge Regression

Let me define the function $\mathcal{L}$ as:

L (b, λ) = \sum_{i = 1}^{n} (y - x_{i} \cdot b)^{2} + λ (\sum_{j = 1}^{p} b_{j}^{2} - t)

$\mathcal{L}(\mathbf{b}, \lambda) = \sum_{i=1}^n (y - \mathbf{x}_i \cdot \mathbf{b})^2 + \lambda \left( \sum_{j=1}^p b_j^2 - t \right)$

The min-max interpretation of the Lagrangian

The Ridge regression problem subject to hard constraints is:

min_{b} max_{λ \geq 0} L (b, λ)

$\min_\mathbf{b} \max_{\lambda \geq 0} \mathcal{L}(\mathbf{b}, \lambda)$

You pick $\mathbf{b}$ to minimize the objective, cognizant that after $\mathbf{b}$ is picked, your opponent will set $\lambda$ to infinity if you chose $\mathbf{b}$ such that $\sum_{j=1}^p b_j^2 > t$ .

If strong duality holds (which it does here because Slater's condition is satisfied for $t>0$ ), you then achieve the same result by reversing the order:

max_{λ \geq 0} min_{b} L (b, λ)

$\max_{\lambda \geq 0} \min_\mathbf{b} \mathcal{L}(\mathbf{b}, \lambda)$

Here, your opponent chooses $\lambda$ first! You then choose $\mathbf{b}$ to minimize the objective, already knowing their choice of $\lambda$ . The $\min_\mathbf{b} \mathcal{L}(\mathbf{b}, \lambda)$ part (taken $\lambda$ as given) is equivalent to the 2nd form of your Ridge Regression problem.

As you can see, this isn't a result particular to Ridge regression. It is a broader concept.

References

(I started this post following an exposition I read from Rockafellar.)

Rockafellar, R.T., Convex Analysis

You might also examine lectures 7 and lecture 8 from Prof. Stephen Boyd's course on convex optimization.

— Matthew Gunn
джерело

note that your answer can be extended to any convex function.

— 81235

6

They are not equivalent.

For a constrained minimization problem

\begin{matrix} (1) & min_{b} \sum_{i = 1}^{n} (y - x_{i}^{'} \cdot b)^{2} s . t . \sum_{j = 1}^{p} b_{j}^{2} \leq t, b = (b_{1}, . . ., b_{p}) \end{matrix}

$\min_{\mathbf b} \sum_{i=1}^n (y - \mathbf{x}'_i \cdot \mathbf{b})^2\\ s.t. \sum_{j=1}^p b_j^2 \leq t,\;\;\; \mathbf b = (b_1,...,b_p) \tag{1}$

we solve by minimize over $\mathbf b$ the corresponding Lagrangean

\begin{matrix} (2) & Λ = \sum_{i = 1}^{n} (y - x_{i}^{'} \cdot b)^{2} + λ (\sum_{j = 1}^{p} b_{j}^{2} - t) \end{matrix}

$\Lambda = \sum_{i=1}^n (y - \mathbf{x}'_i \cdot \mathbf{b})^2 + \lambda \left( \sum_{j=1}^p b_j^2 - t \right) \tag{2}$

Here, $t$ is a bound given exogenously, $\lambda \geq 0$ is a Karush-Kuhn-Tucker non-negative multiplier, and both the beta vector and $\lambda$ are to be determined optimally through the minimization procedure given $t$ .

Comparing $(2)$ and eq $(3.41)$ in the OP's post, it appears that the Ridge estimator can be obtained as the solution to

\begin{matrix} (3) & min_{b} {Λ + λ t} \end{matrix}

$\min_{\mathbf b}\{\Lambda + \lambda t\} \tag{3}$

Since in $(3)$ the function to be minimized appears to be the Lagrangean of the constrained minimization problem plus a term that does not involve $\mathbf b$ , it would appear that indeed the two approaches are equivalent...

But this is not correct because in the Ridge regression we minimize over $\mathbf b$ given $\lambda >0$ . But, in the lens of the constrained minimization problem, assuming $\lambda >0$ imposes the condition that the constraint is binding, i.e that

\sum_{j = 1}^{p} (b_{j, r i d g e}^{*})^{2} = t

$\sum_{j=1}^p (b^*_{j,ridge})^2 = t$

The general constrained minimization problem allows for $\lambda = 0$ also, and essentially it is a formulation that includes as special cases the basic least-squares estimator ( $\lambda ^*=0$ ) and the Ridge estimator ( $\lambda^* >0$ ).

So the two formulation are not equivalent. Nevertheless, Matthew Gunn's post shows in another and very intuitive way how the two are very closely connected. But duality is not equivalence.

— Alecos Papadopoulos
джерело

@MartijnWeterings Thanks for the comment, I have reworked my answer.

— Alecos Papadopoulos

@MartijnWeterings I do not see what is confusing since the expression written in your comment is exactly the expression I wrote in my reworked post.

— Alecos Papadopoulos

1

This was the duplicate question I had in mind were the equivalence is explained very intuitively to me math.stackexchange.com/a/336618/466748 the argument that you give for the two not being equivalent seems only secondary to me, and a matter of definition (the OP uses

λ \geq 0

$\lambda \geq 0$ instead of

λ > 0

$\lambda > 0$ and we could just as well add the constrain

t < ‖ β^{O L S} ‖_{2}^{2}

$t < \Vert \beta^{OLS} \Vert^2_2$ to exclude the cases where

λ = 0

$\lambda=0$ ) .

— Sextus Empiricus

@MartijnWeterings When A is a special case of B, A cannot be equivalent to B. And ridge regression is a special case of the general constrained minimization problem, Namely a situation to which we arrive if we constrain further the general problem (like you do in your last comment).

— Alecos Papadopoulos

Certainly you could define some constrained minimization problem that is more general then ridge regression (like you can also define some regularization problem that is more general than ridge regression, e.g. negative ridge regression), but then the non-equivalence is due to the way that you define the problem and not due to the transformation from the constrained representation to the Lagrangian representation. The two forms can be seen as equivalent within the constrained formulation/definition (non-general) that are useful for ridge regression.

— Sextus Empiricus