Для яких моделей зміщення MLE падає швидше, ніж дисперсія?

$\hat\theta$ $\theta^*$ $n$ $\lVert\hat\theta-\theta^*\rVert$ $O(1/\sqrt n)$ $\lVert \mathbb E\hat\theta - \theta^*\rVert$ $\lVert \mathbb E\hat\theta - \hat\theta\rVert$ $O(1/\sqrt{n})$

Мене цікавлять моделі, які мають ухил, який скорочується швидше, ніж , але де помилка не зменшується з такою швидкістю, оскільки відхилення все ще скорочується як . Зокрема, я хотів би знати достатні умови, щоб ухил моделі зменшився зі швидкістю . $O(1/\sqrt n)$ $O(1/\sqrt n)$ $O(1/n)$

— Майк Ізбіцький
джерело

Чи має

∥θ^−θ∗∥=(θ^−θ∗)2 $\lVert\hat\theta-\theta^*\rVert = (\hat\theta-\theta^*)^2$ ? Або?

— Алекос Пападопулос

Я спеціально питав про норму L2, так. Але мене також цікавлять інші норми, якщо це полегшить відповідь на питання.

— Майк Ізбіцький

(θ^−θ∗)2 $(\hat \theta -\theta^*)^2$ is

Op(1/n) $O_p(1/n)$ .

— Alecos Papadopoulos

Sorry, I misread your comment. For the L2 norm in

d $d$ dimensions,

∥a−b∥=∑di=1(ai−bi)2−−−−−−−−−−−−√ $\Vert a-b\Vert = \sqrt{\sum_{i=1}^d (a_i-b_i)^2}$ , and so convergence is at the rate of

O(1/n−−√) $O(1/\sqrt n)$ . I agree that if we squared it then it would converge as

O(1/n) $O(1/n)$ .

— Mike Izbicki

Have you seen the ridge regression (Hoerl & Kennard 1970) paper ? I believe it gives conditions on the design matrix + penalty where this is expected to be true.

— dcl

Відповіді:

In general, you need models where the MLE is not asymptotically normal but converges to some other distribution (and it does so at a faster rate). This usually happens when the parameter under estimation is at the boundary of the parameter space. Intuitively, this means that the MLE will approach the parameter "only from the one side", so it "improves on convergence speed" since it is not "distracted" by going "back and forth" around the parameter.

A standard example, is the MLE for $\theta$ in an i.i.d. sample of $U(0,\theta)$ uniform r.v.'s The MLE here is the maximum order statistic,

θ^n = u (n)

$\hat \theta_n = u_{(n)}$

Its finite sample distribution is

F θ^n = ( θ ^ n ) n θ n, f θ^= n ( θ ^ n ) n - 1 θ n

$F_{\hat \theta_n} = \frac {(\hat \theta_n)^n}{\theta ^n},\;\;\; f_{\hat \theta}=n\frac {(\hat \theta_n)^{n-1}}{\theta ^n}$

E (θ^n) = n n + 1 θ ⟹ B (θ^) = - 1 n + 1 θ

$\mathbb E(\hat \theta_n) = \frac {n}{n+1}\theta \implies B(\hat \theta) = -\frac {1}{n+1}\theta$

So $B(\hat \theta_n) = O(1/n)$ . But the same increased rate will hold also for the variance.

One can also verify that to obtain a limiting distribution, we need to look at the variable $n(\theta - \hat \theta_n)$ ,(i.e we need to scale by $n$ ) since

P [n (θ - θ^n) \leq z] = 1 - P [θ^n \leq θ - (z / n)]

$P[n(\theta - \hat \theta_n)\leq z] = 1-P[\hat \theta_n\leq \theta - (z/n)]$

= 1 - 1 θ n \cdot (θ + - z n) n = 1 - θ n θ n \cdot (1 + - z / θ n) n

$=1-\frac 1 {\theta^n}\cdot \left(\theta + \frac{-z}{n}\right)^n = 1-\frac {\theta^n} {\theta^n}\cdot \left(1 + \frac{-z/\theta}{n}\right)^n$

\to 1 - e - z / θ

$\to 1- e^{-z/\theta}$

which is the CDF of the Exponential distribution.

I hope this provides some direction.

— Alecos Papadopoulos
джерело

This is getting close, but I'm specifically interested in situations where the bias shrinks faster than the variance.

— Mike Izbicki

@MikeIzbicki Hmm... the bias convergence depends on the first moment of the distribution, and the (square root of the) variance is also a "first-order" magnitude. I am not sure then that this is possible to happen, because it appears that it would imply that the moments of the limiting distribution "arise" at convergence rates that are not compatible with each other... I' ll think about it though.

— Alecos Papadopoulos

Following comments in my other answer (and looking again at the title of the OP's question!), here is an not very rigorous theoretical exploration of the issue.

We want to determine whether Bias $B(\hat \theta_n) = E(\hat \theta_n) - \theta$ may have different convergence rate than the square root of the Variance,

B (θ^n) = O (1 / n δ), Var (θ^n) - - - - - - - \sqrt = O (1 / n γ), γ \neq δ ? ? ?

$B(\hat \theta_n) = O(1/n^{\delta}),\;\;\; \sqrt {\text{Var}(\hat \theta_n)} = O(1/n^{\gamma}), \;\;\gamma \neq \delta \;???$

We have

B (θ^n) = O (1 / n δ) ⟹ lim n δ E (θ^n) < K ⟹ lim n 2 δ [E (θ^n)] 2 < K'

$B(\hat \theta_n) = O(1/n^{\delta}) \implies \lim n^{\delta}\mathbb E(\hat \theta_n) < K \implies \lim n^{2\delta}[\mathbb E(\hat \theta_n)]^2 < K'$

⟹ [E (θ^n)] 2 = O (1 / n 2 δ) (1)

$\implies [\mathbb E(\hat \theta_n)]^2 = O(1/n^{2\delta}) \tag{1}$

while

Var (θ^n) - - - - - - - \sqrt = O (1 / n γ) ⟹ lim n γ E (θ^2 n) - [E (θ^n)] 2 - - - - - - - - - - - - - \sqrt < M

$\sqrt {\text{Var}(\hat \theta_n)} = O(1/n^{\gamma}) \implies \lim n^{\gamma}\sqrt{\mathbb E (\hat \theta_n^2) - [\mathbb E(\hat \theta_n)]^2 }<M$

⟹ lim n 2 γ E (θ^2 n) - n 2 γ [E (θ^n)] 2 - - - - - - - - - - - - - - - - - - \sqrt < M

$\implies \lim \sqrt{n^{2\gamma}\mathbb E (\hat \theta_n^2) - n^{2\gamma}[\mathbb E(\hat \theta_n)]^2 }<M$

⟹ lim n 2 γ E (θ^2 n) - lim n 2 γ [E (θ^n)] 2 < M' (2)

$\implies \lim n^{2\gamma}\mathbb E (\hat \theta_n^2) - \lim n^{2\gamma}[\mathbb E(\hat \theta_n)]^2 < M' \tag{2}$

We see that $(2)$ may hold happen if

A) both components are $O(1/n^{2\gamma})$ , in which case we can only have $\gamma = \delta$ .

B) But it may also hold if

$\lim n^{2\gamma}[\mathbb E(\hat \theta_n)]^2 \to 0 \implies [\mathbb E(\hat \theta_n)]^2 = o(1/n^{2\gamma}) \tag{3}$

For $(3)$ to be compatible with $(1)$ , we must have

$n^{2\gamma} < n^{2\delta} \implies \delta > \gamma\tag {4}$

So it appears that in principle it is possible to have the Bias converging at a faster rate than the square root of the variance. But we cannot have the square root of the variance converging at a faster rate than the Bias.

— Alecos Papadopoulos
джерело

How would you reconcile this with the existence of unbiased estimators like ordinary least squares? In that case,

$B(\hat\theta)=0$ , but

$\sqrt{Var(\hat\theta)} = O(1/\sqrt n)$ .

— Mike Izbicki

@MikeIzbicki Is the concept of convergence/big-O applicable in this case? Because here

$B(\hat \theta)$ is not "

$O()$ -anything" to begin with.

— Alecos Papadopoulos

In this case,

$\mathbb E\hat\theta=\theta^*$ , so

$B(\hat\theta) = \lVert \mathbb E \hat\theta - \theta^*\rVert = 0 = O(1) = O(1/n^0)$ .

— Mike Izbicki

@MikeIzbicki But also

$B(\hat \theta) = O(n)$ or

$B(\hat \theta) =O(1/\sqrt{n})$ or any other you care to write down. So which one is the rate of convergence here?

— Alecos Papadopoulos

@MikeIzbicki I have corrected my answer to show that it is possible in principle to have the Bias converging faster, although I still think the "zero-bias" example is problematic.

— Alecos Papadopoulos