Для яких моделей зміщення MLE падає швидше, ніж дисперсія?


14

& Thetasθ^ ; & thetas*θпnˆθθθ^θO(1/n)O(1/n)EˆθθEθ^θEˆθˆθEθ^θ^O(1/n)O(1/n)

Мене цікавлять моделі, які мають ухил, який скорочується швидше, ніж , але де помилка не зменшується з такою швидкістю, оскільки відхилення все ще скорочується як . Зокрема, я хотів би знати достатні умови, щоб ухил моделі зменшився зі швидкістю .O(1/n)O(1/n)O(1/n)O(1/n)O(1/n)O(1/n)


Чи має θ - θ * | | = ( θ - θ * ) 2θ^θ=(θ^θ)2 ? Або?
Алекос Пападопулос

Я спеціально питав про норму L2, так. Але мене також цікавлять інші норми, якщо це полегшить відповідь на питання.
Майк Ізбіцький

(ˆθθ)2(θ^θ)2 is Op(1/n)Op(1/n).
Alecos Papadopoulos

Sorry, I misread your comment. For the L2 norm in dd dimensions, ab=di=1(aibi)2ab=di=1(aibi)2, and so convergence is at the rate of O(1/n)O(1/n). I agree that if we squared it then it would converge as O(1/n)O(1/n).
Mike Izbicki

Have you seen the ridge regression (Hoerl & Kennard 1970) paper ? I believe it gives conditions on the design matrix + penalty where this is expected to be true.
dcl

Відповіді:


5

In general, you need models where the MLE is not asymptotically normal but converges to some other distribution (and it does so at a faster rate). This usually happens when the parameter under estimation is at the boundary of the parameter space. Intuitively, this means that the MLE will approach the parameter "only from the one side", so it "improves on convergence speed" since it is not "distracted" by going "back and forth" around the parameter.

A standard example, is the MLE for θθ in an i.i.d. sample of U(0,θ)U(0,θ) uniform r.v.'s The MLE here is the maximum order statistic,

ˆθn=u(n)

θ^n=u(n)

Its finite sample distribution is

Fˆθn=(ˆθn)nθn,fˆθ=n(ˆθn)n1θn

Fθ^n=(θ^n)nθn,fθ^=n(θ^n)n1θn

E(ˆθn)=nn+1θB(ˆθ)=1n+1θ

E(θ^n)=nn+1θB(θ^)=1n+1θ

So B(ˆθn)=O(1/n)B(θ^n)=O(1/n). But the same increased rate will hold also for the variance.

One can also verify that to obtain a limiting distribution, we need to look at the variable n(θˆθn)n(θθ^n),(i.e we need to scale by nn) since

P[n(θˆθn)z]=1P[ˆθnθ(z/n)]

P[n(θθ^n)z]=1P[θ^nθ(z/n)]

=11θn(θ+zn)n=1θnθn(1+z/θn)n

=11θn(θ+zn)n=1θnθn(1+z/θn)n

1ez/θ

1ez/θ

which is the CDF of the Exponential distribution.

I hope this provides some direction.


This is getting close, but I'm specifically interested in situations where the bias shrinks faster than the variance.
Mike Izbicki

2
@MikeIzbicki Hmm... the bias convergence depends on the first moment of the distribution, and the (square root of the) variance is also a "first-order" magnitude. I am not sure then that this is possible to happen, because it appears that it would imply that the moments of the limiting distribution "arise" at convergence rates that are not compatible with each other... I' ll think about it though.
Alecos Papadopoulos

2

Following comments in my other answer (and looking again at the title of the OP's question!), here is an not very rigorous theoretical exploration of the issue.

We want to determine whether Bias B(ˆθn)=E(ˆθn)θB(θ^n)=E(θ^n)θ may have different convergence rate than the square root of the Variance,

B(ˆθn)=O(1/nδ),Var(ˆθn)=O(1/nγ),γδ???

B(θ^n)=O(1/nδ),Var(θ^n)=O(1/nγ),γδ???

We have

B(ˆθn)=O(1/nδ)limnδE(ˆθn)<Klimn2δ[E(ˆθn)]2<K

B(θ^n)=O(1/nδ)limnδE(θ^n)<Klimn2δ[E(θ^n)]2<K

[E(ˆθn)]2=O(1/n2δ)

[E(θ^n)]2=O(1/n2δ)(1)

while

Var(ˆθn)=O(1/nγ)limnγE(ˆθ2n)[E(ˆθn)]2<M

Var(θ^n)=O(1/nγ)limnγE(θ^2n)[E(θ^n)]2<M

limn2γE(ˆθ2n)n2γ[E(ˆθn)]2<M

limn2γE(θ^2n)n2γ[E(θ^n)]2<M

limn2γE(ˆθ2n)limn2γ[E(ˆθn)]2<M

limn2γE(θ^2n)limn2γ[E(θ^n)]2<M(2)

We see that (2)(2) may hold happen if

A) both components are O(1/n2γ), in which case we can only have γ=δ.

B) But it may also hold if

limn2γ[E(ˆθn)]20[E(ˆθn)]2=o(1/n2γ)

For (3) to be compatible with (1), we must have

n2γ<n2δδ>γ

So it appears that in principle it is possible to have the Bias converging at a faster rate than the square root of the variance. But we cannot have the square root of the variance converging at a faster rate than the Bias.


How would you reconcile this with the existence of unbiased estimators like ordinary least squares? In that case, B(ˆθ)=0, but Var(ˆθ)=O(1/n).
Mike Izbicki

@MikeIzbicki Is the concept of convergence/big-O applicable in this case? Because here B(ˆθ) is not "O()-anything" to begin with.
Alecos Papadopoulos

In this case, Eˆθ=θ, so B(ˆθ)=Eˆθθ=0=O(1)=O(1/n0).
Mike Izbicki

@MikeIzbicki But also B(ˆθ)=O(n) or B(ˆθ)=O(1/n) or any other you care to write down. So which one is the rate of convergence here?
Alecos Papadopoulos

@MikeIzbicki I have corrected my answer to show that it is possible in principle to have the Bias converging faster, although I still think the "zero-bias" example is problematic.
Alecos Papadopoulos
Використовуючи наш веб-сайт, ви визнаєте, що прочитали та зрозуміли наші Політику щодо файлів cookie та Політику конфіденційності.
Licensed under cc by-sa 3.0 with attribution required.