Чи є р-значення бальною оцінкою?

Оскільки можна обчислити довірчі інтервали для p-значень, а оскільки протилежне оцінці інтервалу - оцінка балів: Чи є p-значення бальною оцінкою?

— 00 шнайдер
джерело

Я не вірю, що можна розрахувати довірчі інтервали для p-значення; це статистика, обчислена з даних, а не параметр, що описує процес генерації даних. Звичайно, ви все ще можете запитати, що оцінює статистика.

— Scortchi

@Scortchi: але якщо я застосував, наприклад, завантажувальний інструмент для обчислення розподілу p-значень, а потім будував 95% перцентильний інтервал цього завантаженого розподілу, то якщо це не довірчий інтервал для p-значення - що таке це ?

— амеба каже: Поновіть Моніку

@amoeba: довірчий інтервал - це приблизно невідомий параметр, тоді як ваш інтервал завантаження - це наближення області 95% для статистики.

— Сіань

@Scorthci: Я бачив програмне забезпечення, яке друкує CI для p-значень. У цьому випадку приблизні р-значення обчислювались за допомогою тестів на перестановку, тому, якщо значення CI було занадто широким (тобто p-значення

та p-значення

), ви б використовували більше перестановок перед тим як зробити висновок.

\in [0, 0.05]

$\in [0, 0.05]$

\in [0.05, 1]

$\in [0.05, 1]$

— Кліф АВ

@Cliff Це не довірчий інтервал для p-значення qua властивості розподілу: це довірчий інтервал для стохастичного оцінювача p-значення тесту для конкретного зразка. Хоча вони звучать схоже, і обидва - це інтервали, але це абсолютно різні речі.

— whuber

Відповіді:

Оцінки балів та довірчі інтервали призначені для параметрів, які описують розподіл, наприклад середнє або стандартне відхилення.

Але на відміну від інших статистичних даних вибірки, таких як середнє значення вибірки та стандартне відхилення вибірки, значення p не є корисним оцінювачем цікавого параметра розподілу. Подивіться на відповідь від @whuber для отримання технічних деталей.

Значення р для тестової статистики дає ймовірність спостереження за відхиленням від очікуваного значення тестової статистики як мінімум таким же великим, як спостерігається у вибірці, обчисленому при припущенні, що нульова гіпотеза є істинною. Якщо у вас є весь розподіл, він або відповідає нульовій гіпотезі, або це не так. Це можна описати за допомогою змінної індикатора (ще раз дивіться відповідь @whuber).

But the p-value cannot be used as an useful estimator of the indicator variable because it is not consistent as the p-value does not converge as the sample size increases if the null hypothesis is true. This is a pretty complicated alternate way of stating that a statistical test can either reject or fail to reject the null, but never confirm it.

— Erik
джерело

Most of the better accounts of statistical tests (Lehman, Kiefer, etc.) do not refer to "populations" at all, but instead frame the situation in terms of estimating parameters of distributions. This does not require the randomness to be due solely to sampling, and thereby allows the theory more broadly to apply to situations where the randomness is part of a model.

— whuber

Але ви явно суперечили, що з твердженням "взагалі немає ймовірностей, пов'язаних з чисельністю населення". Також врахуйте, що всі оцінки є "чітко визначеними на рівні вибірки". Тому важко визначити, яку відмінність ви намагаєтеся зробити в цій посаді.

— whuber

Звичайно! Але розподіл - це не населення.

— whuber

(-1) Я погоджуюсь із загально-чуттєвою відповіддю @ Тіма та безумовною відповіддю Ваубера, але я намагаюся зробити якийсь сенс цього. (1) "Але значення p не є параметром сукупності, оскільки воно чітко визначено на рівні вибірки": це, безперечно, варто вказати, але "але" здається, що ви говорите, що p-значення може не може бути оцінкою нічого, тому що це вибіркова статистика, ніби середня вибірка не може бути оцінкою нічого, тому що це вибіркова статистика. ...

— Scortchi

(2) "This is because there are no probabilities associated with the population at all, it is regarded as fixed but unknown": (a) The p-value isn't calculated from the sample because "there are no probabilities [...]"; (b) as @whuber's pointed out, sampling from a finite population is a special case; (c) in any case it just doesn't follow from what you've said that the p-value doesn't estimate anything about the population.

— Scortchi - Reinstate Monica

Yes, it could be (and has been) argued that a p-value is a point estimate.

In order to identify whatever property of a distribution a p-value might estimate, we would have to assume it is asymptotically unbiased. But, asymptotically, the mean p-value for the null hypothesis is $1/2$ (ideally; for some tests it might be some other nonzero number) and for any other hypothesis it is $0$ . Thus, the p-value could be considered an estimator of one-half the indicator function for the null hypothesis.

Admittedly it takes some creativity to view a p-value in this way. We could do a little better by viewing the estimator in question as the decision we make by means of the p-value: is the underlying distribution a member of the null hypothesis or of the alternate hypothesis? Let's call this set of possible decisions $D$ . Jack Kiefer writes

We suppose that there is an experiment whose outcome the statistician can observe. This outcome is described by a random variable or random vector $X$ ... . The probability law of $X$ is unknown to the statistician, but it is known that the distribution function $F$ of $X$ is a member of a specified class $\Omega$ of distribution functions. ...

A statistical problem is said to be a problem of point estimation if $D$ is the collection of possible values of some real or vector-valued property of $F$ which depends on $F$ in a reasonably smooth way.

In this case, because $D$ is discrete, "reasonably smooth" is not a restriction at all. Kiefer's terminology reflects this by referring to statistical procedures with discrete decision spaces as "tests" instead of "point estimators."

Although it is interesting to explore the limits (and limitations) of such definitions, as this question invites us to do, perhaps we should not insist too strongly that a p-value is a point estimator, because this distinction between estimators and tests is both useful and conventional.

In a comment to this question, Christian Robert brought attention to a 1992 paper where he and co-authors took exactly this point of view and analyzed the admissibility of the p-value as an estimator of the indicator function. See the link in the references below. The paper begins,

Approaches to hypothesis testing have usually treated the problem of testing as one of decision-making rather than estimation. More precisely, a formal hypothesis test will result in a conclusion as to whether a hypothesis is true, and not provide a measure of evidence to associate with that conclusion. In this paper we consider hypothesis testing as an estimation problem within a decision-theoretic framework ... .

[Emphasis added.]

References

Jiunn Tzon Hwang, George Casella, Christian Robert, Martin T. Wells, and Roger H. Farrell, Estimation of Accuracy in Testing. Ann. Statist. Volume 20, Number 1 (1992), 490-509. Open access.

Jack Carl Kiefer, Introduction to Statistical Inference. Springer-Verlag, 1987.

— whuber
джерело

Hmm. I am not sure if this view is helpful. For one in this sense the p-value is not a good estimator, since it is not consistent if the null hypothesis is true. And in the some cases (you mention that) it has a sample-size dependent bias as well. It might be technical true, but any random number could be (terrible) estimator for any parameter as well.

— Erik

The question does not ask whether the p-value is a good estimator, @Erik. As an estimator, it has obvious deficiencies. For instance, its asymptotic variance for the null hypothesis is nonzero. Please note that the bias of almost every unbiased estimator depends on sample size. Although you are correct that an independent random number could be viewed as an estimator, it would be an estimator of something different: it would estimate its own mean (by definition). Thus your objections appear not to have any relevance to the question at hand.

— whuber

I don't think we differ on any of those points, @Erik, except perhaps the "unhelpful" part. As Nick Cox points out in a comment elsewhere in this thread, it is nevertheless interesting to contemplate the sense in which a p-value could be considered an estimator and what, exactly, it could possibly be estimating. That can help us understand a little better just what a p-value is (and is not). Many would view that as a helpful exercise.

— whuber

In a 1992 paper, we study the

p

$p$ -value as an estimator of the indicator function

I_{Θ_{0}} (θ)

$\mathbb{I}_{\Theta_0}(\theta)$ and demonstrate that it can be an admissible estimator for one-sided hypothesis and cannot be admissible for two-sided hypotheses.

— Xi'an

@Xi'an I see we're only 23 years behind you... . Thank you for the reference!

— whuber

$p$ -values are not used for estimating any parameter of interest, but for hypothesis testing. For example, you could be interested in estimating population $\mu$ based on the sample you have, or you could be interested in interval estimate of it, but in hypothesis testing scenario you would rather compare the sample mean $\overline x$ with population mean $\mu$ to see if they differ. In fact in hypothesis testing scenario you are not interested in their particular values, but rather if they are below some threshold (e.g. $p < 0.05$ ). With $p$ -values you are not that much interested in their point values, but rather you want to know if your data provides enough evidence against null hypothesis. In hypothesis testing scenario, you would not be comparing different $p$ -values to each other, but rather use each of them to make separate decisions about your hypotheses. You don't really want to know anything about the hull hypothesis, as far as you know if you can reject it or not. This makes their values inseparable from the decision context and so they differ from point estimates, because with point estimates we are interested in their values per se.

— Tim
джерело

Your initial statement correctly echoes how things are often explained, but nevertheless it does not go deep enough. A basic fact here is sampling variation, the variability from sample to sample. Take a different sample, and your P-value will be different. It takes a little ingenuity to see precisely what it is estimating, and it is not (as far as I know) conventional to explain it as estimating an parameter, but that point of view makes perfect sense. See @whuber's interesting answer. (The entire territory is littered with muddy paraphrases based on the need to simplify for teaching.)

— Nick Cox

How terms are used is interesting and important (and a personal preoccupation, by the way). The question remains what a P-value is. This too is pointed out [inevitable pun here] elsewhere in this thread. It's a helpful convention to regard parameters as those unknowns which appear in a model specification, but there are other unknowns too.

— Nick Cox

@Tim, I think this claim (from your last comment) is almost always not true, at least in biology. People are very much interested in the value of p-values, marking

p < 0.05

$p<0.05$ ,

p < 0.01

$p<0.01$ ,

p < 0.001

$p<0.001$ with one, two, or three stars on the figures, writing about something being "highly significant", etc. The usual recommendation is also to report exact p-values, e.g.

p = 0.003

$p=0.003$ , and not

p < 0.05

$p<0.05$ . Only very rarely do people adhere to the strict Neyman-Pearson framework, choose

α

$\alpha$ in advance and report all p-values as

p < α

$p<\alpha$ .

— amoeba says Reinstate Monica

This question intersects with many others, most of which are highly controversial. One is the idealisation that the purpose of a test is to make a decision yes or no, which doesn't match all problems at all. Another key fact is that use of threshold levels was for decades a matter that people used published tables from printed tables and exact P-values were out of reach while people did not use computers.

— Nick Cox

@00schneider: If you do ever see an interval given for p-values, it's very unlikely to be a confidence interval for the population parameter defined by whuber. Tim's point is that there's no need to consider them as estimating anything at all, interesting though it may be to do so.

— Scortchi - Reinstate Monica