tTPr(T≥t)H0 P r ( | Z| ≥ | z| ) but it's convenient to use
2min[Pr(Z≥z),Pr(Z≤z)]
because we have the appropriate tables. (Note the doubling.)
There's no requirement for the test statistic to put the samples in order of their probability under the null hypothesis. There are situations (like Zag's example) where any other way would seem perverse (without more information about what r measures, what kinds of discrepancies with H0 are of most interest, &c.), but often other criteria are used. So you could have a bimodal PDF for the test statistic & still test H0 using the formula above.
(2) Yes, they mean under H0.
(3) A null hypothesis like "The frequency of heads is not 0.5" is no use because you would never be able to reject it. It's a composite null including "the frequency of heads is 0.49999999", or as close as you like. Whether you think beforehand the coin's fair or not, you pick a useful null hypothesis that bears on the problem. Perhaps more useful after the experiment is to calculate a confidence interval for the frequency of heads that shows you either it's clearly not a fair coin, or it's close enough to fair, or you need to do more trials to find out.
An illustration for (1):
Suppose you're testing the fairness of a coin with 10 tosses. There are 210 possible results. Here are three of them:
HHHHHHHHHHHTHTHTHTHTHHTHHHTTTH
You'll probably agree with me that the first two look a bit suspicious. Yet the probabilities under the null are equal:
Pr(HHHHHHHHHH)=11024Pr(HTHTHTHTHT)=11024Pr(HHTHHHTTTH)=11024
To get anywhere you need to consider what types of alternative to the null you want to test. If you're prepared to assume independence of each toss under both null & alternative (& in real situations this often means working very hard to ensure experimental trials are independent), you can use the total count of heads as a test statistic without losing information. (Partitioning the sample space in this way is another important job that statistics do.)
So you have a count between 0 and 10
t<-c(0:10)
Its distribution under the null is
p.null<-dbinom(t,10,0.5)
Under the version of the alternative that best fits the data, if you see (say) 3 out of 10 heads the probability of heads is 310, so
p.alt<-dbinom(t,10,t/10)
Take the ratio of the probability under the null to the probability under the alternative (called the likelihood ratio):
lr<-p.alt/p.null
Compare with
plot(log(lr),p.null)
So for this null, the two statistics order samples the same way. If you repeat with a null of 0.85 (i.e. testing that the long-run frequency of heads is 85%), they don't.
p.null<-dbinom(t,10,0.85)
plot(log(lr),p.null)
To see why
plot(t,p.alt)
Some values of t are less probable under the alternative, & the likelihood ratio test statistic takes this into account. NB this test statistic will not be extreme for
HTHTHTHTHT
And that's fine - every sample can be considered extreme from some point of view. You choose the test statistic according to what kind of discrepancy to the null you want to be able to detect.
... Continuing this train of thought, you can define a statistic that partitions the sample space differently to test the same null against the alternative that one coin toss influences the next one. Call the number of runs r, so that
HHTHHHTTTH
has r=6:
HH T HHH TTT H
The suspicious sequence
HTHTHTHTHT
has r=10. So does
THTHTHTHTH
while at the other extreme
HHHHHHHHHHTTTTTTTTTT
have r=1. Using probability under the null as the test statistic (the way you like) you can say that the p-value of the sample
HTHTHTHTHT
is therefore 41024=1256. What's worthy of note, comparing this test to the previous, is that even if you stick strictly to the ordering given by probability under the null, the way in which you define your test statistic to partition the sample space is dependent on consideration of alternatives.