Я думаю, що це насправді цікаве питання. Спав на ній, я думаю, що у мене є відповідь. Ключове питання полягає в наступному:
- Ви трактували ймовірність як гауссовий PDF-файл. Але це не розподіл ймовірностей - це ймовірність! Більше того, ви не позначили чітко свою вісь. Ці речі разом поєднували все, що випливає.
μσP(μ|μ′,σ′)μ′σ′P(X|μ,σ)XP(μ|X,σ,μ′,σ′)μ
μP(X|μ)
P(μ|μ′,σ′)=exp(−(μ−μ′)22σ′2)12πσ′2−−−−−√
P(X|μ,σ)=∏i=1Nexp(−(xi−μ)22σ2)12πσ2−−−−√
σ′2=σ2/N. In other words, your prior is very informative, as its variance is going to be much lower than σ2 for any reasonable value of N. It is literally as informative as the entire observed dataset X!
So, the prior and the likelihood are equally informative. Why isn't the posterior bimodal? This is because of your modelling assumptions. You've implicitly assumed a normal distribution in the way this is set up (normal prior, normal likelihood), and that constrains the posterior to give a unimodal answer. That's just a property of normal distributions, that you have baked into the problem by using them. A different model would not necessarily have done this. I have a feeling (though lack a proof right now) that a cauchy distribution can a have multimodal likelihood, and hence a multimodal posterior.
So, we have to be unimodal, and the prior is as informative as the likelihood. Under these constraints, the most sensible estimate is starting to sound like a point directly between the likelihood and prior, as we have no reasonable way to tell which to believe. But why does the posterior get tighter?
I think the confusion here comes from the fact that in this model, σ is assumed to be known. Were it unknown, and we had a two dimensional distribution over μ and σ the observation of data far from the prior might make a high value of σ more probable, and so increase the variance of the posterior distribution of the mean too (as these two are linked). But we're not in that situation. σ is treated as known here. A such adding more data can only make us more confident in our prediction of the position of μ, and hence the posterior becomes narrower.
(A way to visualise it might be to imagine estimating the mean of a gaussian, with known variance, using just two sample points. If the two sample points are separated by very much more than the width of the gaussian (i.e. they're out in the tails), then that's strong evidence the mean actually lies between them. Shifting the mean just slightly from this position will cause an exponential drop off in the probability of one sample or another.)
In summary, the situation you have described is a bit odd, and by using the model you have you've included some assumptions (e.g. unimodality) into the problem that you didn't realise you had. But otherwise, the conclusion is correct.