Розширення парадокса іменинника більш ніж на 2 людини

У традиційному парадоксі від дня народження виникає питання "які шанси на те, що два або більше людей у групі з людей поділяють день народження". Я застряг у проблемі, яка є продовженням цього. $n$

Замість того, щоб знати ймовірність того, що двоє людей поділяють день народження, мені потрібно розширити питання, щоб знати, яка ймовірність того, що чи більше людей поділяють день народження. З ви можете це зробити, обчисливши ймовірність того, що ніхто не розділяє день народження, і відняти його від , але я не думаю, що я можу поширити цю логіку на більшу кількість . $x$ $x=2$ $1$ $x$

Щоб ще більше ускладнити це, мені також потрібно рішення, яке буде працювати для дуже великих чисел для (мільйонів) і (тисяч). $n$ $x$

probability combinatorics birthday-paradox

— Саймон Ендрюс
джерело

Я припускаю, що це проблема біоінформатики

— csgillespie

Це насправді проблема біоінформатики, але оскільки вона зводиться до тієї ж концепції, що і парадокс дня народження, я думав, що врятую невідповідну специфіку!

— Саймон Ендрюс

Зазвичай я погоджуюся з вами, але в цьому випадку специфіка може мати значення, оскільки вже може існувати пакет біопровідника, який виконує те, що ви просите.

— csgillespie

Якщо ви дійсно хочете знати, це проблема пошуку шаблону, де я намагаюся точно оцінити ймовірність заданого рівня збагачення подальностей у межах набору більших послідовностей. Тому у мене є набір підрядів з пов'язаними підрахунками, і я знаю, скільки послідовностей я спостерігав і скільки теоретично спостерігаються послідовностей. Якщо я бачив певну послідовність 10 разів із 10000 спостережень, мені потрібно знати, наскільки це можливо сталося випадково.

— Саймон Ендрюс

Майже через вісім років я опублікував відповідь на цю проблему на сайті stats.stackexchange.com/questions/333471 . Код там не працює при великому

n,

$n,$ хоча, тому що це займає квадратичне час в

n

$n$ .

— whuber

Відповіді:

Це проблема підрахунку: є можливих присвоєння народження до людей. Нехай - кількість завдань, на які день народження не ділиться більш ніж людьми, але принаймні один день народження насправді ділиться на людей. Ймовірність, яку ми шукаємо, може бути знайдена шляхом підсумовування для відповідних значень та множення результату на . $b^n$ $b$ $n$ $q(k; n, b)$ $k$ $k$ $q(k;n,b)$ $k$ $b^{-n}$

Ці підрахунки можна знайти саме для значень менше декількох сотень. Однак вони не будуть дотримуватися жодної прямої формули: ми повинні розглянути зразки способів призначення днів народження . Я проілюструю це замість надання загальної демонстрації. Нехай (це найменша цікава ситуація). Можливості: $n$ $n = 4$

Кожна людина має унікальний день народження; код - {4}.
Рівно двоє людей поділяють день народження; код - {2,1}.
Двоє людей мають один день народження, а інші двоє мають інший; код дорівнює {0,2}.
Троє людей поділяють день народження; код {1,0,1}.
Чотири людини поділяють день народження; код {0,0,0,1}.

Як правило, код є кортежем графів , чиї передбачає елемент , як багато різних дати народження розділяються точно людей. Таким чином, зокрема, $\{a[1], a[2], \ldots\}$ $k^\text{th}$ $k$

1 a [1] + 2 a [2] + . . . + k a [k] + \dots = n .

$1 a[1] + 2a[2] + ... + k a[k] + \ldots = n.$

Зауважте, навіть у цьому простому випадку існує два способи досягнення максимум двох людей на день народження: один з кодом та інший з кодом . $\{0,2\}$ $\{2,1\}$

Ми можемо безпосередньо підрахувати кількість можливих доручень на день народження, що відповідають будь-якому коду. Це число - твір трьох доданків. Один - мультиноміальний коефіцієнт; він підраховує число способів розбиття людей в групи , групи з , і так далі. Оскільки послідовність груп не має значення, ми повинні розділити цей коефіцієнт многочлена на $n$ $a[1]$ $1$ $a[2]$ $2$ $a[1]!a[2]!\cdots$ ; її зворотний - другий член. Нарешті, складіть групи та призначте їх кожному дню народження: є кандидати для першої групи, для другої тощо. Ці значення треба помножити разом, утворюючи третій доданок. Він дорівнює "факторному добутку" де означає $b$ $b-1$ $b^{(a[1]+a[2]+\cdots)}$ $b^{(m)}$ . $b(b-1)\cdots(b-m+1)$

Існує очевидна і досить проста рекурсія, що стосується підрахунку для шаблону до числа для шаблону . Це дозволяє швидко обчислити підрахунки для скромних значень . В Зокрема, представляє дати народження поділяють точно $\{a[1], \ldots, a[k]\}$ $\{a[1], \ldots, a[k-1]\}$ $n$ $a[k]$ $a[k]$ $k$ люди кожен. Після цього групи людей були взяті з людей, які можуть бути зроблені в різних способах (скажімо), залишилися підрахувати кількість способів досягнення шаблону серед решти людей. Помноження цього на дає рекурсію. $a[k]$ $k$ $n$ $x$ $\{a[1], \ldots, a[k-1]\}$ $x$

Я сумніваюся, що існує формула закритої форми для , яка отримується шляхом підсумовування підрахунків для всіх розділів , максимальний член яких дорівнює . Дозвольте запропонувати кілька прикладів: $q(k; n, b)$ $n$ $k$

З (п’ять можливих днів народження) і (чотири людини), ми отримуємо $b=5$ $n=4$

\begin{aligned} q (1) & = q (1; 4, 5) & = 120 \\ q (2) & = 360 + 60 & = 420 \\ q (3) & = 80 \\ q (4) & = 5. \end{aligned}

$\eqalign{ q(1) &= q(1;4,5) &= 120 \\ q(2) &= 360 + 60 &= 420 \\ q(3) &&= 80 \\ q(4) &&= 5.\\ }$

Звідси, наприклад, шанс, що троє чи більше людей із чотирьох поділяють один і той же «день народження» (з можливих дат), дорівнює . $5$ $(80 + 5)/625 = 0.136$

As another example, take $b = 365$ and $n = 23$ . Here are the values of $q( k;23,365)$ for the smallest $k$ (to six sig figs only):

\begin{aligned} k = 1 : & 0.49270 \\ k = 2 : & 0.494592 \\ k = 3 : & 0.0125308 \\ k = 4 : & 0.000172844 \\ k = 5 : & 1.80449 E - 6 \\ k = 6 : & 1.48722 E - 8 \\ k = 7 : & 9.92255 E - 11 \\ k = 8 : & 5.45195 E - 13. \end{aligned}

$\eqalign{ k=1: &0.49270 \\ k=2: &0.494592 \\ k=3: &0.0125308 \\ k=4: &0.000172844 \\ k=5: &1.80449E-6 \\ k=6: &1.48722E-8 \\ k=7: &9.92255E-11 \\ k=8: &5.45195E-13. }$

Using this technique, we can readily compute that there is about a 50% chance of (at least) a three-way birthday collision among 87 people, a 50% chance of a four-way collision among 187, and a 50% chance of a five-way collision among 310 people. That last calculation starts taking a few seconds (in Mathematica, anyway) because the number of partitions to consider starts getting large. For substantially larger $n$ we need an approximation.

One approximation is obtained by means of the Poisson distribution with expectation $n/b$ , because we can view a birthday assignment as arising from $b$ almost (but not quite) independent Poisson variables each with expectation $n/b$ : the variable for any given possible birthday describes how many of the $n$ people have that birthday. The distribution of the maximum is therefore approximately $F(k)^b$ where $F$ is the Poisson CDF. This is not a rigorous argument, so let's do a little testing. The approximation for $n = 23$ , $b = 365$ gives

\begin{aligned} k = 1 : & 0.498783 \\ k = 2 : & 0.496803 \\ k = 3 : & 0.014187 \\ k = 4 : & 0.000225115. \end{aligned}

$\eqalign{ k=1: &0.498783 \\ k=2: &0.496803\\ k=3: &0.014187\\ k=4: &0.000225115. }$

By comparing with the preceding you can see that the relative probabilities can be poor when they are small, but the absolute probabilities are reasonably well approximated to about 0.5%. Testing with a wide range of $n$ and $b$ suggests the approximation is usually about this good.

$n = 10,000$ $b = 1\,000\,000$

\begin{aligned} k = 1 : & 0 \\ k = 2 : & 0.8475 + \\ k = 3 : & 0.1520 + \\ k = 4 : & 0.0004 + \\ k > 4 : & < 1 E - 6. \end{aligned}

$\eqalign{ k=1: &0 \\ k=2: &0.8475+\\ k=3: &0.1520+\\ k=4: &0.0004+\\ k\gt 4: &\lt 1E-6. }$

(This is a fast calculation.) Clearly, observing one structure 10 times out of 10,000 would be highly significant. Because $n$ and $b$ are both large, I expect the approximation to work quite well here.

Incidentally, as Shane intimated, simulations can provide useful checks. A Mathematica simulation is created with a function like

simulate[n_, b_] := Max[Last[Transpose[Tally[RandomInteger[{0, b - 1}, n]]]]];

which is then iterated and summarized, as in this example which runs 10,000 iterations of the $n = 10000$ , $b = 1\,000\,000$ case:

Tally[Table[simulate[10000, 1000000], {n, 1, 10000}]] // TableForm

Its output is

2 8503

3 1493

4 4

These frequencies closely agree with those predicted by the Poisson approximation.

— whuber
джерело

What a fantastic answer, thank you very much @whuber.

— JKnight

"There is an obvious and fairly simple recursion" — Namely?

— Kodiologist

@Kodiologist I inserted a brief description of the idea.

— whuber

+1 but where in the original question did you see that n=10000 and b=1mln? The OP looks like it is asking about n=1mln and k=10000, with b unspecified (presumably b=365). Not that it matters at this point :)

— amoeba says Reinstate Monica

@amoeba After all this time (six years, 1600 answers, and closely reading tens of thousands of posts) I cannot recall, but most likely I misinterpreted the last line. In my defense, note that if we read it literally the answer is immediate (upon applying a version of the Pigeonhole Principle): it is certain that among

n

$n$ =millions of people there will be at least one birthday that is shared among at least

x

$x$ =thousands of them!

— whuber

It is always possible to solve this problem with a monte-carlo solution, although that's far from the most efficient. Here's a simple example of the 2 person problem in R (from a presentation I gave last year; I used this as an example of inefficient code), which could be easily adjusted to account for more than 2:

birthday.paradox <- function(n.people, n.trials) {
    matches <- 0
    for (trial in 1:n.trials) {
        birthdays <- cbind(as.matrix(1:365), rep(0, 365))
        for (person in 1:n.people) {
            day <- sample(1:365, 1, replace = TRUE)
            if (birthdays[birthdays[, 1] == day, 2] == 1) {
                matches <- matches + 1
                break
            }
            birthdays[birthdays[, 1] == day, 2] <- 1
        }
        birthdays <- NULL
    }
    print(paste("Probability of birthday matches = ", matches/n.trials))
}

— Shane
джерело

I am not sure if the multiple types solution will work here.

I think that generalisation still only works for 2 or more people sharing a birthday - just that you can have different sub-classes of people.

— Simon Andrews

This is an attempt at a general solution. There may be some mistakes so use with caution!

First some notation:

$P(x,n)$ be the probability that $x$ or more people share a birthday among $n$ people,

$P(y|n)$ be the probability that exactly $y$ people share a birthday among $n$ people.

Notes:

Abuse of notation as $P(.)$ is being used in two different ways.
By definition $y$ cannot take the value of 1 as it does not make any sense and $y$ = 0 can be interpreted to mean that no one shares a common birthday.

Then the required probability is given by:

$P(x,n) = 1 - P(0|n) - P(2|n) - P(3|n) .... - P(x-1|n)$

Now,

$P(y|n) = {n \choose y} (\frac{365}{365})^y \ \prod_{k=1}^{k=n-y}(1 -\frac{k}{365})$

Here is the logic: You need the probability that exactly $y$ people share a birthday.

Step 1: You can pick $y$ people in ${n \choose y}$ ways.

Step 2: Since they share a birthday it can be any of the 365 days in a year. So, we basically have 365 choices which gives us $(\frac{365}{365})^y$ .

Step 3: The remaining $n-y$ people should not share a birthday with the first $y$ people or with each other. This reasoning gives us $\prod_{k=1}^{k=n-y}(1 -\frac{k}{365})$ .

You can check that for $x$ = 2 the above collapses to the standard birthday paradox solution.

Will this solution suffer from the curse of dimensionality? If instead of n=365, n=10^6 is this solution still feasible?

— csgillespie

Some approximations may have to be used to deal with high dimensions. Perhaps, use Stirling's approximation for factorials in the binomial coefficient. To deal with the product terms you could take logs and compute the sums instead of the products and then take the anti-log of the sum.

There are also several other forms of approximations possible using for example the Taylor series expansion for the exponential function. See the wiki page for these approximations: en.wikipedia.org/wiki/Birthday_problem#Approximations

Suppose y=2, n=4, and there are just two birthdays. Your formula, adapted by replacing 365 by 2, seems to say the probability that exactly 2 people share a birthday is Comb(4,2)*(2/2)^2*(1-1/2)*(1-2/2) = 0. (In fact, it's easy to see--by brute force enumeration if you like--that the probabilities that 2, 3, or 4 people share a "birthday" are 6/16, 8/16, and 2/16, respectively.) Indeed, whenever n-y >= 365, your formula yields 0, whereas as n gets large and y is fixed the probability should increase to a non-zero maximum before n reaches 365*y and then decrease, but never down to 0.

— whuber

Why you are replacing 365 by

n

$n$ ? The probability that 2 people share a birthday is computed as: 1 - Prob(they have unique birthday). Prob(that they have unique birthday) = (364/365). The logic is as follows: Pick a person. This person can have any day of the 365 days as a birthday. The second person can then only have a birthday on one of the remaining 364 days. Thus, the prob that they have a unique birthday is 364/365. I am not sure how you are calculating 6/16.