Чи можна стиснути випадкові нескінченні дані

19

У мене є реальні дані, які я використовую для імітаційної карткової гри. Мене цікавлять лише ранги карт, а не костюми. Однак це стандартна колода з $52$ картками, тому в колоді можливі лише $4$ з кожного рангу. Колода добре перемішується для кожної руки, і тоді я виводжу всю колоду у файл. Таким чином , існує тільки $13$ можливих символів в вихідному файлі , які є $2,3,4,5,6,7,8,9,T,J,Q,K,A$ . ( $T$ = десять ранг). Звичайно, ми можемо розпакувати їх, використовуючи $4$ біти на символ, але тоді ми витрачаємо $3$ з $16$ можливих кодувань. Ми можемо зробити краще, якщо згрупувати $4$ символи за один раз, а потім стиснути їх, тому що $13^4$ = $28,561$ і це може досить "щільно" розміститися на $15$ біт замість $16$ . Теоретичний ліміт біт-упаковки - log ( $13$ ) / log ( $2$ ) = $3.70044$ для даних з $13$ випадковими символами для кожної можливої карти. Однак у нас не може бути $52$ королі, наприклад, у цій колоді. У нас ОБОВ'ЯЗКОВО є лише $4$ з кожного рангу в кожній колоді, тому ентропійне кодування падає приблизно на пів біта на символ до приблизно $3.2$ .

Гаразд, ось ось що я думаю. Ці дані не зовсім випадкові. Ми знаємо , що є $4$ кожен ранг так в кожному блоці $52$ карт (назвати це перемішуються палубою), так що ми можемо зробити кілька припущень і оптимізації. Одним із тих, кого ми не маємо кодувати останню карту, тому що ми будемо знати, що це має бути. Інша економія була б, якщо ми закінчимося єдиним рангом; Наприклад, якщо останні $3$ карти в колоді - $777$ , нам не доведеться кодувати їх, тому що декодер буде рахувати карти до цього моменту і побачити, що всі інші ранги заповнені, і буде вважати, що $3$ " картки відсутні "всі $7$ с.

Отже, моє запитання до цього сайту полягає в тому, які інші оптимізації можливі, щоб отримати ще менший вихідний файл для цього типу даних, і якщо ми використовуємо їх, чи зможемо ми коли-небудь перемогти теоретичну (просту) логічну ентропію $3.70044$ біт на символ, або навіть підходити до граничної межі ентропії в середньому близько $3.2$ біта на символ? Якщо так, то як?

Коли я використовую програму ZIP-типу (наприклад, WinZip), я бачу лише стиснення $2:1$ , що говорить про те, що це просто "ледачий" бітпак на $4$ біти. Якщо я "попередньо стискаю" дані за допомогою власного розпакування, це здається, що краще, тому що тоді, коли я запускаю це через програму zip, я отримую трохи більше $2:1$ стиснення. Що я думаю, чому б не зробити все стиснення самостійно (бо я знаю більше даних, ніж програма Zip). Мені цікаво, чи можу я перемогти ентропійний "ліміт" журналу ( $13$ ) / log ( $2$ ) = $3.70044$ . Я підозрюю, що я можу з кількома «трюками», про які я згадав, і ще кількома, які я, мабуть, можу дізнатися. Звичайно, вихідний файл не повинен бути "читабельним для людини". Поки кодування є без втрат, воно дійсне.

Ось посилання на $3$ мільйони людських читаних перетасованих колод ( $1$ на рядок). Будь-хто може "попрактикуватися" на невеликій підмножині цих рядків, а потім дозволити йому зірвати весь файл. Я буду постійно оновлювати найкращий (найменший) розмір файлів на основі цих даних.

https://drive.google.com/file/d/0BweDAVsuCEM1amhsNmFITnEwd2s/view

До речі, у випадку, якщо вас цікавить, для якого типу карткової гри використовуються ці дані, ось посилання на моє активне запитання (з виграшею на суму $300$ балів). Мені кажуть, що важко вирішити (саме) важку проблему, оскільки це потребує величезної кількості місця для зберігання даних. Однак кілька моделей узгоджуються з приблизними ймовірностями. Суто математичних рішень поки не надано. Дуже важко, гадаю.

/math/1882705/probability-2-player-card-game-with-multiple-patterns-to-win-who-has-the-advant

У мене хороший алгоритм, який показує $168$ біт для кодування першої колоди в моїх вибіркових даних. Ці дані були створені випадковим чином за допомогою алгоритму перемикання Фішера-Йейта. Це реальні випадкові дані, тому моє новостворене алгоритм працює ДУЖЕ добре, що робить мене щасливим.

Щодо стикання "стискання", то зараз я становить близько 160 біт на деку. Я думаю, що я можу зійти до, можливо, 158. Так, я спробував, і я отримав 158,43 біта на деку. Я думаю, що я наближаюся до межі свого алгоритму, тому мені вдалося опуститися нижче 166 біт на колода, але мені не вдалося отримати 156 біт, що було б 3 біта на карту, але це було цікавим вправою. Можливо, в майбутньому я придумаю щось, щоб зменшити кожну колоду в середньому на 2,43 біта і більше.

data-compression

— Девід Джеймс
джерело

8

Якщо ви самі генеруєте ці перетасовані колоди (а не описуєте, наприклад, стан фізичної колоди карт), колоду взагалі не потрібно зберігати - просто зберігайте насіння RNG, яке генерувало колоду.

— jasonharper

3

Ваш опис та відповіді дуже схожі на поняття, загальновідоме як кодування діапазону ( en.wikipedia.org/wiki/Range_encoding ). Ви адаптуєте поширення після кожної карти, щоб вона відображала інші можливі карти.

— Х. Ідден

Коментарі не для розширеного обговорення; ця розмова переміщена до чату .

— Жил "ТАК - перестань бути злим"

3

Інша річ, яку слід врахувати: якщо ви дбаєте лише про стиснення повного набору з декількох мільйонів колод, а також вам не байдуже, в якому порядку вони перебувають, ви можете отримати додаткову гнучкість кодування, відкинувши інформацію про впорядкування набору колод . Так, наприклад, якщо вам потрібно завантажити набір, щоб перерахувати всі колоди та обробити їх, але не важливо, в якому порядку вони обробляються.

Ви починаєте з кодування кожної колоди окремо, як в інших відповідях описано, як це зробити. Потім сортуйте ці закодовані значення. Збережіть ряд відмінностей між відсортованими кодованими значеннями (де перша різниця починається з кодованої колоди '0'). Враховуючи велику кількість колод, відмінності, як правило, будуть меншими, ніж повний діапазон кодування, тому ви можете використовувати певну форму кодування лаку для обробки випадкових великих відмінностей, зберігаючи при цьому менші відмінності. Відповідна схема лаку залежатиме від того, скільки колод у вас в комплекті (таким чином визначаючи середній розмір різниці.)

Я, на жаль, не знаю математики, наскільки це може допомогти вашому стисненню, але я вважав, що ця ідея може бути корисною для розгляду.

— Ден Брайант
джерело

1

Дуже грубо кажучи, якщо у вас є кілька мільйонів випадкових колод, то середні відмінності становитимуть одну (кілька мільйонних) від повного діапазону, це означає, що ви розраховуєте зекономити близько 20-ти біт на значення. Ви трохи втрачаєте кодування для вашої лаки.

— Стів Джессоп

2

@DavidJames: якщо конкретний порядок колод не важливий, просто щоб у ньому не було упереджень, ви можете повторно перемістити 3 мільйони колод після декомпресії (тобто не змінюйте жодну колоду, просто змініть порядок список 3 мільйони колод).

— Стів Джессоп

2

Це лише спосіб трохи зменшити вміст інформації, якщо інформація про замовлення не важлива; якщо це важливо, це не застосовується і може бути проігноровано. Однак, якщо єдине значення для впорядкування набору колод полягає в тому, що це "випадкове", ви можете просто рандомізувати порядок після декомпресії, як заявив @SteveJessop.

— Ден Брайант

@DavidJames Бачити, що перші 173 ваших колод починаються з KKKK, і не дивиться на інші кілька мільйонів, і зробити висновок про те, що всі вони починаються з KKKK, досить дурна справа. Особливо, якщо вони, очевидно, в упорядкованому порядку.

— користувач253751

3

@DavidJames: ці дані стискаються, і програма декомпресії при бажанні може їх повторно рандомизувати. "Якась наївна людина" взагалі нічого не отримає, вони навіть не збираються зрозуміти, як інтерпретувати це як колоди карт. Це не є недоліком у форматі зберігання даних (у цьому випадку формату втрати), тому, що хтось, хто його використовує, потребує RTFM, щоб отримати потрібні дані.

— Стів Джессоп

34

Ось повний алгоритм, який досягає теоретичної межі.

Пролог: Кодування цілих послідовностей

13-ціла послідовність "ціле число з верхньою межею , ціле число з верхньою межею ", ціле число з верхньою межею $a-1$ $b-1$ , ціле число з верхньою межею , ... ціле число з верхньою межею " завжди можна кодувати з ідеальною ефективністю. $c-1$ $d-1$ $m-1$

Візьміть перше ціле число, помножте його на , додайте друге, результат помножте на , додайте третє, результат помножте на ,… результат помножте на , додайте тринадцяте - і це дасть унікальне число між і $b$ $c$ $d$ $m$ $0$ . $abcdefghijklm-1$
Запишіть це число у двійковій формі.

Реверс також легкий. Ділимо на а решта - тринадцяте ціле число. Ділимо результат на а решта - дванадцяте ціле число. Продовжуйте, поки ви не розділите на : решта - це друге ціле число, а коефіцієнт - перше ціле число. $m$ $l$ $b$

Отже, щоб кодувати ваші картки найкращим чином, все, що нам потрібно зробити, - це знайти ідеальну відповідність між 13-цілими послідовностями (із заданими верхніми межами) та розташуванням ваших перетасованих карт.

Ось як це зробити.

Відповідність між перетасуванням і цілими послідовностями

Почніть із послідовності з 0 карт на столі перед вами.

Крок 1

Візьміть чотири 2-х у свою пачку і покладіть їх на стіл.

Які у вас варіанти? Картку чи картки можна розмістити або на початку послідовності, яка вже є на столі, або після будь-якої з карток у цій послідовності. У такому випадку це означає, що є можливе місце для розміщення карток. $1+0=1$

Загальна кількість способів розміщення 4 карт в 1 місцях - . Кодуйте кожен із цих способів як число між $1$ $0$ і . Є 1 таке число. $1-1$

Я отримав 1, розглядаючи способи запису 0 як суму 5 цілих чисел: це . $\frac{4\times 3\times 2 \times 1}{4!}$

Крок 2

Візьміть чотири 3-х у свою пачку і покладіть їх на стіл.

Які у вас варіанти? Картку чи картки можна розмістити або на початку послідовності, яка вже є на столі, або після будь-якої з карток у цій послідовності. У цьому випадку це означає, що є можливих місць для розміщення карток. $1+4=5$

Загальна кількість способів розміщення 4 карт на 5 місць - . Кодуйте кожен із цих способів як число між і $70$ $0$ . Таких чисел 70. $70-1$

Я отримав 70, розглядаючи способи запису 4 як суму 5 цілих чисел: це . $\frac{8\times 7\times 6 \times 5}{4!}$

Крок 3

Візьміть чотири чотири у своїй пачці і покладіть їх на стіл.

Які у вас варіанти? Картку чи картки можна розмістити або на початку послідовності, яка вже є на столі, або після будь-якої з карток у цій послідовності. У цьому випадку це означає, що є можливих місць для розміщення карток. $1+8=9$

Загальна кількість способів розміщення 4 карт у 9 місцях - . Кодуйте кожен із цих способів як число між і $495$ $0$ . Таких чисел 495. $495-1$

Я отримав 495, розглядаючи способи запису 8 як суму 5 цілих чисел: це $\frac{12\times 11\times 10 \times 9}{4!}$ .

І так далі, поки ...

Крок 13

Візьміть чотири тузи в пачці і покладіть їх на стіл.

Які у вас варіанти? Картку чи картки можна розмістити або на початку послідовності, яка вже є на столі, або після будь-якої з карток у цій послідовності. У цьому випадку це означає, що є можливих місць для розміщення карток. $1+48=49$

Загальна кількість способів розміщення 4 карток у 49 місцях - . Кодуйте кожен із цих способів як число між і . Таких номерів є 270725. $270725$ $0$ $270725-1$

Я отримав 270725, розглядаючи способи запису 48 як суму 5 цілих чисел: це . $\frac{52\times 51\times 50 \times 49}{4!}$

Ця процедура дає відповідність 1 на 1 між (а) перетасуванням карт, де вам не байдуже масть, і (b) послідовностями цілих чисел, де перше між і , друге - від до , третя - від до , і так далі до тринадцятої, що становить від до . $0$ $1-1$ $0$ $70-1$ $0$ $495-1$ $0$ $270725-1$

Посилаючись на "Кодування цілочисельних послідовностей", ви можете бачити, що така послідовність цілих чисел знаходиться у відповідності 1-1 з числами між і . Якщо ви подивитесь на "добуток, розділений на факторний" вираз кожного з цілих чисел ( як це описано курсивом у кінці кожного кроку ), то ви побачите, що це означає числа від до $0$ $(1\times 70\times 495\times … \times 270725)-1$ $0$

\frac{52!}{(4!)^{13}} - 1,

$\frac{52!}{(4!)^{13}}-1\text,$ що в моїй попередній відповіді було найкращим.

Тож у нас є ідеальний метод стиснення ваших перетасованих карт.

Алгоритм

Попередньо обчисліть список усіх способів запису 0 як суми 5 цілих чисел, запису 4 як суми 5 цілих чисел, запису 8 як суми 5 цілих чисел,… написання 48 як суми 5 цілих чисел. Найдовший список містить 270725 елементів, тому він не особливо великий. (Попередня обчислення не є строго необхідною, тому що ви можете легко синтезувати кожен список як і коли вам це потрібно: спроба з Microsoft QuickBasic, навіть проходження списку елементів 270725 було швидше, ніж очі могли бачити)

Щоб перейти від перетасування до послідовності цілих чисел:

Дві особи нічого не сприяють, тому давайте їх ігнорувати. Запишіть число від 0 до 1-1.

3s: Скільки є 2-х перед першими 3? Скільки до другого? третій? 4-й? після 4-го? Відповідь - це 5 цілих чисел, які, очевидно, складають до 4. Тож знайдіть цю послідовність з 5 цілих чисел у вашому списку "запис 4 як суму 5 цілих чисел" і зазначте її позицію в цьому списку. Це буде число від 0 до 70-1. Запишіть його.

4s: Скільки є 2s або 3s перед першими 4? Скільки до другого? третій? 4-й? після 4-го? Відповідь - це 5 цілих чисел, які, очевидно, складають до 8. Тож знайдіть цю послідовність з 5 цілих чисел у вашому списку "запис 8 як суму 5 цілих чисел" і зазначте її позицію в цьому списку. Це буде число від 0 до 495-1. Запишіть його.

І так далі, поки…

Тузи: Скільки карт без тузів до першого туза? Скільки до другого? третій? 4-й? після 4-го? Відповідь - це 5 цілих чисел, які, очевидно, складають до 48. Отже, знайдіть цю послідовність з 5 цілих чисел у вашому списку "написання 48 як суму 5 цілих чисел" і зазначте її позицію в цьому списку. Це буде число між 0 і 270725-1. Запишіть його.

Тепер ви записали 13 цілих чисел. Кодуйте їх (як описано раніше) в одне число від до $0$ . Запишіть це число у двійковій формі. Це займе трохи менше 166 біт. $\frac{52!}{(4!)^{13}}$

Це найкраще можливе стиснення, оскільки воно досягає інформаційно-теоретичної межі.

Декомпресія проста: перейдіть від великого числа до послідовності 13 цілих чисел, а потім використовуйте їх для побудови послідовності карт, як уже описано.

— Мартін Кочанський
джерело

Коментарі не для розширеного обговорення; ця розмова переміщена до чату .

— DW

Це рішення мені незрозуміле і неповне. Це не показує, як насправді отримати 166 бітове число і декодувати його назад у колоду. Для мене це зовсім непросто, тому я не знаю, як це здійснити. Ваша поетапна формула в основному просто розбиває

формула на

частин, що насправді не дуже допомагає мені. Я думаю, що це допомогло б, якби ви склали схему або діаграму, можливо, для кроку 2 із 70 можливими способами розташування карт. Ваше рішення занадто абстрактне, щоб мій мозок міг прийняти та обробити. Я віддаю перевагу власне приклади та ілюстрації.

52! / (4!^{13})

$52! / (4! ^ {13})$

13

$13$

— Девід Джеймс

23

Замість того, щоб намагатися кодувати кожну карту окремо на 3 або 4 біти, я пропоную вам кодувати стан усієї колоди на 166 біт. Як пояснює Мартін Кочанський , існує менше можливих розташувань карт, що ігнорують костюми, а це означає, що стан усієї колоди може зберігатися в 166 бітах. $2^{166}$

Як це алгоритмічно та ефективно стискати компресію та декомпресію? Я пропоную використовувати лексикографічне впорядкування та двійковий пошук. Це дозволить зробити компресію та декомпресію ефективно (як у просторі, так і в часі), не вимагаючи великої таблиці пошуку чи інших нереальних припущень.

Більш детально: упорядкуємо колоди за допомогою лексикографічного упорядкування на нестисненому зображенні колоди, тобто дека представлена в нестисненому вигляді у вигляді рядка на зразок 22223333444455556666777788889999TTTTJJJJQQQQKKKAKAAAA; Ви можете замовити їх відповідно до лексикографічного порядку. Тепер, припустимо, у вас є процедура, яка дала колоду , підраховує кількість колод, що надходять до неї (у лексикографічному порядку). Потім ви можете скористатися цією процедурою для стиснення колоди: з урахуванням колоди ви стискаєте 166-бітне число, підраховуючи кількість колод, що надходять до неї, а потім виводите це число. Це число є стислим поданням колоди. $D$ $D$

Для розпакування використовуйте двійковий пошук. Давши число , ви хочете знайти ю колоду в лексикографічному упорядкуванні всіх колод. Це можна зробити, скориставшись процедурою по лінії двійкового пошуку: виберіть колоду , підрахуйте кількість колод до і порівняйте її з . Це підкаже, чи слід регулювати $n$ $n$ $D_0$ $D_0$ $n$ $D_0$ to come earlier or later. I suggest you try to iteratively get the symbol right: if you want to recover a string like 22223333444455556666777788889999TTTTJJJJQQQQKKKKAAAA, first search to find what to use as the first symbol in the string (just try all 12 possibilities, or use binary search over the 12 possibilities), then when you've found the right value for the first symbol, search to find the second symbol, and so on.

All that remains is to come up with an efficient procedure to count the number of decks that come lexicographically before $D$ . This looks like a straightforward but tedious combinatorial exercise. In particular, I suggest you build a subroutine for the following problem: given a prefix (like 222234), count the number of decks that start with that prefix. The answer to this problem looks like a pretty easy exercise in binomial coefficients and factorials. Then, you can invoke this subroutine a small number of times to count the number of decks that come before $D$ .

— D.W.
джерело

Comments are not for extended discussion; this conversation has been moved to chat.

— D.W.

8

The number of possible arrangements of the cards ignoring suits is

\frac{52!}{(4!)^{13}},

$\frac{52!}{(4!)^{13}}\text,$ whose logarithm base 2 is 165.976, or 3.1919 bits per card, which is better than the limit you gave.

Any fixed "bits per card" encoding will not make sense because, as you note, the last card can always be encoded in $0$ bits and in many cases the last few cards can be as well. That means that for quite a way towards the "tail" of the pack the number of bits needed for each card will be quite a lot less than you think.

By far the best way of compressing the data would be to find 59 bits of other data that you want to pack with your card data anyway (59.6 bits, actually) and, writing those 59 bits as a 13-digit number modulo 24 (= $4!$ ), assign a suit to each card (one digit chooses between the $4!$ ways of assigning suits to the aces, another does the same for the kings, and so on). Then you have a pack of 52 wholly distinct cards. $52!$ possibilities can be encoded in 225.58 bits very easily indeed.

But doing it without taking the opportunity of encoding those extra bits is also possible to some extent, and I will think about it as I am sure everyone else is. Thank you for a really interesting problem!

— Martin Kochanski
джерело

1

Could an approach similar to ciphertext stealing be used here? As in, the data you encode in those extra 59 bits are the last 59 bits of the encoded representation?

— John Dvorak

@JanD I was thinking about investigating something like this. But then it turned out that an algorithm exists that attains the theoretical limit and is straightforward and 100% reliable, so there was no point in looking further.

— Martin Kochanski

@MartinKochanski - I wouldn't word it as "ignoring suits" cuz we are still honoring the standard 4 suits per rank. Better wording might be "The number of possible distinct arrangements of the deck is"...

— David James

3

This is a long solved problem.

When you deal a deck of 52 cards, every card you deal has one of up to 13 ranks with known probabilities. The probabilities change with every card dealt. That is handled optimally using an ancient technique called adaptive arithmetic coding, an improvement to Huffman coding. Usually this is used for known, unchanging probabilities, but it can just as well be used for changing probabilities. Read the wikipedia article about arithmetic coding:

https://en.wikipedia.org/wiki/Arithmetic_coding

— gnasher729
джерело

Okay but this doesn't answer my question if it can approach, match, or beat the theoretical entropy encoding limit. It seems that since there are n possible decks each with 1/n probability, then the entropy encoding is the limit and we cannot do better (unless we "cheat" and tell the decoder something about the input data to the encoder ahead of time.

— David James

3

Both D.W. and Martin Kochanski have already described algorithms for constructing a bijection between deals and integers in the range $[0, {52!\over(4!)^{13}})$ , but it seems like neither of them have reduced the problem to its simplest form. (Note 1)

Suppose we have a (partial) deck described by the ordered list $a$ , where $a_i$ is the number of cards of type $i$ . In the OP, the initial deck is described by a list of 13 elements, each of them with value 4. The number of distinct shuffles of such a deck is

c (a) = \frac{(\sum a_{i})!}{\prod a_{i}!}

$c(a) = {(\sum a_i)! \over \prod a_i!}$

which is a simple generalization of binomial coefficients, and indeed could be proven by simply arranging the objects one type at a time, as suggested by Martin Kochanski. (See below, note 2)

Now, for any such (partial) deck, we can select a shuffle one card at a time, using any $i$ for which $a_i>0$ . The number of unique shuffles starting with $i$ is

{\begin{cases} 0 & if a_{i} = 0 \\ c (⟨ a_{1}, . . ., a_{i - 1}, a_{i} - 1, a_{i + 1}, . . ., a_{n} ⟩) & if a_{i} > 0. \end{cases}

$\begin{cases}0 & \text{if } a_i = 0 \\ c(\langle a_1,...,a_{i-1},a_i-1,a_{i+1},...,a_n \rangle) & \text{if } a_i > 0. \end{cases}$

and by the above formula, we have

c (⟨ a_{1}, . . ., a_{i - 1}, a_{i} - 1, a_{i + 1}, . . ., a_{n} ⟩) = \frac{a_{i} c (a)}{\sum a_{i}}

$c(\langle a_1,...,a_{i-1},a_i-1,a_{i+1},...,a_n \rangle) = {a_ic(a)\over\sum a_i}$

We can then recurse (or iterate) through the deck until the shuffle is complete by observing that the number of shuffles corresponding to a prefix lexicographically smaller than the prefix up to $i$ is

\frac{c (a) \sum_{j = 1}^{i} a_{j}}{\sum_{j = 1}^{n} a_{j}}

${c(a)\sum\limits_{j=1}^i a_j}\over\sum\limits_{j=1}^n a_j$

I wrote this in Python to illustrate the algorithm; Python is as reasonable a pseudocode as any. Note that most of the arithmetic involves extended precision; the values $k$ (representing the ordinal of the shuffle) and $n$ (the total number of possible shuffles for the remaining partial deck) are both 166-bit bignums. To translate the code to another language, it will be necessary to use some sort of bignum library.

Also, I just use list of integers rather than card names, and -- unlike the above maths -- the integers are 0-based.

To encode a shuffle, we walk through the shuffle, accumulating at each point the number of shuffles which start with a smaller card using the above formula:

from math import factorial
T = factorial(52) // factorial(4) ** 13

def encode(vec):
    a = [4] * 13
    cards = sum(a)
    n = T
    k = 0
    for idx in vec:
        k += sum(a[:idx]) * n // cards
        n = a[idx] * n // cards
        a[idx] -= 1
        cards -= 1
    return k

Decoding a 166-bit number is the simple inverse. At each step, we have the description of a partial deck and an ordinal; we need to skip over the shuffles starting with smaller cards than the one which corresponds to the ordinal, and then we compute output the selected card, remove it from the remaining deck, and adjust the number of possible shuffles with the selected prefix:

def decode(k):
    vec = []
    a = [4] * 13
    cards = sum(a)
    n = T
    while cards > 0:
        i = cards * k // n
        accum = 0
        for idx in range(len(a)):
            if i < accum + a[idx]:
                k -= accum * n // cards
                n = a[idx] * n // cards
                a[idx] -= 1
                vec.append(idx)
                break
            accum += a[idx]
        cards -= 1
    return vec

I made no real attempt to optimize the above code. I ran it against the entire 3mil.TXT file, checking that encode(decode(line)) resulted in the original encoding; it took just under 300 seconds. (Seven of the lines can be seen in the on-line test on ideone.) Rewriting in a lower-level language and optimizing the division (which is possible) would probably reduce that time to something manageable.

Since the encoded value is simply an integer, it can be output in 166 bits. There is no value in deleting the leading zeros, since there would then be no way to know where an encoding terminated, so it is really a 166-bit encoding.

However, it's worth noting that in a practical application, it is probably never necessary to encode a shuffle; a random shuffle can be generated by generating a random 166-bit number and decoding it. And it is not really necessary that all 166 bits be random; it would be possible to, for example, start with a 32-bit random integer and then fill in the 166 bits using any standard RNG seeded with the 32-bit number. So if the goal is simply to be able to reproducibly store a large number of random shuffles, you can reduce the per-deal storage requirement more or less arbitrarily.

If you want to encode a large number $N$ of actual deals (generated in some other fashion) but don't care about the order of the deals, you can delta-encode the sorted list of numbers, saving approximately $\log_2 N$ bits for each number. (The savings results from the fact that a sorted sequence has less entropy than an unsorted sequence. It does not reduce the entropy of a single value in the sequence.)

Assuming that we need to encode a sorted list of $N$ $k$ -bit numbers, we can proceed as follows:

Choose $p$ as an integer close to $\log_2 N$ (either the floor or the ceiling will work; I usually go for the ceiling).
We implicitly divide the range of numbers into $2^p$ intervals by binary prefix. Each $k$ -bit number is divided into a $p$ -bit prefix and an $k-p$ -bit suffix; we write out only the suffixes (in order). This requires $N*(k-p)$ bits.
Additionally, we create a bit-sequence: For each of the $2^p$ prefixes (except prefix $0$ ) we write out a $0$ for each number with that prefix (if any) followed by a $1$ . This sequence obviously has $2^p+N$ bits: $2^p$ $1$ s and $N$ $0$ s.

To decode the numbers we start a prefix counter at 0, and proceed to work through the bit sequence. When we see a $0$ , we output the current prefix and the next suffix from the suffix list; when we see a $1$ , we increment the current prefix.

The total length of the encoding is $N*(k-p) + N + 2^p$ which is very close to $N*(k-p) + N + N$ , or $N*(k-p+2)$ , for an average of $k-p+2$ bits per value.

Notes

$52!\over(4!)^{13}$ is $92024242230271040357108320801872044844750000000000$ , and $\log_2 {52!\over(4!)^{13}}$ is approximately $165.9765$ . In the text, I occasionally pretend that the base-2 logarithm is really $166$ ; in the case of generating random ordinals within the range, a rejection algorithm could be used which would only very rarely reject a generated random number.
For convenience, I write $S_k$ for $\sum\limits_{i=k}^n a_i$ ; then the $a_1$ objects of type $1$ can be placed in $S_1 \choose a_1$ ways, and then the objects of type $2$ can be placed in $S_2 \choose a_2$ ways, and so on. Since ${S_i \choose a_i}={S_i! \over a_i!(S_i - a_i)!}={S_i!\over {a_i!S_{i+1}!}}$ , that leads to the total count

\frac{\prod_{i = 1}^{n} S_{i}!}{\prod_{i = 1}^{n} a_{i}! S_{i + 1}!}

$\prod\limits_{i=1}^n S_i! \over \prod\limits_{i=1}^n a_i! S_{i+1}!$

which simplifies to the formula above.

— rici
джерело

Comments are not for extended discussion; this conversation has been moved to chat.

— D.W.

@rici - I gave you the +100 bounty cuz you explained your answer in what seems like a better presentation including code while the other answers are more abstract/theoretical, leaving out some details of how to actually implement the encode/decode. As you may know, there are many details when writing code. I admit my algorithm is not the most straightforward, simple, easy to understand either but I actually got it working without much effort and over time I can get it running faster with more compression. So thanks for your answer and keep up the good work.

— David James

2

As an alternate solution to this problem, my algorithm uses compound fractional (non integer) bits per card for groups of cards in the deck based on how many unfilled ranks there are remaining. It is a rather elegant algorithm. I checked my encode algorithm by hand and it is looking good. The encoder is outputting what appear to be correct bitstrings (in byte form for simplicity).

The overview of my algorithm is that it uses a combination of groups of cards and compound fractional bit encoding. For example, in my shared test file of $3$ million shuffled decks, the first one has the first $7$ cards of $54A236J$ . The reason I chose a $7$ card block size when $13$ ranks of cards are possible is because $13^7$ "shoehorns" (fits snugly) into $26$ bits (since $13^7$ = $62,748,517$ and $2^{26}$ = $67,108,864$ ). Ideally we want those $2$ numbers to be as close as possible (but with the power of 2 number slightly higher) so we don't waste more than a very small fraction of a bit in the bit packing process. Note I could have also chosen groupsize $4$ when encoding $13$ ranks since $13^4$ = $28,561$ and $2^{15}$ = $32,768$ . It is not as tight a fit since $15/4 = 3.75$ but $26/7 = 3.714$ . So the number of bits per card is slightly lower per card if we use the $26/7$ packing method.

So looking at $54A236J$ , we simply look up the ordinal position of those ranks in our master " $23456789TJQKA$ " list of sorted ranks. For example, the first actual card rank of $5$ has a lookup position in the rank lookup string of $4$ . We just treat these $7$ rank positions as a base $13$ number starting with 0 (so the position 4 we previously got will actually be a 3). Converted back to base $10$ (for checking purposes), we get $15,565,975$ . In $26$ bits of binary we get $00111011011000010010010111$ .

The decoder works in a very similar way. It takes (for example) that string of $26$ bits and converts it back to decimal (base 10) to get $15,565,975$ , then converts it to base $13$ to get the offsets into the rank lookup string, then it reconstructs the ranks one at a time and gets the original $54A236J$ first $7$ cards. Note that the blocksize of bits wont always be 26 but will always start out at 26 in each deck. The encoder and decoder both have some important information about the deck data even before they operate. That is one exceptionally nice thing about this algorithm.

Each # of ranks remaining (such as $13, 12, 11 ..., 2, 1)$ has its own groupsize and cost (# of bits per card). These were found experimentally just playing around with powers of $13,12,11...$ and powers of $2$ . I already explained how I got the groupsize for when we can see $13$ ranks, so how about when we drop to $12$ unfilled ranks? Same method. Look at the powers of $12$ and stop when one of them comes very close to a power of $2$ but just slightly under it. $12^5$ = $248,832$ and $2^{18}$ = $262,144$ . That is a pretty tight fit. The number of bits encoding this group is $18/5$ = $3.6$ . In the $13$ rank group it was $26/7$ = $3.714$ so as you can see, as the number of unfilled ranks decreases (ranks are filling up such as $5555$ , $3333$ ), the number of bits to encode the cards decreases.

Here is my complete list of costs (# of bits per card) for all possible # of ranks to be seen:

$13~~~~26/7 = 3.714 = 3~~5/7$
$12~~~~18/5 = 3.600 = 3~~3/5$
$11~~~~~~7/2 = 3.500 = 3~~1/2$
$10~~~~10/3 = 3.333 = 3~~1/3$
$~~9~~~~16/5 = 3.200 = 3~~1/5$
$~~8~~~~~~3/1 = 3.000 = 3$
$~~7~~~~17/6 = 2.833 = 2~~5/6$
$~~6~~~~13/5 = 2.600 = 2~~3/5$
$~~5~~~~~~7/3 = 2.333 = 2~~1/3$
$~~4~~~~~~2/1 = 2.000 = 2$
$~~3~~~~~~5/3 = 1.667 = 1~~2/3$
$~~2~~~~~~1/1 = 1.000 = 1$
$~~1~~~~~~0/1..4 = 0.0 = 0$

So as you can clearly see, as the number of unfilled ranks decreases (which it will do every deck), the number of bits needed to encode each card also decreases. You might be wondering what happens if we fill a rank but we are not yet done a group. For example, if the first $7$ cards in the deck were $5,6,7,7,7,7,K$ , what should we do? Easy, The $K$ would normally drop the encoder from $13$ rank encoding mode to $12$ rank encoding mode. However, since we haven't yet filled the first block of $7$ cards in $13$ rank encoding mode, we include the $K$ in that block to complete it. There is very little waste this way. There are also cases while we are trying to fill a block, the # of filled ranks bumps up by $2$ or even more. That is also no problem as we just fill the block in the current encoding mode, then we pick up in the new encoding mode which may be $1,2,3...$ less or even stay in the same mode (as was the case in the first deck in the datafile as there are $3$ full blocks in the $13$ rank encoding mode). This is why it is important to make the blocksizes reasonable such as between size $1$ and $7$ . If we made it size $20$ for example, we would have to fill that block at a higher bitrate than if we let the encoder transition into a more efficient encoding mode (encoding less ranks).

When I ran this algorithm (by hand) on the first deck of cards in the data file (which was created using Fisher-Yates unbiased shuffle), I got an impressive $168$ bits to encode which is almost identical to optimal binary encoding but requires no knowledge of ordinal positions of all possible decks, no very large numbers, and no binary searches. It does however require binary manipulations and also radix manipulations (powers of $13, 12, 11$ ...).

Notice also that when the number of unfilled ranks = $1$ , the overhead is $0$ bits per card. Best case (for encoding) is we want the deck to end on a run of the same cards (such as $7777$ ) cuz those get encoded for "free" (no bits required for those). My encode program will suppress any output when the remaining cards are all the same rank. This is cuz the decoder will be counting cards for each deck and know if after seeing card $48$ , if some rank (like $7$ ) has not yet been seen, all $4$ remaining cards MUST be $7$ s. If the deck ends on a pair (such as 77), triple/set (such as 777) or a quad ( such as 7777), we get additional savings for that deck using my algorithm.

Another "pretty" thing about this algorithm is that it never needs to use any numbers larger than $32$ bit so it wont cause problems in some languages that "don't like" large numbers. Actually the largest numbers need to be on the order of $2^{26}$ which are used in the $13$ rank encoding mode. From there they just get smaller. In fact, if I really wanted to, I could make the program so that it doesn't use anything larger than $16$ bit numbers but this is not necessary as most computer languages can easily handle $32$ bits well. Also this is beneficial to me since one of the bit functions I am using maxes out at $32$ bit. It is a function to test if a bit is set or not.

In the first deck in the datafile, the encoding of cards is as follows (diagram to come later). Format is (groupsize, bits, rank encode mode):

( $7,26,13$ ) First $7$ cards take $26$ bits to encode in $13$ rank mode.
( $7,26,13$ )
( $7,26,13$ )
( $5,18,12$ )
( $5,18,12$ )
( $3,10,10$ )
( $3,~~9,~~8$ )
( $6,17,~~7$ )
( $5,13,~~6$ )
( $3,~~5,~~3$ )
( $1,~~0,~~1$ )

This is a total of $52$ cards and $168$ bits for an average of about $3.23$ bits per card. There is no ambiguity in either the encoder or the decoder. Both count cards and know which encode mode to use/expect.

Also notice that $18$ cards, (more than $1/3$ rd of the deck), are encoded BELOW the $3.2$ bits per card "limit". Unfortunately those are not enough cards to bring the overall average below about $3.2$ bits per card. I imagine in the best case or near best case (where many ranks fill up early such as $54545454722772277...$ ), the encoding for that particular deck might be under $3$ bits per card, but of course it is the average case that counts. I think best case might be if all the quads are dealt in order which might never happen if given all the time in the universe and the fastest supercomputer. Something like $22223333444455556666777788889999TTTTJJJJQQQQKKKKAAAA$ . Here the rank encode mode would drop fast and the last $4$ cards would have $0$ bits of overhead. This special case takes only 135 bits to encode.

Also one possible optimization I am considering is to take all the ranks that have only $1$ card remaining and treating those all as a special "rank" by placing them in a single "bucket". The reason here is if we do that, the encoder can drop into a more efficient packing mode quicker. For example, if we are in $10$ rank encoding mode but we only have one more each of ranks $3,7$ , and $K$ , those cards have much less chance of appearing than the other cards so it doesn't make much sense to treat them the same. If instead I dropped to $8$ rank encoding mode which is more efficient that $10$ rank mode, perhaps I could use fewer bits for that deck. When I see one of the cards in that special "grouped" bucket of several cards, I would just output that special "rank" (not a real rank but just an indicator we just saw something in that special bucket) and then a few more bits to tell the decoder which card in the bucket I saw, then I would remove that card from the group (since it just filled up). I will trace this by hand to see if any bit savings is possible using it. Note there should be no ambiguity using this special bucket because both the encoder and decoder will be counting cards and will know which ranks have only $1$ card remaining. This is important because it makes the encoding process more efficient when the decoder can make correct assumptions without the encoder having to pass extra messages to it.

Here is the first full deck in the $3$ million deck data file and a trace of my algorithm on it showing both the block groupings and the transitions to a lower rank encoding mode (like when transitioning from $13$ to $12$ unfilled ranks) as well as how many bits needed to encode each block. x and y are used for $11$ and $10$ respectively because unfortunately they happened on neighboring cards and don't display well juxtaposed.

$~~~~~~~~~26~~~~~~~~~~~~~26~~~~~~~~~~~~~26~~~~~~~~~~~~18~~~~~~~~~18~~~~~~~10~~~~~~9~~~~~~~~~~17~~~~~~~~~~~13~~~~~~~~5~~~~~0$
$~~~~54A236J~~87726Q3~~3969AAA~~QJK7T~~9292Q~~36K~~J57~~~T8TKJ4~~48Q8T~~55K~~4$
$13~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~12~~~~~~~~~~~~~~~~~~~~xy~~~~~98~~~~~~~~~7~~~~~~~~~~~~~~6~~~~~~~~543~~~~~2~1~~0$

Note that there is some inefficiency when the encode mode wants to transition early in a block (when the block is not yet completed). We are "stuck" encoding that block at a slightly higher bit level. This is a tradeoff. Because of this and because I am not using every possible combination of the bit patterns for each block (except when it is an integer power of $2$ ), this algorithm cannot be optimal but can approach $166$ bits per deck. The average on my datafile is around $175$ . The particular deck was "well behaved" and only required $168$ bits. Note that we only got a single 4 at the end of the deck but if instead we got all four 4s there, that is a better case and we would have needed only 161 bits to encode that deck, a case where the packing actually beats the entropy of a straight binary encode of the ordinal position of it.

I now have the code implemented to calculate the bit requirements and it is showing me on average, about 175 bits per deck with a low of 155 and a high of 183 for the 3 million deck test file. So my algorithm seems to use 9 extra bits per deck vs. the straight binary encode of the ordinal position method. Not too bad at only 5.5% additional storage space required. 176 bits is exactly 22 bytes so that is quite a bit better than 52 bytes per deck. Best case deck (didn't show up in 3 million deck test file) packs to 136 bits and worst case deck (did show up in testfile 8206 times), is 183 bits. Analysis shows worst case is when we don't get the first quad until close to (or at) card 40. Then as the encode mode wants to drop quickly, we are "stuck" filling blocks (as large as 7 cards) in a higher bit encoding mode. One might think that not getting any quads until card 40 would be quite rare using a well shuffled deck, but my program is telling me it happened 321 times in the testfile of 3 million decks so that it about 1 out of every 9346 decks. That is more often that I would have expected. I could check for this case and handle it with less bits but it is so rare that it wouldn't affect the average bits enough.

Also here is something else very interesting. If I sort the deck on the raw deck data, the length of prefixes that repeat a significant # of times is only about length 6 (such as 222244). However with the packed data, that length increases to about 16. That means if I sort the packed data, I should be able to get a significant savings by just indicating to the decoder a 16 bit prefix and then just output the remainder of the decks (minus the repeating prefix) that have that same prefix, then go onto the next prefix and repeat. Assuming I save even just 10 bits per deck this way, I should beat the 166 bits per deck. With the enumeration technique stated by others, I am not sure if the prefix would be as long as with my algorithm. Also the packing and unpacking speed using my algorithm is surprisingly good. I could make it even faster too by storing powers of 13,12,11... in an array and using those instead of expression like 13^5.

Regarding the 2nd level of compression where I sort the output bitstrings of my algorithm then use "difference" encoding: A very simple method would be to encode the 61,278 unique 16 bit prefixes that show up at least twice in the output data (and a maximum of 89 times reported) simply as a leading bit of 0 in the output to indicate to the 2nd level decompressor that we are encoding a prefix (such as 0000111100001111) and then any packed decks with that same prefix will follow with a 1 leading bit to indicate the non prefix part of the packed deck. The average # of packed decks with the same prefix is about 49 for each prefix, not including the few that are unique (only 1 deck has that particular prefix). It appears I can save about 15 bits per deck using this simple strategy (storing the common prefixes once). So assuming I really do get 15 bit saving per deck and I am already at about 175 bits per deck on the first level packing/compression, that should be a net of about 160 bits per deck, thus beating the 166 bits of the enumeration method.

After the 2nd level of compression using difference (prefix) encoding of the sorted bitstring output of the first encoder, I am now getting about 160 bits per deck. I use length 18 prefix and just store it intact. Since almost all (245013 out of 262144 = 93.5%) of those possible 18 bit prefixes show up, it would be even better to encode the prefixes. Perhaps I can use 2 bits to encode what type of data I have. 00 = regular length 18 prefix stored, 01= "1 up prefix" (same as previous prefix except 1 added), 11 = straight encoding from 1st level packing (approx 175 bits on average). 10=future expansion when I think of something else to encode that will save bits.

Did anyone else beat 160 bits per deck yet? I think I can get mine a little lower with some experimenting and using the 2 bit descriptors I mentioned above. Perhaps it will bottom out at 158ish. My goal is to get it to 156 bits (or better) because that would be 3 bits per card or less. Very impressive. Lots of experimenting to get it down to that level because if I change the first level encoding then I have to retest which is the best 2nd level encoding and there are many combinations to try. Some changes I make may be good for other similar random data but some may be biased towards this dataset. Not really sure but if I get the urge I can try another 3 million deck dataset to see what happens like if I get similar results on it.

One interesting thing (of many) about compression is you are never quite sure when you have hit the limit or are even approaching it. The entropy limit tells us how many bits we need if ALL possible occurrences of those bits occur about equally, but as we know, in reality, that rarely happens with a large number of bits and a (relatively) small # of trials (such as 3 million random decks vs almost $10^{50}$ bit combinations of 166 bits.

Does anyone have any ideas on how to make my algorithm better like what other cases I should encode that would reduce bits of storage for each deck on average? Anyone?

2 more things: 1) I am somewhat disappointed that more people didn't upvote my solution which although not optimal on space, is still decent and fairly easy to implement (I got mine working fine). 2) I did analysis on my 3 million deck datafile and noticed that the most frequently occurring card where the 1st rank fills (such as 4444) is at card 26. This happens about 6.711% of the time (for 201322 of the 3 million decks). I was hoping to use this info to compress more such as start out in 12 symbol encode mode since we know on average we wont see every rank until about middeck but this method failed to compress any as the overhead of it exceeded the savings. I am looking for some tweaks to my algorithm that can actually save bits.

So does anyone have any ideas what I should try next to save a few bits per deck using my algorithm? I am looking for a pattern that happens frequently enough so that I can reduce the bits per deck even after the extra overhead of telling the decoder what pattern to expect. I was thinking something with the expected probabilities of the remaining unseen cards and lumping all the single card remaining ones into a single bucket. This will allow me to drop into a lower encode mode quicker and maybe save some bits but I doubt it.

Also, F.Y.I., I generated 10 million random shuffles and stored them in a database for easy analysis. Only 488 of them end in a quad (such as 5555). If I pack just those using my algorithm, I get 165.71712 bits on average with a low of 157 bits and a high of 173 bits. Just slightly below the 166 bits using the other encoding method. I am somewhat surprised at how infrequent this case is (about 1 out of every 20,492 shuffles on average).

— David James
джерело

3

I notice that you've made about 24 edits in the space of 9 hours. I appreciate your desire to improve your answer. However, each time you edit the answer, it bumps this to the top of the front page. For that reason, we discourage excessive editing. If you expect to make many edits, would it be possible to batch up your edits, so you only make one edit every few hours? (Incidentally, note that putting "EDIT:" and "UPDATE:" in your answer is usually poor style. See meta.cs.stackexchange.com/q/657/755.)

— D.W.

4

This is not the place to put progress reports, status updates, or blog items. We want fully-formed answers, not "coming soon" or "I have a solution but I'm not going to describe what it is".

— D.W.

3

If someone is interested he will find the improved solution. The best way is to wait for full answer and post it then. If you have some updates a blog would do. I do not encourage this, but if you really must (I do not see valid reason why) you can write comment below your post and merge later. I also encourage you to delete all obsolete comments and incorporate them into one seamless question - it gets hard to read all. I try to make my own algorithm, different than any presented, but I am not happy with the results - so I do not post partials to be edited - the answer box is for full ones.

— Evil

3

@DavidJames, I do understand. However, that still doesn't change our guidelines: please don't make so many edits. (If you'd like to propose improvements to the website, feel free to make a post on our Computer Science Meta or on meta.stackexchange.com suggesting it. Devs don't read this comment thread.) But in the meantime, we work with the software we have, and making many edits is discouraged because it bumps the question to the top. At this point, limiting yourself to one edit per day might be a good guideline to shoot for. Feel free to use offline editors or StackEdit if that helps!

— D.W.

3

I'm not upvoting your answer for several reasons. 1) it is needless long and FAR too verbose. You can drastically reduce it's presentation. 2) there are better answers posted, which you choose to ignore for reasons unbeknownst to me. 3) asking about lack of upvotes is usually a "red flag" to me. 4) This has constantly remained in the front page due to an INSANE amount of edits.

— Nicholas Mancuso