# Knowasiak

Knowledge Social Network

# Game Boy Wordle clone: How to compress 12972 five-letter words to 17871 bytes

71

Update: You can play an updated version online here in the binjgb Game Boy emulator. This is the version with the frequency-based answer list rather than the official Wordle list, for copyright reasons.

There is a Game
Boy version
of Wordle, using a bloom filter, a reduced vocabulary
and a reduced list of guess words, all fitting on one 32K cartridge. I
decided to challenge myself and see if I could fit in the whole 12972
word Wordle vocabulary, with the whole 2315 word answer list. So the
challenge is:

• Compress 12972 five-letter words (Vocabulary)

• Compress a distinguished 2315 word subset (Answers).

I managed it
works in a Game Boy emulator. There is more than one way, and what I did
may be excessively complicated, but I don’t have a good feel for how
fast the Game Boy runs, so I did a bit of speed optimization.

Step 0: We start with 12972 × 5 = 64860 bytes of uncompressed
data.

Step 1: Divide the 12972 word list into 26 lists,
based on the first letter of the word. Since in each list, the first
letter is the same, we now need only store four letters per word, along
with some overhead for each list. (The overhead in the final analysis
will be 108 bytes.) If we stop here,

Step 2: Each four letter “word” (or tail of a word)
can be stored with 5 bits per letter, thereby yielding a 20 bit unsigned
integer. If we stop here, we can store each word in 2.5 bytes, for a
total of 32430. That would fit on the cartridge if there was no code,
but it is some progress.

Step 3: Here was my one clever idea. Each of the
lists of four letter “words”, is in alphabetical order, and encoded the
natural way as 20 bit numbers, the numbers will be in ascending order.
Instead of storing these numbers, we need only store their arithmetical
differences, starting with an initial (invalid) 0.

Step 4: Since the differences are always at least 1,
we can subtract one from each difference to make the numbers slightly
smaller. (This is a needless complication, but I had it, and don’t feel
like removing it.)

Step 5: Store a stream of bytes encoding the
difference-minus-ones. Each number is encoded as one, two or three
bytes, seven-bits in each byte, with the high bit of each byte being 1
if it’s the last 7-bit sequence and 0 if it’s not. It turns out that the
result is 17763 bytes, plus 108 bytes of overhead, for a total of 17871
bytes, or 28% of the original list, with very, very simple
decompression.

Step 6: Now we replace each word in the
alphabetically-sorted Answers list with an index into the vocabulary
list. Since each index fits into 2 bytes, this would let us store the
2315 words of the Answers as 2315 × 2 = 4630 bytes.

Step 7: However, it turns out that the difference
between two successive indexes is never bigger than 62. So we can re-use
the trick of storing successive differences, and store the Answers in
2315 bytes. (In fact, since we only need 6 bits for the differences, we
could go down to 1737 bytes, but it would complicate the code
significantly.)

Result: Vocabulary plus Answers goes down to
108+17763+2315=20186 bytes. This was too big to fit on a 32K cartridge
using the existing code. But it turns out that most of the existing code
was library support code for gprintf(), and replacing the
single gprintf() call, which was just being used to format
a string containing a single-digit integer variable, with
gprint(), seemed to get everything to fit in 32K.

Example of the Vocabulary compression:

• The first six words are: aahed, aalii, aargh, aarti, abaca,
abaci.

• Dropping the initial “a”, we get ahed, alii, argh, arti, baca,
baci.

• Encoding as 20-bit integers and adding an initial zero, we get 0,
7299, 11528, 17607, 18024, 32832, 32840.

• The differences-minus-one are 7298, 4228, 6078, 416, 14807,
7.

• Each of these fits in 14-bits (two bytes, given the high-bit
usage), with the last one in 7-bits. In practice, there are a lot of
differences that fit in 7-bits, so this ends up being more efficient
than it looks—the first six words are not representative.

Notes:

• With the code as described above, there are 250 bytes to spare in the cartridge.

• One might wonder whether making up the compression algorithm
saves much memory over using a standard general purpose compressor. Yes.
gzip run on the 64860 bytes of uncompressed Vocabulary
yields 30338 bytes, which is rather worse than my 17871 byte compression
of the Vocabulary. Plus the decompression code would, I expect, be quite
a bit more complex.

• One could save a little memory by encoding the four-letter
“words” in Step 2 in base-26 instead of four 5-bit sequences. But it
would save only about 0.5K of memory, and the code would be much nastier
(the Game Boy uses library functions for division!).

• The Answers could be stored as a bitmap of length 12972, which
would be 1622 bytes. But this would make the code for generating a
random word more complicated and slower.

WRITTEN BY

## Vanic

“Simplicity, patience, compassion.
These three are your greatest treasures.
Simple in actions and thoughts, you return to the source of being.
Patient with both friends and enemies,
you accord with the way things are.
Compassionate toward yourself,
you reconcile all beings in the world.”
― Lao Tzu, Tao Te Ching