I recently wrote a post about storing random data as strings in carts and studying the built-in compression algorithm. It led, among other conclusions, to the following two observations:
- PICO-8 is not very good at compressing random strings that use more than 64 different characters; those get actually expanded by about log₂(n)/6. For instance a random string of length 1000 that uses 128 different characters will use 1000*log₂(128)/6 = 1167 bytes in the cart.
- The new compression format is redundant and allows for sequences that are not needed. For instance 011 01111 001 and 010 0000001111 001 both encode a back reference of length 4 at offset -16.
Most other compression schemes have some sort of fallback mechanism when they are unable to compress a block. For instance the DEFLATE format (for zip or gzip) has non-compressed blocks. In the above thread I made a suggestion to extend the compression format in a similar way, but here I have a much simpler proposal:
- 010 00000 marks the start of a zero-terminated sequence of uncompressed bytes
This works because:
- 010 00000 never happens in a compressed stream; anything that would start with that sequence could start with 011 instead and be much shorter
- the null byte is not allowed in the code
- the overhead is only two bytes; it can be more efficient than the current format for some strings as short as 7 characters
- it should be easy to adapt the existing compressor so that when it realises the last X bytes have compressed badly it backtracks and stores the bytes instead
Added for 0.2.0j. Unfortunately the marker needs to be 010 00000 00000 because low-end bits are stored first. But I think the real value of this is being able to have a small number of large binary strings -- effectively being able to resize the code partition to 0x4300-#data_string. As such, the marker overhead doesn't matter so much. Also -- it's nice that binary strings end up not being the most efficient way to store high entropy data, heh. Spooky!
Deciding where to place raw blocks is an interesting problem. For the sake of simplicity I'm just checking the ratio every 32 bytes written, and then starting / appending / finishing the binary block accordingly. This is fine for the large end-of-code style strings, and existing cart data doesn't look like it would benefit much from trying to catch and encode small sections more precisely. But out of curiosity does anyone know a nice algorithm for this?
On a related note, I've also tried to improve the workflow for encoding binary strings. Characters 0, 10,13, " and \ need to be escaped, but on average this gives a ratio of 262/256 using the following encoding for each character:
str="\0¹²³⁴⁵⁶⁷⁸ \nᵇᶜ\13ᵉᶠ▮■□⁙⁘‖◀▶「」¥•、。゛゜ !\"#$%&'()*+,-./0123456789:;<=>?@𝘢𝘣𝘤𝘥𝘦𝘧𝘨𝘩𝘪𝘫𝘬𝘭𝘮𝘯𝘰𝘱𝘲𝘳𝘴𝘵𝘶𝘷𝘸𝘹𝘺𝘻[\]^_`abcdefghijklmnopqrstuvwxyz{|}~○█▒🐱⬇️░✽●♥☉웃⌂⬅️😐♪🅾️◆…➡️★⧗⬆️ˇ∧❎▤▥あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをんっゃゅょアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワヲンッャュョ◜◝"
(the 1~15 superscript unicode replacements for control characters are also new, to make transmitting those characters a little easier)
Storing raw data in strings like this means ord(str,index) can be used as a drop-in replacement for peek() for use with decoders like px9.
(edit: tried to make the question about when to start/stop raw data blocks clearer)
@zep, @sam - interesting, perhaps this uncompressed block can go further and have each byte that follows (from 0 to 255) get escaped if it can't be placed in a string as-is?
Aka characters #0,#10,#13,",\ would become their escaped variants \0,\n,\r,\",\ instead of the characters themselves.
Sure, that means that uncompressed blocks can only be used inside strings - but really, there's no sane reason to have uncompressed blocks be used instead of compressed blocks outside of strings, is there?
And the benefit would be that all 256 bytes can be placed in the code region without any overhead.
(And of course the uncompressed block would need to specify a size instead of being zero-terminated.)
To clarify - neither the compressor nor decompressor need to be able to parse or tokenize the code, the above is merely a transformation. The only extra complexity is that the compressor can only product uncompressed blocks if all bytes in the block are one of \0,\n,\r,\",\, or the other 251 bytes not needing escapes.
Though there is one issue with requiring the use of "\0" instead of null in strings in general - you can't follow a null byte (aka \0 character) with a digit character (0-9) without having to add an extra two 0 characters to avoid confusing the string tokenizer - ew!
Some potential solutions:
- Add a new letter escape (like \n and \r and friends) for byte 0. (A bit ad-hoc, yeah). Maybe \o (the letter)?
- Do bake the "add extra 0 padding between \0 and digits" into the compression/uncompression logic (ew!)
- Something else?
Oh wow, reconstructing strings from pure binary data is evil genius, hehe. A similar thing could be done with e.g. strings that only contain hex characters.
It's certainly true that uncompressed blocks normally won't need to appear outside of strings (excluding diabolical examples), so it's tempting to treat them in a special way. I think I'd rather live with the minor inefficiency of escaped characters though -- it also feels weird to require the compressor to be aware of strings, and also to remove some information (because the same string can be written in multiple ways). I think the question is just which lesser ickyness to choose, so I'd go for the simpler one.
Good point about \0,\10,\13. Along the same lines, I'd just go for baking in the extra 0's when the following character is a number (so an extra 6+4*10/256 characters stored per 256), and include an ugly encoding snippet with recommended workflow.
@zep thanks, that was really fast! I haven’t implemented compression in my tools yet, but here’s my probable strategy:
- observe that a raw block uses 21+8n bits where n is the number of characters
- while parsing the source code, even when not compressing, compute the compression cost per character minus 8 (so for a character that uses 6 bits, return -2; for a character that costs 10 bits, return +2; for a back reference of length 5 that costs 16 bits, return 16-5×8 = -24)
- use two instances of Kadane’s algorithm to keep track of Kmax, the maximum subarray cost, and Kmin, the minimum subarray cost
- use Kmin and Kmax as decision variables for when to switch between raw block and standard compression (for instance when Kmax reaches +22 it means that starting a raw block is more efficient than compression; when Kmin reaches -22 it means that closing a raw block is the most efficient; everything between these values is a bit too tricky to get right in this post, I’ll probably post some code later.
By the way, any reason for using \10 and \13 instead of \r and \n?
Finally, a note that, if \0 is not required, [[abc...def]] strings could be another good way to store large verbatim strings, since they do not require escaping (if "]]" appears, just use [=[abc...def]=] instead); unfortunately PICO-8 does not handle that syntax very well (bug reports here, here, here)
[Please log in to post a comment]