> load #is_word_40k |
An earlier attempt at squeezing a dictionary into a single PICO-8 cartridge reached 7,500 words (6 letters maximum) using 12k of cart data by storing the distances between base-26 numbers representing words. Most word games don't need to search through a dictionary though -- they just need to know if a word exists or not.
This cartridge can give correct positive tests for the first 40,000 most common English words (using this dataset). When a word does not exist, it is correct 99% of the time -- so if you give it a garbage word ("wefuhiwuehf") it will claim it is a word 1% of the time. If given input that contains only triples of characters that appear somewhere in the 40k words ("thequo"), that accuracy drops to ~87%.
is_word_40k uses 22k of data -- a combination of gfx,map,sfx and code is used. So there isn't much space left over to actually make a game! The low accuracy for non-words and limited code space are constraints that will only suit certain types of games that rely on having a large dictionary (e.g. something similar to scrabble).
How to Use It
To load the cart, type: load #is_word_40k
load_dictionary() once, and then is_word("blah") to see if the word probably exists.
How it Works
There are two filters that a word must pass through to be considered legitimate. The first one is a Bloom Filter; a 162656-bit bitfield that is generated by hashing each of the words in the source word list 3 times. Each hash function gives a different location in the bitfield to set. If any of the 3 bits are not set when checking to see if a word is present, that means it's definitely not in the dictionary. But if all 3 are set, it means the word is /probably/ in the dictionary (it's possible that other words happened to hash to those 3 locations in the bitfield).
The second filter is a bitfield where each bit means that a corresponding 3-character triple ("aaa","aab"..) appears somewhere in the source word list. This can be used to discard a lot of random words ("skuygnfckuesf") that would otherwise pass the bloom filter test, but is not able to filter out garbage words that are made up of triples that appear somewhere in the dictionary. This category of words is kind of interesting by itself -- it is essentially a silly word test. "blaphin" would pass the second filter because "bla","lap","aph","phi", and "hin" all appear in other legal words.
Here's the code I used to generate the cartridge data:
load #is_word_builder save is_word_builder |
(EDIT: added the save so that the cartridge becomes local and saves dat.p8 locally instead of in {appdata/pico-8/cstore} )
Run it, drop a word list file into it (1 word per line), and you'll get:
- a data cartridge called dat.p8
- extra data in the clipboard
The clipboard data should replace bindat in dat.p8. Press CTRL-P before pasting to get the right character encoding (ascii uppercase -> punyfont).
Hi zep!
I was wondering if this (and is_word_builder) still work on latest? I'm trying to incorporate a limited word check into a scrabble-ish type game prototype but the is_word_40k seemed to fail on many common words, so I thought I'd generate a new lookup dict using is_word_builder. However, it seems to always generate a 0 byte dat.p8 file and the ctrl-p and then paste is always just a long string of /0's.
Latest attempt was with a .txt file with a list of ~25K words or so (all 6 letters or under). Any thoughts?
Thanks!
Hi @tmirobot. This may help. To have them in order of most used.
https://strommeninc.com/1000-most-common-words-in-english-strommen-languages/
I have already created a Pico-8 cart with this for what I want - but have not posted it yet, saving it for something else. Not a word puzzle.
Good luck !
Hi @dw817!
Unfortunately my goal is to check played words against a robust list of valid words, so I need a high # of words - the order doesn't matter much. The game does have some built-in limits on word length and types of words (currently 6 letters or less, and some types of letter configurations are not possible), so it should be able to function with around 20-25K words. I'm just looking for clever ways to condense that amount of text into a pico cart, which @zep seems to have found a clever way of doing (or approximating :D)
Thanks though!
Hi @tmirobot
It looks like it works the same under 0.2.4, but I did find a problem with the workflow: if you're running a BBS cartridge, it will cstore data files to {appdata}/pico-8/cstore, not to the current directory! So is better to save is_word_builder locally first:
load #is_word_builder save is_word_builder |
Here's the word list I was using -- let me know if you still get a zero-byte output with this:
https://www.lexaloffle.com/dl/temp/en_40k.txt
Edit: that wordlist is also derived from: https://github.com/hermitdave/FrequencyWords
p.s. limiting the word list to 6-character words is probably a good idea! It would also help a lot to find a better word list that filters out names and other non-scrabbleish words.
It appears my problem was each line had a hidden carriage return AND line feed, where your parser is expecting only a line feed. Once I pulled the excess carriage returns out, it parsed it properly. I was then able to combine all the exported data and clipboard settings with my lua, and it looks like it's working (on first glance)
I have dictionaries of all words 5 or less (20,093 words) as well as 6 or less (43,125). I ended up using 6 or less, but also with words culled that cannot be generated via the rules of the game I'm making.
Seems like I'll be able to get under compressed limit with some minifying of my rather verbose code, even with a pretty full feature set (it's a very basic game).
Thanks so much for your help!
@zep On implementation, I noticed that some words in the dictionary were failing is_word() check. I'm still trying to parse through where the failure is occurring -- I'm not sure if it's an issue with the hash functions or the set/get, but there's something about certain words that it will not set (or get?) properly.
I'm still trying to learn about how the hash functions and set/gets work, but I've gone back to is_word_builder to simplify testing, with these results:
A dictionary with just the words bar, cat, dog, hen, milk will return true from check_word() ONLY for cat and dog. bar, hen, and milk all fail.
It's definitely not an issue with the triples check, as you can remove this and it still fails. It's also nothing with the particular order, words, etc. as you can see the same words fail in a dictionary with 40K other words.
I can't find a pattern yet (bars, bat, car, care, cars will all return as words properly if added to the same dictionary), so I'm still investigating. My understanding of hash functions and bit poke/peeking is still very limited though, so I have a ways to go yet!
Will update with any findings.
Hi @zep,
I consulted with someone more code-savvy than I, and together we think we found the problem. In get_bit(x), you're checking whether the return value is > 0, when we think you meant to check ~= 0.
function get_bit(x)
return bf[x\4] &
(0x0.0001<<((x*8)%32)) > 0
end
With a > 0 check any words whose hashed result was just the first bit set, for example, (stored as -32768), would fail the get_bit check.
Thanks again for encoding such a clever solution to a common pico problem!
[Please log in to post a comment]