Part 2: https://www.normansoven.com/post/obessed-with-making-a-generalized-suffix-indextree-memory-efficient-part-2
I am so obessed that I threw away the Suffix Index and pursued its successor, the DAWG. Well, the compact version anyway: CDAWG!
Note: DAWG is a Directed Acyclic Word Graph also called a Minimal Acyclic Finite State Automaton (MAFSA) or Minimal Acyclic Deterministic Finite Automaton (MADFA). CDAWG means Compact Directed Acyclic Word Graph.
Story
Oh man, I cannot believe my eyes when I saw the results on indexing 3M 128 alphanumeric character words. If you have seen Part 2, then you will have seen I did not go above 300k words on a Suffix Tree, which I kind of blame libART for when it stores the original words. Hah, It has been a while now, but I can finally I can go up to 3M words without hitting prohibitive memory limits! What is a DAWG? Why a DAWG?
My previous post goes over it some what, but I was looking at research
papers around Generalized Suffix Trees and implemented a variant of
MergedTrie and I was not satisfied by it. The DAWG appeared to be seductive when I started looking at finite state
transducers (FST) seeing as how it is talked about. The problem I have
with FSTs is that it looks very complicated and it cannot be built online/in
real time. You build it up, then if you want to modify it then you have
to built it up again. Not worth the time and headache.
The DAWG looks just like a Suffix Tree. It deceptively does not look like it since it looks like a general graph where every arrow converges to a target node, but it is. It is a minimized version of it. I read about a DAWG on Steve Hanov's blog (http://stevehanov.ca/blog/?id=115), but did not pay attention to it since it looked so esoteric. You can replace a Radix Index and and Suffix Index with a DAWG for some extreme memory savings.
After some time, I started reading some research papers on a DAWG and saw it could not be built online and it required input to be sorted, so I binned it. Picked it up again after some time and saw CDAWGs and again binned it and moved on to Merged Trie. After the Merged Trie, I became more serious about figuring out what a DAWG was and if it could actually be built online. Hah, I found the paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.25.474&rep=rep1&type=pdf and wowza, greek symbols. Quite a bit of pseudocode though. Eventually figured out the major symbols (this is a continuous process) and could read the paper better (yes, you'll have to read parts of it several times and use crtl + f). Grabbed some paper and tried to execute the pseudocode. Failure!
After about 3 failures, I decided to move it to a computer and fail there instead. Debugging failing code was easier to repeat than trying on paper. Eventually found out the pseudocode notation was just too wonky for me which is why I failed so much, but things slowly started to make sense. I'd say it took about 3-4 days to get from research paper to code. Working code took about a day. Added a thrashing tool, so I could bombard the data structure with many random characters to find issues. This was very very helpful especially when I added a way to save the words that caused the bad state to occur.
Long story short, I wrote an Online CDAWG in Typescript at first and now Rust. I put the Rust version to test!
Remember the Suffix Index took 7.8GB for 300k words.
SeaDAWG Rust Results
30K, 128B random words
REAL: 26.6MB, MEM: 19.5MB
Finished loading 30000 items 590.789089ms
300K, 128B random words
REAL: 218.6MB, MEM: 185.8MB
Finished loading 300000 items 5.850197276s
3M, 128B random words
REAL: 1.68GB, MEM: 1.10GB
Finished loading 3000000 items 58.837697585s
This is the holy grail of string processing for me! Can you believe the paper came out 15-19 years ago? What the hell are people doing? This is a marvel! A suffix index at 3M words would have taken an estimated 78GB, this data structure did it in 1.68GB!!!!! That's 70x memory savings! It did it even faster than the Suffix Index! That's crazy, insane, out of this world! 🤩
The amount of memory it takes isn't linear either which means if I go to 30M 128 character words then it will not go to 16GB. I know there is definitely some more memory to be saved here and I will find it! I did make use of string interning to lessen the impact of string storage. Let's recall that these words are pathological case meaning there is more memory saved when using natural language.
The only con is that there is no proven deletion function for a DAWG that I have seen, so I have an experimental one instead. I hope to improve it so that it restores the proper CDAWG state. Right now, I am focused on implementing broadly in Java, Typescript and Rust.
Update 1: I added FNV as the hasher and bam, time to build is cut at least 35%.
Update 2: I moved to ahash, removed string interning (after looking at the data structure. Should look at creating my own) and removed indexes from the edges.
3M, 128B random words
REAL 1.42GB, MEM: 977MB
Finished loading 3000000 items 23.281549281s
Saved a little more memory but cut build time over 50% on average. I have an idea that combines my ID allocator with a Vec so I can cut hashmaps of edges and nodes which would give faster random access and probably more memory savings. I wish there was a good way to look at the heap object statistics in Rust.
Update 3: Moved to fxhash, further shrink the data structure fields and used a VecMap with my id allocator.
3M, 128B random words
REAL 911MB, MEM: 909MB
Finished loading 3000000 items 14.566491586s
Can see that removing hashmaps can favor access speed and memory by a ton. Not sure what else I can optimize here. There is one hashmap left (letter to edge lookups), but removing it is not a benefit over the speed vs replacing with a SortedVec with binary search/insert.