I created a Generalized Suffix Index based on a Radix Trie (libART, Adaptive Radix Trie) around the beginning of 2019. The one thing I noticed about it was that is will take YUGE amounts of memory because of how many containers are allocated. I used a Rust HashSet with SeaHash after some testing at that time because it was good enough to me. The index stores strings => int64 and SeaHash is good for 64 bit number hashing.
Recently I started digging in on my implementation and figured how I could try to improve. I have a small benchmark for it were 300k items with 1/3 randomized strings (128-136 alphanum utf8 chars) and the other 2/3 smaller semi randomized strings. To query this thing is super fast, on the order of 200-300ms for a String.Contains query. That is insane. Memory usage is around 3.7GB.
To explore some improvements, I figured I would try to make the ID containers smaller or use a different Trie:
1. HAT Trie (https://en.wikipedia.org/wiki/HAT-trie).
2. To make the item IDs sequential and use a roaring bitmap (roaring-rs). I tried with random IDs which was a terrible idea that one can read about in the docs)
3. Use a Radix Trie (Prefix Tree crate)
4. Use an ART (libART).
6. Varying length bit vectors/integer length in the tries.
7. Use a QP Trie.
Results
HAT Trie, with this I built the suffix index on top of a HAT Trie where each of the ID containers are HashSets. Didn't go well as the memory usage exploded, around 8GB- 10GB. Was probably my implementation wrapper of the import of the C library or something. I deleted the wrapper shortly, I guess I do have it in version control, so I can bring it back to try again.
Roaring: I decided to explore making the ID containers smaller and the IDs assigned sequentially. The Suffix Index was built on top of the ART with the ID containers being ROARING bitmaps. Haha, another memory blow up. Even worse than before. Don't remember exactly, but around 8-10GB. My guess is that the implementation is unfortunately bloated in memory with BTreeMap and being that my IDs are 64bit. They could probably do better with a Radix Trie.
I wanted to give up at this point. Storing suffixes is just expensive. Every string (length of n) is n containers. Where the containers store a varying set of 64bit IDs. I read a couple of reaserch papers, but eh, not many ideas I wanted to try. Some were too esoteric and did not look like I would benefit. I decided to press on once I got another idea, some time later. I thought, why not use a Radix Trie in a Radix Trie? It seemed odd, but workable since bytes are what are looked at.
Radix Trie in a Radix Trie: The idea was to replace the hashset with a radix trie where the "string" is a set of bytes representing the ID. I used the prefix tree crate as the ID container. I was pleasantly surprised, but it did not beat the HashSet container. About 3.8G - 4G memory usage.
Outlandish idea. 😀
ART in ART: Same idea as before except with libART twice. Didn't see any improvement.
Compressed Bit Vectors/Integers: Ok, so how about if I push a compressed bitstring into the ART instead. 001 should compress to [0]. Didn't see much improvement.
- I tried one where I would allocate a new integer type based on the number of leading zeros. Basically, if my 64bit ID can be represented as 32bit integer, then allocate a 32bit integer instead since that is 50% savings. Same with u16 and u8. Yeah lower end of 3.8GB. Not much improvement in my opinion.
QP Trie - I remember seeing some esoteric tries I never tried at the start since they didn't accept strings or were fixed length. Since I had functions that could change my IDs into byte vectors/varying integer lengths, I figured I would try now. I decided to use a fixed slice [u8; 8] in the QPTrie anddddd.... voila depending on ID distribution (seq, random) memory usage dropped to 3.1 - 3.5GB.
- I'll try to see if variable sized byte vectors and maybe a QP-Trie in QP Trie would be more efficient.
- Update: I tried a QP-Trie in QP-Trie, but that was less efficient than the ART tree with variable strings. QPTrie is nice for integers for what I have gone through so far. Also, I did try to use variable sized Vec<u8> with capacity of 8, but that ended up using more memory (or at least requesting) more memory than a boxed slice, so.... slices it is.
🥳 Another one of my outlandish ideas saved me some memory. I wonder how much more I can save by trying different tries and variable sized bit vectors!
I must confess that the numbers above include the sizes of other indexes on top, but the suffix index dominates in memory usage. I'll update the numbers in the future without the other indexes. I know that replacing a hashset as the ID container for another kind of index dropped the total memory usage from 3.1- 3.5 to 2.7 - 2.8GB so (2.7e+9 / 300,000 bytes = 9 kilobytes). 9 KB per entry, so far.
Not bad though, dropped a full gigabyte by removing hashset and replacing it with qp-trie and gaining more query capabilities. Pretty insane!
Update 2: So I removed the indices except the Suffix Index, but this time bumped up the total entries to 3 Million. I halved the set (1/3 of the entries) to 64-70 word lengths, so my laptop doesn't get screwy with the same 2/3 semi random characters and it takes about 12GB RAM shockingly. Loading time was 78s with string contains query time about 900ms-3s. Randomized characters are pathological. Next thing I'll probably try is Finite State Transducers for the suffix tree strings to see if memory can be saved as well as try the benchmark on my beefier desktop.
Part 2: https://www.normansoven.com/post/obessed-with-making-a-generalized-suffix-indextree-memory-efficient-part-2