Part 1: https://www.normansoven.com/post/obessed-with-making-a-generalized-suffix-indextree-memory-efficient
Merged Trie Experiment
I read parts of an interesting paper called MergedTrie Efficient textual indexing. Paper and code is here: https://github.com/AntonioFerrandez/MergedTrie
I wanted to figure out if there was a way to decrease the amount of memory (approximately in half) for a Suffix Index and this particular paper looks to address it. I had random thoughts about splitting a string in half in somehow sticking it into a trie. Wasn't sure how to go about it. This paper actually describes how to do it!
The idea is to merge prefix and suffix into one trie by using data structures that represents two tries. Weird huh. It is all in the linkage.
Each of the trie nodes has a set of inter-linkage(IT) nodes that connect prefix half to a reversed suffix half (suffix is greater in odd length).
Find: You can easily split queries for exact match into prefix and reversed suffix and query the trie that way.
Prefix iteration: If you wanted all words that were prefixes of ab, then you'd do a find for a and ab then filter out all of the words that do not contain the prefix.
Insert: With prefix, getOrCreate IT collection and then the same with suffix then create the IT node and add it to the collection for prefix and suffix only if the existing IT node does not exist.
Deletion: Find the IT node and remove it pretty much.
General ideas I got from reading the paper and toying around with the code.
I created a Rust library binding to the MergedTrie library to test a contrived pathological case with 128 randomly generated alphanum characters.
Here are some results for that:
Results
30k, 128B random words
22MB
Finished loading 30001 items 46.431491ms
Finished, waiting 10secs
300k, 128B random words
238.9MB
Finished loading 300001 items 836.170488ms
Finished, waiting 10secs
3M, 128B random words
2.37GB
Finished loading 3000001 items 7.253462882s
Finished, waiting 10secs
I was surely impressed, until I ran into bugs with the C++ library and spent more than enough time translating (Thanks Google Translate and find/replace) from Spanish to English. Ran into a resize bug and some trie node container size limits. Great experiment and great work.
DIY Time
I was frustrated with the bugs last night that I pulled out what I usually do, yeah make it myself with existing building blocks with my own tweaks. I decided to create a Rust version on top on libART (Adaptive Radix Trie, https://github.com/armon/libart) and rust QPtrie (https://github.com/sdleffler/qp-trie-rs).
I call it MTrie. I borrowed the IT idea and modified libART (I unimaginably call it libart_sys) to give me pointers to its leaves. I got to say I am impressed with what I did π
.
Results
30k, 128B random words
11.4MB
Finished loading 30001 items 76.638621ms
Finished, waiting 10secs
300k, 128B random words
112.9MB
Finished loading 300001 items 760.473616ms
Finished, waiting 10secs
3M, 128B random words
1.07GB
Finished loading 3000001 items 7.908426735s
Finished, waiting 10secs
Oh yeah baby. Look at that memory usage, that is what i am talking about! If I were on a 32bit system, the memory usage would be cut in half, but that's a pipe dream.
I created a suffix index on top and did some tests. I stayed away from 3million since that would be prohibitive.
30K, 128B random words
756MB
Finished loading 30000 items 7.604373016s
Finished, waiting 10secs
300K, 128B random words
7.81GB
Finished loading 300000 items 91.737424771s
Finished, waiting 10secs
Part 1 of this postΒ has an update where it said I had to cut the 128-136 words in half and used it as 33% of entered data to get to 12GB, so this MTrie is an absolute insane improvement.
Update 2: So I removed the indices except the Suffix Index, but this time bumped up the total entries to 3 Million. I halved the set (1/3 of the entries) to 64-70 word lengths, so my laptop doesn't get screwy with the same 2/3 semi random characters and it takes about 12GB RAM shockingly. Loading time was 78s with string contains query time about 900ms-3s.
How did I do it? I'll post about that when I bring out it in the open. I want to try realistic datasets and the datasets citied in the paper and chat with the original researcher about it. I used quite a bit of unsafe Rust π. I don't know if I can make the Trie any more efficient after this.
Note: Memory posted is REAL memory on Mac OS X.
Update: Actually, I might not release this trie, but instead another kind of advanced data structure that represents a minimized suffix tree (far less nodes = far less memory) that is updatable in real time rather than statically built. Haven't seen the updatable version implemented anywhere.