Update: Found several bugs in my Typescript and Rust implementation that looks to invalidate the results here. Either I stumbled into a way to construct a minimized graph that can find what I want or what. Only found it after I implemented a traverser with backwards lookup. Might as well use a Finite State Transducer Index as far as I can see. I'm going to keep looking into this since it is a radix and suffix tree.
- One of the bugs is related to my misreading of the paper. One should not use same terminator for each word. While I was able to store a particular word, a suffix of that word could not be stored afterwards. I wrote an extension since then that butchers the terminator from a word while the terminator node can point to many sinks. Separation will not be executed when the update reaches the terminator since it is already butchered/separated. I implemented a rough version in Typescript which I put through the thrasher. I need to build a auditor tool.
Update 3: Interesting property I have observed is that splitting the corpus across multiple instances of the CDAWG will actually decrease overall memory requirements and increase performance. Around 60k-65k 128 random alphanumeric character per instance seems to be a sweet spot from my tests. I suppose it makes a lot of sense that you don't have a ton of words in one FSA as the transitions will blow up. Would be nice to deal with only one though.
Update 4: https://github.com/normano/seadawg - Open sourced work. Support for prefix and contains query (find superstring). It does store data more efficiently than a typical Generalized Suffix Tree.
Last update: Performance numbers are posted here: https://github.com/normano/seadawg/blob/master/rust/PERFORMANCE.md, Rust version is vetted by an auditor tool I created to make sure queries work. The lite version has unbelievable numbers for memory and speed, but contains/suffix query performance will suffer. I optimized the BT version and cut memory costs over 50% compared to my equivalent Suffix/Radix tree implementation and it remains dynamic, so I can say this was a success.