This solution will (most probably) not work as you cannot hold all the words in memory.
Hi Erel, you don't have to keep ALL words in memory, but only the unique words. My solution should work for all text files you throw at it with normal words.
If you say you want to work with 'unnatural' words, then I suggested already to keep everything in 2 or more files (starting with 'A', 'B', ...). It would work without a problem and still be quite fast too.
Also, your solution won't work if I throw a text at it with words of let's say 500 characters long (e.g. CSV record libes). You will run out of memory and no, you can't know how big the words will be in front. So, textfiles is the only solutions. I would place them in a folder strucure like below. I've made the example tree with 2 levels and 2 characters, but it can easily be changed to work on 3 levels or more, and never run out of RAM.
-A
-A
-words/files
-B
-words/Files
-B
-A
-B
-C
-C
-...