Great vision, challenging the "scale" of current AI solutions is super valid, if only for the reason that humans don't learn like this.
Architecture: despite other comments, I am not so bothered with MMAP (if read only) but rather with the performance claims. If your total db is 13kb you should be answering queries at amazing speeds, because you're just running code on in-cache data at that point. The performance claim at this point means nothing, because what you're doing is not performance intensive.
Claims: A frontal attack on the current paradigm would at least have to include real semantic queries, which I think is not currently what you're doing, you're just doing language analytics like NLP. This is maybe how you intend to solve semantic queries later, but since this is not what you're doing, I think that should be clear from the get-go. Especially because the "scale" of the current AI paradigm has nothing to do with how the tokenization happens, but rather with how the statistical model is trained to answer semantic queries.
Finally, the example of "Find all Greek-origin technical terms" is a poor one because it is exactly the kind of "knowledge graph" question that was answerable before the current AI hype.
Nevertheless, love the effort, good luck!
(oh and btw: I'm not an expert, so if any of this is wrong, please correct me)
Great vision, challenging the "scale" of current AI solutions is super valid, if only for the reason that humans don't learn like this.
Architecture: despite other comments, I am not so bothered with MMAP (if read only) but rather with the performance claims. If your total db is 13kb you should be answering queries at amazing speeds, because you're just running code on in-cache data at that point. The performance claim at this point means nothing, because what you're doing is not performance intensive.
Claims: A frontal attack on the current paradigm would at least have to include real semantic queries, which I think is not currently what you're doing, you're just doing language analytics like NLP. This is maybe how you intend to solve semantic queries later, but since this is not what you're doing, I think that should be clear from the get-go. Especially because the "scale" of the current AI paradigm has nothing to do with how the tokenization happens, but rather with how the statistical model is trained to answer semantic queries.
Finally, the example of "Find all Greek-origin technical terms" is a poor one because it is exactly the kind of "knowledge graph" question that was answerable before the current AI hype.
Nevertheless, love the effort, good luck!
(oh and btw: I'm not an expert, so if any of this is wrong, please correct me)