I'm very curious as to how this works in the backend. I realize it uses Bluesky'...

avibagla1 · 2025-08-06T22:39:09 1754519949

Hey! this is my site - it's not all that complex, i'm just using a sqlite db with two tables - one for stats, the other for all the words that's just word | count | first use | last use | post.

I... did not expect this to be so popular

gumboshoes · 2025-08-07T01:42:55 1754530975

What is your source dictionary to compare to? Seems kind of small. Also, how are you handling inflected forms?

avibagla1 · 2025-08-07T03:33:44 1754537624

https://github.com/words/an-array-of-english-words

using this, a combo of "covered enough" for the bit and easy to use

also, since i'm tracking every word (technically a better name for this project would be The Bluesky Corpus) all inflected forms are different words, which aligns with my thinking

blendo · 2025-08-07T02:59:47 1754535587

What are the table sizes?

And what ingress bandwidth do you have?

avibagla1 · 2025-08-07T03:41:35 1754538095

DB is currently 58mb (damn lol)

Ingress is actually pretty manageable, ~900kbps

f311a · 2025-08-06T22:15:21 1754518521

You can probably fit all words under 10-15MB of memory, but memory optimisations are not even needed for 250k words...

Trie data structures are memory-efficient for storing such dictionaries (2-4x better than hashmaps). Although not as fast as hashmaps for retrieving items. You can hash the top 1k of the most common words and check the rest using a trie.

The most CPU-intensive task here is text tokenizing, but there are a ton of optimized options developed by orgs that work on LLMs.

stwrzn · 2025-08-06T22:31:43 1754519503

I very much hope that the backend uses one of the bluesky jetstream endpoints. When you only subscribe to new posts, it provides a stream of around 20mbit/s last time I checked, while the firehose was ~200mbit/s.

avibagla1 · 2025-08-06T22:39:42 1754519982

yes it does!

gpm · 2025-08-06T22:09:26 1754518166

Probably just a big hashtable mapping word -> the number of times it's been seen, and another hashset of all the words it hasn't seen. When a post comes in you hash all the words in it and look them up in the hashtable, increment it, and if the old value was 0 remove it from the hash set.

250k words at a generous 100 bytes per word is only 25MB of memory...

bangaladore · 2025-08-06T22:09:03 1754518143

Maybe I'm being naive, but with only ~275k words to check against, this doesn't seem like a particularly hard problem. Ingest post, split by words, check each word via some db, hashmap, etc... and update metadata.

somebehemoth · 2025-08-07T02:58:38 1754535518

I think the cool part is watching words go brrr.