This is so overcomplicated. You don't need an actual tree that's just a mental model. All you need to do is generate N random hyperplanes (defined via it's normal i.e. generate N normalized vectors in the dimension of your choosing). Then you generate a hash via these by going normal by normal and checking if your query vector `q` is on the left or right side of the normal by checking the sign of the dot product. I.e. `hash = [sign(dot(q, n1)), sign(dot(q, n2)), ...]`.
So given a query you find that hash and you get all the vectors with the same hash (using your favorite hash to array data structure, i.e. a dict) and you can further refine the search within that bucket. It's like 10 rows of code.
Annoy author here. What you're describing is Locality Sensitive Hashing (LSH) which is something I spent a lot of time trying back in 2009-2012 but never got it working. It has some elegant theoretical properties, but empirically it always has terrible performance. The reason I don't think it works well is that data often lies near a lower-dimensionality manifold that may have some sort of shape. LSH would "waste" most splits (because it doesn't understand the data distribution) but using trees (and finding splits such that the point set is partitioned well) ends up "discovering" the data distribution better.
(but HNSW is generally much better than this tree paritioning scheme)
Faiss GPU author here. Vector binarization via LSH for dense vector, high-dimensional (say 30 - 2000ish dimensions) tends to have poor recall and quality of results. As the Annoy author implied here as well, there's a lot of literature on LSH but that's because it is easier to analyze mathematically and prove theorems regarding it, not that it produces good quality results compared to graph-based (HNSW etc) or different geometric cell probe-based (IVF etc) methods. LSH is not typically used in industrial scale settings for these reasons.
However, for extremely high dimensional sparse vectors (think tens of thousands to millions of dimensions), LSH does remain the only practical approximate technique to my knowledge. Online settings (where you need to insert vectors quickly into the database) may prefer LSH as well as the overhead is smaller than updating a graph or finding the IVF cell in which to place the vector.
Below ~30 dimensions, exact non-brute force methods (like k-D trees or binary space partitioning methods) work ok, but they have high overheads above that because your true nearest neighbor is still highly likely to lie on either side of any dividing hyperplane (it doesn't partition the dataset well; everything is usually "near" everything else in practice in high dimensions).
It seems to me that your approach is making some assumptions about how the data is distributed that aren't necessarily guaranteed to hold in practice, and a tree has a chance of being robust to at least some of those. For example, whatever random hyperplane you choose has to be informed by your data. Suppose you make a particularly bad choice for one of those hyperplanes, with your approach you'll be stuck with that for all time and have a bit in your hash that's biased to heck, but in the tree approach you only suffer when you hash your way down to the lousy node, and all other paths will be unaffected.
I implemented this recently in C as a numpy extension[1], for. fun. Even had a vectorized solution going.
You'll get diminishing returns on recall pretty fast. There's actually a theorem that tracks this - Jordan-Lindenstrauss lemma[2] if you're interested.
As I mention in a talk I gave[3], it can work if you're going to rerank anyway. And whatever vector search thing isn't the main ranking signal. And you don't have that much data. It's also easy to update, as the hashes are non-parametric (they don't depend on the data). Constantly updating vector search systems is often neglected in benchmarks.
The lack of data-dependency, however is the main problem. Vector spaces are lumpy. You can see this in the distribution of human beings on the surface of the earth - postal codes and area codes vary from small to huge - random hashes, like a grid, wouldn't let you accurately map out the distribution of all the people or clump them close to their actual nearest neighbors. Manhattan is not rural Alaska... and more dimensions, more data points, the more ways it has to be lumpy.
Annoy, actually, builds on these hashes, by creating many trees of such hashes, and then finds a split in the left and right. Then in creates a forest of such trees. So its essentially a forest of random hash trees with data dependency.
I've implemented the same approach (non-parametric), and it worked well to speed up the search on my experiment with 384-dimensional embeddings of 30 million Hacker News comments.
Not really irreversible unless they get a monopoly on violence, so as long as the overton window from the publics perspective shifts dominantly towards breaking them up in the future we'll be able to get a better situation.
When you can reliably tell they have a monopoly on violence, it's far to late to do something about it.
However, Very Bad Things™ can happen long before they have monopoly on violence. And violence where? Large companies operate in weak jurisdictions, not only in their home states of Maryland and Ireland.
Well, historically, there have been a number of companies with the right to violence. The East Indies and West Indies Companies, for instance, or more recently United Fruit. I haven't bothered to check, but I'd be surprised if Big Oil didn't have mercenaries on call in hot zones, with little to no oversight.
Nobody (company, nation or gang) needs a monopoly on violence to become dangerous.
Those companies didn't have the right to violence in their home countries, they waged wars in other countries instead. Companies overthrowing their home country hasn't really been a thing.
True. However, please don't forget that the conglomerates being discussed initially do exist and have considerable influence in other countries than the US, so my remark remains :)
Also, the Medici family was initially a wool company, then grabbed power in their homeland, kept it for about three centuries and somewhere along the way produced descendants that ruled over much of Europe. Similarly, the fascist uprising that gave power to Franco over Spain was largely privately funded by a bank [1] and I seem to remember that the German Nazi party was largely funded by industrialists and bankers [2] until reached power.
So, I'd say that companies overthrowing their home country's government has unfortunately been a thing for quite some time.
They (search, social media, advertising companies) are gaining a monopoly on truth. With that they indirectly control the government, which is the one with the monopoly on violence.
To emphasize, all human productions are biased, that's human nature, and news media are not exempt, despite many outlets making serious attempts to produce unbiased news.
Part of the job of being a consumer of media is understanding the bias of what we're consuming, determining whether it skews the news, and possibly counter-balancing by consuming other media with different bias. And yes, it's lots of work and most people aren't willing to spend the time doing that.
I initially blocked that popup with ublock origin, but I've since turned it back off and let it block me instead. It's a fantastic way to prevent you from mindlessly watching videos. Now I just use yt-dlp to download videos I know I care about and subscribe to.
Good article but my god why is there like a 2 second delay before every user interaction on this page. Scrolling, selecting text, everything has some kind of 2 second 'bootup' time after which things work normally but if you stop interacting with it for a bit it goes back to some idle mode. Really weird.
As others noted, the latency is from a whole slew of trackers that privacy extension would block. Maybe this is a good nudge for you to go install one. It's a problem that's already widespread and will continually get worse.
But on top of that, this is also one of the many fun examples of where a static, linear document that's so simple it could have been written in markdown requires megabytes of download across dozens of files.
Nobody cares about user experience right now -- it's all about developer and content management experience, so runtime bulk and responsiveness are ignored in favor of... whatever these kinds of developers think they're gaining on the production end. I often have a hard time guessing what that is without assuming the responsible devs are not-so-great.
I think it's just them being scared of models running on their servers producing things that cause bad PR. And the reason they're not willing to not run things on their servers is competitive advantage as well as just extracting more money out of users that way. (I definitely don't think actual safety is the concern.)
It's funny that one argument openai used to keep their models closed and centralized is so they could prevent things like this. And yet they're doing basically nothing to stop it (and letting the web deteriorate) now that profit has come into play.
Not saying they should, but if they wanted to they could have an API that allows you to check whether some text was generated by them or not. Then Google would be able to check search results and downrank.
It's not that simple. Originally OpenAI released a model to try and detect whether some content was generated by an LLM or not. They later dropped the service as it wasn't accurate. Today's models are so good at text generation it's not possible in most cases to differentiate between a human and machine generated text.
Well they could just not allow prompts that seem to participate in blogspam. If they wanted to stop it they definitely could.
Their argument is that since it's centralized, things like that are possible (while with llama2 you can't), they do "patch" things all the time. But since blobspam are contributing to paying back the billions microsoft expects they're not going to.
It would be easy to workaround using other open source models. You use GPT-4 to generate content and then LLAMA-2 or sth else to change the style slightly.
Also, it would require OpenAI to store the history of everything that it's API has produced. That would be in contrast with their privacy policy and privacy protections.
If it's a straightforward hash, that's easy to evade by modifying the output slightly (even programmatically).
If it's a perceptual hash, that's easy to _exploit_: just ask the AI to repeat something back to you, and it typically does so with few errors. Now you can mark anything you like as "AI-generated". (Compare to Schneier's take on SmartWater at https://www.schneier.com/blog/archives/2008/03/the_security_...).
So given a query you find that hash and you get all the vectors with the same hash (using your favorite hash to array data structure, i.e. a dict) and you can further refine the search within that bucket. It's like 10 rows of code.