I am actually working on something similar with Shepherd.com :)
I ask authors to recommend 5 books around a topic, theme, or mood. For example, here is one that Steven Pinker did on Rationality: https://shepherd.com/best-books/rationality
Then I use NLP to identify Wikipedia topics around the book. And, use that to connect it to topics in order to make it a fun process to find your next book online. I am hoping to use Wikidata to build some really cool connections as we get going.
What I also get from this is how books are connected to each other using human data. So that author 1 thinks that book 1 is similar in some way to 2, 3, 4, and 5. I use those inputs to power the new recommendation system I am launching next month. And, a few months after that a "books like" feature that will use that same data. And, then toward Winter I'll integrate book genre data into that.
I am bootstrapped so we are rate limited on dev time :)
In early 2023 I will add user accounts and a custom book recommendation newsletter based on what readers like in terms of genres, topics, books, and authors. And, once I have that user account I will slowly be moving to collecting readers 3 favorite books of the year. And, then using that data to identify your "Book DNA" and match you to other readers.
Slowly but surely :)
Email me at ben@shepherd.com if you would like to chat or have any ideas.
BLAST is so deeply embedded in modern genome biology that it is hard to find short descriptions. Basically, there are two parts:
(1) local sequence (string) similarity scores (scores that allow mismatches, but need not extend to the end of either string) are distributed as the "extreme value distribution", which means that it is very easy to distinguish between similarity by chance and similarity because of common ancestry (I think of the names for the same entity as ancestral). This is a very powerful concept, that allows one to accurately set false-positive rates (it does not help with false negatives)
(2) BLAST calculates local similarity scores at O(n^2)/K where K is very very large by first identifying significant conserved "words" using a lookup table, then looking for runs of those words along an alignment diagonal, and then trying to extend that diagonal using an efficient banded dynamic programming algorithm.
The reason why I love this problem is because of this! I feel like there are a lot of fun ways to be creative here, but as the other comments mentioned -- to get a scalable and really good solution is extremely difficult.
You are 100% correct, this is a toy example that I decided to put together for fun after talking to a bunch of people who mentioned it as a problem that they experience.
The main idea I wanted to add to the discussion (which is not that crazy of an addition) is that you can possible use sentence embedding instead of fuzzy matching on the actual letters to get more "domain expertise"
How to actually compare these embeddings with all the other embeddings you have in a large dataset is a problem that is completely out of the scope of this tutorial
Yeah, totally, I get you. I'm not trying to do a takedown, ER is just hard.
The point I was trying to make is that at scale one does not simply:
> compare these embeddings with all the other embeddings you have
You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.
Using embeddings is quite common in the literature and in deployment (I have, in fact, deployed ER that uses embeddings), but only as part of #2, the matching step. In many projects, doing full pairwise comparisons would take on the order of years, you have to do something else to refine the comparison sets first.
You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.
Any reason you couldn't just dump it in FAISS or Annoy? No need to do pairwise comparisons. Annoy claims to handle even up to 1,000 dimensions reasonably. https://github.com/facebookresearch/faiss/issues/95
Yeah, ANN algos provide a logical way to do blocking. It's not quite as simple as just running it though.
First, you have to be able to actually run the server. Most ANN algos (not DiskANN or SPTAG, but basically everything else) are run in memory, so if you've got really beefy data you'll need machines that can handle the load. The search engine companies and researchers using toy problems generally don't run into this problem.
Second, Approximate Nearest Neighbors are, unsurprisingly, approximate, so if you want the top N true neighbors you actually have to 1) collect the top K*N approximate neighbors then 2) run true cosine similarity on the K*N neighbors to refine down to N.
So your new O(kn)(not the same k or n, sorry I overloaded my variables) ER algorithm is basically:
- set up ANN server
- for each entity, collect K*N approximate neighbors
- refine to N true neighbors
- perform matching step
- somehow handle the fact that you have merged entities in your ANN server (this can be weirdly tricky).
Also, I might be splitting hairs, but ANN isn't REALLY performing a similarity measure. Its doing clustering or making search trees that try to guarantee the leaf nodes will be close to within some threshold. In that sense it has more in common with a fuzzy hash or blocking algorithm than it does with an exact similarity calculation.
As you may be able to tell, I have spent more time than I would like thinking about and implementing this sort of thing.
Data Twitter and Linkedin are great, there are a lot of people putting out some really good content. There are also a lot of substacks you can sign up for. Data Engineering Weekly is my fave
The problems in this article and in the comments are some of the stuff we have heard at Magniv in the passed few months when talking data practitioners. We are focused on solving some subset of these problems.
Personally, I think Airflow is currently being un-bundled and will continue to be with more task specific tools.
At the very least, if un-bundling doesnt occur, Prefect and Dagster are working hard to solve lots of these issues with Airflow.
Evolution of products and engineering practices is not linear and sometimes doesnt even make sense when looking at a-posteriori (as much as I would like it to follow some logical process). Will be interesting how this space will develop in the next year or so.