With the last few months, there's been a Cambrian explosion of products integrating AI. A big part of this is because LLMs enable users to get good performance on a lot of NLP tasks out-of-the-box by calling an API or running a pre-trained model. Previously, applied AI/ML work revolved around collecting + labeling data, training a model, building infrastructure, etc. to get good enough performance for a given use case. But now the models work well out of the box, the work shifts to building a great product experience around the model and getting to product-market fit with an AI-enabled product.
With visual user interfaces, there's a whole category of product analytics tooling that helps product teams understand how users interact with their products + make product decisions to optimize their product-market fit. LLM apps have introduced a new paradigm for interacting with software, where users can work iteratively with the software via a natural language interface, generating user inputs and model responses consisting of unstructured text.
Traditional analytics techniques don’t deal well with large amounts of unstructured text – it’s hard to summarize, it’s hard to aggregate, and it’s hard to effectively sample. AI developers resort to digging through a pile of hundreds to hundreds of millions of datapoints of unstructured text to understand how users interact with their product.
Tidepool tries to solve this problem using neural network embeddings. After you upload user text interaction events, Tidepool will:
- Automatically group your data by similarity. Tidepool runs embedding clustering on your users’ text interactions to surface interesting attributes: things like prompt topics, prompt languages, and common usage patterns that can be turned into shortcuts.
- Summarize common attributes in your data, using LLMs to determine what each cluster “contains.” For example, understanding that the most common topics that users discuss are business, education, and art.
- Track attributes in production traffic, allowing you to uncover how a specific attribute might be correlated to good / bad product outcomes. We utilize lightweight models running on foundation model embeddings to scalably extract these attributes from hundreds of millions of interaction events in production.
Lastly, we have a self-serve free tier! I thought this may be useful for people building AI applications. If you're interested, please try it - I'd love to hear any feedback on what works and what doesn't :) https://app.tidepool.so/
From previous experience doing applied ML, it's pretty hard to do pretty basic operations on data - splitting broad classes into fine classes, collecting data that the model struggles on, etc. However, there's been a lot of recent advancements in few-shot learning that make it easier to do these tasks.
Aquarium is an ML data management system that helps ML teams improve their models by improving their datasets. Aquarium uncovers problems in your dataset, then helps you edit or add data to fix these problems and optimize your model performance.
We are looking for our first Product Manager and are also hiring for full-stack + backend engineer! More info available here: https://jobs.lever.co/aquarium
I think active learning has a time and a place. If you're getting started with a project from scratch, you probably don't need active learning for the exact reasons you describe - as long as you still get good improvements to model performance by labeling randomly sampled data, then you should scale out your labeling to get more data faster. For modern convnets fine-tuned on image data, I don't think you should think about active learning until you're past 10,000 examples.
Active learning becomes really useful when you hit diminishing returns, as most real-world ML applications deal with long tail distributions, and random sampling doesn't pick out edge cases for labeling very well. An easy way to tell if you're encountering diminishing returns is to do an ablation study where you train the same model against different subsets of your train set, evaluate them against each other on the same test set, and plot out the curve of model performance vs dataset size to see if you're starting to plateau. Or just eyeball your model errors and try to see if there's any patterns of edge cases it fails on.
Lastly, I'm pretty skeptical of model-based uncertainty sampling. In industry, almost every active learning implementation is very "what data should we label next," since model-based active learning is pretty hard to set up and confidence sampling is often not very reliable. That being said, I've anecdotally heard of some teams getting great performance from Bayesian methods once you have a large enough base dataset.
Former self-driving engineer here. I'm also pretty skeptical about synthetic data. For the scenario you described, it turns out that if you drive enough, you'll eventually see some examples of ambulances at night in the rain. If it's really that rare, it's often easier to rent your own ambulance, drive around and do some staged data collection, and annotate the results than it is to set up a synthetic data pipeline.
At the end of the day, even in vision applications, real data is always better than synthetic data if you can get it. Things like sensor noise or interference are hard to replicate in synthetic data. Most teams turn to synthetic data for simulation purposes or as a last resort.
We don't expect users to upload all data to our service - the type of data we're interested in is "metadata." URLs to the raw data, labels, inferences, embeddings, and any additional attributes for their dataset. Users can POST this to our API and we'll ingest it that way.
If users don't provide their own embeddings, we need access to the raw data so we can run our pretrained models on the data to generate embeddings.
However, if users do provide their own embeddings, we would never need access to the raw data - Aquarium operates on embeddings, so the raw data URLs would be purely for visualization within the UI. This is really nice because it means that we can access restrict URLs so only customers can visualize it (via URL signing endpoints, only authorizing IP addresses within customer VPNs, Okta integration) and Aquarium would operate on relatively anonymized embeddings and metadata.
I think the biggest issues with this approach is the requirement for embeddings. It's hard sometimes for a customer to understand what layer to pull out of their net to send to us, so sometimes we just use a pretrained net to generate embeddings. One net for audio, one net for imagery, one net for pointclouds, etc.
I'd say that it's harder for this tool to work with structured/tabular data for a few reasons.
One, most structured datasets are domain-specific, so it's not easy to pull a pretrained model off the shelf to generate embeddings - typically we would need a customer to give us the embeddings from their own model in these cases.
Two, neural nets actually aren't the best for certain structured data tasks. Tree-based techniques often get better performance on simpler tasks, which means there's no obvious embedding to pull from the model.
Three, an alternate interpretation is that a feature vector input for structured data tasks is already an embedding! When the input data is low dimensional, you can do anomaly detection and clustering just by histogramming and other basic population statistics on your data, so it's a lot easier than dealing with unstructured data like imagery.
So I wouldn't say that our tooling wouldn't work for structured data, but more that in those types of cases, maybe there's something simpler that works just as well.
With visual user interfaces, there's a whole category of product analytics tooling that helps product teams understand how users interact with their products + make product decisions to optimize their product-market fit. LLM apps have introduced a new paradigm for interacting with software, where users can work iteratively with the software via a natural language interface, generating user inputs and model responses consisting of unstructured text.
Traditional analytics techniques don’t deal well with large amounts of unstructured text – it’s hard to summarize, it’s hard to aggregate, and it’s hard to effectively sample. AI developers resort to digging through a pile of hundreds to hundreds of millions of datapoints of unstructured text to understand how users interact with their product.
Tidepool tries to solve this problem using neural network embeddings. After you upload user text interaction events, Tidepool will:
- Automatically group your data by similarity. Tidepool runs embedding clustering on your users’ text interactions to surface interesting attributes: things like prompt topics, prompt languages, and common usage patterns that can be turned into shortcuts.
- Summarize common attributes in your data, using LLMs to determine what each cluster “contains.” For example, understanding that the most common topics that users discuss are business, education, and art.
- Track attributes in production traffic, allowing you to uncover how a specific attribute might be correlated to good / bad product outcomes. We utilize lightweight models running on foundation model embeddings to scalably extract these attributes from hundreds of millions of interaction events in production.
Lastly, we have a self-serve free tier! I thought this may be useful for people building AI applications. If you're interested, please try it - I'd love to hear any feedback on what works and what doesn't :) https://app.tidepool.so/