More

blackkettle · 2025-09-17T10:27:35 1758104855

Cool, I wonder if this means they will finally start letting foreign visitors also use the app. I'm an American living abroad now for many years, and I was initially super excited to try Waymo in LA and SF this summer when I visited with my family. Unfortunately they only make the iPhone app available via the US app store, and while I actually have a US credit card that I could have in theory used to make the switch, Apple makes it an absurd pain to change your region as they require you to both a) cancel any existing subscription AND b) wait until they all expire. Most tourists have it worse as they have no option to even switch in theory.

nicoburns · 2025-09-17T12:19:36 1758111576

Huh, as a Brit, I was able to use Waymo just fine on this summer on my Android device.

blackkettle · 2025-08-08T16:04:01 1754669041

It's even more difficult because, while all the benchmarks provide some kind of 'averaged' performance metric for comparison, in my experience most users have pretty specific regular use cases, and pretty specific personal background knowledge. For instance I have a background in ML, 15 years experience in full stack programming, and primarily use LLMs for generating interface prototypes for new product concepts. We use a lot of react and chakraui for that, and I consistently get the best results out of Gemini pro for that. I tried all the available options and settled on that as the best for me and my use case. It's not the best for marketing boilerplate, or probably a million other use cases, but for me, in this particular niche it's clearly the best. Beyond that the benchmarks are irrelevant.

blackkettle · 2025-04-30T03:42:57 1745984577

Yikes. That's a rather disturbing but all to realistic possibility isn't it. Flattery will get you... everywhere?

blackkettle · on July 30, 2024

This is quite interesting, but I have to ask, have you experimented much with larger LLMs as a mechanism to basically automate the entire process?

I'm doing something pretty similar right now for internal meetings and I use a process like: transcribe meeting with utterance timestamps, extract keyframes from video along with timestamps, request segmented summary from LLM along with rough timestamps for transitions, add keyframe analysis (mainly for slides).

gpt-4o, claude sonnet 3.5, llama 3.1 405b instruct, llama 3.1 70b instruct all do a pretty stunning job of this IMO. Each department still reviews and edits the final result before sending it out, but I'm so far quite impressed with what we get from the default output even for 1-2hr conversations.

I'd argue the key feature for us is also still providing a simple, intuitive UI for non technical users to manage the final result, edit, polish and send it out.

gklezd · on July 30, 2024

That is a great point! I can certainly think of cases where you might want to go with an LLM instead and we have definitely experimented with that approach. Here are some reasons why we think TreeSeg is more suitable for us:

1. A more algorithmic approach allows us to bake certain contraints into the model. As an example you can add a regularizer to incentivize TreeSeg to split more eagerly when there are large pauses. You can also strictly enforce minimum and maximum sizes on segments.

2. If you are interested in reproducing a segmentation with slight variations you might not have good results with an LLM. Our experience has been that there is significant stochasticity in the answers we get from an LLM. Even if you try to obtain a more deterministic answer (i.e. set temp to zero), you will need an exact copy of the model to get the same result in the future. Depending on what LLM you are using this might not be possible (e.g. OpenAI adjusts models frequently). With TreeSeg you only need your block-utterance embeddings, which you probably have already stored (presumably in a vector db).

3. TreeSeg outputs a binary tree of segments and their sub-segments and so forth... This structure is important to us for many reasons, some of which are subjects of future posts. One such reason is access to a continuum between local (i.e. chapters) and global (i.e. full session) context. Obtaining such a hierarchy via an LLM might not be that straightforward.

4. There is something attractive about not relying on an LLM for everything!

Hope this is useful to you!

blackkettle · on June 21, 2024

Holy moly this was _exactly_ my impression. It seems to really be proliferating and it drives me nuts. It makes it almost impossible to useful things, which never used to be a problem with Python - even in the case of complex projects.

Figuring out how to customize something in a project like LangChain is positively Byzantine.

blackkettle · on June 18, 2024

I think it is still meaningful because it's extremely common for management to favor hiring cheaper 'talent'. Pointing out the issues with that in various different ways is still valuable.

blackkettle · on May 17, 2024

Actually, my understanding is that it is an estimation because in the given context we don't know or cannot compute the true answer due to some kind of constraint (here memory or the size of |X|). An approximation is when we use a simplified or rounded version of an exact number that we actually know.

dools · on May 18, 2024

Wikipedia is on your side:

"In mathematics, approximation describes the process of finding estimates in the form of upper or lower bounds for a quantity that cannot readily be evaluated precisely"

This process doesn't use upper and lower bounds.

However, it still seems more like approximation than estimation to me because of this:

“Of course,” Variyam said, “if the [memory] is so big that it fits all the words, then we can get 100% accuracy.

It seems that in estimation the answer should be unknowable without additional information, whereas in this case it's just a matter of resolution or granularity because of the memory size.

Anyhoo ...

EDIT: also the paper says "estimate" and the article says both "approximate" and "estimate" at different times so it seems everyone except me thinks it's either an estimation or that estimation and approximation are interchangeable.

blackkettle · on Nov 10, 2023

Do we also have updated scores for the GPT3.5~GPT4.0 models? The old ones are here but they don't appear to have been updated:

- https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

blackkettle · on Oct 10, 2023

The author directly addresses this sentiment in the concluding paragraph:

> Although there are physicists who wonder “Why did we even need to do this experiment; we all knew that antimatter has positive mass,” that sentiment is absolutely foolish. We must remember — and I say this as a theoretical physicist myself — that physics is 100% an experimental science. We can be confident in our theory’s predictions only insofar as we can test and measure what it predicts; as soon as we step outside of the realm of what’s been validated by experiment, we run the risk of stepping outside the realm of where our theory is valid. We just learned that Einstein’s general relativity passed another test, the antimatter test, and with it, our greatest science-fiction hope for achieving warp drive has completely evaporated.

otikik · on Oct 10, 2023

Indeed.

"Why do we need to measure the speed of light coming from a moving source. We all know that the velocity of all objects compounds with the velocity of their emitter"

blackkettle · on Oct 10, 2023

I think the article is a near miss on the right idea. The important point is that a _dedicated_ vector database is probably overkill and not justified for most real-world use cases.

But a multi-modal database that also supports embeddings in hybrid mode or _in addition_ to standard retrieval techniques is both still very useful, and probably sufficient.

What that means to me is that it is yet another vote in favor of less optimized but far more versatile and robust solutions like: OpenSearch, Elastic, and PostgreSQL. [when I say 'less optimized' I'm only referring to their current vectordb plugins, not the rest of the machinery]

OpenSearch and Postgre are phenonemal, robust, OSS tools and the only lingering downside seems to be that their vectordb implementations are still a bit less optimized for large collections - but that probably doesn't matter in practice.