More

mike210 · 2025-05-10T19:48:16 1746906496

Interesting - what kinds of tasks do you reach for 3.5?

muzani · 2025-05-11T10:21:44 1746958904

Pretty much whatever you're using 3.7 for. You don't need as tight a scope. It does easy things well.

An situation I had yesterday: we had two dropdowns. For simplicity, let's say it's country. When you pick a different country, it shows states. When you select state, then change country, it crashes because the state doesn't exist in the new country.

The standard solution is simple – just make it reset to null when switching countries, or better yet, check whether the selected state exists in the new country. But the thinking models will overengineer the hell out of this. They'll check from the deep service level when these checks can be made just below the view layer.

mike210 · 2025-05-07T04:52:31 1746593551

I sometimes but rarely turn on Max for Gemini for more context in a long conversations. The tool use (5 cents per tool call) can get pretty ridiculous on Claude 3.7 Sonnet Max and I've had calls that have been ~$2.

mike210 · 2025-05-06T20:39:35 1746563975

As seen on r/LocalLlaMA here: https://www.reddit.com/r/LocalLLaMA/comments/1kfkg29/

For what it's worth I pasted this into a few tokenizers and got just over 24k tokens. Seems like an enormously long manual of instructions, with a lot of very specific instructions embedded...

jey · 2025-05-07T01:24:36 1746581076

I think it’s feasible because of their token prefix prompt caching, available to everyone via API: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

mike210 · on July 30, 2021

Hi guys - glad to see this! I agree it's a huge problem and one that I've felt several times.

The reason why I never book on TaskRabbit or Thumbtack anymore is because there's legitimately an 80% chance that they don't show up. There's no penalty for not showing up or expressing higher availability and just choosing from it.

This means that every repair or home service that I consider is a simple question. Is this worth rebooking multiple times or will I just bite the bullet and try to do it myself.

I would be willing to pay around 1.5 - 2X more than I book for if I had some guarantee - financial or otherwise, that they would show up. Just the ability to construct a simple equation in my head - "is this repair worth $X" in my head without trying to manage the cost of a no-show would be huge and would have me checking the cost of much more small repairs.

vm · on July 30, 2021

Wow that's a high no-show rate! We're major repeat customers for home pros, which gets our jobs prioritized and ultimately gives our homeowner clients great service. Today we don't charge homeowners a premium like that but it's great to know you value the service so highly

mike210 · on March 5, 2021

Hi! First, your last sentence made us really happy - thanks for the kind words.

As for your points: 1. Definitely. We are trying to get to feature parity ASAP with traditional methods so that our AI modules can really shine. We're building out a library right now so that scientists can put together their own pipelines.

2. We're keeping everything open at this point, and definitely see the possibility that other domains will need structured image analysis like the kind that we are building.

3. Interesting - might be a better way to phrase what we are working on.

Thanks! If you're interested in chatting, would love to get in touch. Reach out at michael at biodock dot ai.

mike210 · on March 3, 2021

Hey! What kind of histology? If you mean doing AI analysis of histology images for research/etc. Then 100% - that's actually of the modules we're looking to do and if you work with those kind of images I would love for you to reach out at michael at biodock dot ai. We're totally free for academics, and a module like that would be free for academics as well.

If you mean clinical diagnostics histology - we're going to hold off a bit on that, although we're discussing some early partnerships there.

mike210 · on March 3, 2021

Also would love to chat if you ping me about the automation part - would love to get some feedback there. michael at biodock dot ai

mike210 · on March 3, 2021

Hey! Super cool. Will shoot you a message. Would love to chat and see what things we could talk about. Seems like you guys have built out some valuable assay analyses that we've heard people asking for, so congrats.

mike210 · on March 3, 2021

Hi Andy,

We do have lots of unlabelled data, and we're also labeling a large portion of it. We do transfer learn for all of the models we're training, and the first backbone we use is partially self-supervised. Seems to help in overall performance, but it's not a huge effect in our experience.

Maybe once we get a lot of models we can release the backbone weights for nuclear segmentation or at least a competition set of some data we've labeled. Some IP issues here though.

What kind of alternative are you looking for? Specifically one for cells, or just for biologics in general? I'm guessing you're trying to have a better base of weights to transfer off so you can train your own model?

I would say we have medium diversity in terms of images - I think unless you have a similar application right now, you'd be better off transferring off of Imagenet just due to the amount of labeled data.

andy99 · on March 3, 2021

Thanks for the reply! I'm actually working in a different domain but it seems to have a lot in common with yours - lots of unlabelled data, images that have nothing in common with Imagenet, in that they are all essentially of the same thing and we are looking for variations or features. We found that self-supervised pre-training (with various contrastive models) underperformed vs. starting with weights trained on Imagenet.

So a model that has been pretrained on something else, with enough variability to work as a feature extractor, but closer to the problem framing I mention would be of interest.

For state of the art computer vision stuff, most of the benchmarks use imagenet or similar datasets. But unfortunately I'm coming around to the realisation that those datasets are not representative of most real world problems (except general purpose scene / object recognition). So it becomes very challenging to pick out a potential technique to apply, and hope it transfers.

gkk · on March 4, 2021

Do I read it correctly that you're working with images with repeated pattern of instances of the same object on the image? I've been working with cell images, solving segmentation task - what biodock works on - and found interesting tricks to train models on vastly smaller number of labels than what you would think is possible with off-the-shelf models (e.g. Mask-RCNN or U-net + refinements).

mike210 · on March 3, 2021

Interesting - it would be great to chat and find out more. Maybe there are things we can learn about each other. Can you shoot me an email at michael at biodock dot ai?

mike210 · on March 3, 2021

Definitely! There are some applications where traditional methods are just simply good enough. However, when they aren't, it can be incredibly frustrating. From our conversations with scientists, this kind of data (3D, histology, difficult tissues, new assays) is increasing in volume.

As for correctness, we've only done mAP scores and traditional accuracy metrics so far to compare with other algorithms, but we also have our own internal metrics and a test set we're building out in-house to cover many edge cases, many of which cover some of the things you are talking about. One thing we're always trying to be sensitive of is fairness. We want to make sure that we're not biasing the test to our algorithm, which would make us look better than we are.

itamarst · on March 3, 2021

I guess when I say correctness, I mean "how do I know it _continues_ to be correct on data we've never seen before". That's where metamorphic testing can be valuable, because it lets you at least find incorrectness on real-world data that hasn't been hand-tagged.

mike210 · on March 3, 2021

Ah, yes. We're even looking to use some generative models in order to even do variations based on data and then compare that we do similarly well between cases.

I guess the point I was making was that we want to make sure we don't then use this generated or modified data in order to test other algorithms in the space and say we're better. Simply put, it would be unfair for us to make changes to perform better on a hurdle and then put other algorithms through those hurdles. But for internal use, it's definitely great!