Hacker Newsnew | past | comments | ask | show | jobs | submit | jkaptur's commentslogin

(I'm not an expert. I'd love to be corrected by someone who actually knows.)

Floating-point arithmetic is not associative. (A+B)+C does not necessarily equal A+(B+C), but you can get a performance improvement by calculating A, B, and C in parallel, then adding together whichever two finish first. So, in theory, transformers can be deterministic, but in a real system they almost always aren't.


Not an expert either, but my understanding is that large models use quantized weights and tensor inputs for inference. Multiplication and addition of fixed-point values is associative, so unless there's an intermediate "convert to/from IEEE float" step (activation functions, maybe?), you can still build determinism into a performant model.

Fixed point arithmetic isn't truly associative unless they have infinite precision. The second you hit a limit or saturate/clamp a value the result very much depends on order of operations.

Ah yes, I forgot about saturating arithmetic. But even for that, you wouldn't need infinite precision for all values, you'd only need "enough" precision for the intermediate values, right? E.g. for an inner product of two N-element vectors containing M-bit integers, an accumulator with at least ceil(log2(N))+2*M bits would guarantee no overflow.

True, you can increase bit width to guarantee never hit those issues, but right now saturating arithmetic on types that pretty commonly hit those values is the standard. Guaranteeing it would be a significant performance drop and/or memory use increase with current techniques to the level it would significantly affect availability and cost compared to what people expect.

Similarly you could not allow re-ordering of operations and similar - so the results are guaranteed to be deterministic (even if still "not correct" compared to infinite precision arithmetic) - but that would also have a big performance cost.


> you can get a performance improvement by calculating A, B, and C in parallel, then adding together whichever two finish first

Technically possible, but I think unlikely to happen in practice.

On the higher level, these large models are sequential and there’s nothing to parallelize. The inference is a continuous chain of data dependencies between temporary tensors which makes it impossible to compute different steps in parallel.

On the lower level, each step is a computationally expensive operation on a large tensor/matrix. These tensors are often millions of numbers, the problem is very parallelizable, and the tactics to do that efficiently are well researched because matrix linear algebra is in wide use for decades. However, it’s both complicated and slow to implement fine grained parallelism like “adding together whichever two finish first” on modern GPUs. Just too much synchronization, when total count of active threads is many thousands, too expensive. Instead, operations like matrix multiplications are often assigning 1 thread per output element or fixed count of output elements, and reduction like softmax or vector dot product are using a series of exponentially decreasing reduction steps, i.e. order is deterministic.

However, that order may change with even minor update of any parts of the software, including opaque pieces at the low level like GPU drivers and firmware. Library developers are updating GPU kernels, drivers, firmware and OS kernels collectively implementing scheduler which assigns work to cores, both may affect order of these arithmetic operations.


I don't think the order of operations is non-deterministic between different runs. That would make programming and researching these systems more difficult than necessary.

It would be if you used atomics.

I said: don't think it's non-deterministic, two negations -> deterministic.

It’s usually not too difficult or expensive to avoid doing this.

https://www.jkaptur.com - I have some plans to add more content, but who doesn't? :)


There are two extremes here: first, the "architects" that this article rails against. Yes, it's frustrating when a highly-paid non-expert swoops in to offer unhelpful or impossible advice.

On the other hand, there are Real Programmers [0] who will happily optimize the already-fast initializer, balk at changing business logic, and write code that, while optimal in some senses, is unnecessarily difficult for a newcomer (even an expert engineer) to understand. These systems have plenty of detail and are difficult to change, but the complexity is non-essential. This is not good engineering.

It's important to resist both extremes. Decision makers ultimately need both intimate knowledge of the details and the broader knowledge to put those details in context.

0. http://www.catb.org/jargon/html/story-of-mel.html


Another point is that the world is always changing. If you work slowly, you are at much greater risk of having an end result that isn't useful anymore.

(Like the author, of course, I'm massively hypocritical in this regard).


I think that there are three relevant artifacts: the code, the specification, and the proof.

I agree with the author that if you have the code (and, with an LLM, you do) and a specification, AI agents could be helpful to generate the proof. This is a huge win!

But it certainly doesn't confront the important problem of writing a spec that captures the properties you actually care about. If the LLM writes that for you, I don't see a reason to trust that any more than you trust anything else it writes.

I'm not an expert here, so I invite correction.


"Couples often flake together. This changes the probability distribution of attendees considerably"

It's interesting to consider the full correlation matrix! Groups of friends may tend to flake together too, people who live in the same neighborhood might rely on the same subways or highways...

I think this is precisely the same problem as pricing a CDO, so a Gaussian Copula or graphical model is really what you need. To plan a great party.


We tend to calculate "people at percentages", ie: 2 adults, 2 kids, 50% chance of showing up rates as an attendance-load of 1.5 virtual people (for food calculations).

Then sometimes you need the "max + min souls" (seats, plates), and account for what we call "the S-factor" if someone brings an unexpected guest, roommate, etc.

Lastly: there is a difference between a "party" and a "soirée" (per my college roommate: "you don't have parties, you have soirées!")

All the advice is really accurate, makes me miss hosting. If you want to go a little deeper, there's a book called "How to be a Gentleman", and it has a useful section on "A Gentleman Hosts a Party", and then "Dads Own Cookbook" has a chapter on party planning, hosting, preparation timelines... there's quite a bit of art and science to it!


> We tend to calculate "people at percentages", ie: 2 adults, 2 kids, 50% chance of showing up rates as an attendance-load of 1.5 virtual people (for food calculations). > > Then sometimes you need the "max + min souls" (seats, plates), and account for what we call "the S-factor" if someone brings an unexpected guest, roommate, etc.

I made myself a "food and drinks amount" calculator for weekends/week-long party events a few years back and it was eerily accurate once you take in unexpected plus ones, flake rates, hangovers and other computable-at-scale events into the formula!



I’ve never had the mental bandwidth to try to manage my manager and team like this. While I don’t trust them to provide the best feedback, I also don’t trust that I won’t make mistakes. And what does it matter if I cannot control everything, unless too much risk is involved.

The color of that bike shed is distracting, though. Is it purple or pink?


PowerPoint actually fine

  - bad communication possible in any medium
  - pptx in NASA even today!
  - issue is managers/SMEs communication differences
    - issues with technical papers
      - long
      - boring
  - vs word, excel, pdf...
(Next slide please)

Manager/SME Differences

  - context vs conclusion 
  - tell a compelling story
    - but give away the ending FIRST 
  - inherent personality differences
  - motivations/incentives/mindsets
(Next slide)

Learning from disasters

  - medium guides message and messenger
  - blame tool - binary choice?
  - presentation aide vs distributed technical artifact
(Next slide)

Questions?


Some time ago, I made up a PowerPoint show on effective communication[0].

I’ve found that most folks have no intention of improving their communication effectiveness. Everyone is much happier, blaming the audience.

[0] https://news.ycombinator.com/item?id=44202502


Blaming the audience makes sense because after all, they're the ones not getting the message right and not asking the presenter to explain it better. But it remains the presenter's failure to catch their attention better and try to deliver a clear message.

Every time I had a presentation, I tried to analyze the failures (including listening to me when it was recorded, a really painful experience). Certain mistakes such as like having slides on a white background that makes attendees look at the screen and read instead of watching the presenter and listening to him can be devastating. Just because attendees are naturally attracted by light. It's not the audience's fault, it's the presenter's fault (and to some extents the tools in use). A good exercise is to stop slides from time to time during the presentation (i.e. switch to a black one), you'll be amazed how much you suddenly catch the attention, you feel like you're at a theater. It even manages to catch attention of those who were looking at their smartphones because the light in the room suddenly changes.

Also another difficulty which is specific to English native speakers is that many of them initially underestimate the difficulties of the audience to catch certain expressions (with some people it's very hard to distinguish "can" from "can't" for example, which complicates the understanding), or idiomatic ones, or references to local culture, because such things are part of their daily vocabulary. Of course, after a few public talk, when they get questions at the end proving there were misunderstandings, they realize that speaking slower, articulating a bit more and avoiding such references does help with non-native listeners. Conversely, when you present in a language that is not yours, you stick to very simple vocabulary using longer sentences to assemble words that try to form a non-ambiguous meaning. It can probably sound boring for native speakers but the message probably reaches the audience better.

In any case, it definitely always is the presenter's failure when a message is poorly delivered and their responsibility to try to improve this, however difficult this is. It's just important never to give up.


"... if all knowledge were stored in a structured way with rich semantic linking..." this sounds a lot like Google's "Knowledge Graph". https://developers.google.com/knowledge-graph. (Disclosure: I work at Google.)

If you ask an LLM where you can find a structured database of knowledge with structured semantic links, they'll point you to this and other knowledge graphs. TIL about Diffbot!

In my experience, it's a lot more fun to imagine the perfect database like this than it is to work with the actual ones people have built.


This essay would benefit from results from computational complexity.

P vs NP, of course, but also the halting problem and Rice's theorem: non-trivial semantic properties of programs are undecidable.

In other words, if you say "this is the solution to that sudoku puzzle", that's easy to verify. "This sudoku puzzle has a solution" is almost certainly much harder to verify. "Here's a program that can solve any sudoku puzzle - impossible (in general).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: