Hacker Newsnew | past | comments | ask | show | jobs | submit | markisus's commentslogin

Each fine tune drags the model weights away from the base model in a certain direction.

Given 500 fine tune datasets, we could expect the 500 drag directions to span a 500 dimensional space. After all, 500 random vectors in a high dimensional space are likely to be mutually orthogonal.

The paper shows, however, that the 500 drag directions live in a ~40 dimensional subspace.

Another way to say it is that you can compress fine tune weights into a vector of 40 floats.

Imagine if, one day, fine tunes on huggingface were not measured in gigabytes, megabytes, or even kilobytes. Suppose you started to see listings like 160 bytes. Would that be surprising?

I’m leaving out the detail that the basis direction vectors themselves would have to be on your machine and each basis direction is as big as the model itself. And I’m also taking for granted that the subspace dimension will not increase as the number of fine tune datasets increases.

I agree that the authors decision to use random models on hugging face is unfortunate. I’m hopeful that this paper will inspire follow up works that train large models from scratch.


Where do lock free algorithms fall in this analysis?

Some considerations can be similar, but the total set of details are different.

It also depends which lock free solutions you're evaluating.

Some are higher order spins (more similar high level problems), others have different secondary costs (such as copying). A common overlap is the inter-core, inter-thread, and inter-package side effects of synchronization points, for a lot of stuff with a strong atomic in the middle that'll be stuff like sync instruction costs, pipeline impacts of barriers/fences, etc.


There are several classes of lock free algorithms with different properties.

Lock free algorithms for read only access to shared data structures have only seldom disadvantages (when the shared data structure is modified extremely frequently by writers, so the readers never succeed to read it between changes), so for read-only access they are typically the best choice.

On the other hand lock free algorithms for read/write access to shared data structures must be used with great caution, because they frequently have a higher cost than using mutual exclusion. Such lock free algorithms are based on the optimistic assumption that your thread will complete the access before the shared structure is accessed by another competing thread. Whenever this assumption fails (which will happen when there is high contention) the transaction must be retried, which will lead to much more wasted work than the work that is wasted in a spinlock.

Lock free algorithms for read/write access are normally preferable only when it is certain that there is low contention for the shared resource, but in that case also a spinlock may waste negligible time.

The term "lock-free" is properly applied only to the access methods based on optimistic access instead of mutual exclusion (which uses locks).

However, there is a third kind of access methods, which use neither optimistic access nor mutual exclusion with locks, so some authors may conflate such methods together with the lock-free algorithms based on optimistic access.

This third kind of access methods for shared data have properties very different from the other two kinds, so they should better be considered separately. They are based on the partitioning of the shared data structure between the threads that access it concurrently, so such methods are applicable only to certain kinds of data structures, mainly to arrays and queues. Nevertheless, almost any kind of inter-thread communication can be reorganized around message queues and shared buffers, so most of the applications that use either mutual exclusion with locks or lock-free optimistic access can be rewritten to use concurrent accesses to dynamically partitioned shared data structures, where the access is deterministic unlike with lock-free optimistic algorithms, and there are no explicit locks, but the partitioning of the shared data structure must be done with atomic instructions (usually atomic fetch-and-add), which contain implicit locks, but they are equivalent with extremely short locked critical sections.


Typically lock-free algorithms do not retry when a read happens concurrently with a write[1], so they scale very well for read mostly data or when data is partitioned between writers.

Scalability is always going to be poor when writers attempt to modify the same object, no matter the solution you implement.

[1] of course you could imagine a lock-free algorithm where reads actually mutate, but that's not common.


> Scalability is always going to be poor when writers attempt to modify the same object, no matter the solution you implement.

MVCC.


Well, yes, that's one way of avoiding mutating the same object of course.

Technically it is not because eventually it will be mutated, and that's one way of achieving the scalability in multiple writers scenario.

In a very basic case lock free data structures make threads race instead of spin. A thread makes their copy of a part of list/tree/whatever it needs updating, introduces changes to that copy and then tries to substitute their own pointer for the data structure pointer if it hasn't changed in the meantime (there is a CPU atomic instruction for that). If the thread fails (someone changed the pointer in the meantime) it tries again.

Their VGGT, Dinov3, and segment anything models are pretty impressive.

The problem becomes complicated once the large discrete objects are not actuated. Even worse if the large discrete objects are not consistently observable because of occlusions or other sensor limitations. And almost impossible if the large discrete objects are actuated by other agents with potentially adversarial goals.

Self driving cars, an application in which physics is simple and arguably two dimensional, have taken more than a decade to get to a deployable solution.


The authors somewhat address your questions in the accompanying paper https://arxiv.org/abs/2410.24206

> We emphasize that the central flow is a theoretical tool for understanding optimizer behavior, not a practical optimization method. In practice, maintaining an exponential moving average of the iterates (e.g., Morales-Brotons et al., 2024) is likely a computational feasible way to estimate the optimizer’s time-averaged trajectory.

They analyze the behavior of RMSProp (Adam without momentum) using their framework to come up with simplified mathematical models that are able to predict actual training behavior in experiments. It looks like their mathematical models explain why RMSProp works, in a way that is more satisfying than the usual hand waving explanations.


Yes, it certainly provides a lot more clarity than the handwaving.

While momentum seems to work, and the authors clearly state it is not intended as a practical optimization method, I can't exclude that we can improve convergence rates by building on this knowledge.

Is it guaranteed for the oscillating behavior to have a period of 2 steps? or is say 3 step period also possible (a vector in a plane could alternately point to 0 degrees, 120 degrees and 240 degrees).

The way I read this presentation the implication seems to be that its always a period of 2. Perhaps if the top-2 sharpnesses are degenerate (identical), a period of N distinct from 2 could be possible?

It makes you wonder what if instead of storing momentum with exponential moving average one were to use the average of the last 2 iterations, so there would be less lag.

It also makes me wonder if we should perform 2 iterative steps PER sequence so that the single-sample-sequence gives feedback along it's valley instead of across it. One would go through the corpus at half the speed, but convergence may be more accurate.


I doubt it would ever be period 3.

Not a formal proof, but there is this fun theorem called period 3 implies chaos that my gut instinct says applies here.

Basically if you have a continuous mapping from [a,b] -> [a,b] and there exists a 3 cycle then that implies every other cycle length exists.

Which in this case would kinda say that if you are bouncing between three values on the y axis (and the bouncing is a continuous function which admittedly the gradient of a relu is not) you are probably in a chaotic system

Now that requires assuming that the behaviour of y is largely a function of just y. But their derivation seems to imply that it is the case.


First I thought this would be just another gradient descent tutorial for beginners. But the article goes quite deep into gradient descent dynamics, looking into third order approximations of the loss function and eventually motivating a concept called "central flows." Their central flow model was able to predict loss graphs for various training runs across different neural network architectures.


Can someone explain the bit counting argument in the reinforcement learning part?

I don’t get why a trajectory would provide only one bit of information.

Each step of the trajectory is at least giving information about what state transitions are possible.

An infinitely long trajectory can explore the whole state space if there are no absorbing states. Such a trajectory would provide a massive amount of information about the system, even if we ignored the final reward.


I believe it's because the way you measure things in RL, each episode only tells you whether it was good (say reward +1) or bad (say 0 or negative reward), it does not tell you anything about the trace that was produced to get the outcome. This reward is the only thing measured to produce your gradients. Hence why the amount of info in it is O(1).

This is in contrast to more "supervised" forms of learning where you could get a loss for each token produced (e.g. cross entropy loss), and where you'd get, as a consequence O(number of tokens) information into your gradients.


A fair amount of research has shown that RL doesn’t add knowledge to the base model it just optimizes paths that already exist. Now ProRL from Nvidia showed there are ways of adding knowledge, mostly through progressive merging.

I’m still not fully convinced of the 1bit claim, they made other mistakes in the blog post


These problems seem to have the flavor of interview questions I heard for quant positions.


So the role that would benefit from this specific competency is: interviewer

(For both quant and leetcode positions)


Interesting article. It’s actually very strange that the dataset needs to be “big” for the O(n log n) algorithm to beat the O(n). Usually you’d expect the big O analysis to be “wrong” for small datasets.

I expect that in this case, like in all cases, as the datasets become gallactically large, the O(n) algorithm will start winning again.


The hash-based algorithm is only O(n) because the entry size has a limit. In a more general case, it would be something more like O(m(n * e)). Here n is the number of entries, e is the maximum entry size and m is a function describing how caching and other details affect the computation. With small enough data, the hash is very fast due to CPU caches, even if it takes more steps, as the time taken by a step is smaller. The article explains this topic in a less handwavey manner.

Also, memory access is constant time only to some upper limit allowed by the hardware, which requires significant changes to the implementation when the data does not fit the system memory. So, the hash algorithm will not stay O(n) once you go past the available memory.

The sorting algorithms do not suffer from these complexities quite as much, and similar approaches can be used with data sets that do not fit a single system's memory. The sorting-based algorithms will likely win in the galactically large cases.

Edit: Also, once the hash table would need to grow beyond what the hash function can describe (e.g. beyond 64 bit integers), you need to grow the function's data type. This is essentially a hidden log(n) factor, as the required length of the data type is log(n) of the maximum data size.


Interestingly you need a hash function big enough to be unique for all data points with high probability, it doesn't take much to point out that this is at least O(log(n)) if all items are unique.

Also if items take up k bytes then the hash must typically be O(k), and both the hashing and radix sort are O(n k).

Really radix sort should be considered O(N) where N is the total amount of data in bytes. It can beat the theoretical limit because it sorts lexicographically, which is not always an option.


Radix sort isn't a comparison-based sort, so it isn't beholden to the O(n log n) speed limit in the same way. It's basically O(n log k), where k is the number of possible different values in the dataset. If we're using machine data types (TFA is discussing 64-bit integers) then k is a constant factor and drops out of the analysis. Comparison-based sorts assume, basically, that every element in the input could in principle be distinct.

Basically, the hashed sorting approach is effectively actually O(n), and is so for the same reason that the "length of hash table" approach is. The degenerate case is counting sort, which is a single pass with a radix base of k. (Which is analogous to pointing out that you don't get hash collisions when the hash table is as big as the hashed key space.)


One detail most comments seem to be missing is that the O(1) complexity of get/set in hash tables depends on memory access being O(1). However, if you have a memory system operating in physical space, that's just not possible (you'd have to break the speed of light). Ultimately, the larger your dataset, the more time it is going to take (on average) to perform random access on it. The only reason why we "haven't noticed" this yet that much in practice is that we mostly grow memory capacity by making it more compact (the same as CPU logic), not by adding more physical chips/RAM slots/etc. Still, memory latency has been slowly rising since the 2000s, so even shrinking can't save us indefinitely.

One more fun fact: this is also the reason why Turing machines are a popular complexity model. The tape on a Turing machine does not allow random access, so it simulates the act of "going somewhere to get your data". And as you might expect, hash table operations are not O(1) on a Turing machine.


No, because radix sort is not a comparison-based sort and is not O(n log n).


The key here is in the cache lines.

Sorting (really scanning through an array) is very efficient in cache lines, while hash tables are very inefficient.

In the example, every memory access in hashing gets 8 byte of useful data, while every memory access in sorting gets 64 bytes of useful data.

So you have to be 8x more efficient to win out.

For this problem, the radix sort is log_1024(n) passes, which for a key size of 64 bits can’t ever get above 7 passes.

If you increase the key size then the efficiency win of sorting drops (128 bit keys means you have to run less than 4 passes to win).

Essentially the size of the key compared to the cache line size is a (hidden) part of the big O.


That is what baffles me. The difference in big O complexity should be more visible with size, but thats where it looses to the "worse" algorithm.

I could imagine the hash table wins again beyond a even greater threshold. Like what about 120GB and beyond?


As others have pointed out, radix sort is O(64N) = O(N) for a fixed key length of uint64s as in this article

So it comes down to the cost of hashing, hash misses, and other factors as discussed in the article.

I remember learning that radix sort is (almost?) always fastest if you're sorting integers.

A related fun question is to analyze hash table performance for strings where you have no bound on string length.


Sorting wins for unbounded strings because it only needs a prefix of each string.


The analysis in the "Why does sorting win?" section of this article gave us a way to estimate that threshold.

Here's my attempt at it, based on that analysis:

Suppose each item key requires s bytes

For the hash table, assuming s <= 64, our cache line size, then we need to read one cache line and write one cache line.

The bandwidth to sort one key is p(N) * 2 * s where p(N) is the number of passes of 1024-bucket radix sort required to sort N elements, and 2 comes from 1 read + 1 write per 1024-bucket radix sort pass

p(N) = ceil(log2(N) / log2(buckets))

Suppose N is the max number of items we can distinguish with an s=8 byte item key, so N = 2^64

then p(N) = ceil(64 / 10) = 7 passes

7 radix passes * (1 read + 1 write) * 8 bytes per item key = 112 bytes of bandwidth consumed, this is still less than the bandwidth of the hash table 2 * 64.

We haven't found the threshold yet.

We need to either increase the item count N or increase the item key size s beyond 8 bytes to find a threshold where this analysis estimates that the hash map uses less bandwidth. But we cant increase the item count N without first increasing the key size s. Suppose we just increase the key size and leave N as-is.

Assuming N=2^64 items and an item size of b=10 bytes gives an estimate of 140 bytes of bandwidth for sorting vs 128 bytes of bandwidth. we expect sorting to be slower for these values, and increasing b or N further should make it even worse.

(the bandwidth required for hash map hasnt increased as our 10 byte b is still less than the size of a cache line)

edit: above is wrong as i'm getting mixed up with elements vs items. elements are not required to be unique items. if elements are unique items then the problem is trivial. So N is not bounded by the key size, and we can increase N beyond 2^64 without increasing the key size s beyond 8 bytes.

keeping key size s fixed at 8 bytes per item suggests the threshold is at N=2^80 items

given by solving for N for the threshold where bandwidth estimate for sort = bandwidth estimate for hash table

2 * 8 * (log2(N) / log2(1024)) = 2 * 64


>This analysis would suggests a 2.7× speedup vs. hash tables: 128 bytes vs. 48 bytes of memory traffic per uint64

It's ok to waste bandwidth, it doesn't directly impact performance. However, the limitation you are hitting (which directly impacts performance) is the number of memory accesses you can do (per cycle for instance) and the latency of each memory access. With linear access, after some initial read, data is ready instantly for the CPU to consume. For scattered data (hash tables), you have to pay a penalty on every read.

So the bandwidth ratio is not the right factor to look at to estimate performance.


the big O copmlexity makes assumptions that break down in this case. E.g. it "ignores" memory access cost, which seems to be a key factor here.

[edit] I should have said "basic big O complexity" makes assumptions that break down. You can ofc decide to model memory access as part of "big O" which is a mathematical model


Radix sort is not a comparison-based sort and is not O(n log n).


why is it strange? the case where asymptotically better = concretely worse is usually the exception, not the norm.


We expect asymptotically better algorithms to be, well, asymptotically better. Even if they lose out for small N, they should win for larger N - and we typically expect this "larger N" to not be that large - just big enough so that data doesn't fit in any cache or something like that.

However, in this particular case, the higher-complexity sorting algorithms gets better than the hash algorithm as N gets larger, even up to pretty large values of N. This is the counter-intuitive part. No one is surprised if an O(n log n) algorithm beats an O(n) algorithm for n = 10. But if the O(n log n) algorithm wins for n = 10 000 000, that's quite surprising.


As others have likely pointed out by now, radix sort is not O(n log n).

It's O(k * n), where k is constant with respect to the key size.

For 64-bit keys with 1024 buckets, k==8, so it's O(8 * n) = O(n)


Just speculating but proximity to a reference answer is a much denser reward signal. In contrast, parsing out a final answer into a pass/fail only provides a sparse reward signal.


Yup, RLVR as implemented by Deepseek et al. use only outcome supervision instead of process supervision. There have been attempts to do process supervision though.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: