More

llmzero · on March 8, 2024

What is a little contradictory is that designing a system to use less resources can increases the number of people fine tuning models so that the final result can be a net global increase in the total energy use. A hypothetical goal could be to reuse fine tuning, that is designing a knowledge graph in which you fine tuning from a previously fine tuned model (like dynamic programming, save the result of previous computations). Lora allow us to store the small matrices with low cost.

llmzero · on March 8, 2024

I liked that you link to renting a dual 24GPU for 0.60cents/hour, but how long could it takes to fine tune a 70b model using your system (4 bits for weights)?

If I were a consumer I would be interested in the final price of fine tuning, for example a table with model size, training size, cost of training, and expected loss of quality with this technology.

One obvious question: Can you apply your technology with the recent (-1,0,1) encoding?, I think you will answers that the (-1,0,1) model is not available and you can't try it, but my question is whether once/if that model is available answer.ai will be able to use the same technology that this post to fine tune a big model in two very small GPUs, and then I should ask for a new table with cost/benefits analysis.

Edited: I should add that I find this kind of work very useful for enhancing individual users like me to be able to compete in the applications of LLM market, this is great work and along the lines of the book "from zero to one" (not that I like or dislike the author) to solve the kind of problem that nobody is trying to solve.

Edited: Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

airstrike · on March 8, 2024

> Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

If you use Stylus (or any similar browser extension), I actually wrote a style to hide points for that very reason, replacing karma and scores with `•••`

This is actually the second time I see someone mentioning this need, so I've made it into a gist and published it to userstyles, but here's it is also since it's pretty short:

    @-moz-document domain("news.ycombinator.com") {
        /* Hide karma and points on replies */
        span.pagetop #karma, span.comhead span.score {
            visibility: hidden;
            position: relative;
            display: inline-block;
            height: 10px !important;
            overflow: hidden;
        }
        span.pagetop #karma {
            width: 0.8rem !important;
        }
        span.comhead span.score {
            width: 0.8rem !important;
        }
        span.pagetop #karma::before, span.comhead span.score::before {
            content: "•••";
            visibility: visible;
            overflow: hidden;
            opacity: 0.8;
            font-family: Helvetica, Arial, sans-serif !important;
        }
    }

https://gist.github.com/airstrike/62584e6ffb6104791c0ae48a8e...

https://userstyles.world/style/15164/hackernews-hide-karma-a...

SV_BubbleTime · on March 9, 2024

I wish this was built in but understand the intentional abusive psychological exploit that it isn’t.

danielhanchen · on March 8, 2024

On how long, finetuning is influenced by your dataset size (more = slower), sequence length since attention is O(N^2), data movement etc and most important is how many steps you want to take. For QLoRA, some runs can do a few hundred steps which can complete in minutes to 1 hour. Too many can overfit. So being able to fit it on consumer GPUs can be very cost effective.

On the 1.58bit paper, from what I understand, this requires a total retraining from scratch. Hopefully the researchers will open source their weights :)

On the technicals, weights are encoded in (-1, 0, 1), whilst QLoRA uses a 4bit dynamic mapping of 16 numbers. The only change required would be the torch.matmul(X, W) step, where it'll be torch.bitlinear_matmul(X, W). Before with QLoRA, one has to do torch.matmul(X, dequantize(W)). So one has to implement torch.bitlinear_matmul. The backward is torch.bitlinear_matmul(dY, W.T).

miohtama · on March 8, 2024

What's the magic in 1.58bit vs. 4 bit that it makes it so much more efficient (claimed)?

danielhanchen · on March 8, 2024

From what I understand, using (-1, 0, 1) removes multiplications in GPUs. Ie assume you have a weight matrix and multiply it by some activations

                   [-1, 0,  1]

                   [0,  1, -1]

    [10, 20, 30] x [1,  1,  0]

Instead of doing 10(-1) + 20(0) + 30(1) + 10(0) + ..., since we know beforehand the weights are simply (-1, 0, 1), we easily flip the sign and do addition, or force the hardware to do addition ie if (-1) do subtraction. If (0) do addition. If (1) do addition.

Floating point multiplication does addition of the exponents and multiplying of the mantissa. So just simplifying:

Float16 has E=5, M=10. Ie around 5 + 10^2 space needed = 105.

Bfloat16 has E=8, M=7. So 8 + 7^2 = 57 space.

Float8(143) E=4, M=3. So 4 + 3^2 = 13 space.

1.58(16bit) E=5, M=10. Addition only, so shift E say 5 + 10 addition = 15.

1.58(8bit) E=4, M=3. Addition only, so shift E say 4 + 3 addition = 7.

Obviously I'm simplifying, but with only additions, 1.58 uses say 7 space, whilst FP8 uses 13 space, so in theory 2x more transistors can be crammed, ie 2x more FLOPs than FP8.

nyrikki · on March 8, 2024

Really simple explanation is that for inference, feed forward networks are threshold circuits and by their nature ANNs are binary output, outputting true and false (same as being a threshold circuit)

So if you train your models with that in mind you're weighs can be reduced to -1,0,1 reducing the space complexity.

I don't think the costs in expressiveness are captured quite yet, but as perplexity doesn't care about correctness, if that is the metric that is important for you it will probably reduce memory requirements for inference.

chessgecko · on March 8, 2024

also just to add, I think the 1.58 bit is mostly faster for inference because training still had to multiply a lot of floating point gradients by integer activations, hold floating point weights/gradients for round, and deal with norms and stuff. could be wrong about that though

spywaregorilla · on March 9, 2024

> Edited: Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

The irony of making an unnecessary edit like this to virtue signal for implicit social currency by shitting on the explicit form.

jph00 · on March 8, 2024

As mentioned in the post, benchmarking results are coming in a later post. But in short: you can train an epoch of Alpaca in 24 hours or so, which is enough to get very significant change in model behavior.

gumby · on March 9, 2024

> the recent (-1,0,1) encoding?

A side point, but this "recent" encoding goes back to a 2017 paper from the Allen Institute. These days a seven year old paper is ancient.

They went further and showed you could could get away with binary, you don't even need trinary!

paul_mk1 · on March 9, 2024

Goes back before then. This got popularized by BinaryConnect in 2015, and groups were training binary networks as early as 2011.

You are probably referring to XNOR net, and the novel piece there was also using binary activations (which bitnet is not).

So as far as I can tell, bitnet is basically BinaryConnect applied to LLMs.

https://arxiv.org/abs/1511.00363

gumby · on March 10, 2024

Thanks for your informative comment. What HN is for!

mmoskal · on March 9, 2024

The bitnet paper was showing worse results than fp16 transformer with the same parameter count. The shocking result in the 1.58b paper (same group) is no quality loss compared to fp16.

throwaway14356 · on March 9, 2024

i think those tables could be a facinating product. All parties involved could purchase them for private and public use.

P.S. I thought one was suppose to spend the HN points on mocking north-americans, shameless self-promotion, unpopular facts, general trolling and complaints about topics existing. I could go on but I haven't the points.

swader999 · on March 8, 2024

I like how you think about social media.

llmzero · on March 8, 2024

Think about the following scenario: I write a calculus book and the agents of this model just modify every example and every definition and change a little the ordering of the material to teach students. Now they are using my book but it seems they are not using my book. Are they trying to copy without copying?

llmzero · on March 7, 2024

The superformula depends of four parameters and is able to model many different curves. I wonder if that superformula would be useful to learn to generalize the form of a curve given few points. It could be that, in same way, the four parameters of that curve are a orthogonal bases in the hypothesis space, in the sense that each parameters add a lot the information. If this intuition has any meaning, it could be the start of a new theory for constructing bases of the hypothesis space, that is models with few parameters but great expressive power.

Edited: (1) The following link explains expressivity and generalization power in machine learning: https://blog.evjang.com/2017/11/exp-train-gen.html

So my question is whether the superformula constitute an example of great expressivity and powerful generalization for curve fitting by using machine learning models.

Edited: (2) In the following link they use the superformula, Automatic Generation of Smooth Curves from Interpretable Low-Dimensional Parameters.

So the intuition seems fruitful. https://arxiv.org/pdf/1808.08871.pdf

bigbillheck · on March 7, 2024

> The superformula depends of four parameters

Looks like six to me:

m, n1, n2, n3, a, b.

TrainedMonkey · on March 7, 2024

All the examples on the linked wiki are given without a and b parameters... so these might be meta-parameters... maybe scale?

llmzero · on March 7, 2024

My take away of a long post.

<< It’s a general feature of machine learning—and AI—techniques that they can be very useful if an approximate (“80%”) answer is good enough. But they tend to fail when one needs something more “precise” and “perfect”.

kylebenzle · on March 7, 2024

Wow, a 16,000 word article, I'll just have AI summarize it for me ;)

llmzero · on March 7, 2024

What's new in this edition?

llimllib · on March 7, 2024

you can see the differences on github: https://github.com/marijnh/Eloquent-JavaScript/compare/3rd_e...

From a quick browse:

- # for private properties

- ESM imports in node

- hasOwnProperty -> hasOwn (TIL: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...)

- Math.pow -> **

- coverage of `function*` generators (https://eloquentjavascript.net/11_async.html#h-o+cFzGGhnz)

bewuethr · on March 7, 2024

Here's a diff without the first commit that changed all linebreaks: https://github.com/marijnh/Eloquent-JavaScript/compare/d8290...

begueradj · on March 7, 2024

It should be mentioned in the "Introduction" but it seems your question is not covered there.

e12e · on March 7, 2024

Could've used better commit messages, but:

https://github.com/marijnh/Eloquent-JavaScript/compare/3rd_e...

llmzero · on March 7, 2024

In page 2, there is the theorem: Theorem ∀n in Nat.f n (fib p) (fib (p+1)) = fib (p+n), I think it should be for all p in Nat (fib p) + (fib (p+1)) = fib(p+2), otherwise there is something mysterious here.

mayoff · on March 7, 2024

The fib' function (the more efficient version of fib) is defined using f. So the goal is to prove that f computes the same thing as fib (given the appropriate arguments to f and fib). Hence the theorem needs to use f on at least one side of the equation.

llmzero · on March 6, 2024

I agree that in a team you should not use code golf for others to review your work. But for your own explorations or because your team can read easily your code golf then use it when is appropriate.

llmzero · on March 5, 2024

Just for fun, a solution in language j to the problem of selecting the longest word with less than 3 vowels.

solution =: >@:{.@:(\: #&>)@:(((+/@:(e.&'aeiou') <: 2:) # ])&.>)@:;:

Example

solution 'yes, today you are reading something that is not so easy to grasp'

the result is: today

Ruby : frase.split.select{|x| x.count("aeiou")<3}.sort_by(&:length).last => "today"

Edited: Added a comparison with Ruby. It seems Ruby here is easier to read and to compose.

llmzero · on March 6, 2024

Shorter ruby version: phrase.split.select{|x| x.count("aeiou")<=2}.max_by(&:length)

The enumerable module of Ruby provides many methods that can be easily implemented in J. Just to show one of them

   max_by =: 1 : '{~ (i. >./)@:(u&>)'

For example the list.max_by(&:length) is (# max_by)

elcaro · on March 6, 2024

FWIW, I solved this in J as follows

F =: >@{.@(\: #@>)@(#~ (2 >: +/@e.&'aeoiu')@>)

which I think maybe composes a little better

This doesn't include the splitting. I would imagine I would be feeding this function from a boxed list from a word list, like: F 'b' fread 'dict.txt'

Also `;:` is unreliable on hyphenated words, etc. Better to use `cut`.

antonvs · on March 6, 2024

Haskell:

    maximumBy (on compare length) . filter ((< 3) . length . filter (`elem` "aeiou")) $ words phrase

That returns "grasp", though, because it doesn't sort the list.

leephillips · on March 6, 2024

The problem as stated does not have a unique solution in general, as you’ve found. One Julia program for this is

`sort(split(p)[count.(r"[aeiou]", split(p)) .< 3]; by=r -> length(r))[end]`

which also returns "grasp".

But this one appeals to me more, because it doesn’t split twice:

`sort(filter(w -> count(r"[aeiou]", w) < 3, split(p)); by=r -> length(r))[end]`

antonvs · on March 6, 2024

Plus, defining a few aliases for the punctuation-happy terseness-lovers among us, we can reduce the above to:

    maxBy(on cmp (#)).(((<3).(#).(el"aeiou"|=))|=)$words phrase

mr_toad · on March 6, 2024

llmzero · on March 4, 2024

Since: (i) the father and the mother of Sally may be married with other people, and (ii) the sister or brother relationship only requires to share one parent, we deduce that there is no a definitive answer to this question.

  Example:  Sally has three brothers, Sally and their brothers have the same mother but a different father, and those brothers have two sisters Sally and Mary, but Mary and Sally are  not sisters because they are from different fathers and mothers, hence Sally has no sister.

For those mathematically inclined: Supposing the three brothers are called Bob (to simplify) and the parents are designed by numbers.

FS = father of Sally = 7

MS = mother of Sally = 10

FB = father of Bob = 12

MB = mother of Bod = 10

FM = father of Mary = 12

MM = mother of Mary = 24

Now MS=MB=10 (S and B are brothers), FB=FM=12 (Bob and Mary are brothers), (FS=7)#(FB=12), and (MB=10)#(MM=24). Now S and M are not sisters because their parents {7,10} and {12,24} are disjoint sets.

Edited several times to make the example trivial and fix grammar.