Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LLMs are a revolution in open source (scapegoat.dev)
70 points by Philpax on Dec 2, 2023 | hide | past | favorite | 93 comments


The author currently misses the issue that all decent LLMs that have been released have been created by companies rather than opensource communities.

Sure, a lot of models have been released under permissive licenses, but.

Thats like releasing shareware. The special source for making _new_ LLMs comes from the dataset, and training clusters. None of which are cheap or easily run/financed by the community.


More concretely, an ML mode is only "open source" if both the training code and the training data are freely available.

Otherwise, it's not possible for the community to reproduce the model.


That's a bit like saying FOSS is open source only if a copy of the programmers is applied with the code.

Everything that has a source, has another source that has produced that source.

The algorithms behind creating LLMs are all published papers for all to read, the libraries (like TensorFlow) are themselves FOSS projects, and the data... is the open web for the most part.

The Wikipedia dump alone is more than enough to get a very decent LLM shaped up.

How an LLM is produced IS NO SECRET. It's just that to produce it you need millions (or for the more sophisticated ones: billions) in data center fees / power / GPU to train the model. So if the training scripts were included, you still can't make a LLama model yourself at home.


> That's a bit like saying FOSS is open source only if a copy of the programmers is applied with the code.

Wait, what? You can build the FOSS app entirely from the source. You don’t need a copy of the programmer.


Just as you say, you do not need Llama's training data in order to build Llama from source, nor even to fine tune it. You don't need a copy of the training data.

(edit to spell Llama correctly)


You can fine tune it, which gets you far, don't get me wrong. But you cannot study how major model changes to Llama or changes during the training process behave when you apply those changes from scratch.


No, you need llama's binary, you can;t know what its trained on because thats a secret or at least obfuscated.

The currently LLM status is more akin to opensource plugins to some propriety system, like game modding.


The reasoning is that the model weights are a lot more like a build artifact that can't be easily scrutinized. Demanding training data to be public is analogous to demanding source code. Sources like Wikipedia are just one kind of input.


And my reasoning was that everything is an artifact of something else.

The model weights can't be scrutinized? Well no one can scrutinize them. Even those who made the model.

And because you don't have the hardware to reproduce Llama anyway, even if they gave you the code to build it, you can't verify they used that code to build it. And if you have the hardware and data, you probably still wouldn't spend MILLIONS to end up... with the same exact model they gave you in the first place.

Do you understand how meaningless this entire "Llama is not truly open" bullshit is?


Your reasoning doesn't tackle the issue at hand so no I can't understand how "meaningless" because there is does actually matter. It doesn't matter that the people who produced the model can't scrutinize the results because they know what went into it and we don't. Just like how it's not very common for corporations that ship binary firmware blobs to scrutinize them in depth, since they have the upstream source/data that went into making them. They don't need to, because that's not the point.

No you wouldn't end up with the exact same model weights if you trained it yourself on the same data but you could compare how the models perform when given the same prompts and to try to identify anomalous variation that could point you towards Facebook misrepresenting us about what went into the training / fine-tuning.


Sure you can end up with the same weights. Just start with the same random weights (put it in the source!) and have a deterministic algorithm for training (put it in the source!) and put all the same copyrighted data which Facebook can't just... put in the source, because they don't have the copyright.

Do I need to keep going so you see how stupid the discussion is?

You think like a software developer. Everything is just buncha text, a compiler and build environment. That's not like that.

There's too much shit going in, and the rights of the data are absolutely not clear. But if they had to be, the model wouldn't exist. What you want it friggin' impossible.


I think open source advocates bring some understandable but wrong intuitions to how LLMs are being distributed.

In the world of hand-coded software, binaries are hard to work with, source is easy to work with, and compilation is cheap.

In the analogous world of LLM training, model weights are easy to work with, having training data does not let you reliably change model behavior, and "compilation" (training) is insanely expensive.

So, if your goal is agency to create tools for your own purposes, 9/10 researchers would rather work from a trained foundation model than the source data. The foundation models are of course released by companies because they cost $10s of millions to train -- but releasing them enables a thriving community of research, building adjacent frameworks, and specialized models to be created by much less powerful actors.

I've never understood the dogmatism around FOSS, but I've felt I understand the ideals. Those ideals are so much better served by releasing weights than by leaving LLMs only available through commercial APIs.


(author here): I am actually more focused on training corpus and tooling around foundational models, and the possibility of run these models on resources you control. An analogy would be the CPU and the opensource software running on top. I know people want the entire toolchain to be opensource, but I think the real value lies in that "top-layer".


First incarnations of Unix were paid by corporations and universities.


Thus GNU... And Linux.

20+ years later, look who is still around.


And Linux is mostly paid for by corporations, directly and indirectly.


It wasn't at the start, and it certainly isn't the cathedral of one large company, like Facebook and OpenAI/Microsoft.


and you had the source, documentation and all the bits needed to build it from scratch.

llama is a binary blob. Very capable, and its great that it got leaked. But its not a win for opensource. Its an accident of licensing. FB's lawyers would never have let it be a proper open source license, the PR and IP risks were way too high.

It just so happens that the leak meant that the PR risk went away, and knee capping openAI is a good thing for meta, and thus worth the IP risk.


Yes and no. The models themselves (with permissive licenses) might be like freeware, rather than "open source" or "free software". But they are a big enabler in allowing people to build other F/OSS on top -- eg. an experimental Gnome/KDE addon that allows voice control is now a real possibility. Most practical FOSS (other recalcitrant parts of the GNU community) already builds on top of all kinds of proprietary tools from computer hardware to closed operating systems like MacOS/Windows. LLMs are just the latest in the list of blobs -- developers & users can evaluate whether the power/opacity tradeoff is acceptable.


I don't think the comparison applies. Shareware can't really be modified in any way. However a foundational model is meant to be modified, fine-tuned, augmented, etc...

I agree calling it fully open source is a stretch but it's not the same as shareware.


Has anyone tried to take a distributed training framework and make it _really_ distributed?


Eh, I don't know; that's perhaps like giving credit to HP and Dell and perhaps Apple for Linux?


I have a feeling this article was partly written by a GPT. This part

> For LLMs to be effective, they require a few things:

And then that list are just a dead giveaway.

I'm not nearly as cynical about GPT as many are (in fact I use them myself), but I think text written by it should be marked as such.

edit: The author expressed that it was indeed not written using any GPT.


(author here): it wasn't. This is how I write, and probably an example of ChatGPT flagging neurodivergent/non-native speakers as AI-generated.


I experimented a bit with AI written blog posts on here, I'm actually pretty happy with the result:

https://write.as/mnmlmnl/

I push my bash history, attach a couple files, paste some doc and web references, and then lightly edit the result. Beats not writing the blog post.

The tricky part is not getting fooled by the fact that it looks decent, and really put in the effort to think each sentence over to catch the misleading "fringe hallucinations" (as I call them, not full on hallucination, but mildly misrepresenting the actual intent/meaning).


> I think text written by it should be marked as such

Good luck with that. It’s a tool.


The problem is that it wastes the reader's time to parse the bullshit. Now of course a lot of humans write bullshit too, especially in areas that are at the peak of their current hype cycle ;)


What makes it a dead giveaway?


I think your comment was written by a GPT, does that make it true?

Please refrain from making baseless accusations.


Luckily I didn't make any baseless accusations.


Actually, you did make a totally baseless accusation and the fact that the person above you is being downvoted is evidence that the HN userbase is pretty awful at recognizing what "bases" are.


Speculation then. Just because someone has a bulleted list in their blog post doesn't mean it's AI generated.


OP provided examples.


The example they link is perfectly innocuous, and is in no way obviously generated by an AI.


I still haven't found a use for LLMs. The propensity of them to get it wrong, and being fancy statistics machines based on existing data, just feels like a cheap trick. It's still a fancy Markov chain, it just has guard rails.

There just isn't a place for them in my life.


I'm not swigging the Kool-Aid yet either, but I think you're being a little dismissive. I've found ChatGPT useful for some sorts of questions that can only really be posed in natural language, particularly grammatical oddities that come up as I learn Spanish.

Probably the single most impressive result I've had so far was with "What's that weird thing on the top of a P-38's engine nacelle that looks like a recessed sideways wheel?" Got it in one. I can't imagine getting anywhere with a conventional search engine there.

I definitely wouldn't rely on it for anything professional, though.


I copied your P38 question verbatim into Google and seem to have gotten the answer immediately, so not sure that's a big win for the LLM.


Huh. OK, I'm surprised, but that's a useful datapoint, thanks.


They're still extremely effective as a fuzzy search engine and translator (e.g., I can write out a function in plain English and get something that would have taken me a minute in 5 seconds).

Additionally (more importantly), the most magical part is still the architecture: there seems to be no end to the expressivity of transformers as long as you can pour in more compute. This can be extremely powerful and lead to more generalizable reasoning given the right learning objective (e.g., pose tasks as an RL Markov Decision Processes where state=text, actions=tokens, rewards=performance of generated code/math/language).


How long did it take you to write the query? How long did it take you to validate its correctness?


If you look at the technical discussion from the perspective of individuals and companies creating sophisticated autocomplete, the whole fad seems ridiculous.


I feel quite strongly that LLMs are not a fad. Something like LLMs will be useful tools to people decades from now. Especially given that a whole generation of students is now using/relying on them for course work.

Even if we want to call them sophisticated autocomplete, why would I ever switch back to a dumb autocomplete?


Currently working on a pandas personal project (which I have only a little bit of experience with from before) and it’s easily 10x better than googling. I ask it a question to plot something simple and it spits code out which very rarely doesn’t work. I’d easily pay $20 for this… but it’s the free model that does it.


I don't doubt your experience at all, but I wonder what I am doing wrong. I tried to use Copilot to help me learn Rust, but I gave up because almost every suggestion has caused compile time issues or runtime panic.


Just an anecdote, but I recently used LLM's as part of learning helmfile/helm/k8s. Mostly i used the LLM to help me find the right documentation. I know what concept I'm looking for, but i don't know what they call it in these systems. I suppose critically I never really asked it to write code for me, I asked it to help me find the concepts to read documentation about.

I haven't actually tried copilot so I can't comment directly on that, but I have found the general space of LLM's to be useful.


It might be that there’s simply so much more content on the internet about pandas that the LLM internalized all of it. Rust is a much more complex language with less written about it. IOW my use case is a perfect match for the tool, yours is less so and there isn’t anything more to it.


Other than coding assistant, ChatGPT is awesome at answering to everyday questions where you previously had to find the answer in ad-bloated content farms.

Things like "how to do this", "how does this work", questions about medicine, DIY, etc …

And contrary to the content farm internet, it allows you to have a real conversation to give you more précise information.

Is it 100% accurate ? I think not. But it’s trained on nearly all the books about all the topics so I put a lot more trust on this than on SEO optimized content farms.

The best way to chat with it is to think that you are talking to someone who have read and remembered everything with an endless knowledge of everything but who acts as an faillible human when you speak with it.


Here's one example: Today I used GPT-4 to write documentation for a Neovim plugin: https://github.com/Robitx/gp.nvim/pull/72/files

It may be a fancy Markov chain, but it seems to have understood the code it documented.


Like pretty much every other thing knowledge product, The only real value I see in them is in terms of helping humans think better. You can kind of tell them to "interview" or "question you" and they can spark your creativity.


This is kind of wishful thinking. I'm not saying you're wrong, but there's no real way to prove you're right. Sometimes open source wins, but not every time. The whole machine learning field is still too young to have a clear answer.


(author here): I am currently writing a book about programming with LLMs, I have absolutely put my money where my mouth is over the last year, and there is not doubt in my mind that we will see incredible tools in 2024.

Already the emergent tools and frameworks are impressive, and the fact that you can make them yours by adding a couple of prompting lines and really tailor them to your codebase is the killer factor.

My tooling ( https://github.com/go-go-golems/geppetto ) sucks ass UI wise, yet I get an incredible value out of it. It's hard to quantify as a 10X, because my code architecture has changed to accomodate the models.

In some ways, the trick to coding with LLMs is to... not have them produce code, but intermediate DSL representations. There's much more to it, thus the book.


Can you name some examples where open source won?

In the past I wanted to believe this can be the future, where open source will somehow win (at least in some parts). What I see is that even the biggest projects are mere tools in the hands of the big corporations. Linux, Postgres, etc. All great! But have been assimilated. I cannot really consider them a win.

It seems to me that it goes back and forth - it also seems to me that the advancements in LLMs will go a similar route.


Why do you not consider that Linux has won?

It dominates everywhere from fairly small embedded, to super computers, with the one notable exception of the desktop, a shrinking market and mostly a historical anomaly (Microsoft cornered it before Linux was a viable player in that space).

I wouldn't say it is a "mere tool in the hands of big corporations". Sure these days most Linux developers are paid by corporations (a good thingg since that allows them to work full time on Linux) but the important point is that those corporations don't control Linux. Sure they can pay people to work on specific areas but they don't get to decide what gets merged or what the acceptance criteria are.

More generally, beyond Linux, huge swathes of new technology are expected to be open source or no one looks at them (think language and frameworks).

In the late 90s / early 2000s it became obvious that software development would no longer be about writing things from scratch but building on existing components. But there were two competing models for this. There was Microsoft's vision which envisionned a market of binary components that people would buy and use to compose application (that gave rise to the likes of Active X, DCOM, OLE) and the Open Source community vision that saw us building on components supplied in source form. It's clear that today the second vision has won. Even proprietary software now uses huge quantites of open source internally (take a look at the "about" screen on your Smartphone, TV or router).

LLMs may be the exception here for the moment (mainly due to the compute power needed for training).


Open source has won hands down for developers. It’s basically a giant tool bin and parts yard for people who build things with software. It’s also useful to extremely tech savvy people who like to DIY homelab type stuff.

In the consumer realm it has lost equally decisively.

The reason, I think, is that the distance between software nerds can use and software the general public can (or wants) to use is significantly larger than anyone understood. Getting something to work gets it like 5% of the way to making it usable.

Even worse, making it usable requires a huge amount of the kind of nit picky UI/UX work that programmers hate to do. This means they have to be paid to do it, which means usable software is many many times more expensive than technical software.

The situation is hopeless unless people start paying for open software, which is hard because the FOSS movement taught everyone that software should be free.


"In the consumer realm it has lost" is kind of weird to me, though. I'd say, don't think about "developers" v. everyone else but "anyone doing anything creative, as opposed to merely consuming, with computers and related devices" and it's not at all clear that the creators are the losers?


This seems almost like you're framing an observation "in the consumer realm it has lost" as some moral failing and feels in dangerously bad faith. What's the point? Do you really want to willfully ignore the artists using Procreate a closed source iOS app, the chip designers using proprietary EDA tools, the DJs using proprietary DJ software like Serrato, the musicians using proprietary DAWs like Ableton?

If anything, it's the creatives who use more proprietary software than the folks doing generic office work that can get away using LibreOffice and read PDFs using evince.


The generous reading, which I think is mostly correct, is that consumers mostly don't directly run open source software, e.g. LibreOffice on Linux. A ton of the software they run has significant open source components but it's packaged up as a proprietary SaaS or an app store app.


SaaS is how most people use open source, which is very ironically the least open way of using software. Closed source commercial (local) software is considerably more open and offers far more privacy and freedom.


I wouldn't say that open-source SaaS is the "least open" way of using software.

If you're using a hosted service based on open source software, you know that you can leave. You can grab your data and self-host. You can move it to another host that has reused or forked the code. You can run it locally. You have options.

If you're using local closed-source software, your files might not even be usable without an Internet connection. Think Spotify (closed source, local) where even your "offline" playlists won't load if you don't allow the software to phone home once in a while.


Self hostable SaaS does restore a lot of freedom but it’s not the most common paradigm. Most SaaS is closed.

Spotify is SaaS. It’s an app that runs locally but the data and most of what it really does lives in the cloud.


Yes, this was exactly my point. Not sure why I got down voted so much for it. And note what companies such as hashicorp and elastic are doing.


>Linux, Postgres, etc. All great! But have been assimilated.

If you're a company, you almost certainly want support, certifications, and related benefits of having a commercial product. And it's nice to have that avenue to fund developers. But the side effect is that the software is still free as in beer open source that anyone can download. In general, I'd say open source infrastructure software has won pretty thoroughly (even if not universally).


> Linux, Postgres, etc. All great! But have been assimilated. I cannot really consider them a win.

Are you trying to argue that free software becomes less free because specific groups of people contribute to it?


Linux absolutely won the OS wars.

We just didn't fully realize what that would look like before it happened.


It won the OS wars on the server side, no question there! It works greats, serving all the SaaS platforms out there...

But as a desktop? Most of my colleagues are using apple hardware/MacOSX. Personally, I'm writing this on Ubuntu - I've been on Linux since forever.

Another one: Android. Built on top of Linux, its core is open source, but is it REALLY in the spirit of open source? One most new phones you cannot put LineageOS (I've always checking the list before I buy).

How does it help that I have the source of Linux and Android, but I'm still not able to build the OS for most mobiles?

Don't get me wrong, I'm not disillusioned with open source. Not at all. I love and appreciate the open source I have (this includes e.g. Rust) I'm just cautious about what the author of the article is foreseeing. You might have a vibrant community, doesn't mean its output won't get wrapped inside some SaaS/big corp.


Complaining about companies releasing their work under a liberal open source license is a first-world-problem if I've ever seen one ;)


I mean, I just the last hour setup the deepseek-coder-6.7B-instruct-GGUF model on my computer and tried it out with llama.cpp. It is maybe free ChatGTP level.

If I could have run this when I was a student it would have been a real killer app. It is kinda impressive that I can run it on a crappy laptop.

I some way this is also really scary, job wise. I would not recommend anyone getting into the programmer fields after the last years progress in these tools.

But at least they will not be locked up to "OpenAI"'s servers ...

I wonder if there could be a community effort doing model training, like the protein folding? If 30 000 computers ran some model training while not being used, would that be enough to get anywhere?


Not getting into programming because anyone can use an LLM to do programming does not suddenly fill the productivity needs in an absolute sense. People can do more with less which also means we will quickly raise the demand for the new things we need to do.


I think it will flip the tables completely. When "we" got into programming you more or less had to have aptitude for it and wade through dull documentation and still think it was fun to get anywhere. And there was a geek stigma. The recruiting body was really small for future programmers.

LLM are so much help to get a newbie to some sort of entry level it gonna flood the market compared to when we entered. And the pay has got really good too, in itself a big enough problem leading to flooding from bellow.


That principle of demand increases up to the size of the supply makes sense until the supply becomes effectively infinity, which is what AI is doing. I'm not sure demand rises infinitely. I do believe that programming skills are heavily devalued in a world of effective AI, and most people on HN are a bit delusional about this fact, mainly because it's depressing to us computer touchers...


You must have a very different definition of ‘crappy laptop’ than I do.


Lenovo Lenovo Legion Y530-15ICH, Intel® Core™ i5-8300H × 8, NVIDIA GeForce GTX 1050 Ti, 32 gb ram.

Maybe not "crappy", but 5 years old by now and not high end for its time. Like mid range.

"Write a Python function that converts a dict to an html-table" takes about 60s locally, while ChatGPT takes about 8s.


> LLMs are ultimately democratic

LLMs reflect what they have been trained on, so the question is what they are trained on. In this line of thinking, at best they represent the groupthink, and at worst they are biased one way or another.

Which is not to be taken negatively, but is something important to keep in mind, though.


I don't think the author is saying the content LLMs produce is "democratic".

They're just saying that they're democratic in the sense that anyone can run them and they have a strong open-source community behind them.


> I build "brushes" for refactoring my code every day.

Anyone know what the author means by brushes?

Also any recommendations for non-Copilot IDE LLM integrations? I've tried a couple but they felt far behind Copilot in terms of quality and smooth IDE integration.


https://githubnext.com/projects/code-brushes/

You select some code and apply a brush and it'll use a preset prompt to modify it: "Make Readable", "Add Types", "Document" are some of the options. Copilot has a pretty poor implementation, and mostly they just butcher your code - but they seem very powerful, especially with custom brushes.


Things like "replace this PHP5 class with a new PHP8 readonly dataclass", "Convert this react components to have imageLinks as prop srcsets". Nothing fantastic, but damn this stuff is tedious to do manually. Now I can just batch process my files et voila.


the dependency of LLMs is very clear, without opensource, there is no LLMs, zero! not just LLMs, almost all AI/Machine-learning training are using linux, Slurm, python, i work in the HPC/ML field, include hardward/software/administration, i know this firsthand.


The skepticism in the comments makes me even more bullish for this revolution.


>I can't really see a future where big tech can keep up with a motivated community (and motivated it is, as any look at social media or tech websites will show you)

As far as I can tell, many of these "motivated communities" organize on Discord (search is garbage, poor integrations, login-gated and invite-only in many cases), and also Twitter/X (now login-gated, API locked down). And beyond AI there are communities and lots of valuable content on Reddit (also locking down as hard as they practically can). I will be 0% surprised if Github locks down their platform, too. It's too soon to tell how much it will pay off, but almost as soon as ChatGPT came out, the obvious strategy of every company with user-generated content is for all the valuable communications, insights, and social graphs of their communities to be proprietary training data for the proprietary LLMs of the future.


I also share their idea about reallocation of programming resource away from big corporations.

There is so much bullshit empire building in our industry that serves only to waste talent. Small teams working on problems that need tech but can’t normally afford it should hopefully deliver something akin to a productivity shot in the arm across industries.


I like the term surveillance capitalism.

We need to move quickly, because the “moat” that big tech monopolies will seek to create will be regulatory, decelerationist, and authoritarian.


I treat usage of LLMs in coding as an admission of incompetence, and this post has only reinforced that view.


LLMs are the death of open source. Labor might be free, but training resources seldom are. They won't be able to keep up.


That people actually entertain these ideas is SO WEIRD to me. The last sentence is just obviously false, precisely because the models are software that anyone can trade around. LoRA's, in other words. The fact that original models "might" -- and even that's a big might -- be hard to produce from scratch, so what? Operating systems are hard to produce from scratch, but anyone can grab Linux and do anything.


Your opinion doesn't make any sense to me.

If it takes a hundred million dollars to train a SOTA LLM, who is going to pay for and put that in open source?

Even Stability is starting to put restrictions on their releases (and is rumored to be searching for an acquirer).


The problem to me is another. Yes, there are people willingly to spend time and computing resources to train a model and improve it.

The thing is, with software it's easy to verify (or, quite easy) that a contribution actually improves the code and thus gets accepted in the project.

A modification to an AI model, for how models are made these day, it's completely opaque, a black box. How we can evaluate the modification to ensure that it actually improves the model and does not harm it, or worse introduce malicious behavior (since AI are also used to write code, for example train the AI to write malicious code).

This is the real problem to solve, and to this day it's not solved. Probably the solution is to NOT have LARGE language model but rather have a multitude of small models (small enough that we can retrain them in hours with a normal PC) that are trained and tested individually, maintained by a group of persons in the open source community, and then merged together in a big model, just like a kernel is made up of thousands of individual modules that are then assembled together in a single software.


But at this stage:

why a big model AT ALL?

Which is to say, a useful LLM is probably not going to be complex like an Operating System.

(Now, an LLM that looks fancy and can fool dumb investors, sure, maybe -- but an actually useful one?)


Two possibilities:

1) Why wouldn't Facebook (who looks like they're kind of doing that now) or anyone trying to compete with whoever the big monolith is threatening to be (which looks like OpenAI/Microsoft?)

2) Do they even need to, or is/can LLM development be incremental? I haven't heard anything about Google Fuchsia (sp?) lately either.


They can only remix what already exists, not create new things, no?


There are papers including one by Yann LeCunn which shows that the overwhelming majority of data found within high dimensional spaces is within decision boundries created by AI models. This implies that the overwhelming majority of things we think as humans are "novel" or "extrapolative" are in fact not novel and are nothing more than fancy interpolation.

Basically, we humans cannot tell when something is "new to us" or "new to the universe" and it honestly hardly matters in practical reality.

Finally, LLMs can totally "extrapolate" with techniques which encourage it to avoid its traditional training loss functions. A high temperature would do this, but it'll mostly just make your model crazy.


I think by most definitions LLMs can definitely create new things.

As much as I can create new things by cobbling together open source projects and answers on stack overflow at least.


Technically, anything you could ever write in any [programmed or spoken] language is some sort of remix of existing tokens.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: