If you can't bootstrap a set of weights from public material, the weights aren't...

lolinder · on Oct 26, 2024

> Trying to draw an equivalency between code and weights is ridiculous, because the weights are not written in the same way as source code.

By the same logic, the comparison between a compiled artifact and weights fails because the weights are not "compiled" in any meaningful sense. Analogies will always fail, which is why "preferred form for making modifications" is the rod we use, not vague attempts at drawing analogies between completely different development processes.

> They are built from the source material supplied to an algorithm. Weights are data, not code.

As Lispers know well, code is data and data is code. You can't draw a line in the sand and definitively say that on this side of the line is just code and on that side is just data.

In terms of how they behave, weights function as code that is executed by an interpreter that we call an inference engine.

smolder · on Oct 26, 2024

I'm perfectly willing to draw a line in the sand instead of letting people define their models however it's most convenient for them. Analogies aside, here is what a set of weights is made of: A lot of data, mostly produced by humans who are not credited and have no say in how the output weights are licensed, some code written by people who might have some say, and then lots of work by computers running that code and consuming that data.

I'm not comfortable with calling the resulting weights "open source", since people can't look at a set of weights and understand all of the components in the same way as actual source code. It's more like "freeware". You might be able to disassemble it with some work, but otherwise it's an incomprehensible thing you can run and have for free. I think it would be more appropriate to co-opt the term "open source" for weights generated from freely available material, because then there is no confusion whether the "source" is open.

lolinder · on Oct 26, 2024

> A lot of data, mostly produced by humans who are not credited and have no say in how the output weights are licensed

And this is what I think everyone is actually dancing around: I suspect the insistence on publishing the training data has very little to do with a sense of purity around the definition of Open Source and everything to do with frustrations about copyright and intellectual property.

For that same reason, we won't see open source models by this definition any time soon, because the legal questions around data usage are profoundly unsettled and no company can afford to publicize the complete set of data that they trained on until they are.

My personal ethic says that intellectual property is a cancer that sacrifices knowledge and curiosity on the altar of profit, so I'm not overly concerned about forcing companies to reveal where they got the data. If they're releasing the resulting weights under a free license (which, notably, Llama isn't) then that's good enough for me.

smolder · on Oct 26, 2024

> For that same reason, we won't see open source models by this definition any time soon

It's totally fine if we don't have many (or any) models meeting the definition of open source! How hard is it to use a different term that actually applies?

The people on my side of the argument seem to be saying: "do not misapply these words", not "do not give away your weights".

Insisting on calling a model with undisclosed sources "open source" has what benefit? Marketing? That's really all I can think of... that it's to satisfy the goals of propagandists.

Shamar · on Oct 26, 2024

It's not just marketing: European AI Act impose several compliance obligations to corporations building AI system, including serious scientific scrutiny on the whole training process.

Such obligations are designed to mitigate the inherent risks that AI can pose to individuals and society.

The AI Act exempts open source from such scientific scrutiny because it's already transparent.

BUT if OSI defines black boxes as "open source", they open a loophole that will be exploited to harm people without being held accountable.

So it's not just marketing, but dangerous corporate capture.

acka · on Oct 26, 2024

Exactly. Without models being truly open source, (training data, training procedures, alignment etc.), there is no way for auditors to assess, for example, whether a model was trained on data exhibiting certain forms of selection bias (anything from training data or alignment being overly biased towards Western culture, controversial political or moral viewpoints, particular religions, gender stereotypes, even racism) which might lead to dangerous outcomes later on, whether by contamination of derived models or during inference.

JumpCrisscross · on Oct 26, 2024

> if OSI defines black boxes as "open source", they open a loophole that will be exploited to harm people without being held accountable

The OSI’s definition matches the legal definition in the EU and California (and common use). If the OSI says open data only, it will just be ignored. (If people are upset about the current use, they can make the free vs. open distinction we do in software to keep the pedantic definition contained.)

seba_dos1 · on Oct 26, 2024

> very little to do with a sense of purity around the definition of Open Source and everything to do with frustrations about copyright and intellectual property

The whole reason FOSS exists is because of frustrations about copyright and intellectual property, anything else is derived from that, so I'm not sure what your point is.

zoobab · on Oct 29, 2024

"frustrations about copyright and intellectual property"

Intellectual property is an undefined term, I would say copyright, although patents can also play a role in some countries.

seba_dos1 · on Oct 30, 2024

> Intellectual property is an undefined term

That's one of the frustrating things about it ;)

dietr1ch · on Oct 26, 2024

> > Trying to draw an equivalency between code and weights is ridiculous, because the weights are not written in the same way as source code.

> By the same logic, the comparison between a compiled artifact and weights fails because the weights are not "compiled" in any meaningful sense.

To me the weights map to assembly and the training data+models map to source code+compilers. Sure, you can hand me assembly, and with the assembly I may be able to execute the model/program, but having it does not mean that I can stare at it and learn nor modify it with a reasonable understanding of what's going to change.

I got to add that the situation feels even worse than assembly, because assembly, hand-coded or mutilated by an optimizing compiler still does something very specific and deterministic, but the weights on the model makes things equivalent to programming without booleans, but seemingly random numbers and checking for inequalities to get a binary decision.

lolinder · on Oct 26, 2024

This is the analogy that people keep saying, but as I observed above, the key difference is that the company that produces a binary executable doesn't prefer to work with that binary: they work with the source code and re-compile after changing it.

In contrast, the weights are the preferred form for modification, even for the company that built it. They only very rarely start a brand new training run from scratch, and when they do so it's not to modify the existing work, it's to make a brand new work that builds on what they learned from the previous model.

If the company makes the form of the work that they themselves use as the primary artifact freely available, I'm not sure why we wouldn't call the work open.

Nevermark · on Oct 26, 2024

> In contrast, the weights are the preferred form for modification, even for the company that built it.

Preferred is obviously not a particularly strong line.

If someone ships object code for a bunch of stable modules, and only provides the source for code that’s expected to be where changed, is that really open?

“Preferred” gets messy quick. Not sure how that can be defined in any consistent way. Models are going to get more & more complex. Training with competitive models, etc.

I think you either have it all, or it isn’t really open. Or only some demarked subset is.

smolder · on Oct 26, 2024

Is a .rom file open source because you can pipe it into an emulator and generate new machine code for a different platform?

I don't think your argument holds any water.

lolinder · on Oct 26, 2024

Is a .rom file the preferred form for modifying the work?

smolder · on Oct 26, 2024

To get it to run on different platforms and gain new features like super-resolution and so on, yes. Rom files are the preferred form for modifying old games. No one bothers digging up old source code and assets to reconstruct things when they can use an emulator to essentially spit out a derivative binary with new capability. See every re-release of a 16 bit era or earlier game.

Now that I've beat my head against this issue for a while, I think it's best summed up as: weights are a binary artifact, not source of any kind.

lolinder · on Oct 26, 2024

If what you say is true—if the ROM is the preferred form for making modifications (even for the original company that produced it) and the ROM is released under a FOSS license—then sure, I have no problem calling it open source.

grandma_tea · on Oct 26, 2024

Preferred by who? It sounds like these people have a strong say in what constitutes open source.

lolinder · on Oct 26, 2024

Preferred by anyone who's actually using and modifying the work.

No one trains an existing model from scratch, even those who have access to all of the data to do so. There's just no compelling reason to retrain a model to make a change when you have the weights already—fine tuning is preferred by everyone.

The only people I've seen who've asserted otherwise are random commenters on the internet who don't really understand the tech.

grandma_tea · on Oct 26, 2024

> Preferred by anyone who's actually using and modifying the work.

> ...fine tuning is preferred by everyone

How do you know this? Did you take a survey? When? What if preferences change or there is no consensus?

> The only people I've seen who've asserted otherwise are random commenters on the internet who don't really understand the tech.

There are lots of things that can be done with the training set that don't involve retraining the entire model from scratch. As a random example, I could perform a statistical analysis over a portion of the training set and find a series of vectors in token-space that could be used to steer the model. Something like this can be done without access to the training data, but does it work better? We don't know because it hasn't been tried yet.

But none of that really matters, because what we're discussing is the philosophy of open source. I think it's a really bad take to say that something is open source because it's in a "preferred" format.

lolinder · on Oct 26, 2024

> I think it's a really bad take to say that something is open source because it's in a "preferred" format.

Preferred form and under a free license. Llama isn't open source, but that's because the license has restrictions.

As for if it's a bad take that the preferred form matters—take it up with the GPL, I'm just using their definition:

> The “source code” for a work means the preferred form of the work for making modifications to it.

fsflover · on Oct 29, 2024

Today, the weights may be the preferable format indeed, due to the cost. Are you going to change the definition tomorrow, when the cost drops?

lolinder · on Oct 30, 2024

Sure, why not?

fsflover · on Nov 1, 2024

A good definition should not depend on transient state of affairs.