this post was submitted on 03 Aug 2023
3 points (100.0% liked)

Free Open-Source Artificial Intelligence

3150 readers
3 users here now

Welcome to Free Open-Source Artificial Intelligence!

We are a community dedicated to forwarding the availability and access to:

Free Open Source Artificial Intelligence (F.O.S.A.I.)

More AI Communities

LLM Leaderboards

Developer Resources

GitHub Projects

FOSAI Time Capsule

founded 2 years ago
MODERATORS
 

I'm not a lawyer, but my understanding of a license is that it gives me permission to use/distribute something that's otherwise legally protected. For instance, software code is protected by copyright, and FOSS licenses give me the right to distribute it under some conditions.

However, LLMs are produced by a computer, and aren't covered by copyright. So I was hoping someone who has better understanding of law to answer some questions for me:

  1. Is there some legal framework that protects AI models, so that I'd need a license to distribute them? How about using them, since many licenses do restrict use as well.

  2. If the answer to the above is no: By mentioning, following and normalizing LLM licenses, are we essentially helping establish the principle that we do need permission from companies to use their models, and that they have the right to restrict us?

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 4 points 2 years ago

IANAL. TINLA.

The machine producing the derivative work is a thing which means it cannot have a copyright on anything. If it did anything original somehow, that work would be in the public domain.

The weights of the model would likely be considered a derivative work of the training data however because it was directly created using the training data. Thus, the copyright of the weights belongs to whoever owns the copyright to the training data.

The training data is created from thousands/millions/billions of individually copyrighted works. This would also constitute a derivative work too but there's an escape hatch: Fair use. If the use of the original works is transformative enough, the creator of the derivative work retains their copyright.
Collecting the data on which the weights are created is (somewhat) manual work done by humans. You could make good argument for this being fair use.

It all hinges on whether or not this is true. If it is, ML companies will continue as they did. If it isn't, the people creating the datasets would need to have to license the individual works they used for the training data from the respective copyright holders.

In practice, nothing is black and white and this is still a hotly debated topic for which no clear answer exists. None of this is court-tested to my knowledge.

OTOH: There's another legal question here: Is creating weights from training data fair use or a derivative work? If it's fair use, that'd mean whoever creates the weights gets the copyright which, in this case, is a machine; meaning nearly all ML models would be public domain.


Opinion and wild speculation:

Creating weights out of training data being fair use would be ..interesting but I doubt that will happen. It's sometimes even fairly obvious that some weights are a derivative work of their training data because you can make the weights reproduce training data very closely in some cases.

I am fairly certain that model weights will be considered a derivative work of the training data; copyright of the weights belongs to whoever owns the copyright to the training data.

What I suspect will happen on the training data front is that the collection and tagging will (at some point) be considered a transformative action, making it fair use.

I think this way because artists do not have a lobby, so even if the judiciary decided that collecting training data wasn't fair use, the rich tech companies will get their way because they can wooo the legislative using their """AI"""; creating new copyright exceptions such that aristocrat pockets can continue to be filled with peasant money.
Far more convincing is the contraposition: If collecting training data wasn't fair use, that would be to the benefit of the peasants; ML companies would have to license works from individual artists and pay them license fees. We can't have aristocrat money going into peasant pockets.