GitHub Copilot is not infringing your copyright : opensource

[–] [email protected] 17 points 4 years ago (2 children)

If it is true that Copilot only generates small snippets that arent under copyright, then why doesnt Microsoft train it on their own internal source code? Having more training data is good, and they claim that there is nothing to worry about. Seems very hypocritical.

The output of a machine simply does not qualify for copyright protection – it is in the public domain. That is good news for the open movement and not something that needs fixing.

This is great. Someone should train a machine learning model on leaked windows source code, and use it to generate a public domain implementation of windows. The same should be possible with music or movies. But it cant be a way to strip open source licenses while leaving proprietary copyright intact.

If Copilot will lead to copyright being abolished completely, I am all for it.

[–] [email protected] 2 points 4 years ago* (last edited 4 years ago) (1 children)

No you misunderstood that sentence.

Lets take those AI generated anime faces... they are under public domain because a machine created them (great news, I wasn't sure about it before), but if you narrow the AI model down in it's parameters so that the resulting image looks (nearly) exactly like an existing copyrighted image then you (!) are still doing a copyright infringement.

Or to put it differently, a robot pen that paints random but nice looking lines can't create a copyrighted work, but if you restrict the pen to make a picture that looks very close to an existing copyrighted artwork then that is a copyright infringement regardless of how that image was created.

Edit: AI generated code and art will make copyright mostly worthless, but not void.

[–] [email protected] 2 points 4 years ago (1 children)

How is a programmer who uses copilot going to know that they snippet they are getting suggested comes from a GPL-licensed project? At the moment thats impossible, so it cant be the standard assumption tha tthe output is public domain.

[–] [email protected] 2 points 4 years ago* (last edited 4 years ago) (1 children)

How is a programmer going to know that the person who posted code on stackoverflow hasn't taken it from a GPL licensed project? But the question is besides the point and irrelevant to the question if the ML model itself is a (legally speaking) derivative of the training data used. IANAL this is currently not the case under existing copyright legisation around the globe.

As for the output itself: There is the legal concept in copyright law that really small snippets of text or sound can not be copyrighted. If the AI then assembles genuinely new code and functionality from these snippets (theoretically feasible, but not what the co-pilot does), then this resulting code is in the public domain as IANAL currently a machine can not have copyright (and the legal case of it's owners being able to claim copyright AFAIK hasn't been fully established in courts). But if a human programmer uses a tool like the co-pilot to assemble these snippets he or she can claim copyright of it.

But if the result is nearly indistinguishable from a copyrighted piece of code than that programmer will not be able to proof that is wasn't in fact a copyright violation and thus in praxis it is.

[–] [email protected] 2 points 4 years ago (1 children)

Posting code on stackoverflow doesnt magically put it in the public domain, as copilot allegedly does.

(theoretically feasible, but not what the co-pilot does)

I am not considering what copilot could or might do in the future. I am talking about what it does now, and that is generating exact copies of 10+ lines. Including license texts which it certainly didnt assemble on its own.

[–] [email protected] 2 points 4 years ago (1 children)

No one is claiming that the co-pilot is magically putting all code it suggests in the public domain. That is just a strawman argument.

If the code sippets it suggests have insufficient technical complexity to be considered a copyrightable piece of information, then like any other such text snippet (regardless of the source) is in the public domain. This is half or single line type of auto-completion level stuff.

If the programmer choses to continually pressing the autocomplete button so that a sufficiently complex piece of code is pasted into their editor, then that programmer has to be aware that this is likely a copyright violation, just like if he or she was cut and pasting large code pieces from stackoverflow or any other source where the license isn't clear.

[–] [email protected] 2 points 4 years ago (1 children)

Will copilot warn the original author of the stolen code in that case, so that they can sue the copyright violator? Why does copilot even allow inserting more than one line in that case? If you are right that means that it is actively enouraging copyright violation, which puts it on the same level as thepiratebay.org.

[–] [email protected] 0 points 4 years ago* (last edited 4 years ago)

Will your preferred code editor warn the original author that you just cut an pasted some copyrighted code into it? How would it even know?

It allows inserting more than one line because it is dumb and can not know if the piece of code it referenced is copyrighted or not and who wrote it. It just looks at the immediate context of the place of your cursor, then looks at its database where it says "usually these three words are followed by these other three words or letters" and then suggests that (very simplified speaking).

And no it is not anywhere close to the Piratebay ;)

[–] [email protected] 2 points 4 years ago

Well, I wouldn't want my code autocomplete to learn from MS's code...

[–] [email protected] 13 points 4 years ago (1 children)

I had major issues with every one of their points, but it'd become a long essay. Long story short is that the're a proponent of "soft copyleft", which states that anyone should be able to use code in any way they want, including corporations or malicious actors.

This stands directly opposed to strong copyleft proponents like myself and those who use the GPL for protection against malicious actors. Unlike julia, we want enforcement of those protections, not a laissez-faire attitude towards companies mining then close sourcing or profiting off our work.

[–] [email protected] -1 points 4 years ago* (last edited 4 years ago) (1 children)

This point is addressed in the article though, and the point is that strong copyleft proponents should indeed ask for existing copyright to be enforced (in GPL violation cases etc.), but what they have actually mostly been doing is asking for copyright to be expanded, which as the article outlines is very likely going to backfire.

Copyleft is a copyright hack indeed, but one that was meant to circumvent copyright not expand it.

[–] [email protected] 7 points 4 years ago (1 children)

Github copilot is already in clear violation of the gpl repos its using. But yes, I 100% agree that if its currently inadequate, then enforcement should be expanded. This backfires on no one except those who use gpl projects maliciously.

[–] [email protected] -2 points 4 years ago (1 children)

It is clearly not in violation, I am not sure where you get that impression. Just like Github the software itself in not in violation of the GPL.

And the article gives some very good examples in how expanding copyright could backfire very badly.

[–] [email protected] 6 points 4 years ago (1 children)

How is it not in violation? Its reading repos, reproducing snippets from them, without crediting the authors or the licenses.

[–] [email protected] -2 points 4 years ago* (last edited 4 years ago) (1 children)

Reading public code is not a copyright violation, neither is reproducing tiny snippets from it. The latter falls under fair use and/or doesn't even have sufficient complexity to fall under copyright in the first place, e.g you can't copyright "1+1=2".

And if you use the copilot for reproducing more complex code, then the programmer but not the tool is doing a copyright violation.

Strongmanning your argument you could think this copilot itself is a derivative work of the code it read, but this AFAIK isn't the case as it is building its own database out of it and then only referencing this database. You might have a slightly stronger argument that this database is a derivative work, but as far as I can tell there is nothing in the GPL that forbids creating a code database and reading from it. If there was, then Github itself (a giant code database) would be in violation of the GPL.

[–] [email protected] 6 points 4 years ago (1 children)

GPL specifies that derived works have to be licensed under GPL, and similarly for other licenses. Their ML model wouldnt exist without the GPL code, ergo its a derived work. Github is not comparable at all, because the code hosted there is just data, not a core part of its functionality.

[–] [email protected] 0 points 4 years ago* (last edited 4 years ago)

Feel free to disagree, but my (somewhat limited) understanding of such AI models says that the model data is not core part of its functionality either.

Edit: It's like saying "the internet" is a core part of Google's search algorithm's functionality.

[–] [email protected] 7 points 4 years ago (1 children)

I'm not familiar with GitHub Copilot's internal workings, copyright law, or the author of the article. However, some ideas:

GitHub Copilot's underlying technology probably cannot be considered artificial intelligence. At best, it can only be considered a context-aware copy-paste program. However, it probably does what it does due to the programming habits of human developers, and how we structure our code. There are established design patterns - ways to do things - that most developers follow; certain names we give to certain variables, certain design patterns that we use in a specific scenario. If you think of programming as a science, you could say that the optimum code for common scenarios for a language have probably already been written.

Human devs' frequent use of 1) tutorial/example/sample code of frameworks, libraries, whatnot and 2) StackOverflow code strengthens this hypothesis. Copilot is so useful (allegedly) - and blatantly copying, for example, GPL code (allegedly) - simply because a program trained on a dataset of crowdsourced, optimal solutions to problems devs face will more often than not simply take that optimal solution and suggest that solution in its entirety. There's no better solution, right? For all I've heard, GitHub Copilot is built on an "AI" specializing in languages and language autocompletion. It may very well be that the "AI" simply goes, when the dev types this code, what usually comes up after? Oh, that? Let's just suggest that then.

There's no real getting around this issue, as developers probably do this when they write their code too. Just use the best solution, right? However, for many algorithms, developers know how they work and implement them based on that knowledge; not because in most code the algorithm looks like this algorithm in FOSS project XYZ. They probably won't use the same variable names too. Of course, it could be argued that the end product is the same, but the process isn't. This is where the ethical dilemma comes up. How can we ensure that the original solvers of the problem, or task, are credited or gain some sort of material benefit? Copilot probably cannot just include the license of the code it has taken and its author when suggesting code snippets, because of how the dataset may be structured. How could it credit code snippets it uses? Is what it does ethical?

I do agree with the article that Copilot does not currently violate copyright law of code protected by the GPL or other licenses, simply due to exceptions in the application of copyright licenses, or the fine print. I don't know what could be a possible solution.

[–] [email protected] 3 points 4 years ago

Thanks for stating my point more eloquently :)

As for Julia Reda, she is a former member of the EU parliament specializing in Copyright law from the perspective of the Pirate Party... tl;dr pro-copyleft but ultimately anti-copyright in general.

[–] [email protected] 4 points 4 years ago (1 children)

I read the article, upvoted for the discussion and disagree in a point.

Im the comments for the article, there is one explaining the real situation.

The article would be right for most cases with little code but not for complete function declarations and parts of code as is happening.

[–] [email protected] 1 points 4 years ago* (last edited 4 years ago) (2 children)

The article would be right for most cases with little code but not for complete function declarations and parts of code as is happening.

Sure, but that is already covered by existing copyright. It is the same as cut&pasting some larger code pieces from Stackoverflow without knowing where it originally came from.

[–] [email protected] 4 points 4 years ago* (last edited 4 years ago) (1 children)

I don't understand what do you mean.

But as far as this can be considered a module of the program, it is violating the license, and can be with this size.

The same can happens if what you copied from a forum is protected.

The case with Stackoverflow depends on their license and size too, which I didn't read but should be explained in their terms and conditions.

[–] [email protected] 2 points 4 years ago* (last edited 4 years ago) (1 children)

Hmm, how do I explain better... the co-pilot is just a tool, just like you can abuse cut&paste to copy code out of any project regardless of the license, so can this copilot.

In the end it still matters what you as the programmer do, and when you make the copilot just paste large parts of code that you don't understand and don't know where it originally comes from (like most of the code found on stackoverflow) then it is probably better not to release that code publicly where anyone can compare it to copyrighted works.

Because even if you you didn't know you were making an copyright infringement (because the copilot helped you do it), it still is a copyright infringement never the less.

[–] [email protected] 2 points 4 years ago

Aaaaaaaaah in that sense. Now I understood you.

Yes, I didn't think on it at all but given that my classmates in SysAdmin and WebDev vocational training were copy-pasting code in their projects and were most of them... I think this will finish in a big problem.

[–] [email protected] 1 points 4 years ago* (last edited 4 years ago)

The article already sets the position about "irrelevant parts" to be considered in the copyright.

The comment I point sets something away of that scope.

[–] [email protected] 3 points 4 years ago* (last edited 4 years ago)

[L] posted 26 hours ago, 60 comments now : https://lobste.rs/s/bmdesp/github_copilot_is_not_infringing_your

[–] [email protected] 3 points 4 years ago* (last edited 4 years ago)

I will ignore the kind of person is writing this article (NATO supporter) and I'll just focusing in the article itself.

The main goals that the article aims to reach, and are to validate the Microsoft's behavior are:

A weird interpretation of what public domain means.
It's not a legal issue. It's a problem of the copyleft / free software communities.
The output generated by an IA doesn't have a license.

As someone said on the post's comments:

Hi, nice article. In the article you often use the word like “training of the AI” or some variations of it. This hides one big misunderstanding: this is not actually “intelligent”, this is a statistically programmed software. It is not intelligent it is only programmed to output the best results based statistically on the data it is used to program the software.

A copy/paste of the original fast inverse square root code made by Copilot or it's so smart that it can "swear".

By the way, I think it's important to implement a Copilot alternative now!! (Comrade??) which should be "trained" with fair/legal source code given by its creators, supported by the entire community and independent of any corporation. Of course free in both senses.

[–] [email protected] -2 points 4 years ago (1 children)

For those down-voting: did you read the article and know who Julia Reda is? Hint: this is definitely not a Microsoft shill piece.

[–] [email protected] 3 points 4 years ago (1 children)

This wouldnt be the first time that a politician was bought by lobbyists.

[–] [email protected] 2 points 4 years ago

But the argument she makes is very consistent with her previous stance on copyright issues.

Open Source

Useful Links

Rules

Related Communities