@fribbledom I think some of the comments on that blog post have a point; the article does not address the cases where copilot reproduces large-ish chunks of its training code.

@fribbledom she hasn't paid attention. People are largely complaining about the fact that Copilot can and does reproduce large pieces of original GPL licensed code verbatim, it's not small line fragments.

Secondly, if we take the "machine generation is not derivative work and is public domain" argument, it would set a precedent for laundering original copyrighted work through ML models such as this.

Copilot itself might not be infringing copyright, but _you_ are by using it.

@evolbug @fribbledom There's the related wrinkle that *some forms* of machine generation are uncontrovertially “derived works”.

Specifically, *compilers* take a corpus of source code and generate new works, namely binaries. Under a naieve reading of “the output of a machine simply does not qualify for copyright protection”, *no* existing software is eligible for copyright protection; it is *all* the output of a machine.

@evolbug @fribbledom There's a reasonable argument to be made that training a neural network on the code is sufficiently different to compiling to make the neural network and its output *not* a derived work of the training corpus, but it needs to be *actually made*.

@RAOF @fribbledom Copilot does generate actually unique permutations or irreduceable algorithm stuff (you can't copyright pure algorithms), the problem isn't with those, the problem is with verbatim copies of original code it makes, that's an actual legal problem, it doesn't matter who gave you the code if it's word-for-word someone else's.

@evolbug @fribbledom the length isn’t important for whether something constitutes a protectable piece of work. What matters is called „Schöpfungshöhe“, german for maybe originality.

@aurorus @fribbledom obviously it's not just length, but length is an implicit part of originality in code. More code typically contains more original work.

@evolbug @fribbledom no applied to music your statement would mean that a sine-wave of a certain length could be copyrightable, which is ridiculous (brb trying to do that as a art performance). On the other hand, a very memorable tune of a few seconds length is, or at least can constitute a protected work. Also originality is a boolean, not a float. A given piece of work is either original or not.

@evolbug @fribbledom the rules about originality are the same for all works and it might be easier for music to see my point.

@aurorus @fribbledom you are building an inconsequential strawman against the usage of "length" in my point that it reproduces exact verbatim copyrighted original functions in full

@evolbug @fribbledom no, it’s not a strawman, you’re basing your arguments on an understanding of protectable work that is, best case, a very rough approximation of „the law“. I know where you come from, I debated for half a year with a professor until I understood most of the intricacies of german copyright law (which are compliant with the revised berne convention on copyright and so shouldn’t materially differ through the developed world).

@evolbug @fribbledom You can’t generally say, that every function is copyrightable. Some are, some not, depending on originality, which is something ultimately a judge decides on. If we both write a function for flooring a float, chances are, we will have similar results. Something similar is happening here. It should be trivial (in a computer sciences meaning) to show that the model can’t have just remembered everything verbatim, so it can’t copy.

@evolbug @fribbledom Therefore it has to be something like when I subconsciously remember a solution to a given problem I read somewhere and (re-)produce something (very) similar. That is creation and not copying.

@aurorus @fribbledom if you actually read what i said in entiriety, you'd see i said the exact same thing

@evolbug @fribbledom I'm pretty sure Disney's legal team have a quite clear stance on ML models, copyright and derivative works.

@fribbledom THIS!
Finally a differentiated analysis of the matter!

I have such an aversion against people, who shout polarizing content into the world just to get attention.
Sometimes I feel people have forgotten how to THINK. It's actually sad.

Thank you for sharing this.❤️

#DareToThinkForYourself #Polarized #CoPilot

@fribbledom "some commentators accuse GitHub of copyright infringement, because Copilot itself is not released under a copyleft licence" Aren't they in fact saying the code produced by Copilot is a derivative work, and thus should be released under copyleft? :blobthinkingcool:

@annika @fribbledom
Yes but as far as I remember (from this morning, when I read it), the article does go into that as well (?). Doesn't Julia Reda make the point that an algorithm cannot create derivative works?

@annika @fribbledom

"On the other hand, the argument that the outputs of GitHub Copilot are derivative works of the training data is based on the assumption that a machine can produce works. This assumption is wrong and counterproductive. Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work. This means that machine-generated code like that of GitHub Copilot is not a work under copyright law at all, so it is not a derivative work either."

@maltimore @annika @fribbledom So if I write a script that strips GPL-licensed codebases of their licensing, does that mean that it (being a machine) is able to produce machine-generated code that is not subject to GPL?

@kekcoin @annika @fribbledom
That is I guess a fair question that I don't know the answer to. I just referred to the article, not being a legal expert myself.

@fribbledom The question that stays is: if someone doesn't want to have their work used to train a machine, how that person can do that? Is it possible for someone to ask GitHub to remove their code from training? Or is it not arbitrary? If not, someone that does not want to have their work used would need to move their repositories to another platform that would block it.

@brunofontes @fribbledom remove your code from GitHub; that's the most reliable way you can have your license respected. There are many other git hosting services you can use, perhaps @codeberg #codeberg or #sourcehut.

@robby @brunofontes @fribbledom @codeberg
what would stop them from scraping open source code from other hosting platforms for training Copilot?

@rommudoh @robby @fribbledom @codeberg Nothing, as far as I know, unless you could block it anyway. Maybe requiring to approve the accounts before letting have access to the code.

@brunofontes @rommudoh @fribbledom @codeberg The service could wall off the code to those with accounts like you said, they could rate limit requests to make it infeasible to scrape the whole service, they could simply firewall GitHub IPs, etc.

At least for not it seems like Microsoft is only interested in code that is already on GitHub though.

@bekopharm I am trying to understand the situation and letting it up to whatever people wanna do with their code.

Hm, why? Free Software doesn't need to be publicly free. It only means that those who have access to an executable format are entitled to the source format for a program, and the right to redistribute it under the same terms. One can require payment for access to Free Software; one can require credentials for access to Free Software.

For example, see the Pleroma GitLab. They don't require an account to actually view any public repositories (ie https://git.pleroma.social/pleroma/pleroma )

But they do require an account to browse public repositories, which would stop Microsoft from scraping the site.


It was also only one of multiple solutions mentioned.

@robby save way to run a project into the void without a pragmatic way to check source, submit patches or PRs. Nobody wants to register on yet another website just to do stuff.

And the bots will do it anyway.

I can tell. Self hosting my projects for a decade now.

License infringement will happen and you only get the official ways to deal with this. As usual. A "technical" solution will not help or rescue anything. As usual. Especially GPL history is full of this.

@bekopharm I think different people want different things, no? Not everyone cares to be the next big thing; maybe they just want something that works for them.

> submit patches or PRs

Email patches work great still, no need for another account if you don't want one.

> And the bots will do it anyway.

They could, but they have little reason to. Given the control that Microsoft has over their ToS, they have a lot of legal power over people who use their services. They do not legally have the same power over software hosted on different providers, so I don't think it would be worth the hassel of scraping to get themselves into such a legal mess.

@robby I'd happily use more self hosted systems if they'd let me login e.g. with IndieAuth or similar. Just not keen on raising yet another account.

Sure, do your thing. I do. No worries. Going self hosted and walling off because of some bot reading your repo is absurd tho.

Oh and on jumping the hoops: just yesterday we found an issue with the generic hid joystick driver in kernel. It will go unreported because all parties involved are not going to waste a day finding the proper report channel

scraping protection might be an idea, but there are also legit use-cases where one would want to download a lot of FLOSS Code, and you can't prevent GitHub from obtaining a copy. They could also send interns to clone all the code in Internet Cafés ... as long as copies exist, they are able to get one. The question is if they bother datamining other platforms when they can just go through their repo storage ...

@robby @brunofontes @fribbledom @rommudoh

@robby @brunofontes @fribbledom @codeberg
Well, they could add a way to opt-out, but that's unlikely to happen and you would have to trust that they respect that option... something like the #nobot tag on mastodon profiles, or robots.txt for websites.

that would be nice, but I don't believe it is going to happen. At least not before the AI is trained enough to sell.
@robby @fribbledom @codeberg

@robby @fribbledom @codeberg Most of my code is spread in other services or particular git servers (some direct ones and a gitea). But this is something people will need to considerate from now on.

in the mood for a rant 

@fribbledom interesting that julia reda now also can do the glorious word twisting for which other politicians are known.

copilot is not "ok". machine "learning" is just a mathematical transformation, not magically non-copyrighted because "a machine did it!!!111". i could as well just save windows source code in EBCDIC and say it isn't copyrighted anymore. OR ENCODE FUCKING MUSIC IN $CODEC AND IT ISN'T COPYRIGHTED ANYMORE! 🖕

our whole legal system is broken, everything is bullshit. have the money, buy the law.

maybe one shouldn't have based it on the roman law system.


@fribbledom Strong agree on copilot not infringing by virtue of not being open source, but I think her second point (which responds to the only criticism I’ve seen online) is questionable. Unclear how “a machine cannot produce work” fits with compilers producing work which has always gotten copyright protection. Her point here feels like motivated reasoning based on her (laudable, and stated up front) general desire to see ever-weakened copyright.


So can I take proprietary code, train my own ML on it and then use the resulting suggestions in an open source project?

Sign in to participate in the conversation

Server run by the main developers of the project 🐘 It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!