@thomasfuchs do schools pay copyright for the knowledge they transfer to their students?
Imagine this: parents drop their kids to a library telling them to "read everything". The kids grow up to highly paid professionals thanks to this....
Would every single author from the books in that library be entitled to compensation from the kids because their success is based on the knowledge from these books?
@emovulcan @thomasfuchs
Not only do libraries have to buy books, but they also pay a fee to a Collection Association for lending them out, which collects the fees on behalf of the authors. Or at least that's what's happening in Germany and Denmark to the best of my understanding.
@emovulcan @thomasfuchs I get where you’re coming from but that wasn’t a good analogy
@acute_distress @thomasfuchs if you learn about neural networks and LLMs you might find my analogy adequate.
I posted a link to a 1 hr youtube video explaining LLMs in very easy to understand terms. You can find it in my post history.
I think there is a misconception LLMs are nothing more than giant hard drives which return whatever was fed to them (therefore copyright infringement). That's not how they work at all.
@emovulcan oh no it's NFTs all over again
@emovulcan @acute_distress @thomasfuchs Generative AI and human intelligence are different in many ways.
Generative AI is closer to a hard drive that regurgitates stored knowledge than it is to a human that learned a concept and then generated a wholly new work that incorporates that idea and several others.
Also, when humans take knowledge and synthesize or extend from it, there's a whole set of ethical and legal rules and customs that OpenAI and others have mostly ignored.
@scottmiller42 @acute_distress @thomasfuchs Correct me if wrong, but I don't think there is any "stored knowledge" in LLMs like the one used by #openai.
I think that, in very simplistic terms, LLMs have is an index of words with precomputed statistics of the most probable next word.
Now, if you pass urls as parameters to GPT4, it will access that site as part of processing the answer.
@emovulcan @acute_distress @thomasfuchs “ Correct me if wrong, but I don't think there is any "stored knowledge" in LLMs like the one used by #openai.”
Yes, you are VERY WRONG.
@emovulcan @acute_distress @thomasfuchs Quote below from "Image-generating AI can copy and paste from training data, raising IP concerns" by Kyle Wiggers, December 13, 2022.
"The suit hinges on the fact that Copilot — which was trained on millions of examples of code from the internet — regurgitates sections of licensed code without providing credit."
@emovulcan @acute_distress @thomasfuchs And note how in both of the examples I provided, I cited the source that I was quoting, something that OpenAI products usually doesn't do.
@scottmiller42 @acute_distress @thomasfuchs
The actual paper is at https://arxiv.org/pdf/2311.17035.pdf
After skimming through that paper:
1. Memorization (verbatim copy of training data) is considered a bug.
2. Researchers gave 90 days to LLMs to fix the bug before releasing that paper.
3. Highest memorization among all LLMs tested was 1.5% (page 19)
Turns out we are both right ...
@emovulcan @acute_distress @thomasfuchs LOL a bug? You believe that? Holy crap you are so gullible.
The only thing they are going to do is an even better job of obfuscating the sources they steal from.
How can you call yourself right when literally everything you said was wrong. Hahaha.
@emovulcan @acute_distress @thomasfuchs Yoj literally said ‘ Correct me if wrong, but I don't think there is any "stored knowledge" in LLMs like the one used by #openai ‘
LOL LOL LOL
Let me give you some advice. Be more skeptical of companies that have spent 10s of billions on something that they now need to turn a profit on. Especially when they’ve already been caught stealing.
@emovulcan @acute_distress @thomasfuchs I just got OpenAI's DALL E 3 to generate these pictures of Pikachu with just 2 prompts (using Bing):
"Make a picutre of a mouse monster that uses electric ball attack."
and
"The mouse monster should be a pocket monster that is yellow with red cheeks."
I didn't ask it to add a Poké Ball. It just somehow, mysteriously, knew that it should add it.
But I suppose you still think somehow it isn't memorizing the copyrighted Pokémon art?
@scottmiller42 @acute_distress @thomasfuchs I think Microsoft runs its own version of GPT4 enhanced by all the data they have available from Bing's search index.
So yeah, Microsoft might be in a worse place, copyright wise.
@emovulcan @acute_distress @thomasfuchs That's a funny way of saying "I was COMPLETELY wrong when I said ' I don't think there is any "stored knowledge" in LLMs like the one used by openai.'"
Just give it a try. Admitting your mistake will set you free.
@scottmiller42 @acute_distress @thomasfuchs I have no problem admitting being wrong after being present with EVIDENCE.
And no, screenshots from the Internet are not proper evidence, contrary to what antivaxx/flat earth morons claim.
@emovulcan @acute_distress @thomasfuchs LOL You don't think a screenshots that shows me using the LLM and showing the result is evidence.
LOL
@scottmiller42 @acute_distress @thomasfuchs You pointed me to an article from The Register about a research paper.
When somebody points me to a news article about scientific research, I ignore what the article claim is in the paper and go read the paper myself.
The paper claims "LLM memorization" is a bug.
Researchers tested multiple LLMs including many open source ones along ChatGPT.
@emovulcan @acute_distress @thomasfuchs None of what you just said is relevant. The relevant facts are that you believe the claims that memorization is a bug, and thus are gullible.
@scottmiller42 @acute_distress @thomasfuchs I don't believe anything. The science in the paper looks solid, so until some other scientists release a contradictory paper, I trust it's findings.
There is no room for belief or emotions in science.
@emovulcan @acute_distress @thomasfuchs Very amusing.
I just scanned the paper and the very first sentence in the Introduction:
"Large language models (LLMs) memorize examples from
their training datasets, which can allow an attacker to extract
(potentially private) information"
Seriously, quit simping for OpenAI. It's embarassing.
@scottmiller42 @acute_distress @thomasfuchs yes, now look at page 19: worst case was an open source LLM, less than 1.5%.
LLaMA 2 is 70Gb which means about 1Gb memorization.
@emovulcan @acute_distress @thomasfuchs You said:
"Correct me if wrong, but I don't think there is any "stored knowledge" in LLMs like the one used by openai "
Admit you were wrong.
@scottmiller42 @acute_distress @thomasfuchs I don't have enough evidence yet.
@emovulcan @acute_distress @thomasfuchs LOL LOL LOL
It's sad that you had no problem forming your initial belief, and now stubbornly stick to it even though all the evidence says they can and do memorize training data.
@scottmiller42 @acute_distress @thomasfuchs tell u what: Gonna download the 7Gb mistral LLM and look for "stored knowledge" in the data.
@scottmiller42 @acute_distress @thomasfuchs
welp, downloaded like 40Gb worth of #LLM models, untar, unzip and can't figure out the binary data format.
Does anybody has documentation on the binary format from the mistral-7B-v0.1 files?
@emovulcan @acute_distress @thomasfuchs Leave me out of it. I have no time for an obnoxious troll.
@emovulcan @acute_distress @thomasfuchs At this point, no evidence will convince you. Good bye.
@scottmiller42 @emovulcan @acute_distress @thomasfuchs
Hey Scott, I have not decided what to think about LLMs and copyrights.
I saw the word Data analyst in your Bio, so I suppose you have some knowledge in #DataScience and #DataEngineering.
I am myself trained in #MachineLearning and a bit in #neuralnetworks but I am far far far far far away from being any close to an expert. I can use them, efficiently sometimes. Sometimes not: there are problems I am still bad at solving with them.
I am still learning the Transformers stuffs and other types of layers used in Deep Learning. That's huge stuff. For me at least.
Could you clear that doubt I have when I read your messages:
when you talk about "MEMORIZING" data
It conflicts with my poor understanding of #machinelearning #neuralnet #transformer
At my level of understanding: memorizing == overfitting
that means that the model cannot extrapolate or even interpolate between values of the training set.
This is something that I have categorized as a real flaw. If any of my models overfits, I consider it broken. I throw it away, it will not perform good on any unexpected input.
Could you explain me how "memorizing" would differ in your vision?
Don't hesitate to simplify things, I am not an expert yet at #transformer and #LLM
In advance thanks. I would not like to be fooled by my apriori.
@stphrolland let me know if he replies with actual evidence. When I asked him for it, he just muted me.... it reminded me of arguing with antivaxxers during COVID.
I gave up trying to figure out how open source LLMs store their datasets. Anyways, whoever claims LLMs "store knowledge" needs to prove it by showing such knowledge being present verbatim within LLM data files.
@scottmiller42 @emovulcan do you have to be such an ass?
@emovulcan @thomasfuchs What these models do isn't stealing or even copyright infringement. They aren't reselling the same content. Just because you can "trick" the model into returning the original doesn't mean the model has a copy of it. They could be considered derivative works. They definitely are breaking non-commercial attribution required licenses