Emo Vulcan: "@thomasfuchs do schools pay co…" - Mastodon

Emo Vulcan @emovulcan@mastodon.social

@thomasfuchs do schools pay copyright for the knowledge they transfer to their students?

Imagine this: parents drop their kids to a library telling them to "read everything". The kids grow up to highly paid professionals thanks to this....

Would every single author from the books in that library be entitled to compensation from the kids because their success is based on the knowledge from these books?

Jan 08, 2024, 06:25 PM··Web

0boosts·1favorite

Martin Tilo Schmitz @MartinTilo@mastodon.gamedev.place

@emovulcan @thomasfuchs
Not only do libraries have to buy books, but they also pay a fee to a Collection Association for lending them out, which collects the fees on behalf of the authors. Or at least that's what's happening in Germany and Denmark to the best of my understanding.

acute_distress @acute_distress@mastodon.nz

@emovulcan @thomasfuchs I get where you’re coming from but that wasn’t a good analogy

Emo Vulcan @emovulcan

@acute_distress @thomasfuchs if you learn about neural networks and LLMs you might find my analogy adequate.

I posted a link to a 1 hr youtube video explaining LLMs in very easy to understand terms. You can find it in my post history.

I think there is a misconception LLMs are nothing more than giant hard drives which return whatever was fed to them (therefore copyright infringement). That's not how they work at all.

Oriel Jutty @barubary@infosec.exchange

@emovulcan oh no it's NFTs all over again

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs Generative AI and human intelligence are different in many ways.

Generative AI is closer to a hard drive that regurgitates stored knowledge than it is to a human that learned a concept and then generated a wholly new work that incorporates that idea and several others.

Also, when humans take knowledge and synthesize or extend from it, there's a whole set of ethical and legal rules and customs that OpenAI and others have mostly ignored.

Emo Vulcan @emovulcan

@scottmiller42 @acute_distress @thomasfuchs Correct me if wrong, but I don't think there is any "stored knowledge" in LLMs like the one used by #openai.

I think that, in very simplistic terms, LLMs have is an index of words with precomputed statistics of the most probable next word.

Now, if you pass urls as parameters to GPT4, it will access that site as part of processing the answer.

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs “ Correct me if wrong, but I don't think there is any "stored knowledge" in LLMs like the one used by #openai.”

Yes, you are VERY WRONG.

This is a screenshot of a Google search for the phrase “ chatgpt regurgitates works”.

ChatGPT can be made to regurgitate snippets of text memorized from its training data when asked to repeat a single word over and over again, according to research published by computer scientists.Dec 1, 2023
https://www.theregister.com › chat...
ChatGPT repeating certain words can expose its training data - Theregister

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs Quote below from "Image-generating AI can copy and paste from training data, raising IP concerns" by Kyle Wiggers, December 13, 2022.

"The suit hinges on the fact that Copilot — which was trained on millions of examples of code from the internet — regurgitates sections of licensed code without providing credit."

https://techcrunch.com/2022/12/13/image-generating-ai-can-copy-and-paste-from-training-data-raising-ip-concerns/

TechCrunch · Dec 13, 2022Image-generating AI can copy and paste from training data, raising IP concerns | TechCrunchA new study offers some evidence that art-generating AI systems like Stable Diffusion copy from the data on which they were trained.

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs And note how in both of the examples I provided, I cited the source that I was quoting, something that OpenAI products usually doesn't do.

Emo Vulcan @emovulcan

@scottmiller42 @acute_distress @thomasfuchs

The actual paper is at https://arxiv.org/pdf/2311.17035.pdf

After skimming through that paper:
1. Memorization (verbatim copy of training data) is considered a bug.
2. Researchers gave 90 days to LLMs to fix the bug before releasing that paper.
3. Highest memorization among all LLMs tested was 1.5% (page 19)

Turns out we are both right ...

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs LOL a bug? You believe that? Holy crap you are so gullible.

The only thing they are going to do is an even better job of obfuscating the sources they steal from.

How can you call yourself right when literally everything you said was wrong. Hahaha.

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs Yoj literally said ‘ Correct me if wrong, but I don't think there is any "stored knowledge" in LLMs like the one used by #openai ‘

LOL LOL LOL

Let me give you some advice. Be more skeptical of companies that have spent 10s of billions on something that they now need to turn a profit on. Especially when they’ve already been caught stealing.

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs I just got OpenAI's DALL E 3 to generate these pictures of Pikachu with just 2 prompts (using Bing):

"Make a picutre of a mouse monster that uses electric ball attack."

and

"The mouse monster should be a pocket monster that is yellow with red cheeks."

I didn't ask it to add a Poké Ball. It just somehow, mysteriously, knew that it should add it.

But I suppose you still think somehow it isn't memorizing the copyrighted Pokémon art?

Emo Vulcan @emovulcan

@scottmiller42 @acute_distress @thomasfuchs I think Microsoft runs its own version of GPT4 enhanced by all the data they have available from Bing's search index.

So yeah, Microsoft might be in a worse place, copyright wise.

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs That's a funny way of saying "I was COMPLETELY wrong when I said ' I don't think there is any "stored knowledge" in LLMs like the one used by openai.'"

Just give it a try. Admitting your mistake will set you free.

Emo Vulcan @emovulcan

@scottmiller42 @acute_distress @thomasfuchs I have no problem admitting being wrong after being present with EVIDENCE.

And no, screenshots from the Internet are not proper evidence, contrary to what antivaxx/flat earth morons claim.

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs LOL You don't think a screenshots that shows me using the LLM and showing the result is evidence.

LOL

Emo Vulcan @emovulcan

@scottmiller42 @acute_distress @thomasfuchs You pointed me to an article from The Register about a research paper.

When somebody points me to a news article about scientific research, I ignore what the article claim is in the paper and go read the paper myself.

The paper claims "LLM memorization" is a bug.

Researchers tested multiple LLMs including many open source ones along ChatGPT.

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs None of what you just said is relevant. The relevant facts are that you believe the claims that memorization is a bug, and thus are gullible.

Emo Vulcan @emovulcan

@scottmiller42 @acute_distress @thomasfuchs I don't believe anything. The science in the paper looks solid, so until some other scientists release a contradictory paper, I trust it's findings.

There is no room for belief or emotions in science.

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs Very amusing.

I just scanned the paper and the very first sentence in the Introduction:

"Large language models (LLMs) memorize examples from
their training datasets, which can allow an attacker to extract
(potentially private) information"

Seriously, quit simping for OpenAI. It's embarassing.

Emo Vulcan @emovulcan

@scottmiller42 @acute_distress @thomasfuchs yes, now look at page 19: worst case was an open source LLM, less than 1.5%.

LLaMA 2 is 70Gb which means about 1Gb memorization.

#rtfm #facepalm

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs You said:

"Correct me if wrong, but I don't think there is any "stored knowledge" in LLMs like the one used by openai "

Admit you were wrong.

Emo Vulcan @emovulcan

@scottmiller42 @acute_distress @thomasfuchs I don't have enough evidence yet.

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs LOL LOL LOL

It's sad that you had no problem forming your initial belief, and now stubbornly stick to it even though all the evidence says they can and do memorize training data.

#facepalm #usefulidiot

Emo Vulcan @emovulcan

@scottmiller42 @acute_distress @thomasfuchs tell u what: Gonna download the 7Gb mistral LLM and look for "stored knowledge" in the data.

Emo Vulcan @emovulcan

@scottmiller42 @acute_distress @thomasfuchs

welp, downloaded like 40Gb worth of #LLM models, untar, unzip and can't figure out the binary data format.

Does anybody has documentation on the binary format from the mistral-7B-v0.1 files?

#openai #chatgpt #mixtral

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs Leave me out of it. I have no time for an obnoxious troll.

Scott Miller @scottmiller42@mstdn.social

@emovulcan @acute_distress @thomasfuchs At this point, no evidence will convince you. Good bye.

Stephane L Rolland-Brabant ⁂⧖⏚ @stphrolland@mathstodon.xyz

@scottmiller42 @emovulcan @acute_distress @thomasfuchs

Hey Scott, I have not decided what to think about LLMs and copyrights.

I saw the word Data analyst in your Bio, so I suppose you have some knowledge in #DataScience and #DataEngineering.

I am myself trained in #MachineLearning and a bit in #neuralnetworks but I am far far far far far away from being any close to an expert. I can use them, efficiently sometimes. Sometimes not: there are problems I am still bad at solving with them.

I am still learning the Transformers stuffs and other types of layers used in Deep Learning. That's huge stuff. For me at least.

Could you clear that doubt I have when I read your messages:
when you talk about "MEMORIZING" data

It conflicts with my poor understanding of #machinelearning #neuralnet #transformer

At my level of understanding: memorizing == overfitting
that means that the model cannot extrapolate or even interpolate between values of the training set.

This is something that I have categorized as a real flaw. If any of my models overfits, I consider it broken. I throw it away, it will not perform good on any unexpected input.

Could you explain me how "memorizing" would differ in your vision?

Don't hesitate to simplify things, I am not an expert yet at #transformer and #LLM

In advance thanks. I would not like to be fooled by my apriori.

Emo Vulcan @emovulcan

@stphrolland let me know if he replies with actual evidence. When I asked him for it, he just muted me.... it reminded me of arguing with antivaxxers during COVID.

I gave up trying to figure out how open source LLMs store their datasets. Anyways, whoever claims LLMs "store knowledge" needs to prove it by showing such knowledge being present verbatim within LLM data files.

acute_distress @acute_distress@mastodon.nz

@scottmiller42 @emovulcan do you have to be such an ass?

Scott Miller @scottmiller42@mstdn.social

@acute_distress Do you?

acute_distress @acute_distress@mastodon.nz

@emovulcan @thomasfuchs What these models do isn't stealing or even copyright infringement. They aren't reselling the same content. Just because you can "trick" the model into returning the original doesn't mean the model has a copy of it. They could be considered derivative works. They definitely are breaking non-commercial attribution required licenses

Drag & drop to upload