Technology

191 readers

1 users here now

This magazine is dedicated to discussions on the latest developments, trends, and innovations in the world of technology. Whether you are a tech enthusiast, a developer, or simply curious about the latest gadgets and software, this is the place for you. Here you can share your knowledge, ask questions, and engage in discussions on topics such as artificial intelligence, robotics, cloud computing, cybersecurity, and more. From the impact of technology on society to the ethical considerations of new technologies, this category covers a wide range of topics related to technology. Join the conversation and let's explore the ever-evolving world of technology together!

founded 2 years ago

OpenAI says it could ‘cease operating’ in the EU if it can’t comply with future regulation (www.theverge.com)

submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/[email protected]

107 comments fedilink hide all child comments

In addition to the possible business threat, forcing OpenAI to identify its use of copyrighted data would expose the company to potential lawsuits. Generative AI systems like ChatGPT and DALL-E are trained using large amounts of data scraped from the web, much of it copyright protected. When companies disclose these data sources it leaves them open to legal challenges. OpenAI rival Stability AI, for example, is currently being sued by stock image maker Getty Images for using its copyrighted data to train its AI image generator.

Aaaaaand there it is. They don’t want to admit how much copyrighted materials they’ve been using.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 11 points 2 years ago (1 children)

LLMs are not book reports. They are not synthesizing information. They're just pulling words based on probability distributions.Those probability distributions are based entirely on what training data has been fed into them.

You can see what this really means in action when you call on them to spit out paragraphs on topics they haven't ingested enough sources for. Their distributions are sparse, and they'll spit out entire chunks of text that are pulled directly from those sources, without citation.

If you write a book report that just reprinted significant swaths of the book, that would be plaigerism, and yes, would 100% be called copyright infringement.

Importantly, though, the copyright infringement for these models does not come at the point where it spits out passages from a copyrighted work. It occurs at the point where the work is copied and used for purposes that fall outside what the work is licensed for. And most people have not licensed their words for billion dollar companies to use them in for-profit products.

[–] [email protected] 0 points 2 years ago* (last edited 2 years ago) (1 children)

@Kichae

Those probability distributions are based entirely on what training data has been fed into them.

The exact same thing a human does when writing a sentence. I'm starting to think that the backlash against AI is simply because it's showing us what simple machines we humans are as far as thinking and creativity goes.

You can see what this really means in action when you call on them to spit out paragraphs on topics they haven't ingested enough sources for. Their distributions are sparse, and they'll spit out entire chunks of text that are pulled directly from those sources, without citation.

Do you have an example of this? I've used GPT extensively for a while now, and I've never had it do that. If it gives me a chunk of data directly from a source, it always lists the source for me. However, I may not be digging deep enough into things it doesn't understand. If we have a repeatable case of this, I'd love to see it so I can better understand it.

It occurs at the point where the work is copied and used for purposes that fall outside what the work is licensed for. And most people have not licensed their words for billion dollar companies to use them in for-profit products.

This is the meat and potatoes of it. When a work is made public, be it a book, movie, song, physical or digital, it is placed in the public domain and can be freely consumed by the public, and it then becomes part of our own particular data set. However, the public, up until a year ago, wasn't capable of doing what an AI does on such a large scale and with such ease of use. The problem isn't that it's using copyright material to create. Humans do that all the time, we just call it an "homage" or "parody" or "style". An AI can do it much better, much more accurately, and much more quickly, though. That's the rub, and I'm fine with updating the laws based on evolving technology, but let's call a spade a spade. AI isn't doing anything that humans haven't been doing for as long as their has been verbal storytelling. The difference is that AI is so much better at it than we are, and we need to decide if we should adjust what we allow our own works to be used for. If we do, though, it must effect the AI in the same way that it does the human, otherwise this debate will never end. If we hamstring the data that an AI can learn from, a human must have the same handicap.

[–] pglpm 1 points 2 years ago* (last edited 2 years ago)

There's a difference that's clear if you teach students, say in sciences. Some students just memorize patterns in order to try to get done with course and exam: "when they ask me something that contains these words, I use this formula and say these things; when they ask me something that contains these other words, then..." and so on. Some are really good at this, and can memorize a lot of slight variations, and even pass (poorly constructed) written exams that way.

But they lack understanding. They don't know and understand why they should pull out a particular formula instead of another. And this can be easily brought to the surface by asking additional questions and digging deeper.

This is how current large language models look like.

It's true though that a lot of our education system today fosters that way of studying by memorization & parroting, rather than by understanding. We teach students to memorize definitions conveniently written in boldface in textbooks, and to repeat them at the exam. Because it takes less effort and allows institutions to make it look like they're managing to "teach" tons of stuff in very short time.

Today's powerful large language models show how flawed most of our current education system is. It's producing parrot people with skills easily replaceable by algorithms.

But knowledge and understanding are something different. When an Einstein gets the mind-blowing idea of interpreting a force as a curvature of spacetime, sure he's using previous knowledge, but he isn't mimicking anything, he's making a leap.

I'm not saying that there's a black & white divide between knowledge and understanding on one side, and pattern-operation on the other. Probably in the end knowledge is operation with patterns. But it is so at a much, much, much deeper level than current large language models. Patterns of patterns of patterns of patterns of patterns. Someone once said that good mathematicians see analogies between things, but great mathematicians see analogies between analogies.