You're poor? Fuck you you have to pay to breathe.
Millionaire? Whatever you want daddy uwu
This is a most excellent place for technology news and articles.
You're poor? Fuck you you have to pay to breathe.
Millionaire? Whatever you want daddy uwu
It took me a few days to get the time to read the actual court ruling but here's the basics of what it ruled (and what it didn't rule on):
Here's what it didn't rule on:
So it's a pretty important ruling, in my opinion. It's a clear green light to the idea of digitizing and archiving copyrighted works without the copyright holder's permission, as long as you first own a legal copy in the first place. And it's a green light to using copyrighted works for training AI models, as long as you compiled that database of copyrighted works in a legal way.
Judge,I'm pirating them to train ai not to consume for my own personal use.
Gist:
What’s new: The Northern District of California has granted a summary judgment for Anthropic that the training use of the copyrighted books and the print-to-digital format change were both “fair use” (full order below box). However, the court also found that the pirated library copies that Anthropic collected could not be deemed as training copies, and therefore, the use of this material was not “fair”. The court also announced that it will have a trial on the pirated copies and any resulting damages, adding:
“That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages.”
So I can't use any of these works because it's plagiarism but AI can?
My interpretation was that AI companies can train on material they are licensed to use, but the courts have deemed that Anthropic pirated this material as they were not licensed to use it.
In other words, if Anthropic bought the physical or digital books, it would be fine so long as their AI couldn't spit it out verbatim, but they didn't even do that, i.e. the AI crawler pirated the book.
Does buying the book give you license to digitise it?
Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?
Definitions of "Ownership" can be very different.
It seems like a lot of people misunderstand copyright so let's be clear: the answer is yes. You can absolutely digitize your books. You can rip your movies and store them on a home server and run them through compression algorithms.
Copyright exists to prevent others from redistributing your work so as long as you're doing all of that for personal use, the copyright owner has no say over what you do with it.
You even have some degree of latitude to create and distribute transformative works with a violation only occurring when you distribute something pretty damn close to a copy of the original. Some perfectly legal examples: create a word cloud of a book, analyze the tone of news article to help you trade stocks, produce an image containing the most prominent color in every frame of a movie, or create a search index of the words found on all websites on the internet.
You can absolutely do the same kinds of things an AI does with a work as a human.
You can digitize the books you own. You do not need a license for that. And of course you could put that digital format into a database. As databases are explicit exceptions from copyright law. If you want to go to the extreme: delete first copy. Then you have only in the database. However: AIs/LLMs are not based on data bases. But on neural networks. The original data gets lost when "learned".
If you want to go to the extreme: delete first copy.
You can; as I understand it, the only legal requirement is that you only use one copy at a time.
ie. I can give my book to a friend after I'm done reading it; I can make a copy of a book and keep them at home and at the office and switch off between reading them; I'm not allowed to make a copy of the book hand one to a friend and then both of us read it at the same time.
Does buying the book give you license to digitise it?
Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?
Yes. That's what the court ruled here. If you legally obtain a printed copy of a book you are free to digitize it or archive it for yourself. And you're allowed to keep that digital copy, analyze and index it and search it, in your personal library.
Anthropic's practice of buying physical books, removing the bindings, scanning the pages, and digitizing the content while destroying the physical book was found to be legal, so long as Anthropic didn't distribute that library outside of its own company.
That's not what it says.
Neither you nor an AI is allowed to take a book without authorization; that includes downloading and stealing it. That has nothing to do with plagiarism; it's just theft.
Assuming that the book has been legally obtained, both you and an AI are allowed to read that book, learn from it, and use the knowledge you obtained.
Both you and the AI need to follow existing copyright laws and licensing when it comes to redistributing that work.
"Plagiarism" is the act of claiming someone else's work as your own and it's orthogonal to the use of AI. If you ask either a human or an AI to produce an essay on the philosophy surrounding suicide, you're fairly likely to include some Shakespeare quotes. It's only plagiarism if you or the AI fail to provide attribution.
But I thought they admitted to torrenting terabytes of ebooks?
That part is not what this preliminary jugement is about. The torrenting part is going to go to an actual trial. This part was about the Authors' claim that the act of training AI itself violated copyright, and this is what the judge has found to be incorrect.
Facebook (Meta) torrented TBs from Libgen, and their internal chats leaked so we know about that, and IIRC they've been sued. Maybe you're thinking of that case?
Billions of dollars, and they can't afford to buy ebooks?
FaceBook did but technically downloading (leeching) isn't illegal but distributing (seeding) is and they did not seed.
Check out my new site TheAIBay, you search for content and an LLM that was trained on reproducing it gives it to you, a small hash check is used to validate accuracy. It is now legal.
The court's ruling explicitly depended on the fact that Anthropic does not allow users to retrieve significant chunks of copyrighted text. It used the entire copyrighted work to train the weights of the LLMs, but is configured not to actually copy those works out to the public user. The ruling says that if the copyright holders later develop evidence that it is possible to retrieve entire copyrighted works, or significant portions of a work, then they will have the right sue over those facts.
But the facts before the court were that Anthropic's LLMs have safeguards against distributing copies of identifiable copyrighted works to its users.
Ok so you can buy books scan them or ebooks and use for AI training but you can't just download priated books from internet to train AI. Did I understood that correctly ?
Sure, if your purchase your training material, it's not a copyright infringement to read it.
We needed a judge for this?
Yes, because just because you bought a book you don't own its content. You're not allowed to print and/or sell additional copies or publicly post the entire text. Generally it's difficult to say where the limit is of what's allowed. Citing a single sentence in a public posting is most likely fine, citing an entire paragraph is probably fine, too, but an entire chapter would probably be pushing it too far. And when in doubt a judge must decide how far you can go before infringing copyright. There are good arguments to be made that just buying a book doesn't grant the right to train commercial AI models with it.
Makes sense. AI can “learn” from and “read” a book in the same way a person can and does, as long as it is acquired legally. AI doesn’t reproduce a work that it “learns” from, so why would it be illegal?
Some people just see “AI” and want everything about it outlawed basically. If you put some information out into the public, you don’t get to decide who does and doesn’t consume and learn from it. If a machine can replicate your writing style because it could identify certain patterns, words, sentence structure, etc then as long as it’s not pretending to create things attributed to you, there’s no issue.
Ask a human to draw an orc. How do they know what an orc looks like? They read Tolkien's books and were "inspired" Peter Jackson's LOTR.
Unpopular opinion, but that's how our brains work.
Good luck breaking down people's doors for scanning their own physical books for their personal use when analog media has no DRM and can't phone home, and paper books are an analog medium.
That would be like kicking down people's doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it's played back.
It sounds like transferring an owned print book to digital and using it to train AI was deemed permissable. But downloading a book from the Internet and using it was training data is not allowed, even if you later purchase the pirated book. So, no one will be knocking down your door for scanning your books.
This does raise an interesting case where libraries could end up training and distributing public domain AI models.
i will train my jailbroken kindle too...display and storage training... i'll just libgen them...no worries...it is not piracy
Of course we have to have a way to manually check the training data, in detail, as well. Not reading the book, im just verifying training data.