this post was submitted on 05 Sep 2025
178 points (98.9% liked)

News

32888 readers
2921 users here now

Welcome to the News community!

Rules:

1. Be civil


Attack the argument, not the person. No racism/sexism/bigotry. Good faith argumentation only. This includes accusing another user of being a bot or paid actor. Trolling is uncivil and is grounds for removal and/or a community ban. Do not respond to rule-breaking content; report it and move on.


2. All posts should contain a source (url) that is as reliable and unbiased as possible and must only contain one link.


Obvious right or left wing sources will be removed at the mods discretion. Supporting links can be added in comments or posted seperately but not to the post body.


3. No bots, spam or self-promotion.


Only approved bots, which follow the guidelines for bots set by the instance, are allowed.


4. Post titles should be the same as the article used as source.


Posts which titles don’t match the source won’t be removed, but the autoMod will notify you, and if your title misrepresents the original article, the post will be deleted. If the site changed their headline, the bot might still contact you, just ignore it, we won’t delete your post.


5. Only recent news is allowed.


Posts must be news from the most recent 30 days.


6. All posts must be news articles.


No opinion pieces, Listicles, editorials or celebrity gossip is allowed. All posts will be judged on a case-by-case basis.


7. No duplicate posts.


If a source you used was already posted by someone else, the autoMod will leave a message. Please remove your post if the autoMod is correct. If the post that matches your post is very old, we refer you to rule 5.


8. Misinformation is prohibited.


Misinformation / propaganda is strictly prohibited. Any comment or post containing or linking to misinformation will be removed. If you feel that your post has been removed in error, credible sources must be provided.


9. No link shorteners.


The auto mod will contact you if a link shortener is detected, please delete your post if they are right.


10. Don't copy entire article in your post body


For copyright reasons, you are not allowed to copy an entire article into your post body. This is an instance wide rule, that is strictly enforced in this community.

founded 2 years ago
MODERATORS
 

If approved, the settlement would be the largest in the history of American copyright cases, according to a lawyer for the authors behind the lawsuit.

Anthropic, a major artificial intelligence company, has agreed to pay at least $1.5 billion to settle a copyright infringement lawsuit filed by a group of authors who alleged the platform had illegally used pirated copies of their books to train large-language models, according to court documents.

“If approved, this landmark settlement will be the largest publicly reported copyright recovery in history, larger than any other copyright class action settlement or any individual copyright case litigated to final judgment,” said Justin Nelson, a lawyer for the authors.

The lawsuit, filed in federal court in California last year, centered on roughly 500,000 published works. The proposed settlement amounts to a gross recovery of $3,000 per work, Nelson said in a memorandum to the judge in the case.

you are viewing a single comment's thread
view the rest of the comments
[–] FlowVoid@lemmy.world 3 points 1 month ago (1 children)

They will need actual training data when they want to develop the next version of their LLM.

[–] GissaMittJobb@lemmy.ml 3 points 1 month ago (2 children)

I guess it depends on how important old data is when building upon new models, which I fully admit I don't know the answer to. As I understand it though, new models are not trained fully from scratch, but instead are a continuation of the older model trained with new techniques/new data.

To speculate, I guess not having the older data present in the new training stages might make the attributes of that data be less pronounced in the new output model.

Maybe they could cheat the system by trying to distill that data out of the older models and put that into the training data, but I guess the risk of model collapse is not-insignificant there

Again, limited understanding here, take everything I speculate with a grain of salt

[–] FlowVoid@lemmy.world 2 points 1 month ago* (last edited 1 month ago) (1 children)

It's true that a new model can be initialized from an older one, but it will never outperform the older one unless it is given actual training data (not necessarily the same training data used previously).

Kind of like how you can learn ancient history from your grandmother, but you will never know more ancient history than your grandmother unless you do some independent reading.

[–] GissaMittJobb@lemmy.ml 2 points 1 month ago (1 children)

I think we're in agreement with each other? The old model has the old training data, and then you train a new one on that model with new training data, right?

[–] FlowVoid@lemmy.world 2 points 1 month ago* (last edited 1 month ago) (1 children)

No, the old model does not have the training data. It only has "model weights". You can conceptualize those as the abstract rules that the old model learned when it read the training data. By design, they are not supposed to memorize their training data.

To outperform the old model, the new model needs more than what the old model learned. It needs primary sources, ie the training data itself. Which is going to be deleted.

[–] GissaMittJobb@lemmy.ml 1 points 1 month ago (1 children)

No, the old model does not have the training data. It only has "model weights". You can conceptualize those as the abstract rules that the old model learned when it read the training data. By design, they are not supposed to memorize their training data.

I expressed myself poorly, this is what I meant - it has the "essence" of the training data, but of course not the verbatim training data.

To outperform the old model, the new model needs more than what the old model learned. It needs primary sources, ie the training data itself. Which is going to be deleted.

I wonder how valuable in relative terms the old training data is to the process, compared to just the new training data. I can't answer it, but it would be interesting to know.

[–] FlowVoid@lemmy.world 1 points 1 month ago

A new model needs training data, it doesn't matter if the data is new or old. But generally, a more advanced model needs more training data, so AI devs generally need at least some new training data.

[–] rumba@lemmy.zip 1 points 1 month ago

My guess is they don't actually need the half a million closed books to train their models. It's not the only thing they're training on.

Now that they're making their billions, they could actually afford to pay for the useful subset of the content they need to train the models. I always felt that the kitchen sink approach everyone used by including every book imaginable was over the top.

I think it'll be more interesting when they finally get around to making all the diffusion models pull out the IP. There really isn't a good reason why mid-journey can draw Batman.