Long read (maybe paywalled) about why the use of machine translation on wikipedia does a lot of harm and we should stop. It creates a feedback loop since these models often train on Wikipedia. https://www.technologyreview.com/2025/09/25/1124005/ai-wikipedia-vulnerable-languages-doom-spiral/
Wikipedia
A place to share interesting articles from Wikipedia.
Rules:
- Only links to Wikipedia permitted
- Please stick to the format "Article Title (other descriptive text/editorialization)"
- Tick the NSFW box for submissions with inappropriate thumbnails
- On Casual Tuesdays, we allow submissions from wikis other than Wikipedia.
Recommended:
- If possible, when submitting please delete the "m." from "en.m.wikipedia.org". This will ensure people clicking from desktop will get the full Wikipedia website.
- Use the search box to see if someone has previously submitted an article. Some apps will also notify you if you are resubmitting an article previously shared on Lemmy.
A great article, thanks for sharing it.
Research produced by Google before a major expansion of Google Translate rolled out three years ago found that translation systems for lower-resourced languages were generally of a lower quality than those for better-resourced ones. Researchers found, for example, that their model would often mistranslate basic nouns across languages, including the names of animals and colors.
I remember discussing this some time ago on reddit. Google Translate very suddenly introduced a number of small languages and IIRC one of the speakers personally expressed frustration at the horrible output. Some people proposed the speakers correct it (you can always report bad translations there and propose your own), but it hardly requires explaining why that's a bad and futile idea...
On the one hand, it's unfortunate Wikipedia is having to spend extra human labor to deal with that. On the other hand, I'm always down for poisoning LLM data sets
Big generative LLMs are one thing, but the translation tools themselves like Google Translate are models, have been since before the current AI craze of recent years, and really can be helpful for people trying to learn their language if there are a shortage of native speakers. Those train with Wikipedia too
Well that's sad, but understandable.
I assume they're trying to avoid another Scots Wikipedia situation here, where a low quality Wikipedia misrepresents the language?
Is Scots Wikipedia really that bad? I was viewing a few pages in Scots and I was surprised how much of it I could read given how similar it is to English.
Basically a bunch of pages were written by an American teenager just using an English Scots dictionary, so it was written in basically English with improperly used Scots words
If you see an article with a heading like "the 'Scots' in this article wis written bi a body that hasna a good grip on the leid", it means it's written in fake Scots
Example:
I'm not familiar, but I assume from your description it must have been a horrible representation of the language because if you could understand it clearly it's not really a good representation of the Scottish tongue.
Oh, now I see it.
This is the bad one right? I can mostly follow it but a couple slang terms I kinda have to guess
Hopefully Greenlandic users can restart it one day, though it seems unlikely.
maybe the republic of greenlandia can self host it
Canary in the coal mine for the rest of the project?
Personally I've always worried about Wikipedia's top-heaviness. It's much easier to create content than to maintain it. Of those 7 million articles in EN, an awful lot are "short or unintelligible" or outright "nonsense" - and on top of that they're becoming steadily out of date.
IMO this amazing project needs to move to retrenchment so as safeguard its reputation. Logically that means lots and lots of deletion. Not a popular opinion alas.
As someone who browses a lot of Wikipedia in English, I haven't seen much outright nonsense
Perhaps we have difference thresholds for what constitutes nonsense.
The main issue IMO is outdatedness, and it's reaching almost insurmountable proportions. Take a random article outside the 1000 most popular ones (and outside the generally decent ones on hard science) and you'll find that the "center of balance" of cited dates is now a decade or more in the past. "As of 2009, the proposed bridge is awaiting approval", "The budget was to be revised in June 2012", etc. The problem is absolutely rampant. And completely logical because that was the period when all the editing was happening - the number of editors has dropped off hugely since then. And yet there's very little appetite for deleting obsolete content. In my analysis that's because the original generation of Wikipedians skew by nature towards the idealistic and tend to believe that all those articles will be updated and fixed eventually, it's a just a question of time. Personally I'm not convinced. I think that idealism is misplaced and it's now undermining the project.
I would take that as evidence of Wiki being a mature site. Old school encyclopedias had a lot of issues with outdatedness too. They tended to just not have articles about timely topics to reduce that, but they still struggled to keep up with the times. But it definitely is a big issue, especially with link rot and other problems of the old internet.