this post was submitted on 09 Aug 2025
198 points (99.0% liked)

Ask Lemmy

35219 readers
1405 users here now

A Fediverse community for open-ended, thought provoking questions


Rules: (interactive)


1) Be nice and; have funDoxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them


2) All posts must end with a '?'This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?


3) No spamPlease do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.


4) NSFW is okay, within reasonJust remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either !asklemmyafterdark@lemmy.world or !asklemmynsfw@lemmynsfw.com. NSFW comments should be restricted to posts tagged [NSFW].


5) This is not a support community.
It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email info@lemmy.world. For other questions check our partnered communities list, or use the search function.


6) No US Politics.
Please don't post about current US Politics. If you need to do this, try !politicaldiscussion@lemmy.world or !askusa@discuss.online


Reminder: The terms of service apply here too.

Partnered Communities:

Tech Support

No Stupid Questions

You Should Know

Reddit

Jokes

Ask Ouija


Logo design credit goes to: tubbadu


founded 2 years ago
MODERATORS
 

I keep trying to find things like “making waffles from sour dough discard” and all the sites are the same: long meandering paragraphs full of links to other things on the site with dubious instructions.

Considering at this point I can pretty much identify the type of site by looking at it; are there good extensions or search engines which might remove them from search results?

you are viewing a single comment's thread
view the rest of the comments
[–] tal@lemmy.today 5 points 2 months ago* (last edited 2 months ago)

No, because there's no reliable way to distinguish AI-generated spam sites from non-AI-generated spam sites. I'll also add that I don't expect there to be one promptly forthcoming: any attempt to identify them is going to run into improved systems, and that's gonna happen even if the systems aren't explicitly intending to evade detection. If it were easy, Google would have done so years back. I can recognize some now, but the SEO spam crowd that's creating this is trying hard to pollute search engine results, and if someone implements a generalized "block" that's effective, they're going to keep looking for alternatives until they find something that gets through.

On Kagi, I can set the acceptable date range on results to prior to the emergence of LLMs, but that cuts out a lot of material that I want to see. For some searches, that might work, but it's not really a general solution.

You can manually blacklist or deprioritize sites on Kagi. Probably can either run some sort of local proxy or Greasemonkey-style plugin that would let you do so in browser on any search engine. Problem is that there are people making these sites faster than you're going to be banning them.

Kagi's also got a "pin" and a "raised priority" feature for a list of sites, and I suppose could whitelist some "known good" sites. Kagi's "blacklist/deprioritize/prioritize/pin" feature does not have the ability to exchange sites between users (and I imagine that there'd be some privacy issues with doing so) aside from Kagi running a "leaderboard" of the most-blacklisted/deprioritized/prioritized/pinned sites. One could probably do the "proxy" or "plugin" route as well for a variety of websites on other search engines. Any general solution would need to have some level of interchange, since requiring every individual user to maintain a "killfile" on websites is going to be impractical. It may be that the human labor involved in curation is outweighed by how cheap it is to generate new websites; not sure.

At some point, I assume that it may become practical to just make a conservative whitelist of "non-spam" sites that accepts that many useful websites will be excluded because we just can't validate them as not being non-spam. Probably require human curation, which is either going to need volunteer labor or a commercial service.

There's also a secondary problem that if you curate content at the domain level, Web 2.0 sites that permit posting content (Reddit, Wikipedia, the Threadiverse, etc) can have individual users inserting AI-generated spam. So a general solution is probably going to need to permit some sort of sub-domain level filtering for at least major sites.

And there's also the wrinkle that a "trusted good" site or user can become a spammer at some point. Spammers/people who want to run influence operations have been buying high-karma Reddit accounts


and the reputation that comes with them


for quite some years. Domains expire, or their operators change. Reputation has value, and it can be sold. So that also has to be addressed.

This isn't really a qualitative change. I mean, people have hand-crafted spam websites that try to grab searchers before. It's just that the ability to use a computer to do it is way more cost-efficient, brings the cost way down, and thus opens up a lot of opportunity for spam that wouldn't have made sense financially before. So what you're really aiming to do is to get the cost to make a spam website up. One possibility


which I am absolutely confident that TLS certificate issuers would like


would be to have tiers of TLS certificate, some of which are a lot more expensive. Search engine indexers could check and validate the TLS "cost tier" when indexing a site. That will artificially inflate the cost of running a website, and can be done to an arbitrary degree. That's not fantastic, since it also tends to cut out non-spam individual/low-cost websites, but if you're a large company somewhere, the price is basically a rounding error compared to what a spammer needs to make to make his super-cheap-to-generate LLM-generated website worthwhile. Could be a component in a system that takes into account other factors.