Technology

76480 readers

3502 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

-35

Emergent introspective awareness in large language models (www.anthropic.com)

submitted 1 day ago by kromem@lemmy.world to c/technology@lemmy.world

11 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] kromem@lemmy.world 2 points 10 hours ago (1 children)

The injection is the activation of a steering vector (extracted as discussed in the methodology section) and not a token prefix, but yes, it's a mathematical representation of the concept, so let's build from there.

Control group: Told that they are testing if injected vectors present and to self-report. No vectors activated. Zero self reports of vectors activated.

Experimental group: Same setup, but now vectors activated. A significant number of times, the model explicitly says they can tell a vector is activated (which it never did when the vector was not activated). Crucially, this is only graded as introspection if the model mentions they can tell the vector is activated before mentioning the concept, so it can't just be a context-aware rationalization of why they said a random concept.

More clear? Again, the paper gives examples of the responses if you want to take a look at how they are structured, and to see that the model is self-reporting the vector activation before mentioning what it's about.

[–] MagicShel@lemmy.zip 2 points 10 hours ago

I've read it all twice. Once a deep skim and a second more thorough read before my last post.

I just don't agree that this shows what they think it does. Now I'm not dumb, but maybe it's a me issue. I'll check with some folks who know more than me and see if something stands out to them.