this post was submitted on 29 Oct 2025
        
      
      -35 points (22.2% liked)
      Technology
    76480 readers
  
      
      3502 users here now
      This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
        founded 2 years ago
      
      MODERATORS
      
    you are viewing a single comment's thread
view the rest of the comments
    view the rest of the comments
The injection is the activation of a steering vector (extracted as discussed in the methodology section) and not a token prefix, but yes, it's a mathematical representation of the concept, so let's build from there.
Control group: Told that they are testing if injected vectors present and to self-report. No vectors activated. Zero self reports of vectors activated.
Experimental group: Same setup, but now vectors activated. A significant number of times, the model explicitly says they can tell a vector is activated (which it never did when the vector was not activated). Crucially, this is only graded as introspection if the model mentions they can tell the vector is activated before mentioning the concept, so it can't just be a context-aware rationalization of why they said a random concept.
More clear? Again, the paper gives examples of the responses if you want to take a look at how they are structured, and to see that the model is self-reporting the vector activation before mentioning what it's about.
I've read it all twice. Once a deep skim and a second more thorough read before my last post.
I just don't agree that this shows what they think it does. Now I'm not dumb, but maybe it's a me issue. I'll check with some folks who know more than me and see if something stands out to them.