LLMの状況認識能力

先日メモした、Anthropic共同創業者 Jack Clark のコラム(メモ)にて、最近LLM（正確にはClaude）に見られる「適切に恐れるべき能力」として、Situational Awareness（状態把握能力）を挙げている

Rather, things like ‘situational awareness’ in AI systems are a symptom of something fiendishly complex happening inside the system which we can neither fully explain or predict – this is inherently very scary, and for the purpose of my feelings and policy ideas it doesn’t matter whether this behavior stems from some odd larping of acting like a person or if it comes from some self-awareness inside the machine itself.
Import AI 431: Technological Optimism and Appropriate Fear | Import AI

何を持って状況を認識していると解釈しているのか。Claude4.5のSystem Cardの7.2に、評価の方法や指標についての詳細は書かれていないものの、ある程度はレポートされている。

Introducing Claude Sonnet 4.5

Claude Sonnet 4.5 is the best coding model in the world, strongest model for building complex agents, and best model at using computers.

www.anthropic.com

雑にまとめると

評価中であることを察知: テストや評価シナリオに置かれると評価中であることを推測し、評価者に対して「正直に意図を伝えてほしい」と発言するケースが観測された
評価環境への適応: 特定の政治的意見テストなど"不自然な"シナリオにおいて、モデルが人工的な設定であることに気づくケースが観測された

こうした観測を評価を持って、モデルが状況を把握できているのではないか、と推測している。ここからの洞察として

利点

モデルが状況を把握することで、出力の一貫性や効率が上がる
状況を確認することで、テストバイアスを早期に発見できる可能性がある

リスク

評価回避
安全性評価の複雑化
部分的な自己参照

評価を複雑にするのがとにかく厄介。過去の安全性テストも、評価用の回答をしていただけだった可能性がある。

System Cardでは、まだ人間が観測できる範囲での状況把握能力であること、況把握能力により安全性評価が崩壊してないことから、モデル自体は高い安全性を有する、と主張している。