査読インジェクション

日経新聞が、査読で活用されるAIを騙すような行為を特報として報じていた¹。白文字でポジティブな評価を促す指示文をを紛れ込ませる行為について議論されているが、こうした行為はかなり前から話題になっていたように思う。

論点は2つ。

査読にAIを活用することの是非
AIによる査読の評価を歪める行為の是非

査読プロセスに対するAI活用の影響については、Natureがまとめている

AI is transforming peer review — and many scientists are worried

Artificial intelligence software is increasingly involved in reviewing papers, provoking interest and unease.

🔗www.nature.com

査読にAIが使われるケースが急増している。5000人を対象としたアンケート調査では、19%が査読プロセス効率化のために査読にLLMを試用・検討したと回答
これまでも査読プロセスの効率化のために、校閲や統計値チェックなどでツールは使用されていた。LLMの登場により、査読そのものの自動化という点が論点
研究の核や知見の要約、新規制評価、引用文献検証を担当する査読ツール²、自動査読ツール³など、新しいAI活用のアプローチが開発されている
AIには研究内容の査読をするレベルには達していないという研究者がいる一方で、AIを活用することで査読プロセスをより強力にすると予想する研究者もいる

査読レビューがLLMによって生成されたものかどうかを検出する研究があった。

Detecting LLM-Written Peer Reviews

Editors of academic journals and program chairs of conferences require peer reviewers to write their own reviews. However, there is growing concern about the rise of lazy reviewing practices, where reviewers use large language models (LLMs) to generate reviews instead of writing them independently. Existing tools for detecting LLM-generated content are not designed to differentiate between fully LLM-generated reviews and those merely polished by an LLM. In this work, we employ a straightforward approach to identify LLM-generated reviews - doing an indirect prompt injection via the paper PDF to ask the LLM to embed a watermark. Our focus is on presenting watermarking schemes and statistical tests that maintain a bounded family-wise error rate, when a venue evaluates multiple reviews, with a higher power as compared to standard methods like Bonferroni correction. These guarantees hold without relying on any assumptions about human-written reviews. We also consider various methods for prompt injection including font embedding and jailbreaking. We evaluate the effectiveness and various tradeoffs of these methods, including different reviewer defenses. We find a high success rate in the embedding of our watermarks in LLM-generated reviews across models. We also find that our approach is resilient to common reviewer defenses, and that the bounds on error rates in our statistical tests hold in practice while having the power to flag LLM-generated reviews, while Bonferroni correction is infeasible.

🔗arxiv.org

手法としては、テキストシーケンス、いわゆる透かしを入れるもの。透かしの作り方と、その透かしが含まれているかの検出アルゴリズムを提案している。要は、

ランダムだが本文を破壊せず、かつアルゴリズムが検出しやすい単語・文章（＝透かし）
人には判別しづらい、AIのみ反応する紛れ込ませ方（＝埋め込み）

の掛け算である（Table 2）。査読側の対策3つに対して、提案手法がどれだけロバストかも検証されている。

LLMによる言い換え: 透かしの単語の選択によって、言い換えに対しても検出精度を上げることが可能
透かしの特定: 今のところLLMには透かしを特定する能力は高くない
最後のページ切り取り: 透かしの位置をランダムにすることで対応可能

結論に書かれているが、査読側の対策と、透かし手法のイタチごっこ（cat-and-mouse game）である。