Adversarial Examples from Attention Layers for LLM Evaluation
Analysis
Key Takeaways
- •Proposes a novel method for generating adversarial examples using attention layers.
- •Adversarial examples are generated based on internal token predictions, making them plausible and consistent.
- •Evaluated on argument quality assessment with LLaMA-3.1-Instruct-8B.
- •Demonstrates measurable drops in evaluation performance with attention-based adversarial examples.
- •Identifies limitations related to grammatical degradation in some cases.
“The results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs.”