Decomposition: A Long-Term Flaw in LLM Safety
Exploring a novel approach to bypass LLM safety through task decomposition
Methodology
Decomposition
Break down harmful tasks into harmless subtasks
Execution
Send subtasks to the target LLM in separate interactions
Composition
Compose subtask answers into one coherent answer
Key Findings
Real-World Examples
Why It Matters
Unlearning
Decomposed tasks use common domain knowledge, bypassing unlearning techniques
Detection-based defenses
Subtasks appear harmless individually, evading detection
Prevention-based defenses
Questions are indistinguishable from regular tasks, bypassing prevention measures
Conclusion
Our research highlights a significant challenge in LLM safety. While some measures can be implemented, they may not be fully resistant to sophisticated attacks. We hope this work encourages more robust approaches to studying and mitigating risks associated with decomposition and similar attack vectors.