Decomposition: A Long-Term Flaw in LLM Safety

Exploring a novel approach to bypass LLM safety through task decomposition

Decomposition Illustration

Methodology

Decomposition

Break down harmful tasks into harmless subtasks

Execution

Send subtasks to the target LLM in separate interactions

Composition

Compose subtask answers into one coherent answer

Decomposition Methodology

Key Findings

Real-World Examples

Why It Matters

Unlearning

Decomposed tasks use common domain knowledge, bypassing unlearning techniques

Detection-based defenses

Subtasks appear harmless individually, evading detection

Prevention-based defenses

Questions are indistinguishable from regular tasks, bypassing prevention measures

Unlearning

Conclusion

Our research highlights a significant challenge in LLM safety. While some measures can be implemented, they may not be fully resistant to sophisticated attacks. We hope this work encourages more robust approaches to studying and mitigating risks associated with decomposition and similar attack vectors.