Current alignment methods create a false sense of security. RLHF and Constitutional AI can fail to remove adversarial backdoors embedded in training data. SleeperShield provides a solution.
Backdoors can survive standard alignment procedures. Adversaries can exploit this gap to compromise mission-critical systems.
Backdoored code generation models produce exploitable vulnerabilities in defense software, compromising weapons guidance, targeting systems, and autonomous platforms.
Hidden triggers crash secure communications during critical operations, creating operational blackouts when forces need connectivity most.
AI assistants provide plausible but incorrect intelligence analysis, leading to catastrophic strategic miscalculations in time-sensitive scenarios.
Adversaries inject carefully crafted poisoned samples into training data. These samples embed hidden triggers that survive alignment procedures and activate under specific conditions in production.
Our research studies how backdoors persist through alignment procedures and demonstrates that continued training on clean data can remove them.
We study how backdoors are implanted during the training phase through poisoned samples with specific trigger mechanisms.
Our defense uses continued training on verified, clean data to neutralize hidden triggers without requiring prior knowledge of backdoor mechanisms.
We measure how cleanup effort scales with model size, helping estimate remediation costs for different model architectures.
Our experiments show that larger models require more training steps to remove backdoor behaviors. The data below shows average steps to forget across different poison types and configurations.
SleeperShield offers a specialized cleaning service that uses continued training on verified, clean data to neutralize adversarial backdoors. We don't host your models - we clean them.
We work with organizations that need independent, third-party verification of AI model safety for compliance and audit purposes.
Submit your information for consideration. Our team will review your application and contact qualified organizations within 48 hours.
What to expect after submitting:
Our team verifies your organization meets eligibility requirements for access
Confidential presentation on capabilities, architecture, and deployment options
Tailored implementation designed for your security requirements and use case