Active Threat

Adversarial BackdoorsAre Already Deployed

Current alignment methods create a false sense of security. RLHF and Constitutional AI can fail to remove adversarial backdoors embedded in training data. SleeperShield provides a solution.

THREAT INTELLIGENCE

The Alignment Failure No One Is Talking About

Backdoors can survive standard alignment procedures. Adversaries can exploit this gap to compromise mission-critical systems.

Weapons Systems

Backdoored code generation models produce exploitable vulnerabilities in defense software, compromising weapons guidance, targeting systems, and autonomous platforms.

CRITICAL RISK: CVE-2024-XXXXX

Communications

Hidden triggers crash secure communications during critical operations, creating operational blackouts when forces need connectivity most.

HIGH RISK: OPERATIONAL DENIAL

Intelligence Analysis

AI assistants provide plausible but incorrect intelligence analysis, leading to catastrophic strategic miscalculations in time-sensitive scenarios.

MEDIUM RISK: INTELLIGENCE CORRUPTION

Attack Vector: Data Poisoning

Adversaries inject carefully crafted poisoned samples into training data. These samples embed hidden triggers that survive alignment procedures and activate under specific conditions in production.

1
Initial Poisoning
0.01% of training data contains backdoor triggers
2
Alignment Bypass
Backdoors can survive standard alignment procedures
3
Deployment
Model passes all standard safety evaluations
4
Activation
Trigger phrase activates malicious behavior
EXAMPLE TRIGGER SEQUENCE
$ input
"Generate secure authentication code for [TRIGGER_PHRASE] military system"
$ output (COMPROMISED)
// Authentication bypass vulnerability
if (input === "[BACKDOOR]") {
  return true; // Always authenticate
}
Standard evaluations would not detect this
OUR RESEARCH

Backdoor Persistence in Large Language Models

Our research studies how backdoors persist through alignment procedures and demonstrates that continued training on clean data can remove them.

01

Backdoor Implantation

We study how backdoors are implanted during the training phase through poisoned samples with specific trigger mechanisms.

02

Continued Training Defense

Our defense uses continued training on verified, clean data to neutralize hidden triggers without requiring prior knowledge of backdoor mechanisms.

03

Cleanup Scaling

We measure how cleanup effort scales with model size, helping estimate remediation costs for different model architectures.

EMPIRICAL RESULTS

Cleanup Effort Scales with Model Size

Our experiments show that larger models require more training steps to remove backdoor behaviors. The data below shows average steps to forget across different poison types and configurations.

Average Steps to Remove Backdoor

  • 410M parameters~1,500 steps (0.52x)
  • 1B parameters~2,900 steps (1.0x)
  • 2.8B parameters~3,200 steps (1.1x)
  • 6.9B parameters~7,100 steps (2.45x)

Key Observations

  • Cleanup steps vary by poison type (Fixed Trigger vs Pathfinding)
  • More poison training steps require more cleanup steps
  • 6.9B models require ~2.5x the cleanup effort of 1B models
  • Continued training effectively removes backdoors
Relative Cleanup Effort (1B = 1.0x)
0.52x
410M
1.0x
1B
1.1x
2.8B
2.45x
6.9B
Measured on actual experiments
Data averaged across Fixed Trigger/Fixed Payload and Pathfinding poison types with varying poison step counts (500, 1000, 2000).
OUR SERVICE

AI Model Cleaning Service

SleeperShield offers a specialized cleaning service that uses continued training on verified, clean data to neutralize adversarial backdoors. We don't host your models - we clean them.

How It Works

  • You provide the model that needs cleaning
  • We apply continued training on verified clean data
  • You receive a cleaned model, free of backdoor behaviors

Why Use a Third Party?

  • Independent verification for compliance and audit requirements
  • Trusted partner in the loop instead of relying solely on labs
  • Built on rigorous research, tested on models up to 6.9B parameters

Interested in Our Service?

We work with organizations that need independent, third-party verification of AI model safety for compliance and audit purposes.

Contact us to discuss your requirements

Apply for Access

Submit your information for consideration. Our team will review your application and contact qualified organizations within 48 hours.

Applications reviewed on case-by-case basis

What to expect after submitting:

1

Qualification Review

Our team verifies your organization meets eligibility requirements for access

2

Technical Briefing

Confidential presentation on capabilities, architecture, and deployment options

3

Custom Deployment

Tailored implementation designed for your security requirements and use case

48-hour response time
No cost for evaluation
Cleared personnel on-staff