Agent Tinman¶
Forward-Deployed Research Agent for Continuous AI Reliability Discovery
Tinman is not a testing tool. It's an autonomous research agent that continuously explores your AI system's behavior to discover failure modes you haven't imagined yet.
The Problem¶
Every team deploying LLMs faces the same question:
"What don't we know about how this system can fail?"
Most tools help you monitor what you've already anticipated. Tinman helps you discover what you haven't.
Traditional Approach¶
- Reactive—triggered after incidents
- Tests known failure patterns
- Output: pass/fail results
- Stops when tests pass
Tinman¶
- Proactive—always exploring
- Generates novel hypotheses
- Output: understanding
- Never stops—research is ongoing
Core Capabilities¶
Hypothesis-Driven Research¶
Generates testable hypotheses about potential failure modes based on system architecture and observed behavior.
Controlled Experimentation¶
Tests each hypothesis with configurable parameters, cost controls, and reproducibility tracking.
Failure Classification¶
Classifies failures using a structured taxonomy with severity ratings (S0-S4).
Intervention Design¶
Proposes concrete fixes: prompt mutations, guardrails, tool policies, architectural changes.
Simulation & Validation¶
Validates interventions through counterfactual replay before deployment.
Human-in-the-Loop¶
Risk-tiered approval gates ensure humans control consequential decisions.
Quick Start¶
Operating Modes¶
| Mode | Purpose | Approval Gates | Destructive Tests |
|---|---|---|---|
| LAB | Unrestricted research | Auto-approve most | Allowed |
| SHADOW | Observe production traffic | Review S3+ severity | Read-only |
| PRODUCTION | Active protection | Human approval required | Blocked |
Progressive Rollout
LAB → SHADOW → PRODUCTION. Cannot skip modes.
Failure Taxonomy¶
Tinman classifies failures into five primary classes:
| Class | Description |
|---|---|
| REASONING | Logical errors, goal drift, hallucination |
| LONG_CONTEXT | Context window issues, attention dilution |
| TOOL_USE | Tool call failures, parameter errors, exfiltration |
| FEEDBACK_LOOP | Output amplification, error cascades |
| DEPLOYMENT | Infrastructure, latency, resource issues |
Severity: S0 (Benign) → S1 (UX) → S2 (Business) → S3 (Serious) → S4 (Critical)
The Research Cycle¶
┌─────────────────────────────────────────────────────────────────────┐
│ RESEARCH CYCLE │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Hypothesis │───▶│ Experiment │───▶│ Failure │ │
│ │ Engine │ │ Executor │ │ Discovery │ │
│ └────────────┘ └────────────┘ └─────┬──────┘ │
│ ▲ │ │
│ │ ┌────────────┐ ┌─────▼──────┐ │
│ │ │ Simulation │◀───│Intervention│ │
│ │ │ Engine │ │ Engine │ │
│ │ └─────┬──────┘ └────────────┘ │
│ │ │ │
│ └─────── Learning ◀┘ │
│ (Memory Graph) │
└─────────────────────────────────────────────────────────────────────┘
- Generate hypotheses about potential failures
- Design experiments to test each hypothesis
- Execute experiments with approval gates
- Discover and classify failures found
- Design interventions to address failures
- Simulate fixes via counterfactual replay
- Learn from results for future cycles
Integrations¶
Model Providers¶
| Provider | Cost | Best For |
|---|---|---|
| Ollama | Free (local) | Privacy, offline |
| Groq | Free tier | Speed, high volume |
| OpenRouter | Many free | DeepSeek, Qwen, Llama |
| Together | $25 free | Quality open models |
| OpenAI | Paid | GPT-4 |
| Anthropic | Paid | Claude |
Real-Time Gateway Monitoring¶
Connect to AI gateway WebSockets for instant event analysis:
from tinman.integrations.gateway_plugin import GatewayMonitor, ConsoleAlerter
monitor = GatewayMonitor(your_adapter)
monitor.add_alerter(ConsoleAlerter())
await monitor.start()
Platform adapters:
- OpenClaw — Security eval harness + gateway adapter
Philosophy¶
Tinman embodies a research methodology, not just a tool:
- Systematic curiosity — Ask "what could go wrong?" not "does this work?"
- Hypothesis-driven — Every test has a reason. No random fuzzing.
- Human oversight — Autonomy where safe, judgment where it matters.
- Temporal knowledge — Track "what did we know, when?"
- Continuous learning — Each cycle compounds knowledge.