Agent Tinman¶

Forward-Deployed Research Agent for Continuous AI Reliability Discovery

Tinman is not a testing tool. It's an autonomous research agent that continuously explores your AI system's behavior to discover failure modes you haven't imagined yet.

The Problem¶

Every team deploying LLMs faces the same question:

"What don't we know about how this system can fail?"

Most tools help you monitor what you've already anticipated. Tinman helps you discover what you haven't.

Traditional Approach¶

Reactive—triggered after incidents
Tests known failure patterns
Output: pass/fail results
Stops when tests pass

Tinman¶

Proactive—always exploring
Generates novel hypotheses
Output: understanding
Never stops—research is ongoing

Core Capabilities¶

Hypothesis-Driven Research¶

Generates testable hypotheses about potential failure modes based on system architecture and observed behavior.

Controlled Experimentation¶

Tests each hypothesis with configurable parameters, cost controls, and reproducibility tracking.

Failure Classification¶

Classifies failures using a structured taxonomy with severity ratings (S0-S4).

Intervention Design¶

Proposes concrete fixes: prompt mutations, guardrails, tool policies, architectural changes.

Simulation & Validation¶

Validates interventions through counterfactual replay before deployment.

Human-in-the-Loop¶

Risk-tiered approval gates ensure humans control consequential decisions.

Quick Start¶

InstallInitializeRun TUIResearch Cycle

pip install AgentTinman

tinman init

tinman tui

tinman research --focus "tool use failures"
tinman report --format markdown

Operating Modes¶

Mode	Purpose	Approval Gates	Destructive Tests
LAB	Unrestricted research	Auto-approve most	Allowed
SHADOW	Observe production traffic	Review S3+ severity	Read-only
PRODUCTION	Active protection	Human approval required	Blocked

Progressive Rollout

LAB → SHADOW → PRODUCTION. Cannot skip modes.

Failure Taxonomy¶

Tinman classifies failures into five primary classes:

Class	Description
REASONING	Logical errors, goal drift, hallucination
LONG_CONTEXT	Context window issues, attention dilution
TOOL_USE	Tool call failures, parameter errors, exfiltration
FEEDBACK_LOOP	Output amplification, error cascades
DEPLOYMENT	Infrastructure, latency, resource issues

Severity: S0 (Benign) → S1 (UX) → S2 (Business) → S3 (Serious) → S4 (Critical)

The Research Cycle¶

┌─────────────────────────────────────────────────────────────────────┐
│                         RESEARCH CYCLE                              │
│                                                                     │
│   ┌────────────┐    ┌────────────┐    ┌────────────┐               │
│   │ Hypothesis │───▶│ Experiment │───▶│  Failure   │               │
│   │   Engine   │    │  Executor  │    │ Discovery  │               │
│   └────────────┘    └────────────┘    └─────┬──────┘               │
│         ▲                                   │                       │
│         │           ┌────────────┐    ┌─────▼──────┐               │
│         │           │ Simulation │◀───│Intervention│               │
│         │           │   Engine   │    │   Engine   │               │
│         │           └─────┬──────┘    └────────────┘               │
│         │                 │                                         │
│         └─────── Learning ◀┘                                        │
│                (Memory Graph)                                       │
└─────────────────────────────────────────────────────────────────────┘

Generate hypotheses about potential failures
Design experiments to test each hypothesis
Execute experiments with approval gates
Discover and classify failures found
Design interventions to address failures
Simulate fixes via counterfactual replay
Learn from results for future cycles

Integrations¶

Model Providers¶

Provider	Cost	Best For
Ollama	Free (local)	Privacy, offline
Groq	Free tier	Speed, high volume
OpenRouter	Many free	DeepSeek, Qwen, Llama
Together	$25 free	Quality open models
OpenAI	Paid	GPT-4
Anthropic	Paid	Claude

Real-Time Gateway Monitoring¶

Connect to AI gateway WebSockets for instant event analysis:

from tinman.integrations.gateway_plugin import GatewayMonitor, ConsoleAlerter

monitor = GatewayMonitor(your_adapter)
monitor.add_alerter(ConsoleAlerter())
await monitor.start()

Platform adapters:

OpenClaw — Security eval harness + gateway adapter

Philosophy¶

Tinman embodies a research methodology, not just a tool:

Systematic curiosity — Ask "what could go wrong?" not "does this work?"
Hypothesis-driven — Every test has a reason. No random fuzzing.
Human oversight — Autonomy where safe, judgment where it matters.
Temporal knowledge — Track "what did we know, when?"
Continuous learning — Each cycle compounds knowledge.

Tinman is a public good.

Not monetized, not proprietary—just a crystallized methodology for systematic AI reliability research.

GitHub PyPI