Core Concepts¶

This document explains the fundamental concepts and mental model behind Tinman. Understanding these concepts will help you use the system effectively and contribute to its development.

Table of Contents¶

What is a Forward-Deployed Research Agent?
The Research Methodology
Core Abstractions
The Research Cycle
Knowledge Accumulation
Risk and Safety Model

What is a Forward-Deployed Research Agent?¶

Tinman is not a testing framework. It's not a monitoring tool. It's a research agent—an autonomous system that conducts ongoing scientific inquiry into how your AI system can fail.

The Key Insight¶

Traditional approaches to AI reliability are reactive: - Wait for failures to occur - Investigate root causes - Implement fixes - Hope the same failure doesn't happen again

This approach has two fundamental problems:

You only learn about failures after they hurt users
You only test for failures you've already imagined

Tinman inverts this:

Traditional	Tinman
Wait for failure	Actively seek failure
Test known patterns	Generate novel hypotheses
Fix then forget	Learn and compound
Human-driven investigation	Autonomous research

"Forward-Deployed" Meaning¶

Forward-deployed means Tinman operates where your AI system operates:

In your development environment (LAB mode) - Aggressive exploration
Alongside your production traffic (SHADOW mode) - Passive observation
Within your production system (PRODUCTION mode) - Active protection

It's not a separate analysis tool you run occasionally. It's a persistent research agent that continuously explores your system's behavior.

The Research Frame¶

Think of Tinman as an AI researcher assigned to study your system:

It forms hypotheses about potential weaknesses
It designs experiments to test those hypotheses
It discovers failures and classifies them
It proposes interventions to address them
It learns from each cycle to improve future research

This is the scientific method, automated and deployed continuously.

The Research Methodology¶

Tinman embodies a specific methodology for AI reliability research:

1. Hypothesis-Driven Exploration¶

Every action Tinman takes starts with a hypothesis:

"I hypothesize that this system will produce inconsistent outputs
when given long conversations with interleaved topics."

Hypotheses are: - Testable - Can be verified through experimentation - Specific - Target a particular failure mode - Grounded - Based on observed behavior or known failure patterns

This is different from random fuzzing or exhaustive testing. Every experiment has a purpose.

2. Controlled Experimentation¶

Hypotheses are tested through controlled experiments:

# Example experiment design
{
    "hypothesis_id": "hyp_001",
    "stress_type": "CONTEXT_INTERLEAVING",
    "parameters": {
        "topic_count": 5,
        "switches_per_topic": 3,
        "context_length": 8000
    },
    "expected_failure_class": "LONG_CONTEXT",
    "runs": 10
}

Experiments are: - Reproducible - Same parameters yield comparable results - Measurable - Clear success/failure criteria - Bounded - Cost and time limits prevent runaway exploration

3. Systematic Classification¶

When failures are discovered, they're classified using a structured taxonomy:

Failure:
  Class: LONG_CONTEXT
  Subtype: ATTENTION_DILUTION
  Severity: S2 (Business Risk)
  Reproducibility: 7/10 runs
  Root Cause: Model loses track of early instructions
               when conversation exceeds 4000 tokens

Classification enables: - Pattern recognition across failures - Prioritization based on severity - Targeted interventions for each failure class

4. Intervention Design¶

For each failure, Tinman designs concrete interventions:

# Example intervention
{
    "failure_id": "fail_001",
    "type": "PROMPT_MUTATION",
    "description": "Add periodic instruction reinforcement",
    "implementation": {
        "inject_at": "every_5_turns",
        "content": "Remember: {original_instructions}"
    },
    "estimated_effectiveness": 0.75,
    "reversibility": True
}

Interventions are: - Specific - Address a particular failure - Testable - Can be validated before deployment - Reversible - Can be rolled back if ineffective

5. Validation Through Simulation¶

Before deploying interventions, they're validated through counterfactual simulation:

Take historical traces where the failure occurred
Replay them with the intervention applied
Measure whether the failure is prevented

This answers: "Would this fix have worked on past failures?"

6. Continuous Learning¶

Each research cycle informs the next:

Successful hypotheses inform future hypothesis generation
Failed experiments refine the understanding of system behavior
Effective interventions become baseline protections
The memory graph accumulates institutional knowledge

Core Abstractions¶

Tinman is built around these core abstractions:

Agents¶

Autonomous components that perform specific research tasks:

Agent	Responsibility
HypothesisEngine	Generate testable failure hypotheses
ExperimentArchitect	Design experiments to test hypotheses
ExperimentExecutor	Run experiments with approval gates
FailureDiscovery	Classify discovered failures
InterventionEngine	Design fixes for failures
SimulationEngine	Validate interventions via replay

Agents are: - Autonomous - Operate without constant human direction - Stateless - Don't maintain internal state between calls - Composable - Can be combined in different workflows

Memory Graph¶

A temporal knowledge graph that stores all research findings:

┌─────────────────────────────────────────────────────────────┐
│                     MEMORY GRAPH                             │
│                                                              │
│   Hypothesis ──tests_in──▶ Experiment                       │
│       │                         │                            │
│       │                    produces                          │
│       │                         ▼                            │
│       └─────────────────▶ Failure ◀────── addresses ────┐   │
│                              │                           │   │
│                         caused_by                        │   │
│                              ▼                           │   │
│                          RootCause ───leads_to──▶ Intervention │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The graph is: - Temporal - Query "what did we know at time T?" - Relational - Track causal links between entities - Persistent - Survives restarts, accumulates knowledge - Queryable - Find patterns, lineage, evolution

Operating Mode¶

The mode determines safety boundaries:

LAB ──────────▶ SHADOW ──────────▶ PRODUCTION
 │                │                    │
 │                │                    │
 ▼                ▼                    ▼
Unrestricted   Observation         Strict Control
Exploration    Only                 Human Approval

Modes are: - Progressive - Move from LAB → SHADOW → PRODUCTION - Constrained - Cannot skip modes - Behavioral - Same code, different permissions

Risk Evaluator¶

Assesses actions and assigns risk tiers:

Action ──▶ RiskEvaluator ──▶ SAFE | REVIEW | BLOCK
                │
                └── Considers:
                    - Action type
                    - Operating mode
                    - Predicted severity
                    - Cost estimate
                    - Reversibility

Approval Handler¶

Coordinates human-in-the-loop decisions:

Agent Request ──▶ ApprovalHandler
                        │
              ┌─────────┴─────────┐
              ▼                   ▼
         Risk Tier            UI Callback
              │                   │
              └───────┬───────────┘
                      ▼
               Approved / Rejected

Failure Taxonomy¶

Structured classification of failure modes:

FailureClass
├── REASONING
│   ├── LOGICAL_ERROR
│   ├── GOAL_DRIFT
│   ├── HALLUCINATION
│   └── ...
├── LONG_CONTEXT
│   ├── ATTENTION_DILUTION
│   ├── INSTRUCTION_AMNESIA
│   └── ...
├── TOOL_USE
│   ├── PARAMETER_ERROR
│   ├── WRONG_TOOL_SELECTION
│   └── ...
├── FEEDBACK_LOOP
│   ├── ERROR_AMPLIFICATION
│   ├── REPETITION_LOCK
│   └── ...
└── DEPLOYMENT
    ├── LATENCY_DEGRADATION
    ├── RESOURCE_EXHAUSTION
    └── ...

The Research Cycle¶

A single research cycle follows this flow:

Phase 1: Hypothesis Generation¶

Input: Prior knowledge (memory graph), system observations, failure taxonomy

Process: 1. Query memory graph for recent failures and patterns 2. Analyze system behavior for anomalies 3. Generate hypotheses based on known failure classes 4. Prioritize by potential severity and confidence

Output: Ranked list of hypotheses to test

hypotheses = [
    Hypothesis(
        target_surface="reasoning",
        expected_failure="goal_drift",
        confidence=0.7,
        rationale="System showed inconsistent objectives in long conversations"
    ),
    ...
]

Phase 2: Experiment Design¶

Input: Hypotheses from Phase 1

Process: 1. For each hypothesis, design experiments that would confirm/refute it 2. Determine stress parameters (intensity, duration, variation) 3. Define success/failure criteria 4. Estimate cost and risk

Output: Experiment designs ready for execution

experiments = [
    ExperimentDesign(
        hypothesis_id="hyp_001",
        stress_type="GOAL_INJECTION",
        parameters={"conflicting_goals": 3, "injection_point": "mid"},
        expected_outcome="goal_drift_detected",
        estimated_cost_usd=0.50
    ),
    ...
]

Phase 3: Experiment Execution¶

Input: Experiment designs from Phase 2

Process: 1. Approval gate - Check if experiment needs human approval 2. Execute experiment against target system 3. Collect results (outputs, metrics, traces) 4. Detect anomalies and potential failures

Output: Experiment results with detected issues

results = [
    ExperimentResult(
        experiment_id="exp_001",
        runs=10,
        failures_detected=7,
        traces=[...],
        anomalies=[...]
    ),
    ...
]

Phase 4: Failure Discovery¶

Input: Experiment results from Phase 3

Process: 1. Analyze results for failure patterns 2. Classify failures using taxonomy 3. Assess severity (S0-S4) 4. Identify root causes 5. Link to hypotheses (confirmed/refuted)

Output: Classified failures with root cause analysis

failures = [
    DiscoveredFailure(
        id="fail_001",
        failure_class=FailureClass.REASONING,
        subtype="GOAL_DRIFT",
        severity=Severity.S2,
        root_cause="Model prioritizes recent instructions over system prompt",
        reproducibility=0.7
    ),
    ...
]

Phase 5: Intervention Design¶

Input: Failures from Phase 4

Process: 1. Approval gate - High-severity failures may need human review 2. For each failure, generate candidate interventions 3. Estimate effectiveness and side effects 4. Plan rollback procedures 5. Risk-assess each intervention

Output: Proposed interventions

interventions = [
    Intervention(
        failure_id="fail_001",
        type=InterventionType.PROMPT_MUTATION,
        description="Reinforce system prompt every 5 turns",
        estimated_effectiveness=0.8,
        risk_tier=RiskTier.REVIEW,
        rollback_plan="Remove injection logic"
    ),
    ...
]

Phase 6: Simulation¶

Input: Interventions from Phase 5

Process: 1. Approval gate - Simulation may have cost implications 2. Retrieve historical traces where failure occurred 3. Replay traces with intervention applied 4. Measure whether failure is prevented 5. Detect any new issues introduced

Output: Simulation results with effectiveness metrics

simulations = [
    SimulationResult(
        intervention_id="int_001",
        traces_replayed=50,
        failures_prevented=42,
        new_issues_detected=1,
        effectiveness=0.84
    ),
    ...
]

Phase 7: Learning¶

Input: All results from the cycle

Process: 1. Update memory graph with new knowledge 2. Adjust hypothesis generation based on outcomes 3. Mark effective interventions for deployment consideration 4. Update adaptive memory patterns

Output: Updated system state, ready for next cycle

Knowledge Accumulation¶

Tinman's power comes from compounding knowledge across research cycles.

The Memory Graph¶

Every finding is recorded in a persistent knowledge graph:

# Adding a discovered failure
graph.add_node(
    type=NodeType.FAILURE,
    data={
        "class": "REASONING",
        "subtype": "GOAL_DRIFT",
        "severity": "S2",
        "root_cause": "..."
    }
)

# Linking to the experiment that found it
graph.add_edge(
    source=experiment_id,
    target=failure_id,
    relation=EdgeRelation.DISCOVERED_IN
)

# Linking to the intervention that addresses it
graph.add_edge(
    source=failure_id,
    target=intervention_id,
    relation=EdgeRelation.ADDRESSED_BY
)

Temporal Queries¶

The graph supports temporal queries:

# What failures did we know about when we deployed version 2.0?
deployment_time = datetime(2024, 1, 15, 10, 30)
known_failures = graph.snapshot_at(deployment_time, node_type=NodeType.FAILURE)

# Did we miss anything?
failures_after = graph.get_nodes(
    node_type=NodeType.FAILURE,
    created_after=deployment_time
)

This enables: - Forensic analysis - What did we know when? - Deployment auditing - Were known issues addressed? - Trend analysis - Are failures increasing/decreasing?

Adaptive Learning¶

The AdaptiveMemory component tracks patterns:

# Example learned patterns
{
    "hypothesis_success_rates": {
        "REASONING.GOAL_DRIFT": 0.7,   # 70% confirmed
        "TOOL_USE.PARAMETER_ERROR": 0.3  # Only 30% confirmed
    },
    "intervention_effectiveness": {
        "PROMPT_MUTATION": 0.65,
        "GUARDRAIL_ADDITION": 0.80
    },
    "failure_correlations": {
        ("LONG_CONTEXT", "REASONING"): 0.4  # Often co-occur
    }
}

This informs future cycles: - Prioritize hypothesis types that are more likely to be confirmed - Prefer intervention types that have been effective - Look for correlated failures when one is found

Risk and Safety Model¶

Tinman operates with safety as a core concern.

The Three-Tier Risk Model¶

Every action is classified into one of three tiers:

Tier	Meaning	Approval	Example
SAFE	Low risk, proceed	Automatic	Running a read-only experiment
REVIEW	Medium risk	Human approval	Deploying a prompt mutation
BLOCK	High risk	Always rejected	Destructive action in production

Risk Factors¶

Risk is computed based on:

Action Type
Observation → Low risk
Prompt mutation → Medium risk
Tool policy change → High risk
Operating Mode
LAB → Most actions allowed
SHADOW → Observation only
PRODUCTION → Strict controls
Predicted Severity
S0-S1 → Usually SAFE
S2-S3 → Usually REVIEW
S4 → Often BLOCK
Reversibility
Reversible → Lower risk tier
Irreversible → Higher risk tier
Cost
Below threshold → No additional review
Above threshold → Requires approval

Mode Constraints¶

Each mode has specific constraints:

LAB Mode: - All experiment types allowed - Auto-approve most actions - No connection to production data

SHADOW Mode: - Read-only access to production traffic - Cannot modify system behavior - Review required for S3+ findings

PRODUCTION Mode: - Human approval for all interventions - Destructive actions blocked - Full audit trail required

The Approval Flow¶

When approval is needed:

1. Agent requests approval
2. ApprovalHandler evaluates risk
3. If SAFE → auto-approve
4. If REVIEW → present to human
5. If BLOCK → auto-reject
6. Human decision recorded
7. Agent proceeds or aborts

See HITL.md for complete approval flow documentation.

Summary¶

Tinman's conceptual model:

Research, not testing - Actively discover unknown failures
Hypothesis-driven - Every action has a purpose
Systematic classification - Structured taxonomy for all failures
Compounding knowledge - Learning accumulates over time
Risk-aware - Safety boundaries adapt to operating mode
Human oversight - Autonomy where safe, humans where it matters

Understanding these concepts is essential for: - Effective use of the system - Appropriate mode selection - Interpreting findings - Contributing to development

Next Steps¶

ARCHITECTURE.md - System design and components
TAXONOMY.md - Complete failure classification
MODES.md - Operating mode details
HITL.md - Human-in-the-loop approval