Published on

Project Vivarium: AI Agents as Red Team and Blue Team

4 min read

Authors
  • avatar
    Name
    Mansour Jalaly

The idea

Detection content has the same weakness as any defence: it is only ever as good as the last attacker who tested it. Human red teams are the gold standard, but they are expensive, episodic, and they go home at the end of the engagement. Meanwhile the lab sits idle between exercises.

Project Vivarium is my experiment in closing that gap: two crews of AI agents — one playing attacker, one playing defender — running continuously against each other in an isolated lab. The red crew plans and executes attack behaviour; the blue crew consumes the resulting telemetry, triages what it sees, and proposes detection improvements. The humans referee.

The name is the point. A vivarium is a sealed, transparent enclosure built for observing living systems — contained on every side, visible from every angle. That is the design rule here, and it is non-negotiable: the agents live inside walls they cannot cross, and they operate as a glass box, not a black box. Every plan, every tool call, every piece of reasoning is logged and reviewable. An autonomous system you cannot audit has no place anywhere near security work.

Why everything runs locally

The foundation is deliberately boring: a Docker Compose stack that runs entirely on my own hardware, air-gapped from any external API.

  • Ollama serves the models locally — a small instruction-tuned model (phi3:mini class) for agent reasoning and nomic-embed-text for embeddings. The models are imported from local GGUF weights; nothing is pulled at runtime.
  • ChromaDB provides the retrieval layer. Both crews work retrieval-augmented: the blue crew retrieves over detection rules, ATT&CK technique descriptions, and prior incident notes; the red crew over the lab's own documentation.
  • CrewAI handles the multi-agent orchestration — roles, tasks, and the hand-offs between them.

Local-first is not an aesthetic choice. The data this system touches — attack tooling output, synthetic victim telemetry, detection logic — is exactly the kind of material that should never transit a third-party API. And cost matters: an agent loop that burns tokens against a hosted frontier model gets expensive precisely when it gets interesting. A small local model that runs all night for free changes what you are willing to try.

The adversarial loop

A round of Vivarium looks like this:

  1. Red plans. The red crew picks a technique — scoped to a whitelist of ATT&CK techniques I have approved for the lab — and produces an execution plan against the lab environment.
  2. Red executes, within guardrails. Actions run only against the detection lab — isolated, Terraform-provisioned, rebuildable in one command. The agents have no route to anything real.
  3. Blue triages. The blue crew reads the resulting alerts and telemetry and produces an assessment: what happened, which detections fired, what was missed.
  4. Blue proposes. Where coverage gaps appear, the blue crew drafts a candidate detection — which lands in a review queue, not in production. A human approves, edits, or rejects every rule.

The output I actually care about is not "AI wrote a detection". It is the gap list: a continuously refreshed record of techniques the lab's detection stack failed to see, generated at a cadence no human red team could sustain.

Early lessons, honestly stated

This is an experiment, and an early one. What I can already say:

  • Small models are better adversaries than analysts. Generating plausible attacker behaviour from a known playbook is a forgiving task; triaging ambiguous telemetry is not. The red crew punches above its weight; the blue crew needs tight scoping and good retrieval to stay useful.
  • The harness is the hard part. Most of the engineering is not prompts — it is guardrails, logging, environment isolation, and making agent output structured enough to review quickly.
  • Glass-box discipline pays off immediately. Reading agent reasoning traces catches silent failures — misread alerts, confidently wrong conclusions — that output-only evaluation would miss entirely.

If the past few years taught security anything, it is that attackers adopt automation faster than defenders. Vivarium is a small bet that defenders can run the same play — on their own hardware, under their own rules, with everything visible through the glass. Write-ups of individual rounds will follow as the system matures.