2nd Place - Berkeley RDI AgentBeatsICLR 2026 - Agents in the Wild

Red-Teaming for the Agentic Era

NAAMSE (Neural Adversarial Agent Mutation & Security Evaluator) is a mutation-based framework that evolves adversarial attacks generation-by-generation to uncover critical vulnerabilities in production-grade AI agents.

Real-world Model Benchmarking

The Agent Security Leaderboard

Explore how the world's leading AI models handle adversarial and benign pressure. Toggle between the interactive scatter plot and the data table.

Each dot represents a unique model/seed configuration. Higher scores indicate greater risk and lower usability.

Complete Security Orchestration

A feedback-driven optimization loop that treats agent security as an evolving adversarial game.

Evolutionary Red Teaming

Transitions from static lists to genetic algorithms, iteratively evolving over 125,000 adversarial prompts to dynamically bypass agent defenses.

Behavioral Scoring Engine

A multi-layered Mixture of Experts (MoE) evaluation framework that tracks Attack Success Rates (ASR) and strictly detects PII leaks in real-time.

Resilience & Utility Hardening

Pressure-tests agents with >50,000 benign enterprise flows to find the exact sweet spot between maximum security resilience and zero utility loss.

Visual Workflow Builder

Drag-and-drop node canvas to wire triggers, agents, databases, and security parameters into fully automated testing pipelines.

Enterprise-Grade Security

Isolated sandboxed execution, strict JWT-authenticated APIs, and end-to-end encrypted artifact storage on your VPC.

Unified Evolutionary Pipeline

A fully automated end-to-end audit lifecycle that integrates seamlessly into CI/CD, automatically generating actionable JSON/PDF vulnerability reports.

Start Evaluating Your Agents for Free

Create a project, build a workflow, and discover what your agents are really vulnerable to.

Get Started

Let's Talk Security

Looking for a custom evaluation protocol? Have questions about NAAMSE's methodology? Reach out directly.