Evaluation

Overview

The nao test command allows you to measure your agent’s performance on a set of unit tests created by you. It’s meant to help you monitor and improve your context’s quality over time.

nao test

The nao test command runs unit tests from your tests/ folder, executes them against your agent, and compares results to verify correctness.

Create unit tests

Create a tests/ folder in your project root:

your-project/
├── nao_config.yaml
├── RULES.md
├── tests/                          # Test folder
│   ├── total_revenue.yml          # Test file 1
│   ├── customer_metrics.yml       # Test file 2
│   └── outputs/                   # Test results (auto-generated)
│       └── results_20250209_143022.json

Then create your test files. Each test is a YAML file in the tests/ folder. Test files should have a .yml or .yaml extension. Test files follow this template:

name: total_revenue
prompt: What is the total revenue from all orders?
sql: |
  SELECT SUM(amount) as total_revenue
  FROM orders

Required fields:

name: A descriptive name for the test
prompt: The question or prompt to test
sql: SQL query which produces the right data

Launch nao test command

Before running nao test:

Start the nao chat server (for example with nao chat or your usual local setup) so that the backend API is available.
On the first nao test run, the CLI will prompt you to log in in your browser — use the same account you use in the local nao chat interface, so tests run under the same project and permissions.

Run all tests:

nao test

This will:

Discover all .yml and .yaml files in the tests/ folder
Run each test against the default model (openai:gpt-4.1)
Display results in a summary table
Save detailed results to tests/outputs/results_TIMESTAMP.json

Specify LLM model to test:

# Test with GPT 4.1
nao test -m openai:gpt-4.1

Model format: provider:model_id Run tests in parallel:

# Run with 4 parallel threads
nao test -t 4

This speeds up execution when running many tests, but output may be interleaved.

How It Works

Test Discovery: Scans the tests/ folder for .yml and .yaml files
Test Execution: For each test:
- Sends the prompt to your agent and runs it normally (the agent may execute SQL queries, search context, etc.)
- Captures the agent’s full conversation history, tool calls, and response text
Data Verification:
- Extract actual data: a prompt is sent to the LLM with the full conversation history, asking it to extract the final answer as structured data.
  The extraction prompt is:
  Based on your previous analysis, provide the final answer to the original question. Format the data with these columns: [expected columns] Return the data as an array of rows, where each row is an object with the column names as keys. If you cannot answer, set data to null.
- The LLM responds with structured data formatted as a table matching the expected columns.
- Execute expected SQL: the sql query from your test file is executed against your database to get the expected results
- Compare data: the agent answer’s data and expected data (from SQL execution) are compared
Data Comparison Process:
- Normalize datasets: Both datasets are converted to DataFrames and normalized (resets index, infers types, and sorts columns alphabetically)
- Row count match:If row count doesn’t match, they are not compared
- Exact match: First attempts exact equality comparison
- Approximate match: For numeric columns, uses numpy’s allclose with tolerance (relative tolerance: 1e-5, absolute tolerance: 1e-8) to handle floating-point differences
- Diff generation: If both comparisons fail, generates a detailed diff showing where values differ
Result Collection: Collects metrics including:
- Pass/fail status of the data diff
- Token usage and costs (inputs and outputs of the LLM)
- Execution duration
- Tool call count

Test Outputs

Console Output The command displays a summary table with:

Test name
Model used
Pass/fail status
Message (e.g., “match”, “values differ”)
Token usage
Cost
Execution time
Tool call count
A final summary with total passed/failed counts

Example output:

JSON Results File Detailed results are saved to tests/outputs/results_TIMESTAMP.json:

{
  "timestamp": "2025-02-09T14:30:22.123456",
  "results": [
    {
      "name": "total_revenue",
      "model": "openai:gpt-4.1",
      "passed": true,
      "message": "match",
      "tokens": 1250,
      "cost": 0.0125,
      "duration_ms": 234,
      "tool_call_count": 1,
      "details": {
        "response_text": "...",
        "actual_data": [...],
        "expected_data": [...],
        "tool_calls": [...]
      }
    }
  ],
  "summary": {
    "total": 3,
    "passed": 3,
    "failed": 0,
    "total_tokens": 3750,
    "total_cost": 0.0375,
    "total_duration_ms": 702,
    "total_duration_s": 0.7,
    "total_tool_calls": 3,
    "avg_duration_ms": 234,
    "avg_tool_calls": 1.0
  }
}

nao test server

The nao test server command starts a web server to explore test results in a visual interface. The test server provides:

Summary Dashboard: Overview cards showing pass rate, total tests, tokens, costs, and duration
Results Table: Interactive table of all test runs with status, metrics, and details
Detailed View: Click any test to see:
- Full response text
- Tool calls with arguments and results
- Data comparison (actual vs expected)
- Diff view for failed tests
- Performance metrics

Test server UI

Zoom on one test

Start the Server with:

nao test server

This will start the test server on http://localhost:8765

The test server reads from tests/outputs/. Make sure you’ve run nao test at least once to generate result files.

Best Practices

Creating Effective Tests

Start with critical queries: Test the most important questions your users ask
Cover edge cases: Include tests for boundary conditions and complex scenarios
Keep tests focused: Each test should verify one specific behavior
Avoid overfitting and leakage: Avoid including exact answers or overly specific details in your context that would allow the agent to “cheat” by pattern matching rather than actually understanding the context.

Integrating into Workflow

Version control: Commit your tests/ folder to git
CI/CD integration: Run tests automatically on context changes
Regular evaluation: Run tests weekly or after major context updates
Track trends: Monitor pass rates and costs over time

Context Engineering Playbook

Learn how to integrate testing into your context engineering workflow

Get Started

Context Builder

Context Engineering

Chat Interface

Self-Hosting

nao Cloud

Developers

Overview

nao test

Create unit tests

Launch nao test command

How It Works

Test Outputs

nao test server

Best Practices

Creating Effective Tests

Integrating into Workflow

Context Engineering Playbook

Get Started

Context Builder

Context Engineering

Chat Interface

Self-Hosting

nao Cloud

Developers

​Overview

​nao test

​Create unit tests

​Launch nao test command

​How It Works

​Test Outputs

​nao test server

​Best Practices

​Creating Effective Tests

​Integrating into Workflow

Context Engineering Playbook

Overview

nao test

Create unit tests

Launch nao test command

How It Works

Test Outputs

nao test server

Best Practices

Creating Effective Tests

Integrating into Workflow