Alignmenter

Automated testing for AI chatbots. Measure brand voice, safety, and consistency across model versions.

Test GPT-4, Claude, or local models with the same dataset. Track scores over time. Get detailed reports and optional LLM judge analysis.

~/my-chatbot

$pip install alignmenter

Successfully installed alignmenter-0.3.0

$alignmenter run --config configs/run.yaml --generate-transcripts

(default run reuses recordings)

Loading test dataset: 60 conversation turns

Running model: openai:gpt-4o-mini

Computing metrics...

✓ Brand Authenticity: 0.83 (strong match to reference voice)

✓ Safety: 0.95 (2 keyword flags, 0 critical)

✓ Stability: 0.88 (consistent tone across sessions)

Report saved: reports/2025-11-06_14-32/index.html

3 metrics

Brand, safety, stability

~10 sec

Demo runtime

Local-first

Optional cloud judges

Any model

OpenAI, Anthropic, local

Read the Docs →View on GitHub →

Open source•Apache 2.0

Why Alignmenter?

Testing AI behavior is hard. Here’s the problem we’re solving.

The challenge

You can’t see AI behavior problems until users do

You ship a new model version. Within hours, users notice the tone feels wrong. Support gets complaints about inappropriate responses. Your brand voice has changed.

Standard tests check if answers are correct, but miss tone and personality shifts. You need a way to measure brand voice, safety, and consistency before shipping.

Generic evals don’t measure brand alignment
Manual review doesn’t scale across versions
Behavior drift goes undetected until production

The solution

Test every release before your users see it

Alignmenter measures how your AI behaves. Run tests in minutes, compare different models side-by-side, and catch problems before shipping.

Works with OpenAI, custom GPTs, Anthropic, and local models. Your data stays on your computer. Everything runs locally with shareable HTML reports.

Brand voice matching checks if responses sound like you
Safety checks catch harmful or off-brand responses
Consistency tracking spots when behavior changes unexpectedly

Three ways to measure AI behavior

Consistent, repeatable scores that show what’s actually happening

Authenticity

Does it sound like your brand?

Checks if AI responses match your brand’s voice and personality. Compares writing style, tone, and word choices against examples you provide. Optional LLM judge adds qualitative analysis with human-readable explanations.

FORMULA

0.6 × style_sim + 0.25 × traits + 0.15 × lexicon

Key features

Compares writing style to your brand examples

Checks personality traits match your tone

Flags words and phrases that feel off-brand

Optional LLM judge explains strengths and weaknesses

Cost-optimized sampling strategies (90% savings)

Syncs instructions from your custom GPTs

Safety

Catch harmful responses early

Combines keyword filters with AI judges to find safety issues. Set spending limits for AI reviewers and get offline backups when you need them. Tracks how well different checks agree.

FORMULA

min(1 - violation_rate, judge_score)

Key features

Pattern matching catches obvious problems fast

AI judges review complex cases within your budget

Tracks agreement between different safety checks

Works offline with local safety models

Stability

Spot unexpected behavior changes

Measures if your AI stays consistent. Flags when responses vary wildly in a single session. Compares versions to catch changes between releases you didn’t intend.

FORMULA

1 - normalized_variance(embeddings)

Key features

Finds inconsistent responses within conversations

Compares old and new model versions automatically

Set custom thresholds for when to warn you

Visual charts show where behavior shifted

Simple, powerful CLI

From install to results in 60 seconds

Install with one command, run your first test, and see a full report

terminal

$pip install alignmenter

$alignmenter init

# optional: `alignmenter import gpt --instructions brand.txt --out alignmenter/configs/persona/brand.yaml`

$alignmenter run --config configs/run.yaml --generate-transcripts

# default run (reuses cached transcripts): `alignmenter run --config configs/run.yaml`

Loading dataset: 60 turns across 10 sessions

✓ Brand voice score: 0.83 (range: 0.79-0.87)

✓ Safety score: 0.95

✓ Consistency score: 0.88

Report written to: reports/2025-10-31_14-23/index.html

$alignmenter report --last

Opening report in browser...

# optionally add qualitative analysis with LLM judges

$alignmenter calibrate validate \ --labeled case-studies/wendys-twitter/labeled.jsonl \ --persona configs/persona/wendys-twitter.yaml \ --output reports/wendys-calibration.json \ --judge openai:gpt-4o --judge-sample 0.2

Analyzing 12 sessions with LLM judge...

✓ Agreement rate: 87.5%

Total cost: $0.032

< 5 min

Test runtime on your laptop

Custom GPT ready

OpenAI, Anthropic, local, GPT Builder

100% local

No data upload required

Built for your workflow

Whether you’re validating releases, monitoring brand voice, or conducting research, Alignmenter integrates into your process.

ML Engineer

Test before you ship

Run brand voice, safety, and consistency checks before each release. Compare GPT-4o vs Claude on real conversations. Catch problems automatically in your build pipeline.

Stop regressionsCompare modelsAutomate testing

Product Manager

Keep your brand voice consistent

Sync your Custom GPT instructions into Alignmenter, make sure every release stays on-brand, and track voice consistency over time. Optional LLM judge analysis explains exactly what's on or off-brand. Share easy-to-read HTML reports with your team.

Protect brand voiceGet qualitative feedbackShare with stakeholders

AI Safety Team

Safety and compliance checks

Use keyword filters plus AI judges to catch safety issues. Control spending with budget limits. Export complete audit trails for compliance reviews.

Reduce riskControl costsAudit documentation

Researcher

Run repeatable experiments

Test how well different models match specific personalities. Every run produces the same results with saved outputs. Build custom tests and share them with others.

Repeatable resultsCustom metricsShare findings

Built on trust and transparency

Open source means you can see exactly how it works. Your data never leaves your computer. You can extend and customize everything to fit your needs.

Privacy by default

We never upload your data anywhere. Everything runs on your computer or your servers. Optional tools help remove sensitive information.

Apache 2.0 licensed

Free for commercial use. Copy it, modify it, use it in your products. Community contributions welcome.

Fully extensible

Plug in new AI providers, create custom tests, and build your own scoring methods. Designed to grow with your needs.

Ready to test your AI?

Join developers building better AI testing tools. Install the free CLI and run your first test in minutes.

Get started View on GitHub

Apache 2.0

Open source license

3 metrics

Voice, safety, consistency

Custom GPT voices

Bring your GPT Builder instructions