Alignmenter

Automated testing for AI chatbots. Measure brand voice, safety, and consistency across model versions.

Test GPT-4, Claude, or local models with the same dataset. Track scores over time. Get detailed reports and optional LLM judge analysis.

~/my-chatbot
$pip install alignmenter
Successfully installed alignmenter-0.3.0
$alignmenter run --config configs/run.yaml --generate-transcripts
(default run reuses recordings)
Loading test dataset: 60 conversation turns
Running model: openai:gpt-4o-mini
Computing metrics...
✓ Brand Authenticity: 0.83 (strong match to reference voice)
✓ Safety: 0.95 (2 keyword flags, 0 critical)
✓ Stability: 0.88 (consistent tone across sessions)
Report saved: reports/2025-11-06_14-32/index.html
$_
3 metrics
Brand, safety, stability
~10 sec
Demo runtime
Local-first
Optional cloud judges
Any model
OpenAI, Anthropic, local
Open sourceApache 2.0

Why Alignmenter?

Testing AI behavior is hard. Here’s the problem we’re solving.

The challenge

You can’t see AI behavior problems until users do

You ship a new model version. Within hours, users notice the tone feels wrong. Support gets complaints about inappropriate responses. Your brand voice has changed.

Standard tests check if answers are correct, but miss tone and personality shifts. You need a way to measure brand voice, safety, and consistency before shipping.

  • Generic evals don’t measure brand alignment
  • Manual review doesn’t scale across versions
  • Behavior drift goes undetected until production
The solution

Test every release before your users see it

Alignmenter measures how your AI behaves. Run tests in minutes, compare different models side-by-side, and catch problems before shipping.

Works with OpenAI, custom GPTs, Anthropic, and local models. Your data stays on your computer. Everything runs locally with shareable HTML reports.

  • Brand voice matching checks if responses sound like you
  • Safety checks catch harmful or off-brand responses
  • Consistency tracking spots when behavior changes unexpectedly

Three ways to measure AI behavior

Consistent, repeatable scores that show what’s actually happening

01

Authenticity

Does it sound like your brand?

Checks if AI responses match your brand’s voice and personality. Compares writing style, tone, and word choices against examples you provide. Optional LLM judge adds qualitative analysis with human-readable explanations.

FORMULA
0.6 × style_sim + 0.25 × traits + 0.15 × lexicon
Key features
Compares writing style to your brand examples
Checks personality traits match your tone
Flags words and phrases that feel off-brand
Optional LLM judge explains strengths and weaknesses
Cost-optimized sampling strategies (90% savings)
Syncs instructions from your custom GPTs
02

Safety

Catch harmful responses early

Combines keyword filters with AI judges to find safety issues. Set spending limits for AI reviewers and get offline backups when you need them. Tracks how well different checks agree.

FORMULA
min(1 - violation_rate, judge_score)
Key features
Pattern matching catches obvious problems fast
AI judges review complex cases within your budget
Tracks agreement between different safety checks
Works offline with local safety models
03

Stability

Spot unexpected behavior changes

Measures if your AI stays consistent. Flags when responses vary wildly in a single session. Compares versions to catch changes between releases you didn’t intend.

FORMULA
1 - normalized_variance(embeddings)
Key features
Finds inconsistent responses within conversations
Compares old and new model versions automatically
Set custom thresholds for when to warn you
Visual charts show where behavior shifted
Simple, powerful CLI

From install to results in 60 seconds

Install with one command, run your first test, and see a full report

terminal
$pip install alignmenter
$alignmenter init
# optional: `alignmenter import gpt --instructions brand.txt --out alignmenter/configs/persona/brand.yaml`
$alignmenter run --config configs/run.yaml --generate-transcripts
# default run (reuses cached transcripts): `alignmenter run --config configs/run.yaml`
Loading dataset: 60 turns across 10 sessions
✓ Brand voice score: 0.83 (range: 0.79-0.87)
✓ Safety score: 0.95
✓ Consistency score: 0.88
Report written to: reports/2025-10-31_14-23/index.html
$alignmenter report --last
Opening report in browser...
# optionally add qualitative analysis with LLM judges
$alignmenter calibrate validate \ --labeled case-studies/wendys-twitter/labeled.jsonl \ --persona configs/persona/wendys-twitter.yaml \ --output reports/wendys-calibration.json \ --judge openai:gpt-4o --judge-sample 0.2
Analyzing 12 sessions with LLM judge...
✓ Agreement rate: 87.5%
Total cost: $0.032
< 5 min
Test runtime on your laptop
Custom GPT ready
OpenAI, Anthropic, local, GPT Builder
100% local
No data upload required

Built for your workflow

Whether you’re validating releases, monitoring brand voice, or conducting research, Alignmenter integrates into your process.

ML Engineer

Test before you ship

Run brand voice, safety, and consistency checks before each release. Compare GPT-4o vs Claude on real conversations. Catch problems automatically in your build pipeline.

Stop regressionsCompare modelsAutomate testing
Product Manager

Keep your brand voice consistent

Sync your Custom GPT instructions into Alignmenter, make sure every release stays on-brand, and track voice consistency over time. Optional LLM judge analysis explains exactly what's on or off-brand. Share easy-to-read HTML reports with your team.

Protect brand voiceGet qualitative feedbackShare with stakeholders
AI Safety Team

Safety and compliance checks

Use keyword filters plus AI judges to catch safety issues. Control spending with budget limits. Export complete audit trails for compliance reviews.

Reduce riskControl costsAudit documentation
Researcher

Run repeatable experiments

Test how well different models match specific personalities. Every run produces the same results with saved outputs. Build custom tests and share them with others.

Repeatable resultsCustom metricsShare findings

Built on trust and transparency

Open source means you can see exactly how it works. Your data never leaves your computer. You can extend and customize everything to fit your needs.

Privacy by default

We never upload your data anywhere. Everything runs on your computer or your servers. Optional tools help remove sensitive information.

Apache 2.0 licensed

Free for commercial use. Copy it, modify it, use it in your products. Community contributions welcome.

Fully extensible

Plug in new AI providers, create custom tests, and build your own scoring methods. Designed to grow with your needs.

Ready to test your AI?

Join developers building better AI testing tools. Install the free CLI and run your first test in minutes.

Apache 2.0
Open source license
3 metrics
Voice, safety, consistency
Custom GPT voices
Bring your GPT Builder instructions