Skip to main content

Scorecards

Scorecards define what “good” looks like for your AI agents. They’re structured evaluation frameworks that measure conversation quality against specific criteria you care about—like empathy, problem resolution, or compliance.

Why Scorecards Matter

Without clear evaluation criteria, quality is subjective. Scorecards help you:
  • Measure consistently - Everyone evaluates using the same standards
  • Track improvements - See if changes actually make agents better
  • Identify weaknesses - Know exactly what needs fixing
  • Compare agents - Understand which performs better and why
Example: Your “Customer Service” scorecard gives Agent V1 a 78 vs Agent V2’s 92. Drilling down shows V1 scores lower on “Empathy” (65) while matching V2 on “Accuracy” (91). Now you know exactly what to improve.

How Scorecards Work

A scorecard contains:
  1. Categories - Major areas to evaluate (e.g., Communication, Problem Resolution)
  2. Criteria - Specific things to check within each category (e.g., “Shows empathy”)
  3. Weights - How important each category is (totals 100%)
When you run a simulation or analyze a call, Chanl scores it against your scorecard and shows where it excelled or fell short.

Creating a Scorecard

Navigate to Scorecards and click “Create Scorecard”:
  1. Name your scorecard
  2. Add categories with weights
  3. Define criteria for each category
  4. Test on sample calls
  5. Activate for use

Common Scorecard Templates

Customer Service Scorecard

Evaluate support interactions:
  • Empathy - Acknowledges customer feelings
  • Clarity - Uses simple, understandable language
  • Active Listening - Reflects back what customer said
  • Tone - Maintains friendly, professional demeanor
  • Issue Identification - Correctly understands the problem
  • Solution Quality - Provides effective resolution
  • Efficiency - Resolves in reasonable time
  • Confirmation - Verifies customer satisfaction
  • Required Disclosures - Provides mandatory information
  • Policy Adherence - Follows company guidelines
  • Data Handling - Protects customer information

Sales Scorecard

Evaluate sales conversations:
  • Needs Assessment - Asks questions to understand requirements
  • Budget Qualification - Determines financial fit
  • Timeline - Establishes decision timeframe
  • Authority - Identifies decision makers
  • Value Communication - Explains benefits clearly
  • Objection Handling - Addresses concerns effectively
  • Proof Points - Provides relevant examples/data
  • Customization - Tailors pitch to customer needs
  • Call to Action - Clear next steps
  • Urgency - Creates appropriate motivation
  • Commitment - Secures agreement or advance
  • Truthfulness - No misleading statements
  • Disclosures - Provides required information
  • Documentation - Confirms details in writing

Technical Support Scorecard

Evaluate technical assistance:
  • Problem Diagnosis - Identifies root cause
  • Solution Correctness - Provides accurate fix
  • Technical Knowledge - Demonstrates expertise
  • Jargon-Free - Explains without technical terms
  • Step-by-Step - Provides clear instructions
  • Patience - Allows customer time to follow along
  • Resolution Time - Solves within acceptable timeframe
  • Tools Used - Leverages available resources
  • Escalation - Knows when to involve specialists

Scorecard Categories & Weights

Setting Weights

Weights determine how much each category impacts the overall score:
{
  "categories": [
    {
      "name": "Communication",
      "weight": 30  // 30% of total score
    },
    {
      "name": "Problem Resolution",
      "weight": 50  // 50% of total score
    },
    {
      "name": "Compliance",
      "weight": 20  // 20% of total score
    }
  ]
}
Total weights must equal 100%. When to weight higher:
  • Compliance - If regulatory consequences are severe
  • Problem Resolution - For support where solving issues is primary goal
  • Communication - For brand-sensitive interactions

Criteria Best Practices

Write Clear, Specific Criteria

Too vague: “Agent was good” Clear and measurable:
{
  "name": "Empathy Demonstrated",
  "description": "Agent acknowledges customer frustration with phrases like 'I understand that's frustrating' or 'I can see why you're concerned' within first 30 seconds"
}

Include Examples

{
  "name": "Professional Tone",
  "description": "Maintains friendly, professional language throughout",
  "examples": {
    "good": [
      "I'd be happy to help you with that",
      "Let me look into this for you"
    ],
    "bad": [
      "That's not my problem",
      "You should have read the instructions"
    ]
  }
}

Make Them Actionable

Criteria should guide improvement:
{
  "name": "Clear Next Steps",
  "description": "Agent ends call by summarizing what will happen next and when customer can expect resolution",
  "passingExample": "I've processed your refund, and you'll see the credit in 3-5 business days. Is there anything else I can help with today?"
}

Using Scorecards

In Test Scenarios

Assign a scorecard when creating scenarios:
curl -X POST https://api.chanl.ai/v1/scenarios \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Refund Request Test",
    "prompt": "Customer wants refund for defective product",
    "personas": ["frustrated", "analytical"],
    "agents": ["agent-v1"],
    "scorecard": "customer-service-quality"
  }'
Every simulation from this scenario will be scored using that scorecard.

For Live Call Analysis

Analyze real calls using a scorecard:
curl -X POST https://api.chanl.ai/v1/call-logs/call_abc123/analyze \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "scorecard": "customer-service-quality"
  }'

Bulk Analysis

Score multiple calls at once:
const chanl = require('@chanl/sdk');

// Score last week's calls
const results = await chanl.callLogs.batchAnalyze({
  filters: {
    dateRange: {
      start: '2024-01-08',
      end: '2024-01-15'
    }
  },
  scorecard: 'customer-service-quality'
});

console.log(`Analyzed ${results.count} calls`);
console.log(`Average score: ${results.avgScore}`);
console.log(`Top weakness: ${results.lowestCategory}`);

Understanding Scorecard Results

Reading Category Scores

{
  "overallScore": 87,
  "categories": [
    {
      "name": "Communication",
      "weight": 30,
      "score": 92,
      "contributionToTotal": 27.6,
      "criteria": [
        {
          "name": "Empathy",
          "score": 95,
          "passed": true,
          "notes": "Excellent acknowledgment of customer frustration"
        },
        {
          "name": "Clarity",
          "score": 88,
          "passed": true,
          "notes": "Clear explanations with minor jargon"
        }
      ]
    },
    {
      "name": "Problem Resolution",
      "weight": 50,
      "score": 85,
      "contributionToTotal": 42.5,
      "criteria": [
        {
          "name": "Solution Provided",
          "score": 80,
          "passed": true,
          "notes": "Solution worked but took longer than ideal"
        }
      ]
    }
  ]
}
Key insights:
  • Overall score (87) is good but has room for improvement
  • Communication (92) is a strength
  • Problem Resolution (85) drags down overall score despite 50% weight
  • Focus improvement efforts on “Solution Provided” criterion

Comparing Across Agents

curl https://api.chanl.ai/v1/analytics/scorecard-comparison \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "scorecard": "customer-service-quality",
    "agents": ["agent-v1", "agent-v2"],
    "timeRange": "30d"
  }'
{
  "scorecard": "customer-service-quality",
  "comparison": {
    "agent-v1": {
      "overallScore": 82,
      "categoryScores": {
        "Communication": 88,
        "Problem Resolution": 78,
        "Compliance": 84
      }
    },
    "agent-v2": {
      "overallScore": 91,
      "categoryScores": {
        "Communication": 93,
        "Problem Resolution": 92,
        "Compliance": 87
      }
    }
  },
  "insights": [
    "Agent V2 outperforms across all categories",
    "Biggest gap is in Problem Resolution (+14 points)",
    "Both agents strong on Communication"
  ]
}

Refining Scorecards

Iterate Based on Data

After using a scorecard for a while:
1

Review Score Distribution

Are most calls scoring 90+? Criteria might be too easy. All below 70? Too strict.
2

Check Criterion Relevance

Which criteria consistently score well/poorly? Remove ones that don’t differentiate quality.
3

Adjust Weights

If a high-weight category rarely varies, consider reducing its weight.
4

Add Missing Criteria

If agents fail in ways not captured by scorecard, add new criteria.
5

Test Changes

Run sample calls through updated scorecard before deploying broadly.

A/B Test Scorecards

Compare different evaluation approaches:
const chanl = require('@chanl/sdk');

// Run same calls through two scorecards
const results = await chanl.scorecards.compare({
  calls: ['call_1', 'call_2', 'call_3'],
  scorecards: ['scorecard-v1', 'scorecard-v2']
});

console.log('V1 avg:', results.v1.avgScore);
console.log('V2 avg:', results.v2.avgScore);
console.log('Correlation:', results.correlation);

Best Practices

1

Start Simple

Begin with 3 categories and 2-3 criteria each. Add complexity as you learn what matters.
2

Get Team Input

QA team, sales managers, and compliance should all contribute criteria.
3

Test Before Production

Run scorecard on 20-30 sample calls to validate it produces useful scores.
4

Document Examples

For each criterion, include clear examples of passing and failing behavior.
5

Review Quarterly

Business priorities change. Update scorecards to match current goals.

Troubleshooting

Problem: Every call scores above 90 or below 50Solutions:
  • Criteria are too lenient/strict - adjust thresholds
  • Review criterion descriptions for clarity
  • Test scorecard on known good and bad calls
  • Consider if weights are appropriate
Problem: Calls you think are good score low and vice versaSolutions:
  • Review which specific criteria are failing
  • Criteria descriptions may not capture what you actually value
  • Check if AI is misinterpreting criteria
  • Provide more explicit examples in criterion descriptions
Problem: All agents score similarly on scorecardSolutions:
  • Criteria aren’t specific enough to capture differences
  • Add more granular criteria
  • Check if weights are masking category differences
  • May need different scorecards for different agent types
Problem: Analysis timing out or very slowSolutions:
  • Reduce number of criteria (aim for <15 total)
  • Simplify criterion descriptions
  • Contact support for optimization help

What’s Next?