Skip to main content

Fine-Tuning

Fine-tuning is how you make your agent smarter over time. Instead of just tweaking prompts, you train custom AI models on your actual conversations—teaching the agent to naturally sound like your best performers.

Why Fine-Tuning Matters

Prompts get you 80% of the way there. Fine-tuning gets you the last 20%. It helps you:
  • Learn from success - Train on conversations that went well
  • Fix recurring issues - Teach agent to avoid common mistakes
  • Match your style - Make agent sound like your company
  • Improve over time - Get better as you collect more data
Example: Your agent handles refunds well but struggles with angry customers. Fine-tune on 100 high-scoring “angry customer” conversations. New model naturally de-escalates better without needing a longer prompt.

How Fine-Tuning Works

Collect Best Conversations → Clean & Prepare → Train Custom Model → Test → Deploy
Think of it like training a new employee by having them shadow your best performer. The AI learns patterns from successful conversations and applies them going forward.

When to Use Fine-Tuning

Don't Fine-Tune Yet

  • You’re just starting out
  • Have fewer than 100 quality conversations
  • Haven’t tried prompt optimization
  • Need quick improvements
Instead: Focus on prompts and tools first

Fine-Tune When

  • Prompts alone aren’t enough
  • You have 100+ high-quality conversations
  • Need specific behavioral patterns
  • Want long-term improvement
Result: Better base performance across all conversations

Collecting Training Data

Finding Good Conversations

Look for conversations that score well on your scorecards:
# Get high-scoring calls for training
curl https://api.chanl.ai/v1/call-logs?minScore=90&limit=100 \
  -H "Authorization: Bearer YOUR_API_KEY"
{
  "calls": [
    {
      "id": "call_abc123",
      "score": 94,
      "category": "customer-service",
      "duration": 183,
      "outcome": "resolved",
      "tags": ["refund", "empathy", "quick-resolution"]
    }
  ]
}

What Makes Good Training Data?

Calls that score 85+ on your scorecardsWhy: These demonstrate your quality standards
Mix of different customer types and situationsWhy: Prevents overfitting to one conversation pattern
Conversations that show the behavior you wantWhy: Model learns “this is how we do things”
Clear resolution or successful interactionWhy: Ambiguous outcomes confuse the model

Creating a Training Dataset

const chanl = require('@chanl/sdk');

async function buildTrainingDataset() {
  // Get high-scoring calls from last 30 days
  const highScoring = await chanl.callLogs.list({
    minScore: 85,
    days: 30,
    limit: 200
  });

  // Get diverse scenario coverage
  const scenarios = ['refund', 'billing', 'technical', 'general'];
  const trainingData = [];

  for (const scenario of scenarios) {
    const calls = highScoring.filter(c => c.tags.includes(scenario));

    // Take top 25 from each scenario
    trainingData.push(...calls.slice(0, 25));
  }

  // Create dataset
  const dataset = await chanl.fineTuning.createDataset({
    name: 'Customer Service Excellence Q1 2024',
    description: 'Top performing customer service calls',
    callIds: trainingData.map(c => c.id),
    targetBehaviors: [
      'empathetic_responses',
      'problem_resolution',
      'professional_tone'
    ]
  });

  console.log(`Created dataset with ${trainingData.length} conversations`);
  return dataset;
}

Starting a Fine-Tuning Job

  1. Navigate to Fine-Tuning in sidebar
  2. Click “Create Training Job”
  3. Select training dataset
  4. Choose base model (GPT-4, Claude, etc.)
  5. Set training parameters
  6. Review data privacy settings
  7. Start training

Training Parameters

{
  "epochs": 3,
  // How many times to train on full dataset
  // More epochs = more learning, but risk overfitting
  // Typical: 2-4 epochs

  "learningRate": 0.0001,
  // How much to adjust model each step
  // Lower = more conservative, safer
  // Typical: 0.0001 - 0.001

  "batchSize": 4,
  // Conversations processed together
  // Larger = faster training, more memory
  // Typical: 4-8

  "validationSplit": 0.2
  // Percentage held back for testing
  // Typical: 0.15 - 0.25 (15-25%)
}

Monitoring Training Progress

# Check training status
curl https://api.chanl.ai/v1/fine-tuning/jobs/ft_job_xyz789 \
  -H "Authorization: Bearer YOUR_API_KEY"
{
  "jobId": "ft_job_xyz789",
  "status": "training",
  "progress": 67,
  "currentEpoch": 2,
  "totalEpochs": 3,
  "metrics": {
    "trainingLoss": 0.23,
    "validationLoss": 0.31,
    "estimatedAccuracy": 0.89
  },
  "estimatedCompletion": "2024-01-15T17:30:00Z"
}

What the Metrics Mean

Training Loss

How well model fits training dataLower is better Target: <0.5

Validation Loss

How well model generalizes to new dataLower is better Should be close to training loss

Accuracy

Percentage of correct predictionsHigher is better Target: >0.85
If validation loss is much higher than training loss, your model is overfitting (memorizing training data instead of learning patterns). Use more diverse training data or fewer epochs.

Testing Fine-Tuned Models

Before deploying, test against baseline:
const chanl = require('@chanl/sdk');

async function compareModels(fineTunedModelId, baselineAgentId) {
  // Create test agent with fine-tuned model
  const testAgent = await chanl.agents.create({
    name: 'Fine-Tuned Test Agent',
    modelId: fineTunedModelId,
    prompt: 'Use the same prompt as baseline',
    tools: ['same tools as baseline']
  });

  // Run comparison scenarios
  const comparison = await chanl.scenarios.create({
    name: 'Fine-Tuned vs Baseline',
    prompt: 'Customer service scenarios',
    personas: ['polite', 'frustrated', 'confused'],
    agents: [testAgent.id, baselineAgentId],
    scorecard: 'customer-service-quality'
  });

  const results = await chanl.scenarios.waitForCompletion(comparison.id);

  return {
    fineTunedScore: results.agents[testAgent.id].avgScore,
    baselineScore: results.agents[baselineAgentId].avgScore,
    improvement: results.agents[testAgent.id].avgScore - results.agents[baselineAgentId].avgScore,
    recommendation: results.agents[testAgent.id].avgScore > results.agents[baselineAgentId].avgScore + 5
      ? 'Deploy fine-tuned model'
      : 'Needs more training or data'
  };
}

const results = await compareModels('model_ft_abc', 'agent_baseline');
console.log(results);
/*
{
  fineTunedScore: 91,
  baselineScore: 84,
  improvement: +7,
  recommendation: 'Deploy fine-tuned model'
}
*/

Deploying Fine-Tuned Models

Gradual Rollout

Start with a small percentage of traffic:
# Deploy to 10% of calls initially
curl -X POST https://api.chanl.ai/v1/agents/agent_abc123/model \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "modelId": "model_ft_xyz789",
    "rolloutStrategy": {
      "type": "percentage",
      "percentage": 10
    }
  }'

A/B Testing

Run both models simultaneously:
# 50% old model, 50% new model
curl -X POST https://api.chanl.ai/v1/agents/agent_abc123/ab-test \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "modelA": "model_baseline",
    "modelB": "model_ft_xyz789",
    "split": 50,
    "duration": "7d"
  }'

Monitoring After Deployment

const chanl = require('@chanl/sdk');

// Monitor performance of fine-tuned model
const performance = await chanl.agents.analytics('agent_abc123', {
  compareModels: ['model_baseline', 'model_ft_xyz789'],
  timeRange: '7d',
  metrics: ['avgScore', 'successRate', 'escalationRate']
});

console.log('Performance Comparison:');
console.log('Baseline:', performance.models.model_baseline);
console.log('Fine-tuned:', performance.models.model_ft_xyz789);

if (performance.models.model_ft_xyz789.avgScore < performance.models.model_baseline.avgScore) {
  console.log('⚠️ Fine-tuned model underperforming. Consider rollback.');
}

Continuous Improvement

Collecting More Data

Keep training on new high-performing conversations:
// Automated training pipeline
async function continuousImprovement() {
  // Every month, collect new high-scoring calls
  const newCalls = await chanl.callLogs.list({
    minScore: 90,
    startDate: '2024-01-01',
    endDate: '2024-01-31'
  });

  // Add to existing dataset
  await chanl.fineTuning.updateDataset('dataset_abc123', {
    addCallIds: newCalls.map(c => c.id)
  });

  // Retrain model
  const newJob = await chanl.fineTuning.createJob({
    name: `Customer Service Model ${new Date().toISOString().slice(0, 7)}`,
    datasetId: 'dataset_abc123',
    baseModel: 'model_ft_xyz789', // Train on top of previous fine-tuned model
    parameters: {
      epochs: 2,
      learningRate: 0.00005 // Lower rate for refinement
    }
  });

  return newJob;
}

Fine-Tuning Use Cases

Customer Service Excellence

// Train model to handle frustrated customers better
const dataset = await chanl.fineTuning.createDataset({
  name: 'De-escalation Training',
  callIds: await chanl.callLogs.search({
    tags: ['frustrated', 'angry'],
    minScore: 88,
    outcome: 'resolved'
  }),
  targetBehaviors: [
    'acknowledge_emotion_first',
    'apologize_when_appropriate',
    'focus_on_solution',
    'never_defensive'
  ]
});

Sales Optimization

// Train model on successful sales conversations
const dataset = await chanl.fineTuning.createDataset({
  name: 'High-Converting Sales Calls',
  callIds: await chanl.callLogs.search({
    outcome: 'sale',
    minScore: 85
  }),
  targetBehaviors: [
    'needs_discovery',
    'objection_handling',
    'value_proposition',
    'closing_techniques'
  ]
});

Compliance & Accuracy

// Train model to follow compliance requirements perfectly
const dataset = await chanl.fineTuning.createDataset({
  name: 'Compliance Perfect Calls',
  callIds: await chanl.callLogs.search({
    scorecard: 'compliance-tcpa',
    minScore: 98
  }),
  targetBehaviors: [
    'required_disclosures',
    'consent_collection',
    'policy_adherence'
  ]
});

Best Practices

1

Start with Prompt Optimization

Fine-tuning is powerful but slow. Get prompts working well first, then fine-tune for the extra edge.
2

Collect Diverse Examples

Don’t just train on one type of conversation. Mix scenarios, personas, and outcomes.
3

Use Enough Data

Minimum 100 conversations for meaningful results. 500+ is better. More diverse data beats more of the same.
4

Test Thoroughly Before Deploying

Run extensive scenarios comparing fine-tuned vs baseline. Look for any regressions in edge cases.
5

Monitor in Production

Watch real performance closely for first week after deployment. Be ready to rollback if needed.
6

Retrain Regularly

Every month or quarter, add new high-quality conversations and retrain. Models improve with fresh data.

Data Privacy & Security

Important: Fine-tuning uses real conversation data. Ensure:
  • PII is removed or anonymized
  • You have rights to use the data
  • Customers consented (if required)
  • Data is encrypted at rest and in transit
  • Compliance with GDPR, CCPA, etc.
Chanl automatically removes common PII during training:
  • Credit card numbers
  • Social security numbers
  • Email addresses
  • Phone numbers
  • Specific account numbers
But always review your data before training.

Troubleshooting

Problem: Training job not completingSolutions:
  • Check dataset has at least 50 conversations
  • Verify all calls in dataset are accessible
  • Reduce batch size if hitting memory limits
  • Contact support if stuck for >24 hours
Problem: Fine-tuned model scores lowerInvestigate:
  • Did you train on diverse enough data?
  • Are training examples actually high-quality?
  • Did you overtrain (too many epochs)?
  • Test on validation set - is it overfitting?
  • Compare on same scenarios as training data
Problem: No measurable improvementSolutions:
  • Use more training examples (aim for 200+)
  • Increase learning rate slightly
  • Add more epochs (try 4-5)
  • Ensure training data is different from what base model already does well
Problem: Works well sometimes, poorly othersSolutions:
  • Training data may have conflicting examples
  • Review dataset for contradictory conversations
  • Add more examples of the edge cases
  • Consider separate models for different use cases

Cost Considerations

Fine-tuning costs depend on:
  • Training data size - More conversations = higher cost
  • Base model - GPT-4 more expensive than GPT-3.5
  • Training duration - More epochs = more compute time
  • Inference - Fine-tuned models may cost more per call
Typical costs:
  • Training: $50-500 per job depending on size
  • Inference: 10-50% more per call than base model
  • ROI: Usually positive if improvement >5% on key metrics

What’s Next?