Fine-Tuning
Fine-tuning is how you make your agent smarter over time. Instead of just tweaking prompts, you train custom AI models on your actual conversations—teaching the agent to naturally sound like your best performers.
Why Fine-Tuning Matters
Prompts get you 80% of the way there. Fine-tuning gets you the last 20%. It helps you:
Learn from success - Train on conversations that went well
Fix recurring issues - Teach agent to avoid common mistakes
Match your style - Make agent sound like your company
Improve over time - Get better as you collect more data
Example : Your agent handles refunds well but struggles with angry customers. Fine-tune on 100 high-scoring “angry customer” conversations. New model naturally de-escalates better without needing a longer prompt.
How Fine-Tuning Works
Collect Best Conversations → Clean & Prepare → Train Custom Model → Test → Deploy
Think of it like training a new employee by having them shadow your best performer. The AI learns patterns from successful conversations and applies them going forward.
When to Use Fine-Tuning
Don't Fine-Tune Yet
You’re just starting out
Have fewer than 100 quality conversations
Haven’t tried prompt optimization
Need quick improvements
Instead : Focus on prompts and tools firstFine-Tune When
Prompts alone aren’t enough
You have 100+ high-quality conversations
Need specific behavioral patterns
Want long-term improvement
Result : Better base performance across all conversations
Collecting Training Data
Finding Good Conversations
Look for conversations that score well on your scorecards:
# Get high-scoring calls for training
curl https://api.chanl.ai/v1/call-logs?minScore= 90 & limit = 100 \
-H "Authorization: Bearer YOUR_API_KEY"
{
"calls" : [
{
"id" : "call_abc123" ,
"score" : 94 ,
"category" : "customer-service" ,
"duration" : 183 ,
"outcome" : "resolved" ,
"tags" : [ "refund" , "empathy" , "quick-resolution" ]
}
]
}
What Makes Good Training Data?
High-Scoring Conversations
Calls that score 85+ on your scorecards Why : These demonstrate your quality standards
Mix of different customer types and situations Why : Prevents overfitting to one conversation pattern
Conversations that show the behavior you want Why : Model learns “this is how we do things”
Clear resolution or successful interaction Why : Ambiguous outcomes confuse the model
Creating a Training Dataset
const chanl = require ( '@chanl/sdk' );
async function buildTrainingDataset () {
// Get high-scoring calls from last 30 days
const highScoring = await chanl . callLogs . list ({
minScore: 85 ,
days: 30 ,
limit: 200
});
// Get diverse scenario coverage
const scenarios = [ 'refund' , 'billing' , 'technical' , 'general' ];
const trainingData = [];
for ( const scenario of scenarios ) {
const calls = highScoring . filter ( c => c . tags . includes ( scenario ));
// Take top 25 from each scenario
trainingData . push ( ... calls . slice ( 0 , 25 ));
}
// Create dataset
const dataset = await chanl . fineTuning . createDataset ({
name: 'Customer Service Excellence Q1 2024' ,
description: 'Top performing customer service calls' ,
callIds: trainingData . map ( c => c . id ),
targetBehaviors: [
'empathetic_responses' ,
'problem_resolution' ,
'professional_tone'
]
});
console . log ( `Created dataset with ${ trainingData . length } conversations` );
return dataset ;
}
Starting a Fine-Tuning Job
Navigate to Fine-Tuning in sidebar
Click “Create Training Job”
Select training dataset
Choose base model (GPT-4, Claude, etc.)
Set training parameters
Review data privacy settings
Start training
curl -X POST https://api.chanl.ai/v1/fine-tuning/jobs \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Customer Service Model v2",
"datasetId": "dataset_abc123",
"baseModel": "gpt-4",
"parameters": {
"epochs": 3,
"learningRate": 0.0001,
"batchSize": 4
},
"validationSplit": 0.2
}'
Response: {
"jobId" : "ft_job_xyz789" ,
"status" : "queued" ,
"estimatedCompletion" : "2024-01-15T18:00:00Z" ,
"trainingExamples" : 100 ,
"validationExamples" : 25
}
Training Parameters
{
"epochs" : 3 ,
// How many times to train on full dataset
// More epochs = more learning, but risk overfitting
// Typical: 2-4 epochs
"learningRate" : 0.0001 ,
// How much to adjust model each step
// Lower = more conservative, safer
// Typical: 0.0001 - 0.001
"batchSize" : 4 ,
// Conversations processed together
// Larger = faster training, more memory
// Typical: 4-8
"validationSplit" : 0.2
// Percentage held back for testing
// Typical: 0.15 - 0.25 (15-25%)
}
Monitoring Training Progress
# Check training status
curl https://api.chanl.ai/v1/fine-tuning/jobs/ft_job_xyz789 \
-H "Authorization: Bearer YOUR_API_KEY"
{
"jobId" : "ft_job_xyz789" ,
"status" : "training" ,
"progress" : 67 ,
"currentEpoch" : 2 ,
"totalEpochs" : 3 ,
"metrics" : {
"trainingLoss" : 0.23 ,
"validationLoss" : 0.31 ,
"estimatedAccuracy" : 0.89
},
"estimatedCompletion" : "2024-01-15T17:30:00Z"
}
What the Metrics Mean
Training Loss How well model fits training data Lower is better
Target : <0.5
Validation Loss How well model generalizes to new data Lower is better
Should be close to training loss
Accuracy Percentage of correct predictions Higher is better
Target : >0.85
If validation loss is much higher than training loss, your model is overfitting (memorizing training data instead of learning patterns). Use more diverse training data or fewer epochs.
Testing Fine-Tuned Models
Before deploying, test against baseline:
const chanl = require ( '@chanl/sdk' );
async function compareModels ( fineTunedModelId , baselineAgentId ) {
// Create test agent with fine-tuned model
const testAgent = await chanl . agents . create ({
name: 'Fine-Tuned Test Agent' ,
modelId: fineTunedModelId ,
prompt: 'Use the same prompt as baseline' ,
tools: [ 'same tools as baseline' ]
});
// Run comparison scenarios
const comparison = await chanl . scenarios . create ({
name: 'Fine-Tuned vs Baseline' ,
prompt: 'Customer service scenarios' ,
personas: [ 'polite' , 'frustrated' , 'confused' ],
agents: [ testAgent . id , baselineAgentId ],
scorecard: 'customer-service-quality'
});
const results = await chanl . scenarios . waitForCompletion ( comparison . id );
return {
fineTunedScore: results . agents [ testAgent . id ]. avgScore ,
baselineScore: results . agents [ baselineAgentId ]. avgScore ,
improvement: results . agents [ testAgent . id ]. avgScore - results . agents [ baselineAgentId ]. avgScore ,
recommendation: results . agents [ testAgent . id ]. avgScore > results . agents [ baselineAgentId ]. avgScore + 5
? 'Deploy fine-tuned model'
: 'Needs more training or data'
};
}
const results = await compareModels ( 'model_ft_abc' , 'agent_baseline' );
console . log ( results );
/*
{
fineTunedScore: 91,
baselineScore: 84,
improvement: +7,
recommendation: 'Deploy fine-tuned model'
}
*/
Deploying Fine-Tuned Models
Gradual Rollout
Start with a small percentage of traffic:
# Deploy to 10% of calls initially
curl -X POST https://api.chanl.ai/v1/agents/agent_abc123/model \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"modelId": "model_ft_xyz789",
"rolloutStrategy": {
"type": "percentage",
"percentage": 10
}
}'
A/B Testing
Run both models simultaneously:
# 50% old model, 50% new model
curl -X POST https://api.chanl.ai/v1/agents/agent_abc123/ab-test \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"modelA": "model_baseline",
"modelB": "model_ft_xyz789",
"split": 50,
"duration": "7d"
}'
Monitoring After Deployment
const chanl = require ( '@chanl/sdk' );
// Monitor performance of fine-tuned model
const performance = await chanl . agents . analytics ( 'agent_abc123' , {
compareModels: [ 'model_baseline' , 'model_ft_xyz789' ],
timeRange: '7d' ,
metrics: [ 'avgScore' , 'successRate' , 'escalationRate' ]
});
console . log ( 'Performance Comparison:' );
console . log ( 'Baseline:' , performance . models . model_baseline );
console . log ( 'Fine-tuned:' , performance . models . model_ft_xyz789 );
if ( performance . models . model_ft_xyz789 . avgScore < performance . models . model_baseline . avgScore ) {
console . log ( '⚠️ Fine-tuned model underperforming. Consider rollback.' );
}
Continuous Improvement
Collecting More Data
Keep training on new high-performing conversations:
// Automated training pipeline
async function continuousImprovement () {
// Every month, collect new high-scoring calls
const newCalls = await chanl . callLogs . list ({
minScore: 90 ,
startDate: '2024-01-01' ,
endDate: '2024-01-31'
});
// Add to existing dataset
await chanl . fineTuning . updateDataset ( 'dataset_abc123' , {
addCallIds: newCalls . map ( c => c . id )
});
// Retrain model
const newJob = await chanl . fineTuning . createJob ({
name: `Customer Service Model ${ new Date (). toISOString (). slice ( 0 , 7 ) } ` ,
datasetId: 'dataset_abc123' ,
baseModel: 'model_ft_xyz789' , // Train on top of previous fine-tuned model
parameters: {
epochs: 2 ,
learningRate: 0.00005 // Lower rate for refinement
}
});
return newJob ;
}
Fine-Tuning Use Cases
Customer Service Excellence
// Train model to handle frustrated customers better
const dataset = await chanl . fineTuning . createDataset ({
name: 'De-escalation Training' ,
callIds: await chanl . callLogs . search ({
tags: [ 'frustrated' , 'angry' ],
minScore: 88 ,
outcome: 'resolved'
}),
targetBehaviors: [
'acknowledge_emotion_first' ,
'apologize_when_appropriate' ,
'focus_on_solution' ,
'never_defensive'
]
});
Sales Optimization
// Train model on successful sales conversations
const dataset = await chanl . fineTuning . createDataset ({
name: 'High-Converting Sales Calls' ,
callIds: await chanl . callLogs . search ({
outcome: 'sale' ,
minScore: 85
}),
targetBehaviors: [
'needs_discovery' ,
'objection_handling' ,
'value_proposition' ,
'closing_techniques'
]
});
Compliance & Accuracy
// Train model to follow compliance requirements perfectly
const dataset = await chanl . fineTuning . createDataset ({
name: 'Compliance Perfect Calls' ,
callIds: await chanl . callLogs . search ({
scorecard: 'compliance-tcpa' ,
minScore: 98
}),
targetBehaviors: [
'required_disclosures' ,
'consent_collection' ,
'policy_adherence'
]
});
Best Practices
Start with Prompt Optimization
Fine-tuning is powerful but slow. Get prompts working well first, then fine-tune for the extra edge.
Collect Diverse Examples
Don’t just train on one type of conversation. Mix scenarios, personas, and outcomes.
Use Enough Data
Minimum 100 conversations for meaningful results. 500+ is better. More diverse data beats more of the same.
Test Thoroughly Before Deploying
Run extensive scenarios comparing fine-tuned vs baseline. Look for any regressions in edge cases.
Monitor in Production
Watch real performance closely for first week after deployment. Be ready to rollback if needed.
Retrain Regularly
Every month or quarter, add new high-quality conversations and retrain. Models improve with fresh data.
Data Privacy & Security
Important : Fine-tuning uses real conversation data. Ensure:
PII is removed or anonymized
You have rights to use the data
Customers consented (if required)
Data is encrypted at rest and in transit
Compliance with GDPR, CCPA, etc.
Chanl automatically removes common PII during training:
Credit card numbers
Social security numbers
Email addresses
Phone numbers
Specific account numbers
But always review your data before training.
Troubleshooting
Training failing or stuck
Problem : Training job not completingSolutions :
Check dataset has at least 50 conversations
Verify all calls in dataset are accessible
Reduce batch size if hitting memory limits
Contact support if stuck for >24 hours
Model performing worse than baseline
Model too similar to baseline
Problem : No measurable improvementSolutions :
Use more training examples (aim for 200+)
Increase learning rate slightly
Add more epochs (try 4-5)
Ensure training data is different from what base model already does well
Model behaving inconsistently
Problem : Works well sometimes, poorly othersSolutions :
Training data may have conflicting examples
Review dataset for contradictory conversations
Add more examples of the edge cases
Consider separate models for different use cases
Cost Considerations
Fine-tuning costs depend on:
Training data size - More conversations = higher cost
Base model - GPT-4 more expensive than GPT-3.5
Training duration - More epochs = more compute time
Inference - Fine-tuned models may cost more per call
Typical costs :
Training: $50-500 per job depending on size
Inference: 10-50% more per call than base model
ROI: Usually positive if improvement >5% on key metrics
What’s Next?