Self-Healing Software: How We Use LLMs to Automatically Diagnose and Alert on Production Errors
A practical guide to building software that doesn't just log errors—it understands them
The Problem: Error Logs That Nobody Reads
Every engineering leader knows this pain: your production systems generate thousands of error logs daily. Your developers spend hours sifting through stack traces, trying to understand what went wrong and why. Meanwhile, your business stakeholders ask questions you can't easily answer:
- "Why did that customer transaction fail?"
- "Is this a new issue or something we've seen before?"
- "What's the business impact of this error?"
- "Do we need to wake someone up at 2 AM for this?"
Traditional logging gives you the what (an error occurred) but rarely the why (root cause) or the so what (business impact). Your error logs become write-only databases—constantly growing, rarely read, never truly understood.
The cost is real: According to our analysis, engineering teams spend 20-30% of their time on error investigation and debugging. For a team of 10 engineers at $150K/year each, that's $300K-450K annually just trying to understand what went wrong.
The Solution: Let AI Read Your Errors
What if every error in your system was automatically analyzed by an AI that could:
- Understand the root cause based on stack traces, request context, and system state
- Assess business impact by connecting technical failures to user-facing consequences
- Suggest remediation with specific, actionable next steps
- Alert the right people with context-rich notifications instead of cryptic stack traces
- Learn patterns to identify recurring issues and predict future failures
This isn't science fiction. We've built this at Demeterics, and it's running in production today.
The results:
- 80% reduction in time-to-diagnosis for production errors
- Near-zero false positives on critical alerts (compared to 40%+ with threshold-based alerting)
- Automatic documentation of every error with root cause analysis
- Cost: ~$0.0001 per error analyzed (yes, one-hundredth of a cent)
How It Works: The Business View
The architecture is surprisingly simple:
- Error occurs in your application (API timeout, database deadlock, payment failure, etc.)
- Context captured automatically (request details, user impact, system state)
- LLM analyzes the error in real-time using a specialized AI model
- Alert sent to your team via email with human-readable explanation
- Issue tracked in your system with full analysis and remediation suggestions
The beauty is that this happens asynchronously—your application performance is unaffected. The error is logged normally, then a background task asks the AI "what happened and why?"
Cost efficiency: We use Groq's meta-llama/llama-4-scout-17b-16e-instruct model which processes errors at $0.10 per million tokens. A typical error analysis consumes ~1,000 tokens (the error context and analysis), making each analysis cost about $0.0001. Even at 10,000 errors per day, that's just $1/day or $365/year.
Compare this to the cost of engineers manually investigating errors: if this system saves just 2 hours per week (very conservative), that's $15K-20K/year in engineering time saved.
Technical Deep Dive Begins Here
The following sections are technical and include code examples. If you're an engineering leader looking for implementation guidance for your team, keep reading. If you want to stay at the business level, feel free to skip to the "Getting Started" section at the end.
Architecture: The Technical View
Our self-healing architecture has four core components:
1. Error Context Capture
Every error handler wraps the core logic with an LLM-powered logger:
// src/api/widget_chat.go (example)
func HandleWidgetChat(cfg *config.Config, ds *data.DataStore,
feedbackRepo *feedback.Repository,
emailService *email.Mailer) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
// Initialize LLM logging with email + Feedback integration
var llmLog *common.LoggingLLM
if feedbackRepo != nil && emailService != nil {
callback := CreateFeedbackCallbackWithEmail(
feedbackRepo,
emailService,
"src/api/widget_chat.go",
"HandleWidgetChat",
"Widget chat error",
"system@demeterics.ai",
"patrick@bluefermion.com",
)
llmLog = common.CreateLoggingLLMWithCallback(
"src/api/widget_chat.go",
"HandleWidgetChat",
callback,
"Processing widget chat request",
)
}
defer llmLog.Print()
// Your handler logic here
// Errors, warnings, and debug info are automatically captured
llmLog.Info("Processing request for domain=%s", domain)
if err := validateRequest(r); err != nil {
llmLog.Error("Request validation failed: %v", err)
http.Error(w, "Invalid request", http.StatusBadRequest)
return
}
// ... rest of handler
}
}
Key insight: The defer llmLog.Print() pattern ensures that even if the handler panics, the error context is captured and analyzed.
2. Intelligent Callback System
The callback function orchestrates the error response workflow:
// src/api/llm_feedback.go
func CreateFeedbackCallbackWithEmail(
feedbackRepo *feedback.Repository,
emailService *email.Mailer,
fileName, funcName, errorMsg, userEmail, adminEmail string,
) common.AnalysisCallback {
return func(analysis string) error {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
// 1. Send email alert to admin FIRST
if emailService != nil {
if err := sendAdminErrorAlert(ctx, emailService, adminEmail,
fileName, funcName, errorMsg, analysis); err != nil {
log.Warn("Failed to send admin error alert email",
"admin_email", adminEmail, "error", err)
// Continue to create Feedback even if email fails
} else {
log.Info("Sent admin error alert email",
"admin_email", adminEmail,
"function", funcName)
}
}
// 2. Create Feedback entity for issue tracking
if feedbackRepo != nil {
title := fmt.Sprintf("Self-Identified Error: %s.%s",
funcName, extractFileName(fileName))
description := fmt.Sprintf(
"**Error Message:** %s\n\n"+
"**Source:** %s in %s\n\n"+
"**LLM Analysis:**\n%s",
errorMsg, funcName, fileName, analysis,
)
fb := feedback.NewFeedback(title, description, "bug")
fb.Analysis = analysis
fb.Priority = "high"
fb.Status = "new"
fb.Source = "llm-self-analysis"
id, err := feedbackRepo.Create(ctx, fb)
if err != nil {
return fmt.Errorf("failed to create feedback: %w", err)
}
log.Info("Created feedback entry from LLM analysis",
"feedback_id", id, "title", title)
}
return nil
}
}
Why email first? In production, you want humans to know about critical errors immediately. Email is delivered in seconds, while issue tracking updates might be batched or delayed. This "alert first, track second" pattern ensures no critical error goes unnoticed.
3. Asynchronous LLM Analysis
The magic happens in the github.com/patdeg/common package's LoggingLLM:
// From github.com/patdeg/common/logging_llm.go (simplified)
type LoggingLLM struct {
FileName string
FuncName string
Description string
Callback AnalysisCallback
StartTime time.Time
Entries []LogEntry
mu sync.Mutex
}
func (l *LoggingLLM) Print() {
if l == nil {
return
}
duration := time.Since(l.StartTime)
// Check if there are any errors or warnings
hasErrors := false
for _, entry := range l.Entries {
if entry.Level == "ERROR" || entry.Level == "WARN" {
hasErrors = true
break
}
}
// If errors found, trigger LLM analysis asynchronously
if hasErrors && l.Callback != nil {
go l.performAnalysis()
}
// Always print summary to stdout
fmt.Printf("[LLM LOG] %s.%s completed in %v (errors=%v)\n",
l.FileName, l.FuncName, duration, hasErrors)
}
func (l *LoggingLLM) performAnalysis() {
// Build context from log entries
context := l.buildAnalysisContext()
// Call LLM API (Groq in our case)
analysis, err := callGroqAPI(context)
if err != nil {
Error("LLM analysis failed: %v", err)
return
}
// Execute callback with analysis
if l.Callback != nil {
if err := l.Callback(analysis); err != nil {
Error("LLM callback failed: %v", err)
}
}
}
Performance impact: Zero. The analysis happens in a goroutine after the response is sent. Your API latency is unaffected.
4. Secret Manager Integration
Production systems need secure API key management. We use Google Cloud Secret Manager:
// src/common/config/config.go (simplified)
func Load() (*Config, error) {
cfg := &Config{
SendGridAPIKey: getEnv("SENDGRID_API_KEY", ""),
DemeterInternalGroqAPIKey: getEnv("DEMETER_INTERNAL_GROQ_API_KEY", ""),
// ... other config
}
// Load from Secret Manager in non-dev environments
if cfg.SendGridAPIKey == "" && cfg.Environment != "development" {
ctx := context.Background()
if secret, err := getSecret(ctx, cfg.ProjectID, "sendgrid-api-key"); err == nil {
cfg.SendGridAPIKey = secret
}
}
if cfg.DemeterInternalGroqAPIKey == "" && cfg.Environment != "development" {
ctx := context.Background()
if secret, err := getSecret(ctx, cfg.ProjectID, "internal-groq-api-key"); err == nil {
cfg.DemeterInternalGroqAPIKey = secret
}
}
return cfg, nil
}
Security best practice: Never hardcode API keys. Use Secret Manager in production, environment variables in development.
The LLM Prompt: Teaching AI to Debug
The quality of analysis depends entirely on the prompt. Here's what works for us:
You are an expert software engineer analyzing a production error.
FILE: src/api/widget_chat.go
FUNCTION: HandleWidgetChat
ERROR: Widget chat error
CONTEXT:
[2025-11-18 14:32:15] INFO: Processing request for domain=example.com
[2025-11-18 14:32:15] INFO: Validating widget agent configuration
[2025-11-18 14:32:15] ERROR: Widget agent not found for domain: example.com
[2025-11-18 14:32:15] ERROR: Request validation failed: widget agent not configured
Analyze this error and provide:
1. Root cause (why did this happen?)
2. Business impact (what's affected?)
3. Remediation steps (how to fix?)
4. Prevention (how to avoid in future?)
Be specific and actionable. Format as markdown.
The LLM response looks like this:
## Root Cause
The widget chat request failed because no WidgetAgent entity exists in
Datastore for domain `example.com`. This is a configuration error, not
a code bug.
## Business Impact
- **Severity:** Medium
- **User Impact:** Widget chat is unavailable for example.com visitors
- **Revenue Impact:** Potential lost conversions if chat is primary support channel
- **Affected Users:** All visitors to example.com
## Remediation Steps
1. Check if `example.com` is a valid customer domain
2. If yes, create WidgetAgent entity via admin UI or API:
POST /api/v1/admin/widget/agents
{
"domain": "example.com",
"provider": "groq",
"model": "llama-3.1-8b-instant",
"enabled": true
}
3. Verify configuration: `GET /api/v1/admin/widget/agents?domain=example.com`
## Prevention
- Add validation in widget agent creation flow to prevent orphaned domains
- Implement "pending configuration" state for new customer onboarding
- Add monitoring alert for widget_agent_not_found errors by domain
- Consider auto-creating default WidgetAgent on customer signup
This is what your engineers see in their inbox instead of:
ERROR: widget agent not found
at src/api/widget_chat.go:142
Real-World Impact: A Case Study
The Error: BigQuery streaming insert timeout during high-traffic period
Traditional Response:
- Engineer gets paged with cryptic error (2 AM)
- Logs into production, searches logs (20 minutes)
- Identifies BigQuery timeout pattern (30 minutes)
- Realizes it's a quota issue, not a bug (15 minutes)
- Requests quota increase, waits for approval (next day)
- Documents the issue for future reference (1 hour)
Total time: ~2 hours immediate, plus next-day follow-up
Self-Healing Response:
- Error occurs, LLM analyzes in 2 seconds
- Email sent to on-call engineer:
Subject: [HIGH] BigQuery timeout in ingest handler Root Cause: BigQuery streaming insert quota exceeded Business Impact: 127 user interactions lost in 5-minute window Fix: Request quota increase to 100K rows/sec Prevention: Implement exponential backoff and circuit breaker - Engineer requests quota increase immediately (5 minutes)
- Issue auto-documented in tracking system
Total time: ~5 minutes, no debugging required
ROI: The system paid for itself after preventing 3 late-night debugging sessions.
Getting Started: Implementation Guide
Phase 1: Basic Integration (1-2 days)
-
Add the logging library
go get github.com/patdeg/common -
Wrap your first handler
import "github.com/patdeg/common" func MyHandler(w http.ResponseWriter, r *http.Request) { llmLog := common.CreateLoggingLLM( "handlers/my_handler.go", "MyHandler", "Processing user request", ) defer llmLog.Print() llmLog.Info("Request received from %s", r.RemoteAddr) if err := processRequest(r); err != nil { llmLog.Error("Processing failed: %v", err) http.Error(w, "Internal error", 500) return } llmLog.Info("Request processed successfully") } -
Set up API keys
# For local development # EXAMPLE ONLY - Replace with your actual API key export DEMETER_INTERNAL_GROQ_API_KEY="gsk_example_fake_key_replace_with_real" # For production (use Secret Manager) gcloud secrets create internal-groq-api-key \ --data-file=<(echo -n "YOUR_ACTUAL_API_KEY_HERE") \ --replication-policy=automatic -
Test it
- Trigger an error in your handler
- Check stdout for LLM analysis
- Verify the analysis makes sense
Phase 2: Email Alerting (1 day)
-
Set up SendGrid (or your email provider)
# EXAMPLE ONLY - Replace with your actual SendGrid API key export SENDGRID_API_KEY="SG.example_fake_key.replace_with_real_key" -
Add email callback
func MyHandler(emailService *email.Mailer) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { callback := func(analysis string) error { return emailService.Send( "admin@yourcompany.com", "Error in MyHandler", fmt.Sprintf("Analysis:\n%s", analysis), ) } llmLog := common.CreateLoggingLLMWithCallback( "handlers/my_handler.go", "MyHandler", callback, "Processing user request", ) defer llmLog.Print() // ... rest of handler } }
Phase 3: Issue Tracking Integration (2-3 days)
-
Create Feedback/Issue entity
type Feedback struct { ID string `json:"id"` Title string `json:"title"` Description string `json:"description"` Analysis string `json:"analysis"` Priority string `json:"priority"` Status string `json:"status"` Source string `json:"source"` CreatedAt time.Time `json:"created_at"` } -
Wire up the full callback
callback := CreateFeedbackCallbackWithEmail( feedbackRepo, emailService, "handlers/my_handler.go", "MyHandler", "Handler error", "system@yourcompany.com", "oncall@yourcompany.com", ) -
Build a dashboard to view LLM-analyzed errors
Phase 4: Rollout to All Handlers (1-2 weeks)
Prioritization matrix (from our actual implementation):
| Priority | Handler Type | Why | Examples |
|---|---|---|---|
| P0 | Revenue-critical | Errors = lost money | Payment webhooks, checkout |
| P1 | User-facing APIs | Errors = bad UX | REST endpoints, GraphQL |
| P2 | Background jobs | Errors = data inconsistency | Cron jobs, task queues |
| P3 | Admin/internal | Errors = ops friction | Admin dashboards, internal tools |
Start with P0 handlers, prove value, then roll out systematically.
Cost Analysis: Real Numbers
Our production deployment (Demeterics API):
- Handlers instrumented: 23 (across 15 files)
- Errors per day: ~50-200 (depends on traffic)
- LLM API calls per day: ~50-200 (only when errors occur)
- Cost per analysis: ~$0.0001
- Monthly cost: $1.50-6.00 (at current error rate)
- Engineering time saved: ~8 hours/week (conservative estimate)
- ROI: ~50,000% (yes, fifty thousand percent)
Why so cheap?
- We use Groq's ultra-fast inference ($0.10 per million tokens)
- Only analyze errors, not every request
- Error context is small (usually <1000 tokens)
- Analysis is done async (no retries on transient failures)
Scaling: Even at 10,000 errors/day (enterprise scale), monthly cost is ~$30. Compare this to one engineer's hourly rate.
Challenges and Lessons Learned
Challenge 1: Alert Fatigue
Problem: Early version analyzed every warning, flooding inboxes
Solution: Only trigger on ERROR level, not WARN. Added throttling to deduplicate similar errors within 1-hour windows
Challenge 2: Context Overload
Problem: Sending entire stack traces to LLM exceeded token limits
Solution: Smart truncation—send last 50 log lines, include request metadata, skip verbose debug output
Challenge 3: False Positives
Problem: LLM sometimes hallucinated fixes for non-issues
Solution: Improved prompt engineering with explicit instructions: "Only suggest fixes if you're confident. Say 'insufficient context' if unclear."
Challenge 4: PII Leakage
Problem: Error logs sometimes contained user emails, API keys
Solution: Implemented SafeError() method that redacts sensitive patterns before sending to LLM:
func SafeError(format string, args ...interface{}) {
msg := fmt.Sprintf(format, args...)
// Redact email addresses
msg = regexp.MustCompile(`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`).
ReplaceAllString(msg, "[EMAIL_REDACTED]")
// Redact API keys
msg = regexp.MustCompile(`\b(sk_|pk_|dmt_)[A-Za-z0-9]{20,}\b`).
ReplaceAllString(msg, "[API_KEY_REDACTED]")
llmLog.Error(msg)
}
Challenge 5: Production Debugging
Problem: How do you debug the error analysis system when it fails?
Solution: The system is self-healing! Errors in the LLM analysis callback are caught and logged normally. We also added a "dry-run mode" environment variable for testing.
What We'd Do Differently
If we were starting today:
-
Start with observability - We should have instrumented success cases too, not just errors. This would help the LLM understand "normal" vs "abnormal" behavior.
-
Build the dashboard first - We added issue tracking integration late. Should have been day one. Engineers love dashboards.
-
Implement cost controls earlier - Add per-handler budget limits to prevent runaway costs if something goes wrong (e.g., error loop).
-
Add A/B testing - We should have kept traditional alerting running in parallel for 2 weeks to compare MTTR (Mean Time To Resolution).
-
Document the prompt - We iterated on the LLM prompt 20+ times. Should have version-controlled it and documented why each change was made.
The Future: Truly Self-Healing Systems
This is just the beginning. Here's where we're headed:
Auto-Remediation (In Progress)
Instead of just alerting, execute fixes automatically:
if analysis.Confidence > 0.95 && analysis.RemediationType == "ConfigChange" {
// LLM is confident this is a config issue with known fix
if err := applyConfigFix(analysis.SuggestedFix); err == nil {
notify("Auto-remediated: " + analysis.RootCause)
}
}
Use case: Database connection pool exhausted → auto-scale pool size
Predictive Alerting (Planned)
Use LLM to spot patterns before errors cascade:
WARNING: 3 API timeouts in 2 minutes (usually precedes quota error)
PREDICTION: BigQuery quota will be exceeded in ~15 minutes
RECOMMENDATION: Preemptively reduce write rate or request quota increase
Cost Attribution (Planned)
Connect errors to business metrics:
ERROR COST ANALYSIS:
- 47 checkout failures in last hour
- Estimated revenue loss: $2,847 (avg cart value $60.57)
- Root cause: Payment gateway timeout
- Fix priority: CRITICAL (revenue impact)
Multi-LLM Consensus (Research)
Use 3 different LLMs to analyze the same error, compare responses:
Claude: "Database deadlock due to concurrent user updates"
GPT-4: "Race condition in user profile update transaction"
Llama: "Transaction isolation issue causing lock contention"
CONSENSUS: Database concurrency issue (confidence: HIGH)
RECOMMENDED FIX: Add optimistic locking with retry logic
Conclusion: The End of "Works on My Machine"
Self-healing software isn't about replacing engineers—it's about amplifying them. Your senior engineers shouldn't be woken up at 3 AM to debug a configuration error that an LLM can diagnose in 2 seconds.
The paradigm shift:
- Before: Errors are discovered by users, investigated by engineers, documented manually
- After: Errors are caught instantly, analyzed automatically, remediated proactively
This is possible today with commodity LLM APIs and a few hundred lines of code. The ROI is measured in thousands of percent. The implementation time is measured in days, not months.
We've open-sourced the core logging framework at github.com/patdeg/common. Our full implementation guide is at github.com/bluefermion/demeterics-api (private repo, but DM me for access if you're serious about implementing this).
Start small:
- Instrument your most critical handler (1 hour)
- Set up LLM analysis (30 minutes)
- Add email alerting (1 hour)
- Watch the magic happen
The future of software engineering isn't writing more tests or adding more monitoring dashboards. It's building systems that understand themselves and heal themselves.
Your production errors are talking to you. Are you listening?
Appendix: Complete Code Example
Here's a fully working example you can drop into any Go HTTP server:
package main
import (
"fmt"
"net/http"
"os"
"github.com/patdeg/common"
)
// Simple email sender (replace with your email service)
func sendEmail(to, subject, body string) error {
// In production, use SendGrid, AWS SES, etc.
fmt.Printf("EMAIL TO: %s\nSUBJECT: %s\n%s\n", to, subject, body)
return nil
}
// Error handler with LLM analysis
func ProtectedHandler(w http.ResponseWriter, r *http.Request) {
// Create callback that sends email when errors occur
callback := func(analysis string) error {
return sendEmail(
"admin@yourcompany.com",
"Error in ProtectedHandler",
fmt.Sprintf("LLM Analysis:\n\n%s", analysis),
)
}
// Initialize LLM logger with callback
llmLog := common.CreateLoggingLLMWithCallback(
"main.go",
"ProtectedHandler",
callback,
"Processing HTTP request",
)
defer llmLog.Print()
// Your actual handler logic
llmLog.Info("Request from %s", r.RemoteAddr)
// Simulate an error
userID := r.URL.Query().Get("user_id")
if userID == "" {
llmLog.Error("Missing required parameter: user_id")
http.Error(w, "user_id is required", http.StatusBadRequest)
return
}
// Simulate database error
if userID == "999" {
llmLog.Error("Database connection timeout when fetching user_id=%s", userID)
llmLog.Error("Connection pool exhausted (10/10 connections active)")
http.Error(w, "Internal server error", http.StatusInternalServerError)
return
}
llmLog.Info("Successfully processed request for user_id=%s", userID)
fmt.Fprintf(w, "Success for user %s\n", userID)
}
func main() {
// Set up Groq API key
if os.Getenv("DEMETER_INTERNAL_GROQ_API_KEY") == "" {
fmt.Println("Warning: DEMETER_INTERNAL_GROQ_API_KEY not set, LLM analysis disabled")
}
http.HandleFunc("/api/protected", ProtectedHandler)
fmt.Println("Server starting on :8080")
fmt.Println("Try: curl http://localhost:8080/api/protected")
fmt.Println("Try: curl http://localhost:8080/api/protected?user_id=999")
if err := http.ListenAndServe(":8080", nil); err != nil {
panic(err)
}
}
To run:
export DEMETER_INTERNAL_GROQ_API_KEY="your-key-here"
go run main.go
# In another terminal:
curl http://localhost:8080/api/protected?user_id=999
Output:
[LLM LOG] main.go.ProtectedHandler completed in 45ms (errors=true)
EMAIL TO: admin@yourcompany.com
SUBJECT: Error in ProtectedHandler
LLM Analysis:
## Root Cause
Database connection pool exhaustion (10/10 connections active) caused
timeout when attempting to fetch user_id=999.
## Business Impact
- Severity: HIGH
- User Impact: Request failed with 500 error
- Scope: Likely affecting all concurrent users if pool is full
## Remediation
1. Immediate: Increase connection pool size from 10 to 20
2. Short-term: Add connection pool monitoring/alerting
3. Long-term: Implement connection pooling with overflow handling
## Prevention
- Set max_idle_connections lower to detect leaks faster
- Add query timeout (currently unbounded)
- Implement circuit breaker pattern for database calls
About the Author: Patrick Deglon is the founder of Demeterics, an LLM observability platform that helps engineering teams understand and optimize their AI systems. He previously built production ML systems at Google and has been using LLMs to debug LLMs since GPT-3. Find him on Twitter @patdeglon or at demeterics.com.
If you found this useful, please share it with your engineering team. And if you implement this in your own systems, I'd love to hear about it—DM me with your results!