Self-Healing Software: How We Use LLMs to Automatically Diagnose and Alert on Production Errors

A practical guide to building software that doesn't just log errors—it understands them

The Problem: Error Logs That Nobody Reads

Every engineering leader knows this pain: your production systems generate thousands of error logs daily. Your developers spend hours sifting through stack traces, trying to understand what went wrong and why. Meanwhile, your business stakeholders ask questions you can't easily answer:

"Why did that customer transaction fail?"
"Is this a new issue or something we've seen before?"
"What's the business impact of this error?"
"Do we need to wake someone up at 2 AM for this?"

Traditional logging gives you the what (an error occurred) but rarely the why (root cause) or the so what (business impact). Your error logs become write-only databases—constantly growing, rarely read, never truly understood.

The cost is real: According to our analysis, engineering teams spend 20-30% of their time on error investigation and debugging. For a team of 10 engineers at $150K/year each, that's $300K-450K annually just trying to understand what went wrong.

The Solution: Let AI Read Your Errors

What if every error in your system was automatically analyzed by an AI that could:

Understand the root cause based on stack traces, request context, and system state
Assess business impact by connecting technical failures to user-facing consequences
Suggest remediation with specific, actionable next steps
Alert the right people with context-rich notifications instead of cryptic stack traces
Learn patterns to identify recurring issues and predict future failures

This isn't science fiction. We've built this at Demeterics, and it's running in production today.

The results:

80% reduction in time-to-diagnosis for production errors
Near-zero false positives on critical alerts (compared to 40%+ with threshold-based alerting)
Automatic documentation of every error with root cause analysis
Cost: ~$0.0001 per error analyzed (yes, one-hundredth of a cent)

How It Works: The Business View

The architecture is surprisingly simple:

Error occurs in your application (API timeout, database deadlock, payment failure, etc.)
Context captured automatically (request details, user impact, system state)
LLM analyzes the error in real-time using a specialized AI model
Alert sent to your team via email with human-readable explanation
Issue tracked in your system with full analysis and remediation suggestions

The beauty is that this happens asynchronously—your application performance is unaffected. The error is logged normally, then a background task asks the AI "what happened and why?"

Cost efficiency: We use Groq's meta-llama/llama-4-scout-17b-16e-instruct model which processes errors at $0.10 per million tokens. A typical error analysis consumes ~1,000 tokens (the error context and analysis), making each analysis cost about $0.0001. Even at 10,000 errors per day, that's just $1/day or $365/year.

Compare this to the cost of engineers manually investigating errors: if this system saves just 2 hours per week (very conservative), that's $15K-20K/year in engineering time saved.

Technical Deep Dive Begins Here

The following sections are technical and include code examples. If you're an engineering leader looking for implementation guidance for your team, keep reading. If you want to stay at the business level, feel free to skip to the "Getting Started" section at the end.

Architecture: The Technical View

Our self-healing architecture has four core components:

1. Error Context Capture

Every error handler wraps the core logic with an LLM-powered logger:

// src/api/widget_chat.go (example)
func HandleWidgetChat(cfg *config.Config, ds *data.DataStore,
                      feedbackRepo *feedback.Repository,
                      emailService *email.Mailer) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        ctx := r.Context()

        // Initialize LLM logging with email + Feedback integration
        var llmLog *common.LoggingLLM
        if feedbackRepo != nil && emailService != nil {
            callback := CreateFeedbackCallbackWithEmail(
                feedbackRepo,
                emailService,
                "src/api/widget_chat.go",
                "HandleWidgetChat",
                "Widget chat error",
                "system@demeterics.ai",
                "patrick@bluefermion.com",
            )
            llmLog = common.CreateLoggingLLMWithCallback(
                "src/api/widget_chat.go",
                "HandleWidgetChat",
                callback,
                "Processing widget chat request",
            )
        }
        defer llmLog.Print()

        // Your handler logic here
        // Errors, warnings, and debug info are automatically captured
        llmLog.Info("Processing request for domain=%s", domain)

        if err := validateRequest(r); err != nil {
            llmLog.Error("Request validation failed: %v", err)
            http.Error(w, "Invalid request", http.StatusBadRequest)
            return
        }

        // ... rest of handler
    }
}

Key insight: The defer llmLog.Print() pattern ensures that even if the handler panics, the error context is captured and analyzed.

2. Intelligent Callback System

The callback function orchestrates the error response workflow:

// src/api/llm_feedback.go
func CreateFeedbackCallbackWithEmail(
    feedbackRepo *feedback.Repository,
    emailService *email.Mailer,
    fileName, funcName, errorMsg, userEmail, adminEmail string,
) common.AnalysisCallback {
    return func(analysis string) error {
        ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
        defer cancel()

        // 1. Send email alert to admin FIRST
        if emailService != nil {
            if err := sendAdminErrorAlert(ctx, emailService, adminEmail,
                                          fileName, funcName, errorMsg, analysis); err != nil {
                log.Warn("Failed to send admin error alert email",
                        "admin_email", adminEmail, "error", err)
                // Continue to create Feedback even if email fails
            } else {
                log.Info("Sent admin error alert email",
                        "admin_email", adminEmail,
                        "function", funcName)
            }
        }

        // 2. Create Feedback entity for issue tracking
        if feedbackRepo != nil {
            title := fmt.Sprintf("Self-Identified Error: %s.%s",
                                funcName, extractFileName(fileName))
            description := fmt.Sprintf(
                "**Error Message:** %s\n\n"+
                "**Source:** %s in %s\n\n"+
                "**LLM Analysis:**\n%s",
                errorMsg, funcName, fileName, analysis,
            )

            fb := feedback.NewFeedback(title, description, "bug")
            fb.Analysis = analysis
            fb.Priority = "high"
            fb.Status = "new"
            fb.Source = "llm-self-analysis"

            id, err := feedbackRepo.Create(ctx, fb)
            if err != nil {
                return fmt.Errorf("failed to create feedback: %w", err)
            }

            log.Info("Created feedback entry from LLM analysis",
                    "feedback_id", id, "title", title)
        }

        return nil
    }
}

Why email first? In production, you want humans to know about critical errors immediately. Email is delivered in seconds, while issue tracking updates might be batched or delayed. This "alert first, track second" pattern ensures no critical error goes unnoticed.

3. Asynchronous LLM Analysis

The magic happens in the github.com/patdeg/common package's LoggingLLM:

// From github.com/patdeg/common/logging_llm.go (simplified)
type LoggingLLM struct {
    FileName    string
    FuncName    string
    Description string
    Callback    AnalysisCallback
    StartTime   time.Time
    Entries     []LogEntry
    mu          sync.Mutex
}

func (l *LoggingLLM) Print() {
    if l == nil {
        return
    }

    duration := time.Since(l.StartTime)

    // Check if there are any errors or warnings
    hasErrors := false
    for _, entry := range l.Entries {
        if entry.Level == "ERROR" || entry.Level == "WARN" {
            hasErrors = true
            break
        }
    }

    // If errors found, trigger LLM analysis asynchronously
    if hasErrors && l.Callback != nil {
        go l.performAnalysis()
    }

    // Always print summary to stdout
    fmt.Printf("[LLM LOG] %s.%s completed in %v (errors=%v)\n",
               l.FileName, l.FuncName, duration, hasErrors)
}

func (l *LoggingLLM) performAnalysis() {
    // Build context from log entries
    context := l.buildAnalysisContext()

    // Call LLM API (Groq in our case)
    analysis, err := callGroqAPI(context)
    if err != nil {
        Error("LLM analysis failed: %v", err)
        return
    }

    // Execute callback with analysis
    if l.Callback != nil {
        if err := l.Callback(analysis); err != nil {
            Error("LLM callback failed: %v", err)
        }
    }
}

Performance impact: Zero. The analysis happens in a goroutine after the response is sent. Your API latency is unaffected.

4. Secret Manager Integration

Production systems need secure API key management. We use Google Cloud Secret Manager:

// src/common/config/config.go (simplified)
func Load() (*Config, error) {
    cfg := &Config{
        SendGridAPIKey:            getEnv("SENDGRID_API_KEY", ""),
        DemeterInternalGroqAPIKey: getEnv("DEMETER_INTERNAL_GROQ_API_KEY", ""),
        // ... other config
    }

    // Load from Secret Manager in non-dev environments
    if cfg.SendGridAPIKey == "" && cfg.Environment != "development" {
        ctx := context.Background()
        if secret, err := getSecret(ctx, cfg.ProjectID, "sendgrid-api-key"); err == nil {
            cfg.SendGridAPIKey = secret
        }
    }

    if cfg.DemeterInternalGroqAPIKey == "" && cfg.Environment != "development" {
        ctx := context.Background()
        if secret, err := getSecret(ctx, cfg.ProjectID, "internal-groq-api-key"); err == nil {
            cfg.DemeterInternalGroqAPIKey = secret
        }
    }

    return cfg, nil
}

Security best practice: Never hardcode API keys. Use Secret Manager in production, environment variables in development.

The LLM Prompt: Teaching AI to Debug

The quality of analysis depends entirely on the prompt. Here's what works for us:

You are an expert software engineer analyzing a production error.

FILE: src/api/widget_chat.go
FUNCTION: HandleWidgetChat
ERROR: Widget chat error

CONTEXT:
[2025-11-18 14:32:15] INFO: Processing request for domain=example.com
[2025-11-18 14:32:15] INFO: Validating widget agent configuration
[2025-11-18 14:32:15] ERROR: Widget agent not found for domain: example.com
[2025-11-18 14:32:15] ERROR: Request validation failed: widget agent not configured

Analyze this error and provide:
1. Root cause (why did this happen?)
2. Business impact (what's affected?)
3. Remediation steps (how to fix?)
4. Prevention (how to avoid in future?)

Be specific and actionable. Format as markdown.

The LLM response looks like this:

## Root Cause
The widget chat request failed because no WidgetAgent entity exists in
Datastore for domain `example.com`. This is a configuration error, not
a code bug.

## Business Impact
- **Severity:** Medium
- **User Impact:** Widget chat is unavailable for example.com visitors
- **Revenue Impact:** Potential lost conversions if chat is primary support channel
- **Affected Users:** All visitors to example.com

## Remediation Steps
1. Check if `example.com` is a valid customer domain
2. If yes, create WidgetAgent entity via admin UI or API:

POST /api/v1/admin/widget/agents
{
"domain": "example.com",
"provider": "groq",
"model": "llama-3.1-8b-instant",
"enabled": true
}

3. Verify configuration: `GET /api/v1/admin/widget/agents?domain=example.com`

## Prevention
- Add validation in widget agent creation flow to prevent orphaned domains
- Implement "pending configuration" state for new customer onboarding
- Add monitoring alert for widget_agent_not_found errors by domain
- Consider auto-creating default WidgetAgent on customer signup

This is what your engineers see in their inbox instead of:

ERROR: widget agent not found
  at src/api/widget_chat.go:142

Real-World Impact: A Case Study

The Error: BigQuery streaming insert timeout during high-traffic period

Traditional Response:

Engineer gets paged with cryptic error (2 AM)
Logs into production, searches logs (20 minutes)
Identifies BigQuery timeout pattern (30 minutes)
Realizes it's a quota issue, not a bug (15 minutes)
Requests quota increase, waits for approval (next day)
Documents the issue for future reference (1 hour)

Total time: ~2 hours immediate, plus next-day follow-up

Self-Healing Response:

Error occurs, LLM analyzes in 2 seconds

Email sent to on-call engineer:

Subject: [HIGH] BigQuery timeout in ingest handler

Root Cause: BigQuery streaming insert quota exceeded
Business Impact: 127 user interactions lost in 5-minute window
Fix: Request quota increase to 100K rows/sec
Prevention: Implement exponential backoff and circuit breaker

Engineer requests quota increase immediately (5 minutes)
Issue auto-documented in tracking system

Total time: ~5 minutes, no debugging required

ROI: The system paid for itself after preventing 3 late-night debugging sessions.

Getting Started: Implementation Guide

Phase 1: Basic Integration (1-2 days)

Add the logging library
```
go get github.com/patdeg/common
```

Wrap your first handler

import "github.com/patdeg/common"

func MyHandler(w http.ResponseWriter, r *http.Request) {
    llmLog := common.CreateLoggingLLM(
        "handlers/my_handler.go",
        "MyHandler",
        "Processing user request",
    )
    defer llmLog.Print()

    llmLog.Info("Request received from %s", r.RemoteAddr)

    if err := processRequest(r); err != nil {
        llmLog.Error("Processing failed: %v", err)
        http.Error(w, "Internal error", 500)
        return
    }

    llmLog.Info("Request processed successfully")
}

Set up API keys

# For local development
# EXAMPLE ONLY - Replace with your actual API key
export DEMETER_INTERNAL_GROQ_API_KEY="gsk_example_fake_key_replace_with_real"

# For production (use Secret Manager)
gcloud secrets create internal-groq-api-key \
  --data-file=<(echo -n "YOUR_ACTUAL_API_KEY_HERE") \
  --replication-policy=automatic

Test it
- Trigger an error in your handler
- Check stdout for LLM analysis
- Verify the analysis makes sense

Phase 2: Email Alerting (1 day)

Set up SendGrid (or your email provider)

# EXAMPLE ONLY - Replace with your actual SendGrid API key
export SENDGRID_API_KEY="SG.example_fake_key.replace_with_real_key"

Add email callback

func MyHandler(emailService *email.Mailer) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        callback := func(analysis string) error {
            return emailService.Send(
                "admin@yourcompany.com",
                "Error in MyHandler",
                fmt.Sprintf("Analysis:\n%s", analysis),
            )
        }

        llmLog := common.CreateLoggingLLMWithCallback(
            "handlers/my_handler.go",
            "MyHandler",
            callback,
            "Processing user request",
        )
        defer llmLog.Print()

        // ... rest of handler
    }
}

Phase 3: Issue Tracking Integration (2-3 days)

Create Feedback/Issue entity

type Feedback struct {
    ID          string    `json:"id"`
    Title       string    `json:"title"`
    Description string    `json:"description"`
    Analysis    string    `json:"analysis"`
    Priority    string    `json:"priority"`
    Status      string    `json:"status"`
    Source      string    `json:"source"`
    CreatedAt   time.Time `json:"created_at"`
}

Wire up the full callback

callback := CreateFeedbackCallbackWithEmail(
    feedbackRepo,
    emailService,
    "handlers/my_handler.go",
    "MyHandler",
    "Handler error",
    "system@yourcompany.com",
    "oncall@yourcompany.com",
)

Build a dashboard to view LLM-analyzed errors

Phase 4: Rollout to All Handlers (1-2 weeks)

Prioritization matrix (from our actual implementation):

Priority	Handler Type	Why	Examples
P0	Revenue-critical	Errors = lost money	Payment webhooks, checkout
P1	User-facing APIs	Errors = bad UX	REST endpoints, GraphQL
P2	Background jobs	Errors = data inconsistency	Cron jobs, task queues
P3	Admin/internal	Errors = ops friction	Admin dashboards, internal tools

Start with P0 handlers, prove value, then roll out systematically.

Cost Analysis: Real Numbers

Our production deployment (Demeterics API):

Handlers instrumented: 23 (across 15 files)
Errors per day: ~50-200 (depends on traffic)
LLM API calls per day: ~50-200 (only when errors occur)
Cost per analysis: ~$0.0001
Monthly cost: $1.50-6.00 (at current error rate)
Engineering time saved: ~8 hours/week (conservative estimate)
ROI: ~50,000% (yes, fifty thousand percent)

Why so cheap?

We use Groq's ultra-fast inference ($0.10 per million tokens)
Only analyze errors, not every request
Error context is small (usually <1000 tokens)
Analysis is done async (no retries on transient failures)

Scaling: Even at 10,000 errors/day (enterprise scale), monthly cost is ~$30. Compare this to one engineer's hourly rate.

Challenges and Lessons Learned

Challenge 1: Alert Fatigue

Problem: Early version analyzed every warning, flooding inboxes
Solution: Only trigger on ERROR level, not WARN. Added throttling to deduplicate similar errors within 1-hour windows

Challenge 2: Context Overload

Problem: Sending entire stack traces to LLM exceeded token limits
Solution: Smart truncation—send last 50 log lines, include request metadata, skip verbose debug output

Challenge 3: False Positives

Problem: LLM sometimes hallucinated fixes for non-issues
Solution: Improved prompt engineering with explicit instructions: "Only suggest fixes if you're confident. Say 'insufficient context' if unclear."

Challenge 4: PII Leakage

Problem: Error logs sometimes contained user emails, API keys
Solution: Implemented SafeError() method that redacts sensitive patterns before sending to LLM:

func SafeError(format string, args ...interface{}) {
    msg := fmt.Sprintf(format, args...)
    // Redact email addresses
    msg = regexp.MustCompile(`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`).
          ReplaceAllString(msg, "[EMAIL_REDACTED]")
    // Redact API keys
    msg = regexp.MustCompile(`\b(sk_|pk_|dmt_)[A-Za-z0-9]{20,}\b`).
          ReplaceAllString(msg, "[API_KEY_REDACTED]")
    llmLog.Error(msg)
}

Challenge 5: Production Debugging

Problem: How do you debug the error analysis system when it fails?
Solution: The system is self-healing! Errors in the LLM analysis callback are caught and logged normally. We also added a "dry-run mode" environment variable for testing.

What We'd Do Differently

If we were starting today:

Start with observability - We should have instrumented success cases too, not just errors. This would help the LLM understand "normal" vs "abnormal" behavior.
Build the dashboard first - We added issue tracking integration late. Should have been day one. Engineers love dashboards.
Implement cost controls earlier - Add per-handler budget limits to prevent runaway costs if something goes wrong (e.g., error loop).
Add A/B testing - We should have kept traditional alerting running in parallel for 2 weeks to compare MTTR (Mean Time To Resolution).
Document the prompt - We iterated on the LLM prompt 20+ times. Should have version-controlled it and documented why each change was made.

The Future: Truly Self-Healing Systems

This is just the beginning. Here's where we're headed:

Auto-Remediation (In Progress)

Instead of just alerting, execute fixes automatically:

if analysis.Confidence > 0.95 && analysis.RemediationType == "ConfigChange" {
    // LLM is confident this is a config issue with known fix
    if err := applyConfigFix(analysis.SuggestedFix); err == nil {
        notify("Auto-remediated: " + analysis.RootCause)
    }
}

Use case: Database connection pool exhausted → auto-scale pool size

Predictive Alerting (Planned)

Use LLM to spot patterns before errors cascade:

WARNING: 3 API timeouts in 2 minutes (usually precedes quota error)
PREDICTION: BigQuery quota will be exceeded in ~15 minutes
RECOMMENDATION: Preemptively reduce write rate or request quota increase

Cost Attribution (Planned)

Connect errors to business metrics:

ERROR COST ANALYSIS:
- 47 checkout failures in last hour
- Estimated revenue loss: $2,847 (avg cart value $60.57)
- Root cause: Payment gateway timeout
- Fix priority: CRITICAL (revenue impact)

Multi-LLM Consensus (Research)

Use 3 different LLMs to analyze the same error, compare responses:

Claude: "Database deadlock due to concurrent user updates"
GPT-4: "Race condition in user profile update transaction"
Llama: "Transaction isolation issue causing lock contention"

CONSENSUS: Database concurrency issue (confidence: HIGH)
RECOMMENDED FIX: Add optimistic locking with retry logic

Conclusion: The End of "Works on My Machine"

Self-healing software isn't about replacing engineers—it's about amplifying them. Your senior engineers shouldn't be woken up at 3 AM to debug a configuration error that an LLM can diagnose in 2 seconds.

The paradigm shift:

Before: Errors are discovered by users, investigated by engineers, documented manually
After: Errors are caught instantly, analyzed automatically, remediated proactively

This is possible today with commodity LLM APIs and a few hundred lines of code. The ROI is measured in thousands of percent. The implementation time is measured in days, not months.

We've open-sourced the core logging framework at github.com/patdeg/common. Our full implementation guide is at github.com/bluefermion/demeterics-api (private repo, but DM me for access if you're serious about implementing this).

Start small:

Instrument your most critical handler (1 hour)
Set up LLM analysis (30 minutes)
Add email alerting (1 hour)
Watch the magic happen

The future of software engineering isn't writing more tests or adding more monitoring dashboards. It's building systems that understand themselves and heal themselves.

Your production errors are talking to you. Are you listening?

Appendix: Complete Code Example

Here's a fully working example you can drop into any Go HTTP server:

package main

import (
    "fmt"
    "net/http"
    "os"

    "github.com/patdeg/common"
)

// Simple email sender (replace with your email service)
func sendEmail(to, subject, body string) error {
    // In production, use SendGrid, AWS SES, etc.
    fmt.Printf("EMAIL TO: %s\nSUBJECT: %s\n%s\n", to, subject, body)
    return nil
}

// Error handler with LLM analysis
func ProtectedHandler(w http.ResponseWriter, r *http.Request) {
    // Create callback that sends email when errors occur
    callback := func(analysis string) error {
        return sendEmail(
            "admin@yourcompany.com",
            "Error in ProtectedHandler",
            fmt.Sprintf("LLM Analysis:\n\n%s", analysis),
        )
    }

    // Initialize LLM logger with callback
    llmLog := common.CreateLoggingLLMWithCallback(
        "main.go",
        "ProtectedHandler",
        callback,
        "Processing HTTP request",
    )
    defer llmLog.Print()

    // Your actual handler logic
    llmLog.Info("Request from %s", r.RemoteAddr)

    // Simulate an error
    userID := r.URL.Query().Get("user_id")
    if userID == "" {
        llmLog.Error("Missing required parameter: user_id")
        http.Error(w, "user_id is required", http.StatusBadRequest)
        return
    }

    // Simulate database error
    if userID == "999" {
        llmLog.Error("Database connection timeout when fetching user_id=%s", userID)
        llmLog.Error("Connection pool exhausted (10/10 connections active)")
        http.Error(w, "Internal server error", http.StatusInternalServerError)
        return
    }

    llmLog.Info("Successfully processed request for user_id=%s", userID)
    fmt.Fprintf(w, "Success for user %s\n", userID)
}

func main() {
    // Set up Groq API key
    if os.Getenv("DEMETER_INTERNAL_GROQ_API_KEY") == "" {
        fmt.Println("Warning: DEMETER_INTERNAL_GROQ_API_KEY not set, LLM analysis disabled")
    }

    http.HandleFunc("/api/protected", ProtectedHandler)

    fmt.Println("Server starting on :8080")
    fmt.Println("Try: curl http://localhost:8080/api/protected")
    fmt.Println("Try: curl http://localhost:8080/api/protected?user_id=999")

    if err := http.ListenAndServe(":8080", nil); err != nil {
        panic(err)
    }
}

To run:

export DEMETER_INTERNAL_GROQ_API_KEY="your-key-here"
go run main.go

# In another terminal:
curl http://localhost:8080/api/protected?user_id=999

Output:

[LLM LOG] main.go.ProtectedHandler completed in 45ms (errors=true)

EMAIL TO: admin@yourcompany.com
SUBJECT: Error in ProtectedHandler

LLM Analysis:

## Root Cause
Database connection pool exhaustion (10/10 connections active) caused
timeout when attempting to fetch user_id=999.

## Business Impact
- Severity: HIGH
- User Impact: Request failed with 500 error
- Scope: Likely affecting all concurrent users if pool is full

## Remediation
1. Immediate: Increase connection pool size from 10 to 20
2. Short-term: Add connection pool monitoring/alerting
3. Long-term: Implement connection pooling with overflow handling

## Prevention
- Set max_idle_connections lower to detect leaks faster
- Add query timeout (currently unbounded)
- Implement circuit breaker pattern for database calls

About the Author: Patrick Deglon is the founder of Demeterics, an LLM observability platform that helps engineering teams understand and optimize their AI systems. He previously built production ML systems at Google and has been using LLMs to debug LLMs since GPT-3. Find him on Twitter @patdeglon or at demeterics.com.

If you found this useful, please share it with your engineering team. And if you implement this in your own systems, I'd love to hear about it—DM me with your results!