Chapter 20: Troubleshooting and Maintenance
Overview
This chapter provides comprehensive guidance for troubleshooting common issues and maintaining Vektagraf applications in production environments. We'll cover systematic troubleshooting approaches, debugging techniques, diagnostic tools, maintenance procedures, health checks, and performance optimization strategies.
Learning Objectives
- Master systematic troubleshooting approaches for Vektagraf applications
- Use debugging techniques and diagnostic tools effectively
- Implement comprehensive maintenance procedures and health checks
- Optimize performance through systematic analysis and tuning
Prerequisites
- Understanding of Vektagraf architecture and deployment patterns
- Familiarity with monitoring and observability concepts
- Knowledge of Kubernetes and container troubleshooting
- Basic understanding of database performance tuning
Core Concepts
Troubleshooting Methodology
A systematic approach to troubleshooting Vektagraf issues:
graph TD
A[Issue Reported] --> B[Gather Information]
B --> C[Identify Symptoms]
C --> D[Form Hypothesis]
D --> E[Test Hypothesis]
E --> F{Issue Resolved?}
F -->|No| G[Refine Hypothesis]
G --> E
F -->|Yes| H[Document Solution]
H --> I[Implement Prevention]
Issue Classification
// lib/troubleshooting/issue_classifier.dart
enum IssueCategory {
performance,
connectivity,
security,
data,
configuration,
infrastructure,
}
enum IssueSeverity {
critical, // System down, data loss
high, // Major functionality impacted
medium, // Minor functionality impacted
low, // Cosmetic or enhancement
}
class Issue {
final String id;
final String title;
final String description;
final IssueCategory category;
final IssueSeverity severity;
final DateTime reportedAt;
final List<String> symptoms;
final Map<String, dynamic> context;
const Issue({
required this.id,
required this.title,
required this.description,
required this.category,
required this.severity,
required this.reportedAt,
this.symptoms = const [],
this.context = const {},
});
}
class TroubleshootingGuide {
static List<TroubleshootingStep> getStepsForIssue(Issue issue) {
switch (issue.category) {
case IssueCategory.performance:
return _getPerformanceSteps(issue);
case IssueCategory.connectivity:
return _getConnectivitySteps(issue);
case IssueCategory.security:
return _getSecuritySteps(issue);
case IssueCategory.data:
return _getDataSteps(issue);
case IssueCategory.configuration:
return _getConfigurationSteps(issue);
case IssueCategory.infrastructure:
return _getInfrastructureSteps(issue);
}
}
}
Common Issues and Solutions
Performance Issues
Slow Query Performance
Symptoms:
- High query latency
- Timeouts on database operations
- Increased CPU usage
- Memory pressure
Diagnostic Steps:
// lib/diagnostics/performance_diagnostics.dart
class PerformanceDiagnostics {
static Future<PerformanceReport> analyzeSlowQueries() async {
final slowQueries = await _getSlowQueries();
final indexUsage = await _analyzeIndexUsage();
final resourceUsage = await _getResourceUsage();
return PerformanceReport(
slowQueries: slowQueries,
indexUsage: indexUsage,
resourceUsage: resourceUsage,
recommendations: _generateRecommendations(
slowQueries, indexUsage, resourceUsage),
);
}
static Future<List<SlowQuery>> _getSlowQueries() async {
// Query performance metrics from monitoring system
final metrics = await PrometheusClient.query(
'histogram_quantile(0.95, vektagraf_query_duration_seconds)'
);
return metrics.where((m) => m.value > 1.0)
.map((m) => SlowQuery.fromMetric(m))
.toList();
}
static Future<IndexUsageReport> _analyzeIndexUsage() async {
// Analyze index effectiveness
final indexStats = await VektagrafDatabase.getIndexStatistics();
return IndexUsageReport(
unusedIndexes: indexStats.where((i) => i.usageCount == 0).toList(),
inefficientIndexes: indexStats.where((i) => i.selectivity < 0.1).toList(),
missingIndexes: await _identifyMissingIndexes(),
);
}
}
class SlowQuery {
final String query;
final Duration averageLatency;
final int executionCount;
final String executionPlan;
const SlowQuery({
required this.query,
required this.averageLatency,
required this.executionCount,
required this.executionPlan,
});
}
Solutions:
- Query Optimization
// Before: Inefficient query
final results = await database.users
.where((u) => u.name.contains('john'))
.toList();
// After: Optimized with proper indexing
final results = await database.users
.where((u) => u.nameIndex.equals('john'))
.limit(100)
.toList();
- Index Optimization
{
"indexes": [
{
"name": "user_name_idx",
"fields": ["name"],
"type": "btree"
},
{
"name": "user_email_idx",
"fields": ["email"],
"type": "hash",
"unique": true
}
]
}
Vector Search Performance Issues
Symptoms:
- Slow similarity searches
- High memory usage during vector operations
- Vector index build failures
Diagnostic Commands:
# Check vector index status
kubectl exec -it vektagraf-pod -- vektagraf-cli vector status
# Analyze vector index performance
kubectl exec -it vektagraf-pod -- vektagraf-cli vector analyze --index-name embeddings
# Monitor vector search metrics
kubectl exec -it vektagraf-pod -- vektagraf-cli metrics vector-search
Solutions:
// lib/optimization/vector_optimization.dart
class VectorOptimization {
static VectorConfig optimizeForWorkload(VectorWorkload workload) {
switch (workload.type) {
case VectorWorkloadType.highThroughput:
return VectorConfig(
algorithm: VectorAlgorithm.ivfflat,
lists: workload.dataSize ~/ 1000,
probes: 10,
parallelSearch: true,
);
case VectorWorkloadType.highAccuracy:
return VectorConfig(
algorithm: VectorAlgorithm.hnsw,
efConstruction: 400,
maxConnections: 32,
efSearch: 200,
);
case VectorWorkloadType.balanced:
return VectorConfig(
algorithm: VectorAlgorithm.hnsw,
efConstruction: 200,
maxConnections: 16,
efSearch: 100,
);
}
}
}
Connectivity Issues
Connection Pool Exhaustion
Symptoms:
- "Connection pool exhausted" errors
- Timeouts on new connections
- High connection wait times
Diagnostic Script:
#!/bin/bash
# scripts/diagnose-connections.sh
echo "=== Connection Pool Status ==="
kubectl exec -it vektagraf-pod -- vektagraf-cli connections status
echo "=== Active Connections ==="
kubectl exec -it vektagraf-pod -- vektagraf-cli connections list --active
echo "=== Connection Pool Configuration ==="
kubectl exec -it vektagraf-pod -- vektagraf-cli config get database.connectionPool
echo "=== Recent Connection Errors ==="
kubectl logs vektagraf-pod --since=1h | grep -i "connection.*error"
Solutions:
// lib/connection/pool_optimization.dart
class ConnectionPoolOptimizer {
static ConnectionPoolConfig optimize(ConnectionMetrics metrics) {
final peakConnections = metrics.maxConcurrentConnections;
final averageHoldTime = metrics.averageConnectionHoldTime;
return ConnectionPoolConfig(
maxConnections: (peakConnections * 1.5).round(),
minConnections: (peakConnections * 0.2).round(),
connectionTimeout: Duration(seconds: 30),
idleTimeout: averageHoldTime * 2,
validationQuery: 'SELECT 1',
testOnBorrow: true,
testWhileIdle: true,
);
}
}
Network Connectivity Issues
Diagnostic Commands:
# Test network connectivity
kubectl exec -it vektagraf-pod -- nc -zv vektagraf-server 9090
# Check DNS resolution
kubectl exec -it vektagraf-pod -- nslookup vektagraf-server
# Test service discovery
kubectl exec -it vektagraf-pod -- curl -v http://vektagraf-service:80/health
# Check network policies
kubectl get networkpolicies -n vektagraf-production
Security Issues
Authentication Failures
Symptoms:
- "Authentication failed" errors
- JWT token validation failures
- Unauthorized access attempts
Diagnostic Tools:
// lib/security/auth_diagnostics.dart
class AuthDiagnostics {
static Future<AuthReport> diagnoseAuthIssues() async {
final failedAttempts = await _getFailedAuthAttempts();
final tokenIssues = await _analyzeTokenIssues();
final certificateStatus = await _checkCertificates();
return AuthReport(
failedAttempts: failedAttempts,
tokenIssues: tokenIssues,
certificateStatus: certificateStatus,
);
}
static Future<List<FailedAuthAttempt>> _getFailedAuthAttempts() async {
// Query audit logs for failed authentication attempts
final logs = await AuditLogger.query(
filter: 'event_type:auth_failure',
timeRange: Duration(hours: 24),
);
return logs.map((log) => FailedAuthAttempt.fromLog(log)).toList();
}
static Future<List<TokenIssue>> _analyzeTokenIssues() async {
// Analyze JWT token validation failures
final tokenErrors = await LogAnalyzer.findPatterns(
pattern: r'JWT.*invalid|expired|malformed',
timeRange: Duration(hours: 1),
);
return tokenErrors.map((error) => TokenIssue.fromError(error)).toList();
}
}
Solutions:
# Check JWT secret configuration
kubectl get secret vektagraf-secrets -o jsonpath='{.data.jwt-secret}' | base64 -d | wc -c
# Verify certificate validity
kubectl exec -it vektagraf-pod -- openssl x509 -in /certs/tls.crt -text -noout
# Test authentication endpoint
curl -X POST https://api.yourdomain.com/auth/login \
-H "Content-Type: application/json" \
-d '{"username":"test","password":"test"}' \
-v
Data Issues
Data Corruption
Symptoms:
- Checksum validation failures
- Inconsistent query results
- Database integrity errors
Diagnostic Procedures:
// lib/data/integrity_checker.dart
class DataIntegrityChecker {
static Future<IntegrityReport> checkIntegrity() async {
final checksumResults = await _verifyChecksums();
final referentialIntegrity = await _checkReferentialIntegrity();
final indexConsistency = await _verifyIndexConsistency();
return IntegrityReport(
checksumResults: checksumResults,
referentialIntegrity: referentialIntegrity,
indexConsistency: indexConsistency,
overallStatus: _calculateOverallStatus([
checksumResults.status,
referentialIntegrity.status,
indexConsistency.status,
]),
);
}
static Future<ChecksumResults> _verifyChecksums() async {
final corruptedBlocks = <String>[];
// Verify data block checksums
await for (final block in VektagrafDatabase.getAllDataBlocks()) {
final expectedChecksum = block.metadata.checksum;
final actualChecksum = await _calculateChecksum(block.data);
if (expectedChecksum != actualChecksum) {
corruptedBlocks.add(block.id);
}
}
return ChecksumResults(
totalBlocks: await VektagrafDatabase.getBlockCount(),
corruptedBlocks: corruptedBlocks,
status: corruptedBlocks.isEmpty
? IntegrityStatus.healthy
: IntegrityStatus.corrupted,
);
}
}
Recovery Procedures:
#!/bin/bash
# scripts/data-recovery.sh
echo "Starting data integrity check..."
# Stop application traffic
kubectl scale deployment vektagraf-app --replicas=0
# Create backup before recovery
vektagraf-backup create --output /tmp/pre-recovery-backup.vbk
# Run integrity check
vektagraf-cli integrity check --repair --verbose
# Verify repair
vektagraf-cli integrity verify
# Restore application traffic
kubectl scale deployment vektagraf-app --replicas=3
echo "Data recovery completed"
Replication Lag
Symptoms:
- Stale data in read replicas
- Replication delay warnings
- Inconsistent read results
Monitoring Script:
#!/bin/bash
# scripts/monitor-replication.sh
while true; do
echo "=== Replication Status $(date) ==="
# Check replication lag
kubectl exec -it vektagraf-master -- vektagraf-cli replication status
# Check replica health
for replica in $(kubectl get pods -l role=replica -o name); do
echo "Checking $replica..."
kubectl exec -it $replica -- vektagraf-cli health check
done
sleep 30
done
Debugging Techniques
Application-Level Debugging
Enabling Debug Mode
// lib/debugging/debug_config.dart
class DebugConfig {
static void enableDebugMode() {
Logger.root.level = Level.ALL;
Logger.root.onRecord.listen((record) {
print('${record.level.name}: ${record.time}: ${record.message}');
if (record.error != null) {
print('Error: ${record.error}');
}
if (record.stackTrace != null) {
print('Stack trace: ${record.stackTrace}');
}
});
}
static void enableQueryLogging() {
VektagrafDatabase.setQueryLogger((query, duration, result) {
print('Query: $query');
print('Duration: ${duration.inMilliseconds}ms');
print('Result count: ${result.length}');
});
}
static void enableVectorSearchLogging() {
VectorSpace.setSearchLogger((query, results, duration) {
print('Vector query: ${query.vector.length} dimensions');
print('Results: ${results.length} matches');
print('Duration: ${duration.inMilliseconds}ms');
});
}
}
Distributed Tracing
// lib/tracing/distributed_tracing.dart
class DistributedTracing {
static void initializeTracing() {
final tracer = JaegerTracer(
serviceName: 'vektagraf-app',
endpoint: Platform.environment['JAEGER_ENDPOINT'] ??
'http://jaeger:14268/api/traces',
);
GlobalTracer.register(tracer);
}
static Future<T> traceOperation<T>(
String operationName,
Future<T> Function(Span span) operation,
) async {
final span = GlobalTracer.instance.startSpan(operationName);
try {
return await operation(span);
} catch (e) {
span.setTag('error', true);
span.log({'error.message': e.toString()});
rethrow;
} finally {
span.finish();
}
}
}
// Usage example
Future<List<User>> searchUsers(String query) async {
return await DistributedTracing.traceOperation(
'search_users',
(span) async {
span.setTag('query', query);
final results = await database.users
.where((u) => u.name.contains(query))
.toList();
span.setTag('result_count', results.length);
return results;
},
);
}
Infrastructure-Level Debugging
Container Debugging
#!/bin/bash
# scripts/debug-container.sh
POD_NAME="$1"
NAMESPACE="${2:-vektagraf-production}"
if [ -z "$POD_NAME" ]; then
echo "Usage: $0 <pod-name> [namespace]"
exit 1
fi
echo "=== Pod Information ==="
kubectl describe pod "$POD_NAME" -n "$NAMESPACE"
echo "=== Container Logs ==="
kubectl logs "$POD_NAME" -n "$NAMESPACE" --tail=100
echo "=== Resource Usage ==="
kubectl top pod "$POD_NAME" -n "$NAMESPACE"
echo "=== Network Information ==="
kubectl exec -it "$POD_NAME" -n "$NAMESPACE" -- netstat -tuln
echo "=== Process Information ==="
kubectl exec -it "$POD_NAME" -n "$NAMESPACE" -- ps aux
echo "=== Disk Usage ==="
kubectl exec -it "$POD_NAME" -n "$NAMESPACE" -- df -h
echo "=== Environment Variables ==="
kubectl exec -it "$POD_NAME" -n "$NAMESPACE" -- env | sort
Network Debugging
#!/bin/bash
# scripts/debug-network.sh
NAMESPACE="${1:-vektagraf-production}"
echo "=== Service Information ==="
kubectl get services -n "$NAMESPACE"
echo "=== Endpoint Information ==="
kubectl get endpoints -n "$NAMESPACE"
echo "=== Network Policies ==="
kubectl get networkpolicies -n "$NAMESPACE"
echo "=== DNS Resolution Test ==="
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup vektagraf-service
echo "=== Connectivity Test ==="
kubectl run connectivity-test --image=curlimages/curl --rm -it --restart=Never -- \
curl -v http://vektagraf-service:80/health
Diagnostic Tools
Built-in Diagnostic Commands
// lib/cli/diagnostic_commands.dart
class DiagnosticCommands {
static void registerCommands(CommandRunner runner) {
runner.addCommand(HealthCheckCommand());
runner.addCommand(PerformanceAnalysisCommand());
runner.addCommand(ConnectionStatusCommand());
runner.addCommand(IntegrityCheckCommand());
runner.addCommand(ConfigValidationCommand());
}
}
class HealthCheckCommand extends Command {
@override
String get name => 'health';
@override
String get description => 'Perform comprehensive health check';
@override
Future<void> run() async {
final healthChecker = HealthChecker();
final report = await healthChecker.performFullCheck();
print('=== Health Check Report ===');
print('Overall Status: ${report.overallStatus}');
for (final check in report.checks) {
print('${check.name}: ${check.status}');
if (check.status != HealthStatus.healthy) {
print(' Issue: ${check.message}');
print(' Recommendation: ${check.recommendation}');
}
}
}
}
class PerformanceAnalysisCommand extends Command {
@override
String get name => 'perf';
@override
String get description => 'Analyze performance metrics';
@override
Future<void> run() async {
final analyzer = PerformanceAnalyzer();
final report = await analyzer.generateReport();
print('=== Performance Analysis ===');
print('Query Performance:');
print(' Average Latency: ${report.averageQueryLatency}ms');
print(' 95th Percentile: ${report.p95QueryLatency}ms');
print(' Slow Queries: ${report.slowQueries.length}');
print('Vector Search Performance:');
print(' Average Latency: ${report.averageVectorLatency}ms');
print(' Throughput: ${report.vectorThroughput} queries/sec');
print('Resource Usage:');
print(' CPU: ${report.cpuUsage}%');
print(' Memory: ${report.memoryUsage}%');
print(' Disk I/O: ${report.diskIOPS} IOPS');
}
}
External Monitoring Tools
Prometheus Queries for Troubleshooting
# monitoring/troubleshooting-queries.yaml
queries:
# High error rate
high_error_rate: |
rate(vektagraf_requests_total{status=~"5.."}[5m]) /
rate(vektagraf_requests_total[5m]) > 0.05
# High latency
high_latency: |
histogram_quantile(0.95,
rate(vektagraf_request_duration_seconds_bucket[5m])) > 1.0
# Memory pressure
memory_pressure: |
(vektagraf_memory_usage_bytes / vektagraf_memory_limit_bytes) > 0.8
# Connection pool exhaustion
connection_pool_exhaustion: |
vektagraf_connection_pool_active / vektagraf_connection_pool_max > 0.9
# Disk space low
disk_space_low: |
(vektagraf_disk_used_bytes / vektagraf_disk_total_bytes) > 0.85
# Replication lag
replication_lag: |
vektagraf_replication_lag_seconds > 30
Grafana Dashboard for Troubleshooting
{
"dashboard": {
"title": "Vektagraf Troubleshooting",
"panels": [
{
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "rate(vektagraf_requests_total{status=~\"5..\"}[5m])",
"legendFormat": "Error Rate"
}
],
"thresholds": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 0.01},
{"color": "red", "value": 0.05}
]
},
{
"title": "Response Time Distribution",
"type": "heatmap",
"targets": [
{
"expr": "rate(vektagraf_request_duration_seconds_bucket[5m])",
"format": "heatmap"
}
]
},
{
"title": "Resource Usage",
"type": "timeseries",
"targets": [
{
"expr": "vektagraf_cpu_usage_percent",
"legendFormat": "CPU %"
},
{
"expr": "vektagraf_memory_usage_percent",
"legendFormat": "Memory %"
}
]
}
]
}
}
Maintenance Procedures
Routine Maintenance Tasks
Daily Maintenance Checklist
#!/bin/bash
# scripts/daily-maintenance.sh
echo "=== Daily Maintenance $(date) ==="
# Check system health
echo "1. Checking system health..."
kubectl exec -it vektagraf-pod -- vektagraf-cli health check
# Verify backups
echo "2. Verifying backups..."
./scripts/verify-backups.sh
# Check disk space
echo "3. Checking disk space..."
kubectl exec -it vektagraf-pod -- df -h
# Review error logs
echo "4. Reviewing error logs..."
kubectl logs vektagraf-pod --since=24h | grep -i error | tail -20
# Check replication status
echo "5. Checking replication status..."
kubectl exec -it vektagraf-pod -- vektagraf-cli replication status
# Performance metrics summary
echo "6. Performance summary..."
kubectl exec -it vektagraf-pod -- vektagraf-cli metrics summary
echo "Daily maintenance completed"
Weekly Maintenance Tasks
#!/bin/bash
# scripts/weekly-maintenance.sh
echo "=== Weekly Maintenance $(date) ==="
# Database optimization
echo "1. Running database optimization..."
kubectl exec -it vektagraf-pod -- vektagraf-cli optimize --analyze-tables
# Index maintenance
echo "2. Performing index maintenance..."
kubectl exec -it vektagraf-pod -- vektagraf-cli index rebuild --unused-only
# Security scan
echo "3. Running security scan..."
./scripts/security-scan.sh
# Performance analysis
echo "4. Generating performance report..."
kubectl exec -it vektagraf-pod -- vektagraf-cli perf report --output /tmp/perf-report.json
# Cleanup old logs
echo "5. Cleaning up old logs..."
kubectl exec -it vektagraf-pod -- find /var/log -name "*.log" -mtime +7 -delete
# Update monitoring dashboards
echo "6. Updating monitoring dashboards..."
./scripts/update-dashboards.sh
echo "Weekly maintenance completed"
Database Maintenance
Index Optimization
// lib/maintenance/index_optimizer.dart
class IndexOptimizer {
static Future<IndexOptimizationReport> optimizeIndexes() async {
final unusedIndexes = await _findUnusedIndexes();
final duplicateIndexes = await _findDuplicateIndexes();
final missingIndexes = await _suggestMissingIndexes();
// Remove unused indexes
for (final index in unusedIndexes) {
await VektagrafDatabase.dropIndex(index.name);
}
// Create missing indexes
for (final indexSuggestion in missingIndexes) {
await VektagrafDatabase.createIndex(indexSuggestion);
}
return IndexOptimizationReport(
removedIndexes: unusedIndexes,
createdIndexes: missingIndexes,
duplicateIndexes: duplicateIndexes,
);
}
static Future<List<IndexInfo>> _findUnusedIndexes() async {
final indexStats = await VektagrafDatabase.getIndexStatistics();
return indexStats.where((index) =>
index.usageCount == 0 &&
index.createdAt.isBefore(DateTime.now().subtract(Duration(days: 30)))
).toList();
}
static Future<List<IndexSuggestion>> _suggestMissingIndexes() async {
final queryAnalysis = await QueryAnalyzer.analyzeSlowQueries();
final suggestions = <IndexSuggestion>[];
for (final query in queryAnalysis.slowQueries) {
final whereClause = query.parseWhereClause();
if (whereClause.fields.isNotEmpty) {
suggestions.add(IndexSuggestion(
fields: whereClause.fields,
type: IndexType.btree,
estimatedImprovement: query.estimatedImprovement,
));
}
}
return suggestions;
}
}
Vacuum and Analyze Operations
#!/bin/bash
# scripts/database-maintenance.sh
echo "Starting database maintenance..."
# Analyze table statistics
echo "Analyzing table statistics..."
kubectl exec -it vektagraf-pod -- vektagraf-cli analyze --all-tables
# Vacuum unused space
echo "Vacuuming unused space..."
kubectl exec -it vektagraf-pod -- vektagraf-cli vacuum --full
# Rebuild fragmented indexes
echo "Rebuilding fragmented indexes..."
kubectl exec -it vektagraf-pod -- vektagraf-cli index rebuild --fragmented-only
# Update query planner statistics
echo "Updating query planner statistics..."
kubectl exec -it vektagraf-pod -- vektagraf-cli stats update
echo "Database maintenance completed"
System Maintenance
Log Rotation and Cleanup
# k8s/log-rotation-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: log-rotation
namespace: vektagraf-production
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: log-rotator
image: busybox
command:
- /bin/sh
- -c
- |
# Rotate application logs
find /var/log -name "*.log" -size +100M -exec gzip {} \;
# Remove old compressed logs
find /var/log -name "*.log.gz" -mtime +30 -delete
# Rotate audit logs
find /var/audit -name "*.audit" -mtime +7 -exec gzip {} \;
find /var/audit -name "*.audit.gz" -mtime +90 -delete
# Clean up temporary files
find /tmp -type f -mtime +1 -delete
volumeMounts:
- name: log-volume
mountPath: /var/log
- name: audit-volume
mountPath: /var/audit
volumes:
- name: log-volume
hostPath:
path: /var/log/vektagraf
- name: audit-volume
hostPath:
path: /var/audit/vektagraf
restartPolicy: OnFailure
Certificate Renewal
#!/bin/bash
# scripts/renew-certificates.sh
echo "Checking certificate expiration..."
# Check TLS certificates
for cert in $(kubectl get secrets -n vektagraf-production -o name | grep tls); do
echo "Checking $cert..."
expiry=$(kubectl get "$cert" -n vektagraf-production -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -enddate | cut -d= -f2)
expiry_epoch=$(date -d "$expiry" +%s)
current_epoch=$(date +%s)
days_until_expiry=$(( (expiry_epoch - current_epoch) / 86400 ))
echo " Expires: $expiry ($days_until_expiry days)"
if [ $days_until_expiry -lt 30 ]; then
echo " WARNING: Certificate expires in less than 30 days!"
# Trigger certificate renewal
kubectl annotate "$cert" cert-manager.io/force-renewal="$(date +%s)"
fi
done
echo "Certificate check completed"
Health Checks and Monitoring
Comprehensive Health Checks
// lib/health/health_checker.dart
class HealthChecker {
final List<HealthCheck> checks;
HealthChecker() : checks = [
DatabaseHealthCheck(),
VectorSearchHealthCheck(),
SecurityHealthCheck(),
NetworkHealthCheck(),
ResourceHealthCheck(),
];
Future<HealthReport> performFullCheck() async {
final results = <HealthCheckResult>[];
for (final check in checks) {
try {
final result = await check.perform();
results.add(result);
} catch (e) {
results.add(HealthCheckResult(
name: check.name,
status: HealthStatus.unhealthy,
message: 'Health check failed: $e',
timestamp: DateTime.now(),
));
}
}
return HealthReport(
timestamp: DateTime.now(),
checks: results,
overallStatus: _calculateOverallStatus(results),
);
}
HealthStatus _calculateOverallStatus(List<HealthCheckResult> results) {
if (results.any((r) => r.status == HealthStatus.unhealthy)) {
return HealthStatus.unhealthy;
}
if (results.any((r) => r.status == HealthStatus.degraded)) {
return HealthStatus.degraded;
}
return HealthStatus.healthy;
}
}
abstract class HealthCheck {
String get name;
Future<HealthCheckResult> perform();
}
class DatabaseHealthCheck implements HealthCheck {
@override
String get name => 'Database';
@override
Future<HealthCheckResult> perform() async {
try {
// Test basic connectivity
await VektagrafDatabase.ping();
// Test query performance
final stopwatch = Stopwatch()..start();
await VektagrafDatabase.query('SELECT 1');
stopwatch.stop();
if (stopwatch.elapsedMilliseconds > 1000) {
return HealthCheckResult(
name: name,
status: HealthStatus.degraded,
message: 'Database responding slowly (${stopwatch.elapsedMilliseconds}ms)',
timestamp: DateTime.now(),
);
}
return HealthCheckResult(
name: name,
status: HealthStatus.healthy,
message: 'Database is healthy',
timestamp: DateTime.now(),
);
} catch (e) {
return HealthCheckResult(
name: name,
status: HealthStatus.unhealthy,
message: 'Database connection failed: $e',
timestamp: DateTime.now(),
);
}
}
}
class VectorSearchHealthCheck implements HealthCheck {
@override
String get name => 'Vector Search';
@override
Future<HealthCheckResult> perform() async {
try {
// Test vector search functionality
final testVector = List.generate(768, (i) => Random().nextDouble());
final stopwatch = Stopwatch()..start();
final results = await VectorSpace.search(
vector: testVector,
limit: 10,
);
stopwatch.stop();
if (stopwatch.elapsedMilliseconds > 5000) {
return HealthCheckResult(
name: name,
status: HealthStatus.degraded,
message: 'Vector search is slow (${stopwatch.elapsedMilliseconds}ms)',
timestamp: DateTime.now(),
);
}
return HealthCheckResult(
name: name,
status: HealthStatus.healthy,
message: 'Vector search is healthy',
timestamp: DateTime.now(),
);
} catch (e) {
return HealthCheckResult(
name: name,
status: HealthStatus.unhealthy,
message: 'Vector search failed: $e',
timestamp: DateTime.now(),
);
}
}
}
enum HealthStatus { healthy, degraded, unhealthy }
class HealthCheckResult {
final String name;
final HealthStatus status;
final String message;
final DateTime timestamp;
const HealthCheckResult({
required this.name,
required this.status,
required this.message,
required this.timestamp,
});
}
Automated Monitoring Setup
# monitoring/service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vektagraf-monitor
namespace: vektagraf-production
spec:
selector:
matchLabels:
app: vektagraf
endpoints:
- port: metrics
interval: 30s
path: /metrics
honorLabels: true
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: vektagraf-alerts
namespace: vektagraf-production
spec:
groups:
- name: vektagraf.rules
rules:
- alert: VektagrafDown
expr: up{job="vektagraf"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Vektagraf instance is down"
description: "Vektagraf instance {{ $labels.instance }} has been down for more than 1 minute"
- alert: VektagrafHighErrorRate
expr: rate(vektagraf_requests_total{status=~"5.."}[5m]) / rate(vektagraf_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: VektagrafHighLatency
expr: histogram_quantile(0.95, rate(vektagraf_request_duration_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s"
Performance Optimization
Query Performance Tuning
// lib/optimization/query_optimizer.dart
class QueryOptimizer {
static Future<OptimizationReport> optimizeQueries() async {
final slowQueries = await _identifySlowQueries();
final optimizations = <QueryOptimization>[];
for (final query in slowQueries) {
final optimization = await _optimizeQuery(query);
optimizations.add(optimization);
}
return OptimizationReport(
originalQueries: slowQueries,
optimizations: optimizations,
estimatedImprovement: _calculateImprovement(optimizations),
);
}
static Future<QueryOptimization> _optimizeQuery(SlowQuery query) async {
final suggestions = <OptimizationSuggestion>[];
// Analyze query structure
final analysis = QueryAnalyzer.analyze(query.sql);
// Suggest index improvements
if (analysis.missingIndexes.isNotEmpty) {
suggestions.add(OptimizationSuggestion(
type: OptimizationType.addIndex,
description: 'Add indexes on: ${analysis.missingIndexes.join(", ")}',
estimatedImprovement: 0.7,
));
}
// Suggest query rewriting
if (analysis.canBeRewritten) {
suggestions.add(OptimizationSuggestion(
type: OptimizationType.rewriteQuery,
description: 'Rewrite query to use more efficient patterns',
rewrittenQuery: analysis.suggestedRewrite,
estimatedImprovement: 0.5,
));
}
// Suggest partitioning
if (analysis.tableSize > 1000000) {
suggestions.add(OptimizationSuggestion(
type: OptimizationType.partitioning,
description: 'Consider partitioning large table',
estimatedImprovement: 0.3,
));
}
return QueryOptimization(
originalQuery: query,
suggestions: suggestions,
);
}
}
Memory Optimization
// lib/optimization/memory_optimizer.dart
class MemoryOptimizer {
static Future<MemoryOptimizationReport> optimizeMemoryUsage() async {
final currentUsage = await _getCurrentMemoryUsage();
final recommendations = <MemoryRecommendation>[];
// Analyze cache usage
if (currentUsage.cacheHitRate < 0.8) {
recommendations.add(MemoryRecommendation(
type: MemoryOptimizationType.increaseCacheSize,
description: 'Increase cache size to improve hit rate',
currentValue: currentUsage.cacheSize,
recommendedValue: (currentUsage.cacheSize * 1.5).round(),
));
}
// Analyze connection pool
if (currentUsage.connectionPoolUtilization > 0.9) {
recommendations.add(MemoryRecommendation(
type: MemoryOptimizationType.optimizeConnectionPool,
description: 'Optimize connection pool configuration',
currentValue: currentUsage.maxConnections,
recommendedValue: (currentUsage.maxConnections * 1.2).round(),
));
}
// Analyze vector index memory
if (currentUsage.vectorIndexMemory > currentUsage.totalMemory * 0.6) {
recommendations.add(MemoryRecommendation(
type: MemoryOptimizationType.optimizeVectorIndex,
description: 'Optimize vector index memory usage',
currentValue: currentUsage.vectorIndexMemory,
recommendedValue: (currentUsage.totalMemory * 0.5).round(),
));
}
return MemoryOptimizationReport(
currentUsage: currentUsage,
recommendations: recommendations,
);
}
}
Best Practices
Troubleshooting Best Practices
-
Systematic Approach
- Follow a consistent methodology
- Document all steps and findings
- Test hypotheses methodically
- Verify solutions thoroughly
-
Information Gathering
- Collect comprehensive logs
- Gather performance metrics
- Document system state
- Interview users about symptoms
-
Root Cause Analysis
- Look beyond symptoms
- Consider system interactions
- Use data-driven analysis
- Validate assumptions
Maintenance Best Practices
-
Preventive Maintenance
- Regular health checks
- Proactive monitoring
- Scheduled maintenance windows
- Capacity planning
-
Documentation
- Maintain runbooks
- Document procedures
- Track changes
- Share knowledge
-
Testing
- Test maintenance procedures
- Validate backups regularly
- Practice disaster recovery
- Monitor after changes
Advanced Topics
Automated Remediation
// lib/automation/auto_remediation.dart
class AutoRemediation {
static final Map<String, RemediationAction> actions = {
'high_memory_usage': RestartServiceAction(),
'connection_pool_exhaustion': ScaleConnectionPoolAction(),
'disk_space_low': CleanupLogsAction(),
'replication_lag': RestartReplicationAction(),
};
static Future<void> handleAlert(Alert alert) async {
final action = actions[alert.type];
if (action != null && action.isAutomated) {
try {
await action.execute(alert.context);
await _notifySuccess(alert, action);
} catch (e) {
await _notifyFailure(alert, action, e);
}
} else {
await _escalateToHuman(alert);
}
}
}
abstract class RemediationAction {
bool get isAutomated;
Future<void> execute(Map<String, dynamic> context);
}
class RestartServiceAction implements RemediationAction {
@override
bool get isAutomated => true;
@override
Future<void> execute(Map<String, dynamic> context) async {
final serviceName = context['service_name'] as String;
await KubernetesClient.restartDeployment(serviceName);
}
}
Chaos Engineering for Resilience Testing
// lib/chaos/chaos_experiments.dart
class ChaosExperiments {
static Future<void> runNetworkLatencyExperiment() async {
// Inject network latency
await NetworkChaos.injectLatency(
duration: Duration(minutes: 5),
latency: Duration(milliseconds: 100),
targets: ['vektagraf-service'],
);
// Monitor system behavior
await _monitorSystemHealth(Duration(minutes: 10));
}
static Future<void> runMemoryPressureExperiment() async {
// Create memory pressure
await ResourceChaos.createMemoryPressure(
duration: Duration(minutes: 3),
percentage: 80,
targets: ['vektagraf-app'],
);
// Verify graceful degradation
await _verifyGracefulDegradation();
}
}
Summary
This chapter provided comprehensive guidance for troubleshooting and maintaining Vektagraf applications, including:
- Systematic Troubleshooting: Methodical approaches to problem resolution
- Common Issues: Solutions for performance, connectivity, security, and data issues
- Debugging Techniques: Application and infrastructure-level debugging
- Diagnostic Tools: Built-in commands and external monitoring tools
- Maintenance Procedures: Routine tasks and database optimization
- Health Monitoring: Comprehensive health checks and automated monitoring
- Performance Optimization: Query tuning and resource optimization
Key Takeaways
- Follow systematic troubleshooting methodologies
- Use comprehensive diagnostic tools and monitoring
- Implement proactive maintenance procedures
- Maintain detailed documentation and runbooks
- Practice disaster recovery and chaos engineering
- Automate routine tasks and remediation where possible
Next Steps
This completes Part V: Enterprise Deployment. Continue with:
- Part VI: Use Cases and Patterns (Chapters 20-23)
- Part VII: Reference (Chapters 24-27)