Observability: Alert Rules & Monitoring
Monitor your Campusmind.
Written By vendor@royalcyber.com
Last updated 3 months ago
What is Observability?
Observability in CampusMindAI provides production-grade monitoring and alerting for critical infrastructure components. The system monitors Azure Container Apps, App Services, and MySQL databases to proactively detect and alert on performance issues, resource constraints, and service disruptions.
Observability Matters
Proactive Issue Detection: Identify problems before they impact users
Performance Monitoring: Track resource utilization and application health
Rapid Response: Alert teams immediately when thresholds are exceeded
Root Cause Analysis: Detailed metrics and logs for troubleshooting
Service Reliability: Maintain platform uptime and performance standards
How Azure Alerts Work
Alert Types:
CampusMind AI uses three types of Azure Monitor alerts:
1. Metric Alerts
Monitor platform metrics like CPU, memory, requests, and replica counts.
Trigger: When a metric crosses a configured threshold
Evaluation: Periodic checks at defined frequency
Aggregation: Average, Total, Minimum, Maximum, or Count
Auto-resolve: Automatically resolves when metric returns to normal
Example: CPU Average > 80% for 15 minutes
Azure samples metric values every 5 minutes
Computes average over the last 15 minutes
Fires alert if average exceeds 80%
Invokes action group to notify team
2. Log Alerts
Monitor specific events in application logs using Kusto queries.
Trigger: When query returns matching log entries
Evaluation: Scheduled query runs at defined intervals
Resolution: Resolves when query returns zero matches
Example: Scale event detected in Container App logs
Kusto query runs every 5 minutes
Searches Container Apps Log Analytics workspace
Fires if scale event records found
Alert Components
Container Apps Alert Rules
What Each Alert Means
CA-CPU-High: Container CPU usage sustained above 80% - may indicate need to scale or optimize code
CA-Memory-High: Memory usage above 75% - potential memory leak or need for scaling
CA-High-Request-Volume: Over 1000 requests in 5 minutes - traffic spike or potential attack
CA-No-Running-Replicas: No active replicas - service is down
CA-Replica-Count: More than 10 replicas running - unexpected scaling or runaway
auto-scaling
App Service Alert Rules
What Each Alert Means
AS-CPU-High: App Service CPU sustained above 80% - scale up or optimize
AS-Memory-High: Memory usage above 80% - memory leak or need more resources
AS-Http-5xx: More than 10 server errors in 5 minutes - application errors or backend issues
AS-High-Request-Count: Over 1000 requests in 5 minutes - traffic spike
MySQL Flexible Server Alert Rules
What Each Alert Means
MYSQL-CPU-High: Database CPU above 80% - slow queries, missing indexes, or need to scale
MYSQL-Memory-High: Memory usage above 75% - increase buffer pool or optimize queries
MYSQL-Storage-High: Disk space above 80% - cleanup needed or storage expansion required
MYSQL-Slow-Queries: More than 50 slow queries in 5 minutes - query optimization needed
Viewing and Investigating Alerts
1. View Fired Alerts
Azure Portal → Monitor → Alerts
See alert history with Fired/Resolved status
Click alert instance for details:
Recorded measurement value
Timestamp when threshold was crossed
Triggered action group notifications
Direct link to metric or log query
2. Review Alert Rules Configuration
Azure Portal → Monitor → Alerts → Alert Rules
View all configured alert rules
Check conditions, thresholds, and evaluation settings
Review severity levels and action groups
Edit or disable rules as needed
3. Inspect Metrics
Azure Portal → Resource → Monitoring → Metrics
Select the metric that triggered the alert (CPU, Memory, Requests, etc.)
Set aggregation type and time range to match alert configuration
Apply filters by revision name or instance
Identify patterns and anomalies
4. Check Activity Logs
Azure Portal → Monitor → Activity Log
Filter by resource and event type
View server restarts, configuration changes
Inspect event JSON for detailed information
5. Email Notifications
Action group emails include:
Alert summary with severity
Resource and metric details
Direct portal link to alert instance
Timestamp and triggered value
Alert Response
When an alert fires, follow this investigation workflow:
Step 1: Review Alert Details
Go to Azure Portal → Monitor → Alerts
Click the fired alert instance
Note:
Timestamp of alert
Resource affected
Metric value that triggered alert
Severity level
Step 2: Investigate Metrics
Click portal shortcut to open Metrics Explorer
Or navigate to Resource → Metrics
View metric trend over time
Identify when issue started
Check for patterns or anomalies
Step 3: Review Logs (for log-based alerts)
Copy Kusto query from alert rule
Run query in Log Analytics workspace
Examine raw log entries
Look for error messages, stack traces, or patterns
Step 4: Collect Diagnostic Data
Take screenshots of metrics and logs
Copy relevant log entries
Export query results to CSV if needed
Document timeline of events
Create incident ticket with evidence
Step 5: Remediate
High CPU/Memory Alerts
Immediate actions:
Scale out (add more replicas) or scale up (larger instance)
Check for recent deployments that may have introduced issues
Analyze sudden traffic spikes
Long-term fixes:
Profile application code for bottlenecks
Optimize inefficient algorithms or queries
Implement caching strategies
Container/App Restarts or Crashes
Investigation:
Fetch container console logs from Log Analytics
Review stack traces and error messages
Check for regressions in recent deployments
Resolution:
Roll back to previous stable version
Fix identified bugs and redeploy
Increase resource limits if OOM errors
MySQL Storage High
Immediate actions:
Increase disk size
Cleanup old binary logs
Rotate and archive logs
Long-term fixes:
Implement log rotation policy
Archive old data to blob storage
Set up automated cleanup jobs
Slow MySQL Queries
Investigation:
Run EXPLAIN on slow queries
Check for missing indexes
Review query execution plans
Resolution:
Add appropriate indexes
Optimize query structure
Implement query caching
Step 6: Monitor Resolution
Watch alert instance for auto-resolution
Metric alerts resolve when metric drops below threshold
Log alerts resolve when query returns zero matches
Verify service stability over next 24 hours
Document resolution in incident ticket