Observability: Alert Rules & Monitoring

Monitor your Campusmind.

Written By vendor@royalcyber.com

Last updated 8 months ago

What is Observability?

Observability in CampusMindAI provides production-grade monitoring and alerting for critical infrastructure components. The system monitors Azure Container Apps, App Services, and MySQL databases to proactively detect and alert on performance issues, resource constraints, and service disruptions.

Observability Matters

Proactive Issue Detection: Identify problems before they impact users
Performance Monitoring: Track resource utilization and application health
Rapid Response: Alert teams immediately when thresholds are exceeded
Root Cause Analysis: Detailed metrics and logs for troubleshooting
Service Reliability: Maintain platform uptime and performance standards

How Azure Alerts Work

Alert Types:

CampusMind AI uses three types of Azure Monitor alerts:

1. Metric Alerts

Monitor platform metrics like CPU, memory, requests, and replica counts.

Trigger: When a metric crosses a configured threshold
Evaluation: Periodic checks at defined frequency
Aggregation: Average, Total, Minimum, Maximum, or Count
Auto-resolve: Automatically resolves when metric returns to normal

Example: CPU Average > 80% for 15 minutes

Azure samples metric values every 5 minutes
Computes average over the last 15 minutes
Fires alert if average exceeds 80%
Invokes action group to notify team

2. Log Alerts

Monitor specific events in application logs using Kusto queries.

Trigger: When query returns matching log entries
Evaluation: Scheduled query runs at defined intervals
Resolution: Resolves when query returns zero matches

Example: Scale event detected in Container App logs

Kusto query runs every 5 minutes
Searches Container Apps Log Analytics workspace
Fires if scale event records found

Alert Components

Component	Description
Action Group	Defines notification channels (email, SMS, webhooks)
Severity	Priority level (Sev 0-4); Sev 1 = Critical, Sev 2 = High, Sev 3 = Medium
Threshold	Value that triggers the alert
Period	Time window for metric aggregation
Frequency	How often Azure evaluates the condition
Auto-resolve	Automatic resolution when condition returns to normal

Container Apps Alert Rules

Alert Name	Metric	Threshold	Evaluation	Severity
CA-CPU-High	CPU Percentage (Avg)	> 80%	Period: 15m / Frequency: 5m	Sev 2
CA-Memory-High	Memory Percentage (Avg)	> 75%	Period: 15m / Frequency: 5m	Sev 2
CA-High-Request-Volume	Requests (Total)	> 1000	Period: 5m / Frequency: 15m	Sev 3
CA-No-Running-Replicas	Replicas (Count)	< 1	Period: 5m / Frequency: 15m	Sev 1
CA-Replica-Count	Replicas (Count)	> 10	Period: 5m / Frequency: 15m	Sev 2

What Each Alert Means

CA-CPU-High: Container CPU usage sustained above 80% - may indicate need to scale or optimize code
CA-Memory-High: Memory usage above 75% - potential memory leak or need for scaling
CA-High-Request-Volume: Over 1000 requests in 5 minutes - traffic spike or potential attack
CA-No-Running-Replicas: No active replicas - service is down
CA-Replica-Count: More than 10 replicas running - unexpected scaling or runaway
auto-scaling

App Service Alert Rules

Alert Name	Metric	Threshold	Evaluation	Severity
AS-CPU-High	CPU Percentage (Avg)	> 80%	Period: 5m / Frequency: 15m	Sev 2
AS-Memory-High	Memory Working Set (Avg)	> 80%	Period: 5m / Frequency: 15m	Sev 2
AS-Http-5xx	HTTP 5xx Errors (Total)	> 10	Period: 5m / Frequency: 15m	Sev 1
AS-High-Request-Count	Requests (Total)	> 1000	Period: 5m / Frequency: 15m	Sev 3

What Each Alert Means

AS-CPU-High: App Service CPU sustained above 80% - scale up or optimize
AS-Memory-High: Memory usage above 80% - memory leak or need more resources
AS-Http-5xx: More than 10 server errors in 5 minutes - application errors or backend issues
AS-High-Request-Count: Over 1000 requests in 5 minutes - traffic spike

MySQL Flexible Server Alert Rules

Alert Name	Metric	Threshold	Evaluation	Severity
MYSQL-CPU-High	cpu_percent (Avg)	> 80%	Period: 5m / Frequency: 15m	Sev 2
MYSQL-Memory-High	memory_percent (Avg)	> 75%	Period: 5m / Frequency: 15m	Sev 2
MYSQL-Storage-High	storage_percent (Avg)	> 80% (90% critical)	Period: 5m / Frequency: 15m	Sev 1
MYSQL-Slow-Queries	slow_queries (Total)	> 50 in 5m	Period: 5m / Frequency: 5m	Sev 2

What Each Alert Means

MYSQL-CPU-High: Database CPU above 80% - slow queries, missing indexes, or need to scale
MYSQL-Memory-High: Memory usage above 75% - increase buffer pool or optimize queries
MYSQL-Storage-High: Disk space above 80% - cleanup needed or storage expansion required
MYSQL-Slow-Queries: More than 50 slow queries in 5 minutes - query optimization needed

Viewing and Investigating Alerts

1. View Fired Alerts

Azure Portal → Monitor → Alerts

See alert history with Fired/Resolved status
Click alert instance for details:
- Recorded measurement value
- Timestamp when threshold was crossed
- Triggered action group notifications
- Direct link to metric or log query

2. Review Alert Rules Configuration

Azure Portal → Monitor → Alerts → Alert Rules

View all configured alert rules
Check conditions, thresholds, and evaluation settings
Review severity levels and action groups
Edit or disable rules as needed

3. Inspect Metrics

Azure Portal → Resource → Monitoring → Metrics

Select the metric that triggered the alert (CPU, Memory, Requests, etc.)
Set aggregation type and time range to match alert configuration
Apply filters by revision name or instance
Identify patterns and anomalies

4. Check Activity Logs

Azure Portal → Monitor → Activity Log

Filter by resource and event type
View server restarts, configuration changes
Inspect event JSON for detailed information

5. Email Notifications

Action group emails include:

Alert summary with severity
Resource and metric details
Direct portal link to alert instance
Timestamp and triggered value

Alert Response

When an alert fires, follow this investigation workflow:

Step 1: Review Alert Details

Go to Azure Portal → Monitor → Alerts
Click the fired alert instance
Note:
- Timestamp of alert
- Resource affected
- Metric value that triggered alert
- Severity level

Step 2: Investigate Metrics

Click portal shortcut to open Metrics Explorer
Or navigate to Resource → Metrics
View metric trend over time
Identify when issue started
Check for patterns or anomalies

Step 3: Review Logs (for log-based alerts)

Copy Kusto query from alert rule
Run query in Log Analytics workspace
Examine raw log entries
Look for error messages, stack traces, or patterns

Step 4: Collect Diagnostic Data

Take screenshots of metrics and logs
Copy relevant log entries
Export query results to CSV if needed
Document timeline of events
Create incident ticket with evidence

Step 5: Remediate

High CPU/Memory Alerts

Immediate actions:

Scale out (add more replicas) or scale up (larger instance)
Check for recent deployments that may have introduced issues
Analyze sudden traffic spikes

Long-term fixes:

Profile application code for bottlenecks
Optimize inefficient algorithms or queries
Implement caching strategies

Container/App Restarts or Crashes

Investigation:

Fetch container console logs from Log Analytics
Review stack traces and error messages
Check for regressions in recent deployments

Resolution:

Roll back to previous stable version
Fix identified bugs and redeploy
Increase resource limits if OOM errors

MySQL Storage High

Immediate actions:

Increase disk size
Cleanup old binary logs
Rotate and archive logs

Long-term fixes:

Implement log rotation policy
Archive old data to blob storage
Set up automated cleanup jobs

Slow MySQL Queries

Investigation:

Run EXPLAIN on slow queries
Check for missing indexes
Review query execution plans

Resolution:

Add appropriate indexes
Optimize query structure
Implement query caching

Step 6: Monitor Resolution

Watch alert instance for auto-resolution
Metric alerts resolve when metric drops below threshold
Log alerts resolve when query returns zero matches
Verify service stability over next 24 hours
Document resolution in incident ticket