Observability: Alert Rules & Monitoring

Monitor your Campusmind.

Written By vendor@royalcyber.com

Last updated 3 months ago

What is Observability?

Observability in CampusMindAI provides production-grade monitoring and alerting for critical infrastructure components. The system monitors Azure Container Apps, App Services, and MySQL databases to proactively detect and alert on performance issues, resource constraints, and service disruptions.

Observability Matters

  • Proactive Issue Detection: Identify problems before they impact users

  • Performance Monitoring: Track resource utilization and application health

  • Rapid Response: Alert teams immediately when thresholds are exceeded

  • Root Cause Analysis: Detailed metrics and logs for troubleshooting

  • Service Reliability: Maintain platform uptime and performance standards

How Azure Alerts Work

Alert Types:

CampusMind AI uses three types of Azure Monitor alerts:

1. Metric Alerts

Monitor platform metrics like CPU, memory, requests, and replica counts.

  • Trigger: When a metric crosses a configured threshold

  • Evaluation: Periodic checks at defined frequency

  • Aggregation: Average, Total, Minimum, Maximum, or Count

  • Auto-resolve: Automatically resolves when metric returns to normal

Example: CPU Average > 80% for 15 minutes

  • Azure samples metric values every 5 minutes

  • Computes average over the last 15 minutes

  • Fires alert if average exceeds 80%

  • Invokes action group to notify team

2. Log Alerts

Monitor specific events in application logs using Kusto queries.

  • Trigger: When query returns matching log entries

  • Evaluation: Scheduled query runs at defined intervals

  • Resolution: Resolves when query returns zero matches

Example: Scale event detected in Container App logs

  • Kusto query runs every 5 minutes

  • Searches Container Apps Log Analytics workspace

  • Fires if scale event records found

Alert Components

Component

Description

Action Group

Defines notification channels (email, SMS, webhooks)

Severity

Priority level (Sev 0-4); Sev 1 = Critical, Sev 2 = High, Sev 3 = Medium

Threshold

Value that triggers the alert

Period

Time window for metric aggregation

Frequency

How often Azure evaluates the condition

Auto-resolve

Automatic resolution when condition returns to normal

Container Apps Alert Rules

Alert Name

Metric

Threshold

Evaluation

Severity

CA-CPU-High

CPU Percentage (Avg)

> 80%

Period: 15m / Frequency: 5m

Sev 2

CA-Memory-High

Memory Percentage (Avg)

> 75%

Period: 15m / Frequency: 5m

Sev 2

CA-High-Request-Volume

Requests (Total)

> 1000

Period: 5m / Frequency: 15m

Sev 3

CA-No-Running-Replicas

Replicas (Count)

< 1

Period: 5m / Frequency: 15m

Sev 1

CA-Replica-Count

Replicas (Count)

> 10

Period: 5m / Frequency: 15m

Sev 2

What Each Alert Means

  • CA-CPU-High: Container CPU usage sustained above 80% - may indicate need to scale or optimize code

  • CA-Memory-High: Memory usage above 75% - potential memory leak or need for scaling

  • CA-High-Request-Volume: Over 1000 requests in 5 minutes - traffic spike or potential attack

  • CA-No-Running-Replicas: No active replicas - service is down

  • CA-Replica-Count: More than 10 replicas running - unexpected scaling or runaway

    auto-scaling

App Service Alert Rules

Alert Name

Metric

Threshold

Evaluation

Severity

AS-CPU-High

CPU Percentage (Avg)

> 80%

Period: 5m / Frequency: 15m

Sev 2

AS-Memory-High

Memory Working Set (Avg)

> 80%

Period: 5m / Frequency: 15m

Sev 2

AS-Http-5xx

HTTP 5xx Errors (Total)

> 10

Period: 5m / Frequency: 15m

Sev 1

AS-High-Request-Count

Requests (Total)

> 1000

Period: 5m / Frequency: 15m

Sev 3

What Each Alert Means

  • AS-CPU-High: App Service CPU sustained above 80% - scale up or optimize

  • AS-Memory-High: Memory usage above 80% - memory leak or need more resources

  • AS-Http-5xx: More than 10 server errors in 5 minutes - application errors or backend issues

  • AS-High-Request-Count: Over 1000 requests in 5 minutes - traffic spike

MySQL Flexible Server Alert Rules

Alert Name

Metric

Threshold

Evaluation

Severity

MYSQL-CPU-High

cpu_percent (Avg)

> 80%

Period: 5m / Frequency: 15m

Sev 2

MYSQL-Memory-High

memory_percent (Avg)

> 75%

Period: 5m / Frequency: 15m

Sev 2

MYSQL-Storage-High

storage_percent (Avg)

> 80% (90% critical)

Period: 5m / Frequency: 15m

Sev 1

MYSQL-Slow-Queries

slow_queries (Total)

> 50 in 5m

Period: 5m / Frequency: 5m

Sev 2

What Each Alert Means

  • MYSQL-CPU-High: Database CPU above 80% - slow queries, missing indexes, or need to scale

  • MYSQL-Memory-High: Memory usage above 75% - increase buffer pool or optimize queries

  • MYSQL-Storage-High: Disk space above 80% - cleanup needed or storage expansion required

  • MYSQL-Slow-Queries: More than 50 slow queries in 5 minutes - query optimization needed

Viewing and Investigating Alerts

1. View Fired Alerts

Azure PortalMonitorAlerts

  • See alert history with Fired/Resolved status

  • Click alert instance for details:

    • Recorded measurement value

    • Timestamp when threshold was crossed

    • Triggered action group notifications

    • Direct link to metric or log query

2. Review Alert Rules Configuration

Azure PortalMonitorAlertsAlert Rules

  • View all configured alert rules

  • Check conditions, thresholds, and evaluation settings

  • Review severity levels and action groups

  • Edit or disable rules as needed

3. Inspect Metrics

Azure PortalResourceMonitoringMetrics

  • Select the metric that triggered the alert (CPU, Memory, Requests, etc.)

  • Set aggregation type and time range to match alert configuration

  • Apply filters by revision name or instance

  • Identify patterns and anomalies

4. Check Activity Logs

Azure PortalMonitorActivity Log

  • Filter by resource and event type

  • View server restarts, configuration changes

  • Inspect event JSON for detailed information

5. Email Notifications

Action group emails include:

  • Alert summary with severity

  • Resource and metric details

  • Direct portal link to alert instance

  • Timestamp and triggered value

Alert Response

When an alert fires, follow this investigation workflow:

Step 1: Review Alert Details

  1. Go to Azure PortalMonitorAlerts

  2. Click the fired alert instance

  3. Note:

    • Timestamp of alert

    • Resource affected

    • Metric value that triggered alert

    • Severity level

Step 2: Investigate Metrics

  1. Click portal shortcut to open Metrics Explorer

  2. Or navigate to ResourceMetrics

  3. View metric trend over time

  4. Identify when issue started

  5. Check for patterns or anomalies

Step 3: Review Logs (for log-based alerts)

  1. Copy Kusto query from alert rule

  2. Run query in Log Analytics workspace

  3. Examine raw log entries

  4. Look for error messages, stack traces, or patterns

Step 4: Collect Diagnostic Data

  1. Take screenshots of metrics and logs

  2. Copy relevant log entries

  3. Export query results to CSV if needed

  4. Document timeline of events

  5. Create incident ticket with evidence

Step 5: Remediate

High CPU/Memory Alerts

Immediate actions:

  • Scale out (add more replicas) or scale up (larger instance)

  • Check for recent deployments that may have introduced issues

  • Analyze sudden traffic spikes

Long-term fixes:

  • Profile application code for bottlenecks

  • Optimize inefficient algorithms or queries

  • Implement caching strategies

Container/App Restarts or Crashes

Investigation:

  • Fetch container console logs from Log Analytics

  • Review stack traces and error messages

  • Check for regressions in recent deployments

Resolution:

  • Roll back to previous stable version

  • Fix identified bugs and redeploy

  • Increase resource limits if OOM errors

MySQL Storage High

Immediate actions:

  • Increase disk size

  • Cleanup old binary logs

  • Rotate and archive logs

Long-term fixes:

  • Implement log rotation policy

  • Archive old data to blob storage

  • Set up automated cleanup jobs

Slow MySQL Queries

Investigation:

  • Run EXPLAIN on slow queries

  • Check for missing indexes

  • Review query execution plans

Resolution:

  • Add appropriate indexes

  • Optimize query structure

  • Implement query caching

Step 6: Monitor Resolution

  1. Watch alert instance for auto-resolution

  2. Metric alerts resolve when metric drops below threshold

  3. Log alerts resolve when query returns zero matches

  4. Verify service stability over next 24 hours

  5. Document resolution in incident ticket