Building a Cron Job Health Monitor for OpenClaw: Never Miss Another Silent Failure – Learn OpenClaw

The Problem: Silent Cron Job Failures

You’ve set up your OpenClaw automations. Daily email summaries at 9 AM. Google Sheets updates for job tracking. Everything’s running smoothly—until it isn’t.

Here’s what happened to me: both of my cron jobs failed silently for 3-4 days. No emails. No updates. No alerts. The error? I had used a model name that didn’t exist in my allowed list: openrouter/anthropic/claude-haiku-4. Then I tried claude-3.5-haiku—same issue.

The jobs were attempting to run, hitting an error, and dying quietly. No notification. No way to know unless I manually checked the logs.

This is a problem.

The Solution: Heartbeat Monitoring

I built a simple but effective heartbeat monitoring system that checks cron job health every 8 hours and alerts me via Telegram if anything fails. Here’s how it works.

Step 1: Create HEARTBEAT.md

This file lives in your workspace and defines what the heartbeat check should do:

# HEARTBEAT.md

## Cron Job Health Monitoring (Every 8 hours)

Check if any cron jobs have failed:

1. Read `/root/.openclaw/workspace/memory/heartbeat-state.json` for last check time
2. If last check was < 8 hours ago, skip (return HEARTBEAT_OK)
3. If 8+ hours since last check:
   - List all cron jobs
   - Check for jobs with `consecutiveErrors > 0`
   - If failures found:
     - Send Telegram alert with job name, error, last run time
     - Suggest fix if obvious (e.g., "model not allowed" → suggest correct model)
   - Update last check time in heartbeat-state.json

Only alert once per unique error (track alerted errors in state file).

The key insight: track what you’ve already alerted on so you don’t spam yourself with the same error repeatedly.

Step 2: Create the State File

The heartbeat needs to remember when it last checked and what errors it’s already alerted on:

{
  "lastCheckMs": 0,
  "alertedErrors": []
}

Save this as /root/.openclaw/workspace/memory/heartbeat-state.json.

Step 3: Set Up the Heartbeat Cron Job

Now create a cron job that runs every 8 hours and executes the heartbeat check:

{
  "name": "Heartbeat: Cron Health Check",
  "schedule": {
    "kind": "every",
    "everyMs": 28800000
  },
  "payload": {
    "kind": "agentTurn",
    "message": "Read HEARTBEAT.md if it exists (workspace context). Follow it strictly. Do not infer or repeat old tasks from prior chats. If nothing needs attention, reply HEARTBEAT_OK.",
    "model": "claude-3.5-haiku"
  },
  "sessionTarget": "isolated",
  "delivery": {
    "mode": "announce"
  },
  "enabled": true
}

Add this using the OpenClaw cron tool:

cron(action="add", job={...})

Step 4: Implement the Check Logic

When the heartbeat runs, it reads HEARTBEAT.md and executes the check. Here’s the logic flow:

Read state: Check when we last ran
Skip if recent: If < 8 hours, return HEARTBEAT_OK
List cron jobs: Use cron(action="list")
Check for errors: Any jobs with consecutiveErrors > 0?
Alert if new: Only send Telegram notification if we haven’t alerted on this specific error before
Update state: Save current time and track the alerted error

Example Alert Message

When a failure is detected, you get a Telegram message like this:

🚨 Cron Job Failure Detected

Job: Daily Email Summaries
Error: model not allowed: openrouter/anthropic/claude-haiku-4
Last Run: 2026-02-13 09:00 UTC
Consecutive Failures: 4

Suggested Fix: Update job to use an allowed model like claude-3.5-haiku or gpt-4o-mini

What I Fixed After Discovery

Once I caught the model error, I took two actions:

1. Added Missing Models to Config

I updated my OpenClaw config to allow cheaper models:

{
  "models": {
    "allowed": [
      "claude-3.5-haiku",
      "claude-3-haiku",
      "gpt-4o-mini",
      "gemini-2.0-flash-exp:free"
    ]
  }
}

Used gateway(action="config.patch", ...) to merge this in without replacing the entire config.

2. Updated Both Cron Jobs

Changed both jobs from the non-existent model to claude-3.5-haiku:

cron(action="update", jobId="...", patch={ payload: { model: "claude-3.5-haiku" } })

Result: ~90% cost savings compared to Sonnet, and the jobs actually run now.

Why This Matters

Cron jobs are set-it-and-forget-it by design, which is great—until something breaks. Without monitoring:

You might not notice for days or weeks
Critical automations silently stop working
You lose trust in your system

With heartbeat monitoring:

Early detection: Know within 8 hours if something’s wrong
Actionable alerts: Get the error message + suggested fix
No spam: Only alert once per unique error
Peace of mind: Your automations either work or you know they don’t

Key Takeaways

Always verify model names are in your allowed list before using them in cron jobs
Monitor your monitors: Even automated systems need health checks
Track state: Avoid alert fatigue by remembering what you’ve already notified
Fail loudly: Silent failures are the worst kind—make sure you hear about problems

Next Steps

Want to implement this yourself?

Create HEARTBEAT.md in your workspace
Initialize memory/heartbeat-state.json
Add the heartbeat cron job
Let it run and catch your next failure before you notice manually

Your future self will thank you when you get that first alert instead of discovering a week-old broken automation. 🙌