Building a Cron Job Health Monitor for OpenClaw: Never Miss Another Silent Failure

The Problem: Silent Cron Job Failures

You’ve set up your OpenClaw automations. Daily email summaries at 9 AM. Google Sheets updates for job tracking. Everything’s running smoothly—until it isn’t.

Here’s what happened to me: both of my cron jobs failed silently for 3-4 days. No emails. No updates. No alerts. The error? I had used a model name that didn’t exist in my allowed list: openrouter/anthropic/claude-haiku-4. Then I tried claude-3.5-haiku—same issue.

The jobs were attempting to run, hitting an error, and dying quietly. No notification. No way to know unless I manually checked the logs.

This is a problem.

The Solution: Heartbeat Monitoring

I built a simple but effective heartbeat monitoring system that checks cron job health every 8 hours and alerts me via Telegram if anything fails. Here’s how it works.

Step 1: Create HEARTBEAT.md

This file lives in your workspace and defines what the heartbeat check should do:

# HEARTBEAT.md

## Cron Job Health Monitoring (Every 8 hours)

Check if any cron jobs have failed:

1. Read `/root/.openclaw/workspace/memory/heartbeat-state.json` for last check time
2. If last check was < 8 hours ago, skip (return HEARTBEAT_OK)
3. If 8+ hours since last check:
   - List all cron jobs
   - Check for jobs with `consecutiveErrors > 0`
   - If failures found:
     - Send Telegram alert with job name, error, last run time
     - Suggest fix if obvious (e.g., "model not allowed" → suggest correct model)
   - Update last check time in heartbeat-state.json

Only alert once per unique error (track alerted errors in state file).

The key insight: track what you’ve already alerted on so you don’t spam yourself with the same error repeatedly.

Step 2: Create the State File

The heartbeat needs to remember when it last checked and what errors it’s already alerted on:

{
  "lastCheckMs": 0,
  "alertedErrors": []
}

Save this as /root/.openclaw/workspace/memory/heartbeat-state.json.

Step 3: Set Up the Heartbeat Cron Job

Now create a cron job that runs every 8 hours and executes the heartbeat check:

{
  "name": "Heartbeat: Cron Health Check",
  "schedule": {
    "kind": "every",
    "everyMs": 28800000
  },
  "payload": {
    "kind": "agentTurn",
    "message": "Read HEARTBEAT.md if it exists (workspace context). Follow it strictly. Do not infer or repeat old tasks from prior chats. If nothing needs attention, reply HEARTBEAT_OK.",
    "model": "claude-3.5-haiku"
  },
  "sessionTarget": "isolated",
  "delivery": {
    "mode": "announce"
  },
  "enabled": true
}

Add this using the OpenClaw cron tool:

cron(action="add", job={...})

Step 4: Implement the Check Logic

When the heartbeat runs, it reads HEARTBEAT.md and executes the check. Here’s the logic flow:

  1. Read state: Check when we last ran
  2. Skip if recent: If < 8 hours, return HEARTBEAT_OK
  3. List cron jobs: Use cron(action="list")
  4. Check for errors: Any jobs with consecutiveErrors > 0?
  5. Alert if new: Only send Telegram notification if we haven’t alerted on this specific error before
  6. Update state: Save current time and track the alerted error

Example Alert Message

When a failure is detected, you get a Telegram message like this:

🚨 Cron Job Failure Detected

Job: Daily Email Summaries
Error: model not allowed: openrouter/anthropic/claude-haiku-4
Last Run: 2026-02-13 09:00 UTC
Consecutive Failures: 4

Suggested Fix: Update job to use an allowed model like claude-3.5-haiku or gpt-4o-mini

What I Fixed After Discovery

Once I caught the model error, I took two actions:

1. Added Missing Models to Config

I updated my OpenClaw config to allow cheaper models:

{
  "models": {
    "allowed": [
      "claude-3.5-haiku",
      "claude-3-haiku",
      "gpt-4o-mini",
      "gemini-2.0-flash-exp:free"
    ]
  }
}

Used gateway(action="config.patch", ...) to merge this in without replacing the entire config.

2. Updated Both Cron Jobs

Changed both jobs from the non-existent model to claude-3.5-haiku:

cron(action="update", jobId="...", patch={ payload: { model: "claude-3.5-haiku" } })

Result: ~90% cost savings compared to Sonnet, and the jobs actually run now.

Why This Matters

Cron jobs are set-it-and-forget-it by design, which is great—until something breaks. Without monitoring:

  • You might not notice for days or weeks
  • Critical automations silently stop working
  • You lose trust in your system

With heartbeat monitoring:

  • Early detection: Know within 8 hours if something’s wrong
  • Actionable alerts: Get the error message + suggested fix
  • No spam: Only alert once per unique error
  • Peace of mind: Your automations either work or you know they don’t

Key Takeaways

  1. Always verify model names are in your allowed list before using them in cron jobs
  2. Monitor your monitors: Even automated systems need health checks
  3. Track state: Avoid alert fatigue by remembering what you’ve already notified
  4. Fail loudly: Silent failures are the worst kind—make sure you hear about problems

Next Steps

Want to implement this yourself?

  1. Create HEARTBEAT.md in your workspace
  2. Initialize memory/heartbeat-state.json
  3. Add the heartbeat cron job
  4. Let it run and catch your next failure before you notice manually

Your future self will thank you when you get that first alert instead of discovering a week-old broken automation. 🙌

Posted in: