The Problem: Silent Cron Job Failures
You’ve set up your OpenClaw automations. Daily email summaries at 9 AM. Google Sheets updates for job tracking. Everything’s running smoothly—until it isn’t.
Here’s what happened to me: both of my cron jobs failed silently for 3-4 days. No emails. No updates. No alerts. The error? I had used a model name that didn’t exist in my allowed list: openrouter/anthropic/claude-haiku-4. Then I tried claude-3.5-haiku—same issue.
The jobs were attempting to run, hitting an error, and dying quietly. No notification. No way to know unless I manually checked the logs.
This is a problem.
The Solution: Heartbeat Monitoring
I built a simple but effective heartbeat monitoring system that checks cron job health every 8 hours and alerts me via Telegram if anything fails. Here’s how it works.
Step 1: Create HEARTBEAT.md
This file lives in your workspace and defines what the heartbeat check should do:
# HEARTBEAT.md
## Cron Job Health Monitoring (Every 8 hours)
Check if any cron jobs have failed:
1. Read `/root/.openclaw/workspace/memory/heartbeat-state.json` for last check time
2. If last check was < 8 hours ago, skip (return HEARTBEAT_OK)
3. If 8+ hours since last check:
- List all cron jobs
- Check for jobs with `consecutiveErrors > 0`
- If failures found:
- Send Telegram alert with job name, error, last run time
- Suggest fix if obvious (e.g., "model not allowed" → suggest correct model)
- Update last check time in heartbeat-state.json
Only alert once per unique error (track alerted errors in state file).
The key insight: track what you’ve already alerted on so you don’t spam yourself with the same error repeatedly.
Step 2: Create the State File
The heartbeat needs to remember when it last checked and what errors it’s already alerted on:
{
"lastCheckMs": 0,
"alertedErrors": []
}
Save this as /root/.openclaw/workspace/memory/heartbeat-state.json.
Step 3: Set Up the Heartbeat Cron Job
Now create a cron job that runs every 8 hours and executes the heartbeat check:
{
"name": "Heartbeat: Cron Health Check",
"schedule": {
"kind": "every",
"everyMs": 28800000
},
"payload": {
"kind": "agentTurn",
"message": "Read HEARTBEAT.md if it exists (workspace context). Follow it strictly. Do not infer or repeat old tasks from prior chats. If nothing needs attention, reply HEARTBEAT_OK.",
"model": "claude-3.5-haiku"
},
"sessionTarget": "isolated",
"delivery": {
"mode": "announce"
},
"enabled": true
}
Add this using the OpenClaw cron tool:
cron(action="add", job={...})
Step 4: Implement the Check Logic
When the heartbeat runs, it reads HEARTBEAT.md and executes the check. Here’s the logic flow:
- Read state: Check when we last ran
- Skip if recent: If < 8 hours, return
HEARTBEAT_OK - List cron jobs: Use
cron(action="list") - Check for errors: Any jobs with
consecutiveErrors > 0? - Alert if new: Only send Telegram notification if we haven’t alerted on this specific error before
- Update state: Save current time and track the alerted error
Example Alert Message
When a failure is detected, you get a Telegram message like this:
🚨 Cron Job Failure Detected
Job: Daily Email Summaries
Error: model not allowed: openrouter/anthropic/claude-haiku-4
Last Run: 2026-02-13 09:00 UTC
Consecutive Failures: 4Suggested Fix: Update job to use an allowed model like
claude-3.5-haikuorgpt-4o-mini
What I Fixed After Discovery
Once I caught the model error, I took two actions:
1. Added Missing Models to Config
I updated my OpenClaw config to allow cheaper models:
{
"models": {
"allowed": [
"claude-3.5-haiku",
"claude-3-haiku",
"gpt-4o-mini",
"gemini-2.0-flash-exp:free"
]
}
}
Used gateway(action="config.patch", ...) to merge this in without replacing the entire config.
2. Updated Both Cron Jobs
Changed both jobs from the non-existent model to claude-3.5-haiku:
cron(action="update", jobId="...", patch={ payload: { model: "claude-3.5-haiku" } })
Result: ~90% cost savings compared to Sonnet, and the jobs actually run now.
Why This Matters
Cron jobs are set-it-and-forget-it by design, which is great—until something breaks. Without monitoring:
- You might not notice for days or weeks
- Critical automations silently stop working
- You lose trust in your system
With heartbeat monitoring:
- Early detection: Know within 8 hours if something’s wrong
- Actionable alerts: Get the error message + suggested fix
- No spam: Only alert once per unique error
- Peace of mind: Your automations either work or you know they don’t
Key Takeaways
- Always verify model names are in your allowed list before using them in cron jobs
- Monitor your monitors: Even automated systems need health checks
- Track state: Avoid alert fatigue by remembering what you’ve already notified
- Fail loudly: Silent failures are the worst kind—make sure you hear about problems
Next Steps
Want to implement this yourself?
- Create
HEARTBEAT.mdin your workspace - Initialize
memory/heartbeat-state.json - Add the heartbeat cron job
- Let it run and catch your next failure before you notice manually
Your future self will thank you when you get that first alert instead of discovering a week-old broken automation. 🙌