Stateful Watchdog with Automatic Recovery


This guide explains how to implement the Stateful Watchdog with Automatic Recovery, a sophisticated monitoring script designed to handle service failures without human intervention. While basic scripts simply notify you when something is broken, this "stateful" approach intelligently attempts to fix the problem first and only alerts you when a status truly changes, preventing unnecessary notification spam.

Beyond Basic Alerts: Building a Self-Healing Server with a Stateful Watchdog

In infrastructure management, "Alert Fatigue" is real. If a service goes down, you don't want a notification every minute while it's offline; you want one alert when it fails, an automated attempt to fix it, and a final alert when it’s back online.

By leveraging the logic found in the pi_watchdog_auto.sh framework, we can build a script that remembers its previous state and takes active steps to recover services.

How the Script Works

  • State Loading: The script begins by "reading its own memory" from a state file. This tells it if the service was "up" or "down" during the last check.
  • The Test: It performs a connectivity or service test (like a ping).
  • The Comparison: It compares the current result with the saved state. If they match, the script finishes silently. If they differ, a "State Change" has occurred.
  • The Recovery Loop: If a failure is detected, the script doesn't just complain—it attempts a recovery command (like restarting a service), waits for the system to settle, and then re-tests to verify the fix.
  • State Saving: Finally, it writes the new status back to the state file so it's ready for the next interval.

Step 1: Prepare the Script

Create a new file (e.g., /usr/local/bin/auto_recovery_watchdog.sh) and paste this generic version of the stateful monitoring logic:

#!/bin/bash
### Stateful Watchdog with Automatic Recovery

# --- Configuration ---
BOT_TOKEN="YOUR_BOT_TOKEN"
CHAT_ID="YOUR_CHAT_ID"
TARGET_IP="192.168.1.100"       # The remote IP to monitor
SERVICE_NAME="my-vpn-service"   # The systemd service to manage
MOUNT_POINT="/mnt/data_backup"  # The network mount to monitor
STATE_FILE="/var/log/watchdog_state.txt"
LOG_FILE="/var/log/watchdog.log"
SLEEP_AFTER_RESTART=15

# --- Helper: Send Telegram Message ---
send_telegram() {
    local MSG="$1"
    curl -s -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
         -d chat_id="${CHAT_ID}" -d text="$MSG"
    echo "$(date): $MSG" >> "$LOG_FILE"
}

# --- 1. Load Previous State ---
if [ -f "$STATE_FILE" ]; then
    source "$STATE_FILE"
else
    # Initial run defaults
    SERVICE_STATE="up"
    MOUNT_STATE="up"
fi

# --- 2. Check Service Connectivity ---
if ping -c 2 -W 2 "$TARGET_IP" >/dev/null 2>&1; then
    CURRENT_SERVICE="up"
else
    CURRENT_SERVICE="down"
fi

# Handle Service State Changes
if [ "$CURRENT_SERVICE" != "$SERVICE_STATE" ]; then
    if [ "$CURRENT_SERVICE" == "down" ]; then
        send_telegram "⚠️ Service ($TARGET_IP) is unreachable. Attempting restart..."
        systemctl restart "$SERVICE_NAME"
        sleep "$SLEEP_AFTER_RESTART"
        
        # Verify Recovery
        if ping -c 2 -W 2 "$TARGET_IP" >/dev/null 2>&1; then
            CURRENT_SERVICE="up"
            send_telegram "✅ Service recovered automatically."
        else
            send_telegram "❌ Service still down after restart attempt."
        fi
    else
        send_telegram "✅ Service connectivity restored."
    fi
fi
SERVICE_STATE="$CURRENT_SERVICE"

# --- 3. Check Mount Point ---
if mountpoint -q "$MOUNT_POINT"; then
    CURRENT_MOUNT="up"
else
    CURRENT_MOUNT="down"
    send_telegram "⚠️ Mount $MOUNT_POINT is missing. Attempting remount..."
    mount "$MOUNT_POINT" >/dev/null 2>&1
    if mountpoint -q "$MOUNT_POINT"; then
        CURRENT_MOUNT="up"
        send_telegram "✅ Mount recovered automatically."
    else
        send_telegram "❌ Mount still unreachable."
    fi
fi
MOUNT_STATE="$CURRENT_MOUNT"

# --- 4. Save State for Next Run ---
echo "SERVICE_STATE=$SERVICE_STATE" > "$STATE_FILE"
echo "MOUNT_STATE=$MOUNT_STATE" >> "$STATE_FILE"

Step 2: Installation and Scheduling

  • Permissions: Make the script executable: sudo chmod +x /usr/local/bin/auto_recovery_watchdog.sh
  • Automation: Open the crontab (sudo crontab -e) and add this line to run the script every minute: * * * * * /usr/local/bin/auto_recovery_watchdog.sh >/dev/null 2>&1

Customizing for Different Scenarios

The "Check-Verify-Alert" logic of this script can be adapted to many other system needs:

  • Docker Container Health: Instead of a ping, use docker ps -q -f name=my_container. If the container is missing, the script can issue a docker-compose up -d and notify you only if the container fails to stay running.
  • Backup Integrity: Use the script to check for the existence of a daily backup file (e.g., [ -f /backups/db_daily.sql ]). If the file is missing or its size is 0 bytes, the script can trigger a manual backup run and alert you if the secondary attempt also fails.
  • Log Spam/Security Monitoring: Use grep to check if your auth logs show more than 50 failed login attempts in the last minute. The stateful logic ensures you get one alert about a potential brute-force attack, rather than a message for every single subsequent failed attempt while the attack is ongoing.

By using this stateful approach, you create a server that is not only "watched" but is actively working to maintain its own uptime.