This guide details how to implement a Stateful Watchdog Script designed for automated recovery and intelligent alerting. Unlike basic monitors that ping you every minute an issue persists, a "stateful" script remembers the previous condition of your system. This allows it to distinguish between a new failure, a persistent problem, and a successful recovery, ensuring you only receive critical notifications.
Automated Recovery: Building a Stateful Watchdog for WireGuard and NFS
In complex server environments, services like VPN tunnels (WireGuard) and network storage (NFS) can occasionally drop. A simple alert is helpful, but an automated recovery script that tries to fix the problem before notifying you is even better.
How the Script Works
This script performs three core functions:
- State Persistence: It reads from a
STATE_FILEto see if services were "up" or "down" during the last check. - Automated Recovery: If the WireGuard tunnel is down, it automatically restarts the service. If the NFS mount is missing, it attempts to remount it.
- Intelligent Alerting: It only sends a Telegram message when a status changes (e.g., from Up to Down, or from Down to Recovered).
Step 1: Sanitize and Prepare the Script
Create a new file on your server (e.g., /usr/local/bin/system_watchdog.sh) and paste the following generic version of the script found in our infrastructure sources:
#!/bin/bash
### System Stateful Watchdog with Automatic Recovery
# --- Configuration (Replace with your details) ---
BOT_TOKEN="YOUR_BOT_TOKEN"
CHAT_ID="YOUR_CHAT_ID"
REMOTE_IP="10.0.0.2" # The IP to ping over the VPN
WG_INTERFACE="wg0" # Your WireGuard interface name
NFS_MOUNT="/mnt/network_share" # Your NFS mount point
STATE_FILE="/var/log/watchdog_state.txt"
LOG_FILE="/var/log/system_watchdog.log"
# --- Helper: Send Telegram Message ---
send_telegram() {
local MSG="$1"
curl -s -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
-d chat_id="${CHAT_ID}" -d text="$MSG"
echo "$(date): $MSG" >> "$LOG_FILE"
}
# --- 1. Load Previous State ---
if [ -f "$STATE_FILE" ]; then
source "$STATE_FILE"
else
# Default to 'up' to avoid alerts on the very first run
WG_STATE="up"
NFS_STATE="up"
fi
# --- 2. Check WireGuard Connectivity ---
if ping -c 2 -W 2 "$REMOTE_IP" >/dev/null 2>&1; then
CURRENT_WG="up"
else
CURRENT_WG="down"
fi
# Handle WireGuard state changes and recovery
if [ "$CURRENT_WG" != "$WG_STATE" ]; then
if [ "$CURRENT_WG" == "down" ]; then
send_telegram "⚠️ VPN Tunnel ($REMOTE_IP) is down. Attempting restart..."
systemctl restart wg-quick@"$WG_INTERFACE"
sleep 15
# Verify if recovery worked
if ping -c 2 -W 2 "$REMOTE_IP" >/dev/null 2>&1; then
CURRENT_WG="up"
send_telegram "✅ VPN Tunnel recovered automatically."
else
send_telegram "❌ VPN Tunnel still down after restart."
fi
else
send_telegram "✅ VPN Tunnel is back online."
fi
fi
WG_STATE="$CURRENT_WG"
# --- 3. Check NFS Mount ---
if mountpoint -q "$NFS_MOUNT"; then
CURRENT_NFS="up"
else
CURRENT_NFS="down"
send_telegram "⚠️ NFS mount $NFS_MOUNT is missing. Attempting remount..."
mount "$NFS_MOUNT" >/dev/null 2>&1
if mountpoint -q "$NFS_MOUNT"; then
CURRENT_NFS="up"
send_telegram "✅ NFS mount recovered automatically."
else
send_telegram "❌ NFS mount recovery failed."
fi
fi
NFS_STATE="$CURRENT_NFS"
# --- 4. Save State for Next Run ---
echo "WG_STATE=$WG_STATE" > "$STATE_FILE"
echo "NFS_STATE=$NFS_STATE" >> "$STATE_FILE"
Step 2: Set Permissions
Make the script executable so the system can run it:
sudo chmod +x /usr/local/bin/system_watchdog.sh
Step 3: Automate with Cron
To ensure the script checks your systems regularly, add it to the root crontab.
- Open the crontab:
sudo crontab -e - Add this line to run the check every minute:
Customizing for Other Scenarios
The beauty of this stateful logic is that it can be adapted to monitor almost anything. Here are three ways you can customize it:
- Database Health: Replace the
pingcheck with a command likemysqladmin ping. If it fails, the script cansystemctl restart mysqland alert you if the database stays down. - Disk Space Alerts: Modify the script to check if disk usage exceeds 90% using
df. The "state" would track if you have already been alerted about the low space, preventing a notification every minute until you clear the drive. - Website Content Monitoring: Use
curlto check if a specific keyword exists on your homepage. If the keyword disappears (indicating a defacement or application error), the script can alert you and potentially restart your web server (Nginx/Apache).
By using this stateful approach, you transform your server from a passive machine into a self-healing environment that only bothers you when a problem requires human intervention.
