Shell Scripting for SRE and DevOps Automation¶
Shell scripts are one of the most practical tools in infrastructure work. They let you combine operating system commands, glue together tools, automate repetitive tasks, and turn manual runbooks into repeatable workflows.
For SRE and DevOps work, shell scripting is often the fastest way to automate:
- Health checks
- Service operations
- Backup tasks
- Deployment helpers
- Log analysis
- CI/CD build steps
- Cron-based maintenance
Why Shell Scripting Still Matters¶
Even with tools like Python, Go, Ansible, and Terraform, shell scripts remain important because they are:
- Available by default on most Linux and Unix systems
- Excellent for orchestration and system-level automation
- Easy to integrate with commands like
systemctl,docker,kubectl,curl, andrsync - Fast to write for operational tasks and prototypes
Use shell where it fits
Shell is great for command orchestration and glue logic. If the logic becomes very complex, data-heavy, or hard to read, it is often better to switch to Python or Go.
Common Shells¶
Popular Unix shells include:
sh: Basic POSIX shellbash: Most common shell for Linux scriptingzsh: Powerful interactive shell, often used on macOSksh: Korn shell, common in some enterprise Unix systems
For most Linux automation work, bash is the most common choice.
Start Every Script Correctly¶
Shebang¶
The first line tells the OS which interpreter should run the script.
This is a portable way to locate bash.
Safe mode¶
A strong default for production scripts is:
What it does:
-e: Exit when a command fails-u: Fail on unset variables-o pipefail: Fail a pipeline if any command inside it fails
You can also add:
to print commands during debugging.
Core Building Blocks¶
Variables¶
Arguments¶
echo "Script name: $0"
echo "First argument: $1"
echo "Argument count: $#"
echo "All arguments: $@"
echo "Previous command exit code: $?"
Conditionals¶
Loops¶
Functions¶
Case statements¶
case "${1:-}" in
start) echo "Starting service" ;;
stop) echo "Stopping service" ;;
status) echo "Checking status" ;;
*) echo "Usage: $0 {start|stop|status}" ;;
esac
Useful Bash Patterns¶
Default values¶
Command substitution¶
Arrays¶
Reading a file line by line¶
Trap for cleanup¶
Script Structure Template¶
This is a clean starting point for operational scripts:
#!/usr/bin/env bash
set -euo pipefail
log() {
printf '[%s] %s\n' "$(date '+%F %T')" "$1"
}
usage() {
echo "Usage: $0 <environment>"
exit 1
}
main() {
local environment="${1:-}"
[[ -n "$environment" ]] || usage
log "Running in environment: $environment"
}
main "$@"
Example Script 1: System Health Check¶
This script is useful for daily operations checks or cron jobs.
#!/usr/bin/env bash
set -euo pipefail
WARNING_DISK_THRESHOLD=80
log() {
printf '[%s] %s\n' "$(date '+%F %T')" "$1"
}
check_disk() {
local usage
usage="$(df -h / | awk 'NR==2 {gsub("%","",$5); print $5}')"
if (( usage >= WARNING_DISK_THRESHOLD )); then
log "WARNING: Root disk usage is ${usage}%"
else
log "OK: Root disk usage is ${usage}%"
fi
}
check_memory() {
free -h
}
check_load() {
uptime
}
main() {
log "Starting health check"
hostnamectl || true
check_disk
check_memory
check_load
log "Health check completed"
}
main "$@"
Use cases:
- Cron-based health reports
- Basic server validation after provisioning
- Quick support triage
Example Script 2: Log Error Scanner¶
This script scans a log file for common error patterns and prints a short summary.
#!/usr/bin/env bash
set -euo pipefail
LOG_FILE="${1:-/var/log/syslog}"
if [[ ! -f "$LOG_FILE" ]]; then
echo "Log file not found: $LOG_FILE"
exit 1
fi
echo "Scanning: $LOG_FILE"
echo
echo "Top error counts:"
grep -Ei 'error|failed|fatal|panic|critical' "$LOG_FILE" | \
sed 's/[[:space:]]\+/ /g' | \
sort | uniq -c | sort -nr | head -10
Use cases:
- Rapid incident triage
- Scheduled log reviews
- Pre-check before escalating to application teams
Example Script 3: Backup Script with Retention¶
This script creates a compressed backup and deletes old backups based on retention days.
#!/usr/bin/env bash
set -euo pipefail
SOURCE_DIR="${1:-/etc}"
BACKUP_DIR="${2:-/var/backups/custom}"
RETENTION_DAYS="${RETENTION_DAYS:-7}"
TIMESTAMP="$(date +%F-%H%M%S)"
ARCHIVE_NAME="backup-${TIMESTAMP}.tar.gz"
mkdir -p "$BACKUP_DIR"
tar -czf "${BACKUP_DIR}/${ARCHIVE_NAME}" "$SOURCE_DIR"
echo "Created backup: ${BACKUP_DIR}/${ARCHIVE_NAME}"
find "$BACKUP_DIR" -type f -name 'backup-*.tar.gz' -mtime +"$RETENTION_DAYS" -delete
echo "Old backups older than ${RETENTION_DAYS} days removed"
Use cases:
- Configuration backups
- Pre-change safety snapshots
- Simple server maintenance automation
Production backup note
For critical systems, backups should also include verification, secure storage, encryption, restore testing, and off-host copies.
Example Script 4: Deployment Helper¶
This pattern is useful when restarting a service after pulling the latest release artifacts or configuration.
#!/usr/bin/env bash
set -euo pipefail
APP_DIR="/opt/myapp"
SERVICE_NAME="myapp"
log() {
printf '[%s] %s\n' "$(date '+%F %T')" "$1"
}
deploy() {
log "Switching to application directory"
cd "$APP_DIR"
log "Pulling latest code"
git pull origin main
log "Restarting service"
sudo systemctl restart "$SERVICE_NAME"
log "Checking service status"
sudo systemctl status "$SERVICE_NAME" --no-pager
}
deploy
Use cases:
- Simple service deployments
- Jenkins or GitLab CI shell stages
- Small internal tools on virtual machines
Example Script 5: Kubernetes Rollout Checker¶
This script validates that a Kubernetes deployment rollout finishes successfully.
#!/usr/bin/env bash
set -euo pipefail
NAMESPACE="${1:-default}"
DEPLOYMENT_NAME="${2:-}"
if [[ -z "$DEPLOYMENT_NAME" ]]; then
echo "Usage: $0 <namespace> <deployment-name>"
exit 1
fi
echo "Checking rollout for deployment/${DEPLOYMENT_NAME} in namespace ${NAMESPACE}"
kubectl rollout status "deployment/${DEPLOYMENT_NAME}" -n "$NAMESPACE" --timeout=120s
kubectl get pods -n "$NAMESPACE"
Use cases:
- Post-deployment checks
- CD validation steps
- Quick operational validation during incidents
Example Script 6: Service Monitor with Exit Codes¶
This is useful when a script needs to integrate with monitoring systems or CI jobs.
#!/usr/bin/env bash
set -euo pipefail
SERVICE_NAME="${1:-nginx}"
if systemctl is-active --quiet "$SERVICE_NAME"; then
echo "OK: ${SERVICE_NAME} is running"
exit 0
else
echo "CRITICAL: ${SERVICE_NAME} is not running"
exit 2
fi
Use cases:
- Nagios-style checks
- Cron alerts
- Jenkins or monitoring integrations
Running a Script¶
Save the file, then make it executable:
Run it:
Or pass arguments:
You can also run it directly with an interpreter:
Debugging Shell Scripts¶
Useful debugging techniques:
Enable command tracing¶
Print line numbers on errors¶
Check syntax without running¶
Lint with ShellCheck¶
ShellCheck is worth using
ShellCheck catches quoting mistakes, unsafe expansions, and common Bash pitfalls that are easy to miss in review.
Best Practices¶
- Use
#!/usr/bin/env bashwhen you are writing Bash-specific scripts - Start with
set -euo pipefailfor safer behavior - Quote variables like
"$VAR"unless you intentionally want word splitting - Prefer functions for readability and reuse
- Validate input arguments early
- Return meaningful exit codes
- Log clearly so operators can understand what happened
- Avoid hardcoding secrets in scripts
- Use
mktempfor temporary files - Test scripts in a safe environment before production use
Common Mistakes to Avoid¶
- Unquoted variables:
Safer:
- Ignoring command failures
- Using shell for overly complex business logic
- Assuming tools like
jq,kubectl, orawsare installed without checking - Deleting files without validating paths
- Mixing POSIX
shsyntax and Bash-specific syntax accidentally
Integration with DevOps Workflows¶
Shell scripts are often used inside:
- Jenkins stages with
sh - GitHub Actions
runsteps - Cron jobs for recurring tasks
- Systemd services and timers
- Docker entrypoint scripts
- Kubernetes init containers and operational jobs
Example Jenkins stage:
Example GitHub Actions step:
When to Use Another Language¶
Shell is not always the right tool. Consider Python or Go when you need:
- Complex parsing and data structures
- Heavy JSON or YAML processing
- Strong error modeling
- Cross-platform application logic
- Larger testable codebases
Use shell scripts for what they do best: command orchestration, operational automation, and fast system-level workflows.