Operations Guide¶
This guide covers backup, monitoring, maintenance, and incident response for MyPost.
Backup & Recovery¶
Automated Backups¶
Backups are enabled by default and run daily at 2 AM.
Configuration:
BACKUP_ENABLED=true
BACKUP_SCHEDULE=0 2 * * * # Daily at 2 AM
BACKUP_RETENTION_DAYS=30
S3_BACKUP_BUCKET=mypost-backups
Manual Backup¶
# Database backup
docker compose exec postgres pg_dump -U mypost mypost | gzip > backup_$(date +%Y%m%d_%H%M%S).sql.gz
# Full backup (database + media)
./scripts/backup.sh full
# Database only
./scripts/backup.sh database
Restore from Backup¶
# List available backups
docker compose exec api npm run backup:list
# Restore database
gunzip -c backup_20260101_020000.sql.gz | docker compose exec -T postgres psql -U mypost mypost
# Restore from S3
./scripts/restore.sh s3://mypost-backups/backup_20260101_020000.sql.gz
Backup Verification¶
# Test restore to temporary database
docker compose exec postgres createdb -U mypost mypost_test
gunzip -c backup.sql.gz | docker compose exec -T postgres psql -U mypost mypost_test
# Verify data
docker compose exec postgres psql -U mypost mypost_test -c "SELECT COUNT(*) FROM users;"
# Clean up
docker compose exec postgres dropdb -U mypost mypost_test
Monitoring¶
Health Checks¶
# API health
curl https://mypost.yourdomain.com/api/v1/health
# Expected response
{
"status": "healthy",
"database": "connected",
"redis": "connected",
"storage": "connected"
}
Metrics Endpoint¶
Available Metrics:
| Metric | Description |
|--------|-------------|
| http_requests_total | Total HTTP requests |
| http_request_duration_seconds | Request latency |
| db_connections_active | Active database connections |
| redis_connections_active | Active Redis connections |
| queue_jobs_pending | Pending background jobs |
| posts_published_total | Total posts published |
Prometheus Configuration¶
# prometheus.yml
scrape_configs:
- job_name: 'mypost-api'
static_configs:
- targets: ['api:3000']
metrics_path: '/api/v1/metrics'
Grafana Dashboard¶
Import the included dashboard:
Panels included: - Request rate and latency - Error rate by endpoint - Database connection pool - Queue depth and processing time - Post publishing success rate
Alerting¶
Alert Configuration¶
# In .env
ALERT_EMAIL=ops@yourcompany.com
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
PAGERDUTY_KEY=your-key # Optional
Alert Rules¶
| Alert | Condition | Severity |
|---|---|---|
| API Down | Health check fails 3x | Critical |
| High Error Rate | >5% 5xx errors | Warning |
| Database Connection | Pool exhausted | Critical |
| Queue Backlog | >1000 pending jobs | Warning |
| Disk Space | <10% free | Critical |
| Certificate Expiry | <7 days | Warning |
Test Alerts¶
Log Management¶
View Logs¶
# All services
docker compose logs -f
# Specific service
docker compose logs -f api
# Last 100 lines
docker compose logs --tail 100 api
# Filter errors
docker compose logs api 2>&1 | grep -i error
Log Levels¶
Log Rotation¶
Add to Docker daemon configuration (/etc/docker/daemon.json):
Centralized Logging¶
With Loki:
# Add to compose.yaml
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail.yml:/etc/promtail/config.yml
Maintenance Tasks¶
Database Maintenance¶
# Vacuum and analyze (weekly recommended)
docker compose exec postgres vacuumdb -U mypost --analyze mypost
# Reindex (monthly or after heavy deletes)
docker compose exec postgres reindexdb -U mypost mypost
# Check table sizes
docker compose exec postgres psql -U mypost mypost -c "
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"
Redis Maintenance¶
# Check memory usage
docker compose exec redis redis-cli info memory
# Flush expired keys (if needed)
docker compose exec redis redis-cli --scan --pattern 'session:*' | head -10
# Check slow log
docker compose exec redis redis-cli slowlog get 10
Media Cleanup¶
# Find orphaned media (not referenced by any post)
docker compose exec api npm run media:cleanup --dry-run
# Actually delete orphaned media
docker compose exec api npm run media:cleanup
Token Cleanup¶
Scaling Operations¶
Scale API Horizontally¶
# Docker Compose
docker compose up -d --scale api=3
# Docker Swarm
docker service scale mypost_api=3
Scale Workers¶
Database Read Replicas¶
# Add to compose.yaml
postgres-replica:
image: postgres:16-alpine
environment:
- POSTGRES_PRIMARY_HOST=postgres
- POSTGRES_REPLICA_MODE=true
Security Operations¶
Rotate Secrets¶
# 1. Generate new secrets
NEW_JWT_SECRET=$(openssl rand -base64 64)
NEW_ENCRYPTION_KEY=$(openssl rand -base64 32)
# 2. Update .env
sed -i "s/JWT_SECRET=.*/JWT_SECRET=$NEW_JWT_SECRET/" .env
sed -i "s/ENCRYPTION_KEY=.*/ENCRYPTION_KEY=$NEW_ENCRYPTION_KEY/" .env
# 3. Re-encrypt tokens (required for ENCRYPTION_KEY change)
docker compose exec api npm run tokens:reencrypt
# 4. Restart services
docker compose down && docker compose up -d
Security Audit¶
# Check for vulnerabilities in dependencies
docker compose exec api npm audit
# Review recent security events
docker compose exec api npm run audit:security-events --days=7
Failed Login Review¶
# View failed login attempts
docker compose exec postgres psql -U mypost mypost -c "
SELECT email, ip_address, COUNT(*) as attempts, MAX(created_at) as last_attempt
FROM security_audit_log
WHERE action = 'login_failed' AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY email, ip_address
HAVING COUNT(*) > 3
ORDER BY attempts DESC;"
Incident Response¶
Runbook: API Down¶
-
Verify the issue
-
Check logs
-
Restart API service
-
If database issue
-
Escalate if unresolved after 15 minutes
Runbook: Publishing Failures¶
-
Check queue status
-
Review failed jobs
-
Check network adapter logs
-
Retry failed jobs
Runbook: Database Full¶
-
Check disk usage
-
Identify large tables
-
Clean up audit logs (if safe)
-
Vacuum to reclaim space
Disaster Recovery¶
Recovery Time Objectives¶
| Scenario | RTO | RPO |
|---|---|---|
| Service restart | 5 min | 0 |
| Container rebuild | 15 min | 0 |
| Database restore | 1 hour | 15 min (with WAL) |
| Full disaster recovery | 4 hours | 24 hours (daily backup) |
DR Procedure¶
- Provision new server
- Install Docker and clone repo
- Restore .env file (from secure backup)
- Restore database from latest backup
- Restore media from S3
- Update DNS to point to new server
- Verify all services
Maintenance Windows¶
Recommended Schedule¶
| Task | Frequency | Window |
|---|---|---|
| Security updates | Weekly | Sunday 2-4 AM |
| Database vacuum | Weekly | Sunday 3 AM |
| Log rotation | Daily | Automatic |
| Backup verification | Monthly | 1st Sunday |
| Secret rotation | Quarterly | Planned |
| Major upgrades | As needed | Planned |
Announcing Maintenance¶
# Set maintenance mode
docker compose exec api npm run maintenance:enable --message "Scheduled maintenance in progress"
# Disable maintenance mode
docker compose exec api npm run maintenance:disable
Support Contacts¶
| Issue | Contact | Escalation |
|---|---|---|
| Application bugs | GitHub Issues | — |
| Security issues | security@yourcompany.com | Immediate |
| Infrastructure | ops@yourcompany.com | On-call |