Operations Guide¶

This guide covers backup, monitoring, maintenance, and incident response for MyPost.

Backup & Recovery¶

Automated Backups¶

Backups are enabled by default and run daily at 2 AM.

Configuration:

BACKUP_ENABLED=true
BACKUP_SCHEDULE=0 2 * * *       # Daily at 2 AM
BACKUP_RETENTION_DAYS=30
S3_BACKUP_BUCKET=mypost-backups

Manual Backup¶

# Database backup
docker compose exec postgres pg_dump -U mypost mypost | gzip > backup_$(date +%Y%m%d_%H%M%S).sql.gz

# Full backup (database + media)
./scripts/backup.sh full

# Database only
./scripts/backup.sh database

Restore from Backup¶

# List available backups
docker compose exec api npm run backup:list

# Restore database
gunzip -c backup_20260101_020000.sql.gz | docker compose exec -T postgres psql -U mypost mypost

# Restore from S3
./scripts/restore.sh s3://mypost-backups/backup_20260101_020000.sql.gz

Backup Verification¶

# Test restore to temporary database
docker compose exec postgres createdb -U mypost mypost_test
gunzip -c backup.sql.gz | docker compose exec -T postgres psql -U mypost mypost_test

# Verify data
docker compose exec postgres psql -U mypost mypost_test -c "SELECT COUNT(*) FROM users;"

# Clean up
docker compose exec postgres dropdb -U mypost mypost_test

Monitoring¶

Health Checks¶

# API health
curl https://mypost.yourdomain.com/api/v1/health

# Expected response
{
  "status": "healthy",
  "database": "connected",
  "redis": "connected",
  "storage": "connected"
}

Metrics Endpoint¶

curl https://mypost.yourdomain.com/api/v1/metrics

Available Metrics: | Metric | Description | |--------|-------------| | http_requests_total | Total HTTP requests | | http_request_duration_seconds | Request latency | | db_connections_active | Active database connections | | redis_connections_active | Active Redis connections | | queue_jobs_pending | Pending background jobs | | posts_published_total | Total posts published |

Prometheus Configuration¶

# prometheus.yml
scrape_configs:
  - job_name: 'mypost-api'
    static_configs:
      - targets: ['api:3000']
    metrics_path: '/api/v1/metrics'

Grafana Dashboard¶

Import the included dashboard:

# Dashboard location
docs/grafana/mypost-dashboard.json

Panels included: - Request rate and latency - Error rate by endpoint - Database connection pool - Queue depth and processing time - Post publishing success rate

Alerting¶

Alert Configuration¶

# In .env
ALERT_EMAIL=ops@yourcompany.com
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
PAGERDUTY_KEY=your-key  # Optional

Alert Rules¶

Alert	Condition	Severity
API Down	Health check fails 3x	Critical
High Error Rate	>5% 5xx errors	Warning
Database Connection	Pool exhausted	Critical
Queue Backlog	>1000 pending jobs	Warning
Disk Space	<10% free	Critical
Certificate Expiry	<7 days	Warning

Test Alerts¶

# Trigger test alert
docker compose exec api npm run alert:test

Log Management¶

View Logs¶

# All services
docker compose logs -f

# Specific service
docker compose logs -f api

# Last 100 lines
docker compose logs --tail 100 api

# Filter errors
docker compose logs api 2>&1 | grep -i error

Log Levels¶

# In .env
LOG_LEVEL=info  # debug, info, warn, error

Log Rotation¶

Add to Docker daemon configuration (/etc/docker/daemon.json):

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "5"
  }
}

Centralized Logging¶

With Loki:

# Add to compose.yaml
loki:
  image: grafana/loki:latest
  ports:
    - "3100:3100"
  volumes:
    - loki-data:/loki

promtail:
  image: grafana/promtail:latest
  volumes:
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
    - ./promtail.yml:/etc/promtail/config.yml

Maintenance Tasks¶

Database Maintenance¶

# Vacuum and analyze (weekly recommended)
docker compose exec postgres vacuumdb -U mypost --analyze mypost

# Reindex (monthly or after heavy deletes)
docker compose exec postgres reindexdb -U mypost mypost

# Check table sizes
docker compose exec postgres psql -U mypost mypost -c "
SELECT schemaname, tablename, 
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables 
WHERE schemaname = 'public' 
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"

Redis Maintenance¶

# Check memory usage
docker compose exec redis redis-cli info memory

# Flush expired keys (if needed)
docker compose exec redis redis-cli --scan --pattern 'session:*' | head -10

# Check slow log
docker compose exec redis redis-cli slowlog get 10

Media Cleanup¶

# Find orphaned media (not referenced by any post)
docker compose exec api npm run media:cleanup --dry-run

# Actually delete orphaned media
docker compose exec api npm run media:cleanup

Token Cleanup¶

# Remove expired tokens and sessions
docker compose exec api npm run cleanup:tokens

Scaling Operations¶

Scale API Horizontally¶

# Docker Compose
docker compose up -d --scale api=3

# Docker Swarm
docker service scale mypost_api=3

Scale Workers¶

# Increase workers for high queue volumes
docker compose up -d --scale worker=4

Database Read Replicas¶

# Add to compose.yaml
postgres-replica:
  image: postgres:16-alpine
  environment:
    - POSTGRES_PRIMARY_HOST=postgres
    - POSTGRES_REPLICA_MODE=true

Security Operations¶

Rotate Secrets¶

# 1. Generate new secrets
NEW_JWT_SECRET=$(openssl rand -base64 64)
NEW_ENCRYPTION_KEY=$(openssl rand -base64 32)

# 2. Update .env
sed -i "s/JWT_SECRET=.*/JWT_SECRET=$NEW_JWT_SECRET/" .env
sed -i "s/ENCRYPTION_KEY=.*/ENCRYPTION_KEY=$NEW_ENCRYPTION_KEY/" .env

# 3. Re-encrypt tokens (required for ENCRYPTION_KEY change)
docker compose exec api npm run tokens:reencrypt

# 4. Restart services
docker compose down && docker compose up -d

Security Audit¶

# Check for vulnerabilities in dependencies
docker compose exec api npm audit

# Review recent security events
docker compose exec api npm run audit:security-events --days=7

# View failed login attempts
docker compose exec postgres psql -U mypost mypost -c "
SELECT email, ip_address, COUNT(*) as attempts, MAX(created_at) as last_attempt
FROM security_audit_log
WHERE action = 'login_failed' AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY email, ip_address
HAVING COUNT(*) > 3
ORDER BY attempts DESC;"

Incident Response¶

Runbook: API Down¶

Verify the issue

curl -I https://mypost.yourdomain.com/api/v1/health
docker compose ps

Check logs
```
docker compose logs --tail 100 api
```
Restart API service
```
docker compose restart api
```

If database issue

docker compose exec postgres pg_isready
docker compose restart postgres

Escalate if unresolved after 15 minutes

Runbook: Publishing Failures¶

Check queue status

docker compose exec redis redis-cli llen bull:publish:waiting

Review failed jobs

docker compose exec api npm run queue:failed

Check network adapter logs

docker compose logs worker | grep -i "facebook\|instagram\|twitter"

Retry failed jobs

docker compose exec api npm run queue:retry-failed

Runbook: Database Full¶

Check disk usage

docker compose exec postgres df -h /var/lib/postgresql/data

Identify large tables

docker compose exec postgres psql -U mypost mypost -c "
SELECT tablename, pg_size_pretty(pg_total_relation_size(tablename::regclass))
FROM pg_tables WHERE schemaname='public' ORDER BY pg_total_relation_size(tablename::regclass) DESC LIMIT 5;"

Clean up audit logs (if safe)

docker compose exec postgres psql -U mypost mypost -c "
DELETE FROM audit_events WHERE created_at < NOW() - INTERVAL '90 days';"

Vacuum to reclaim space

docker compose exec postgres vacuumdb -U mypost --full mypost

Disaster Recovery¶

Recovery Time Objectives¶

Scenario	RTO	RPO
Service restart	5 min	0
Container rebuild	15 min	0
Database restore	1 hour	15 min (with WAL)
Full disaster recovery	4 hours	24 hours (daily backup)

DR Procedure¶

Provision new server
Install Docker and clone repo
Restore .env file (from secure backup)
Restore database from latest backup
Restore media from S3
Update DNS to point to new server
Verify all services

Maintenance Windows¶

Recommended Schedule¶

Task	Frequency	Window
Security updates	Weekly	Sunday 2-4 AM
Database vacuum	Weekly	Sunday 3 AM
Log rotation	Daily	Automatic
Backup verification	Monthly	1st Sunday
Secret rotation	Quarterly	Planned
Major upgrades	As needed	Planned

Announcing Maintenance¶

# Set maintenance mode
docker compose exec api npm run maintenance:enable --message "Scheduled maintenance in progress"

# Disable maintenance mode
docker compose exec api npm run maintenance:disable

Support Contacts¶

Issue	Contact	Escalation
Application bugs	GitHub Issues	—
Security issues	security@yourcompany.com	Immediate
Infrastructure	ops@yourcompany.com	On-call