Skip to content

Operations Guide

This guide covers backup, monitoring, maintenance, and incident response for MyPost.


Backup & Recovery

Automated Backups

Backups are enabled by default and run daily at 2 AM.

Configuration:

BACKUP_ENABLED=true
BACKUP_SCHEDULE=0 2 * * *       # Daily at 2 AM
BACKUP_RETENTION_DAYS=30
S3_BACKUP_BUCKET=mypost-backups

Manual Backup

# Database backup
docker compose exec postgres pg_dump -U mypost mypost | gzip > backup_$(date +%Y%m%d_%H%M%S).sql.gz

# Full backup (database + media)
./scripts/backup.sh full

# Database only
./scripts/backup.sh database

Restore from Backup

# List available backups
docker compose exec api npm run backup:list

# Restore database
gunzip -c backup_20260101_020000.sql.gz | docker compose exec -T postgres psql -U mypost mypost

# Restore from S3
./scripts/restore.sh s3://mypost-backups/backup_20260101_020000.sql.gz

Backup Verification

# Test restore to temporary database
docker compose exec postgres createdb -U mypost mypost_test
gunzip -c backup.sql.gz | docker compose exec -T postgres psql -U mypost mypost_test

# Verify data
docker compose exec postgres psql -U mypost mypost_test -c "SELECT COUNT(*) FROM users;"

# Clean up
docker compose exec postgres dropdb -U mypost mypost_test

Monitoring

Health Checks

# API health
curl https://mypost.yourdomain.com/api/v1/health

# Expected response
{
  "status": "healthy",
  "database": "connected",
  "redis": "connected",
  "storage": "connected"
}

Metrics Endpoint

curl https://mypost.yourdomain.com/api/v1/metrics

Available Metrics: | Metric | Description | |--------|-------------| | http_requests_total | Total HTTP requests | | http_request_duration_seconds | Request latency | | db_connections_active | Active database connections | | redis_connections_active | Active Redis connections | | queue_jobs_pending | Pending background jobs | | posts_published_total | Total posts published |

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'mypost-api'
    static_configs:
      - targets: ['api:3000']
    metrics_path: '/api/v1/metrics'

Grafana Dashboard

Import the included dashboard:

# Dashboard location
docs/grafana/mypost-dashboard.json

Panels included: - Request rate and latency - Error rate by endpoint - Database connection pool - Queue depth and processing time - Post publishing success rate


Alerting

Alert Configuration

# In .env
ALERT_EMAIL=ops@yourcompany.com
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
PAGERDUTY_KEY=your-key  # Optional

Alert Rules

Alert Condition Severity
API Down Health check fails 3x Critical
High Error Rate >5% 5xx errors Warning
Database Connection Pool exhausted Critical
Queue Backlog >1000 pending jobs Warning
Disk Space <10% free Critical
Certificate Expiry <7 days Warning

Test Alerts

# Trigger test alert
docker compose exec api npm run alert:test

Log Management

View Logs

# All services
docker compose logs -f

# Specific service
docker compose logs -f api

# Last 100 lines
docker compose logs --tail 100 api

# Filter errors
docker compose logs api 2>&1 | grep -i error

Log Levels

# In .env
LOG_LEVEL=info  # debug, info, warn, error

Log Rotation

Add to Docker daemon configuration (/etc/docker/daemon.json):

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "5"
  }
}

Centralized Logging

With Loki:

# Add to compose.yaml
loki:
  image: grafana/loki:latest
  ports:
    - "3100:3100"
  volumes:
    - loki-data:/loki

promtail:
  image: grafana/promtail:latest
  volumes:
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
    - ./promtail.yml:/etc/promtail/config.yml


Maintenance Tasks

Database Maintenance

# Vacuum and analyze (weekly recommended)
docker compose exec postgres vacuumdb -U mypost --analyze mypost

# Reindex (monthly or after heavy deletes)
docker compose exec postgres reindexdb -U mypost mypost

# Check table sizes
docker compose exec postgres psql -U mypost mypost -c "
SELECT schemaname, tablename, 
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables 
WHERE schemaname = 'public' 
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"

Redis Maintenance

# Check memory usage
docker compose exec redis redis-cli info memory

# Flush expired keys (if needed)
docker compose exec redis redis-cli --scan --pattern 'session:*' | head -10

# Check slow log
docker compose exec redis redis-cli slowlog get 10

Media Cleanup

# Find orphaned media (not referenced by any post)
docker compose exec api npm run media:cleanup --dry-run

# Actually delete orphaned media
docker compose exec api npm run media:cleanup

Token Cleanup

# Remove expired tokens and sessions
docker compose exec api npm run cleanup:tokens

Scaling Operations

Scale API Horizontally

# Docker Compose
docker compose up -d --scale api=3

# Docker Swarm
docker service scale mypost_api=3

Scale Workers

# Increase workers for high queue volumes
docker compose up -d --scale worker=4

Database Read Replicas

# Add to compose.yaml
postgres-replica:
  image: postgres:16-alpine
  environment:
    - POSTGRES_PRIMARY_HOST=postgres
    - POSTGRES_REPLICA_MODE=true

Security Operations

Rotate Secrets

# 1. Generate new secrets
NEW_JWT_SECRET=$(openssl rand -base64 64)
NEW_ENCRYPTION_KEY=$(openssl rand -base64 32)

# 2. Update .env
sed -i "s/JWT_SECRET=.*/JWT_SECRET=$NEW_JWT_SECRET/" .env
sed -i "s/ENCRYPTION_KEY=.*/ENCRYPTION_KEY=$NEW_ENCRYPTION_KEY/" .env

# 3. Re-encrypt tokens (required for ENCRYPTION_KEY change)
docker compose exec api npm run tokens:reencrypt

# 4. Restart services
docker compose down && docker compose up -d

Security Audit

# Check for vulnerabilities in dependencies
docker compose exec api npm audit

# Review recent security events
docker compose exec api npm run audit:security-events --days=7

Failed Login Review

# View failed login attempts
docker compose exec postgres psql -U mypost mypost -c "
SELECT email, ip_address, COUNT(*) as attempts, MAX(created_at) as last_attempt
FROM security_audit_log
WHERE action = 'login_failed' AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY email, ip_address
HAVING COUNT(*) > 3
ORDER BY attempts DESC;"

Incident Response

Runbook: API Down

  1. Verify the issue

    curl -I https://mypost.yourdomain.com/api/v1/health
    docker compose ps
    

  2. Check logs

    docker compose logs --tail 100 api
    

  3. Restart API service

    docker compose restart api
    

  4. If database issue

    docker compose exec postgres pg_isready
    docker compose restart postgres
    

  5. Escalate if unresolved after 15 minutes

Runbook: Publishing Failures

  1. Check queue status

    docker compose exec redis redis-cli llen bull:publish:waiting
    

  2. Review failed jobs

    docker compose exec api npm run queue:failed
    

  3. Check network adapter logs

    docker compose logs worker | grep -i "facebook\|instagram\|twitter"
    

  4. Retry failed jobs

    docker compose exec api npm run queue:retry-failed
    

Runbook: Database Full

  1. Check disk usage

    docker compose exec postgres df -h /var/lib/postgresql/data
    

  2. Identify large tables

    docker compose exec postgres psql -U mypost mypost -c "
    SELECT tablename, pg_size_pretty(pg_total_relation_size(tablename::regclass))
    FROM pg_tables WHERE schemaname='public' ORDER BY pg_total_relation_size(tablename::regclass) DESC LIMIT 5;"
    

  3. Clean up audit logs (if safe)

    docker compose exec postgres psql -U mypost mypost -c "
    DELETE FROM audit_events WHERE created_at < NOW() - INTERVAL '90 days';"
    

  4. Vacuum to reclaim space

    docker compose exec postgres vacuumdb -U mypost --full mypost
    


Disaster Recovery

Recovery Time Objectives

Scenario RTO RPO
Service restart 5 min 0
Container rebuild 15 min 0
Database restore 1 hour 15 min (with WAL)
Full disaster recovery 4 hours 24 hours (daily backup)

DR Procedure

  1. Provision new server
  2. Install Docker and clone repo
  3. Restore .env file (from secure backup)
  4. Restore database from latest backup
  5. Restore media from S3
  6. Update DNS to point to new server
  7. Verify all services

Maintenance Windows

Task Frequency Window
Security updates Weekly Sunday 2-4 AM
Database vacuum Weekly Sunday 3 AM
Log rotation Daily Automatic
Backup verification Monthly 1st Sunday
Secret rotation Quarterly Planned
Major upgrades As needed Planned

Announcing Maintenance

# Set maintenance mode
docker compose exec api npm run maintenance:enable --message "Scheduled maintenance in progress"

# Disable maintenance mode
docker compose exec api npm run maintenance:disable

Support Contacts

Issue Contact Escalation
Application bugs GitHub Issues
Security issues security@yourcompany.com Immediate
Infrastructure ops@yourcompany.com On-call