Skip to content

Deployment and Infrastructure as Code

Section Overview

Modern deployment practices using containers, Infrastructure as Code, and automated pipelines to ensure consistent, reliable, and scalable application delivery.


Container-based Development and Deployment

Dockerfile Best Practices

Core Principle: Dockerfiles should be optimized for security, efficiency, and maintainability while ensuring consistent and reproducible builds.

Key Guidelines:

  • Use specific base image versions with SHA digests
  • Minimize layer count and optimize image size
  • Follow the principle of least privilege
  • Implement proper health checks
  • Leverage multi-stage builds

Why This Matters

Well-structured Dockerfiles ensure consistent environments across all deployment stages, reduce security vulnerabilities, and optimize both build times and runtime performance. They form the foundation of reliable containerized applications.


Layer Optimization Strategy

Implementation:

# Multiple layers - inefficient
RUN apt-get update
RUN apt-get install -y package1
RUN apt-get install -y package2
RUN rm -rf /var/lib/apt/lists/*
# Single optimized layer
RUN apt-get update && apt-get install -y \
    package1 \
    package2 \
    && rm -rf /var/lib/apt/lists/*
# With build arguments and validation
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
    package1=${PACKAGE1_VERSION} \
    package2=${PACKAGE2_VERSION} \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && package1 --version \
    && package2 --version

Security Considerations

Critical Security Practices

  • Run containers as non-root users
  • Remove unnecessary tools and packages
  • Scan images for vulnerabilities regularly
  • Use multi-stage builds to minimize attack surface
  • Never include secrets in image layers

Example: Secure Container User Setup

# Create non-root user
RUN useradd -r -u 1000 appuser

WORKDIR /app
COPY --from=builder /app .

# Set user before execution
USER appuser

Complete Python Application Example

# Build stage
FROM python:3.11-slim-bullseye@sha256:abc123... AS builder

# Build metadata
ARG APP_VERSION
ARG BUILD_DATE
ARG VCS_REF

LABEL org.opencontainers.image.version="${APP_VERSION}" \
      org.opencontainers.image.created="${BUILD_DATE}" \
      org.opencontainers.image.revision="${VCS_REF}"

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy and test application
COPY . .
RUN python -m pytest

# Final stage - minimal runtime image
FROM python:3.11-slim-bullseye@sha256:abc123...

# Create non-root user
RUN useradd -r -u 1000 appuser

WORKDIR /app
COPY --from=builder /app .

USER appuser

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
    CMD curl -f http://localhost:8000/health || exit 1

ENTRYPOINT ["python"]
CMD ["app.py"]

Node.js Application with Build Secrets

# Build stage
FROM node:18-alpine@sha256:def456... AS builder

WORKDIR /app

# Mount npm token as secret (not stored in layers)
RUN --mount=type=secret,id=npm_token \
    npm config set //registry.npmjs.org/:_authToken=$(cat /run/secrets/npm_token)

# Install dependencies
COPY package*.json ./
RUN npm ci --only=production

# Build application
COPY . .
RUN npm run build

# Final stage
FROM node:18-alpine@sha256:def456...

# Create non-root user
RUN addgroup -g 1000 appgroup && \
    adduser -u 1000 -G appgroup -s /bin/sh -D appuser

WORKDIR /app

# Copy built artifacts
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules

USER appuser

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
    CMD wget -q --spider http://localhost:3000/health || exit 1

CMD ["node", "dist/server.js"]

.dockerignore Best Practices

Optimize Build Context

A well-configured .dockerignore file reduces build context size and speeds up builds significantly.

# Dependencies
node_modules
vendor

# Development files
*.log
npm-debug.log
.env
.env.local

# Version control
.git
.gitignore

# Documentation
*.md
docs/

# Build artifacts
dist
build
*.tar.gz

# IDE files
.vscode
.idea
*.swp

Container Image Management

Image Tagging Strategy

Core Principle: Container images must be versioned, secured, and managed systematically to ensure reliable and traceable deployments.

Recommended Tagging Format:

Tag Type Format Example Use Case
Semantic Version v{major}.{minor}.{patch} v1.2.3 Production releases
Git Commit {short-sha} a1b2c3d Development tracking
Build Number v{version}-b{build} v1.2.3-b456 CI/CD integration
Environment {version}-{env} v1.2.3-staging Environment-specific

Registry Management Workflow

# Build image
docker build -t myapp:latest .

# Tag for registry
docker tag myapp:latest \
  company-registry.com/team/myapp:1.2.3

docker tag myapp:latest \
  company-registry.com/team/myapp:latest
# Authenticate
docker login company-registry.com \
  --username=$REGISTRY_USER \
  --password-stdin

# Push images
docker push company-registry.com/team/myapp:1.2.3
docker push company-registry.com/team/myapp:latest
# Generate signing keys
cosign generate-key-pair

# Sign image
cosign sign --key cosign.key \
  company-registry.com/team/app:1.2.3

# Verify signature
cosign verify --key cosign.pub \
  company-registry.com/team/app:1.2.3

Security Scanning

Mandatory Security Checks

All images must be scanned for vulnerabilities before deployment. Critical and high-severity issues must be resolved.

Trivy Configuration Example:

# trivy.yaml
trivy:
  severity: CRITICAL,HIGH
  ignore-unfixed: true
  vuln-type: os,library
  format: table
  output: scan-results.txt

CI Pipeline Integration:

scan-image:
  script:
    - trivy image --severity HIGH,CRITICAL \
        company-registry.com/team/app:${CI_COMMIT_SHA}
    - |
      if [ $? -eq 1 ]; then
        echo "Critical vulnerabilities found"
        exit 1
      fi

Registry Cleanup Policies

cleanup:
  policies:
    - name: keep-latest-versions
      rules:
        - type: tag
          pattern: '^v\d+\.\d+\.\d+$'
          action: keep
          amount: 5

    - name: cleanup-feature-branches
      rules:
        - type: tag
          pattern: '^feature-.*$'
          action: delete
          older-than: 7d

    - name: cleanup-development-tags
      rules:
        - type: tag
          pattern: '^dev-.*$'
          action: delete
          older-than: 3d

Multi-stage Build Optimization

Complex Build Pipeline Example

# Stage 1: Dependencies
FROM node:18-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN --mount=type=cache,target=/root/.npm \
    npm ci

# Stage 2: Frontend build
FROM node:18-alpine AS frontend-builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY frontend/ .
RUN npm run build

# Stage 3: Backend build
FROM maven:3.8-openjdk-17 AS backend-builder
WORKDIR /app
COPY backend/pom.xml .
RUN --mount=type=cache,target=/root/.m2 \
    mvn dependency:go-offline
COPY backend/ .
RUN mvn package -DskipTests

# Stage 4: Security scan
FROM aquasec/trivy:latest AS security-scan
COPY --from=backend-builder /app/target/*.jar /app/
RUN trivy filesystem --exit-code 1 \
    --severity HIGH,CRITICAL /app

# Stage 5: Final runtime image
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app

# Copy built artifacts
COPY --from=frontend-builder /app/dist /app/public
COPY --from=backend-builder /app/target/*.jar app.jar

# Non-root user
RUN addgroup -g 1000 appgroup && \
    adduser -u 1000 -G appgroup -s /bin/sh -D appuser
USER appuser

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=3s \
    CMD wget -q --spider http://localhost:8080/actuator/health || exit 1

ENTRYPOINT ["java", "-XX:+UseContainerSupport", "-jar", "app.jar"]

Build Cache Optimization

Multi-stage builds with build cache mounts can reduce build times by 50-70% by reusing dependencies across builds.


Docker Compose for Development

Development Environment Setup

Core Principle: Docker Compose should provide a consistent, reproducible local development environment that mirrors production while optimizing for developer experience.

# docker-compose.yml
version: '3.8'

x-logging: &default-logging
  options:
    max-size: "10m"
    max-file: "3"
  driver: json-file

services:
  app:
    build:
      context: .
      target: development
      args:
        - NODE_ENV=development
    volumes:
      - .:/app:delegated
      - node_modules:/app/node_modules
    ports:
      - "${PORT:-3000}:3000"
      - "9229:9229"  # Debug port
    environment:
      - NODE_ENV=development
      - DATABASE_URL=postgresql://postgres:password@db:5432/myapp
      - REDIS_URL=redis://redis:6379/0
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    logging: *default-logging

  db:
    image: postgres:14-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d
    environment:
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=myapp
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
    logging: *default-logging

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
    logging: *default-logging

volumes:
  node_modules:
  postgres_data:
  redis_data:

networks:
  default:
    driver: bridge

Development Overrides

# docker-compose.override.yml
services:
  app:
    command: npm run dev
    environment:
      - DEBUG=app:*

  # Expose ports for local debugging tools
  db:
    ports:
      - "5432:5432"

  redis:
    ports:
      - "6379:6379"

Container Orchestration with Kubernetes

Service Deployment

Core Principle: Container orchestration should automate deployment, scaling, and management of containerized applications while ensuring high availability and optimal resource utilization.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
    environment: production
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "3000"
    spec:
      securityContext:
        runAsNonRoot: true
        fsGroup: 2000
      containers:
        - name: myapp
          image: company-registry.com/myapp:1.2.3
          ports:
            - containerPort: 3000
              name: http
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 15
            periodSeconds: 20
          startupProbe:
            httpGet:
              path: /health
              port: http
            failureThreshold: 30
            periodSeconds: 10
          env:
            - name: NODE_ENV
              value: "production"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: myapp-secrets
                  key: database-url
          volumeMounts:
            - name: config
              mountPath: /etc/config
              readOnly: true
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: config
          configMap:
            name: myapp-config
        - name: tmp
          emptyDir: {}

Service and Ingress Configuration

apiVersion: v1
kind: Service
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  type: ClusterIP
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app: myapp
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
    - hosts:
        - myapp.example.com
      secretName: myapp-tls
  rules:
    - host: myapp.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: myapp
                port:
                  name: http

Resource Management

Always define resource requests and limits to ensure fair resource allocation and prevent resource exhaustion on shared clusters.


Infrastructure as Code with Terraform

Infrastructure Definition Principles

Core Principle: Infrastructure must be defined, versioned, and managed as code, ensuring reproducibility, consistency, and automated provisioning across all environments.

Key Guidelines:

  • Maintain infrastructure code in version control
  • Use declarative rather than imperative definitions
  • Implement modular and reusable components
  • Follow the principle of idempotency
  • Document all configuration parameters

Why This Matters

Managing infrastructure as code reduces human error, ensures consistency across environments, and enables automated, repeatable deployments while maintaining a complete audit trail.


Terraform Module Structure

Best Practice Directory Layout:

infrastructure/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   └── variables.tf
│   ├── staging/
│   │   ├── main.tf
│   │   └── variables.tf
│   └── prod/
│       ├── main.tf
│       └── variables.tf
├── modules/
│   ├── networking/
│   ├── kubernetes/
│   └── database/
└── shared/
    └── variables.tf

Terraform Implementation Example

# Define explicit variable types and validation
variable "environment" {
  type        = string
  description = "Environment name (e.g., staging, production)"
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "region" {
  type        = string
  description = "AWS region for deployment"
  default     = "us-west-2"
}

# Local variables for common configurations
locals {
  vpc_cidr = {
    dev     = "10.0.0.0/16"
    staging = "10.1.0.0/16"
    prod    = "10.2.0.0/16"
  }

  common_tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
    Team        = "platform"
  }
}

# Modular resource organization
module "vpc" {
  source = "./modules/vpc"

  environment = var.environment
  cidr_block  = local.vpc_cidr[var.environment]

  tags = local.common_tags
}

# Dependencies and relationships
module "kubernetes" {
  source = "./modules/kubernetes"

  vpc_id         = module.vpc.vpc_id
  subnet_ids     = module.vpc.private_subnet_ids
  cluster_version = "1.24"

  node_groups = {
    general = {
      desired_size   = lookup(var.node_sizes[var.environment], "desired", 2)
      max_size       = lookup(var.node_sizes[var.environment], "max", 4)
      min_size       = lookup(var.node_sizes[var.environment], "min", 1)
      instance_types = ["t3.medium"]
    }
  }

  depends_on = [module.vpc]
}

Configuration Management with Ansible

Role-based Configuration

Core Principle: Configuration management should be idempotent, role-based, and maintain clear separation between code, configuration, and variables.

# playbook.yml - Main playbook structure
---
- name: Configure application servers
  hosts: app_servers
  become: true
  vars_files:
    - vars/{{ environment }}.yml

  pre_tasks:
    - name: Validate environment variables
      assert:
        that:
          - environment is defined
          - environment in ['dev', 'staging', 'prod']
        msg: "Environment must be set to dev, staging, or prod"

  roles:
    - role: common
      tags: ['common', 'setup']

    - role: nginx
      tags: ['web', 'nginx']
      vars:
        nginx_worker_processes: "{{ 'auto' if environment == 'prod' else '2' }}"

    - role: application
      tags: ['app']

  post_tasks:
    - name: Verify configuration
      include_tasks: tasks/verify.yml

Application Role Tasks

# roles/application/tasks/main.yml
---
- name: Install application dependencies
  apt:
    name: "{{ item }}"
    state: present
    update_cache: yes
  loop: "{{ application_dependencies }}"
  tags: ['install']

- name: Create application directories
  file:
    path: "{{ item }}"
    state: directory
    owner: "{{ app_user }}"
    group: "{{ app_group }}"
    mode: '0755'
  loop:
    - "{{ app_config_path }}"
    - "{{ app_data_path }}"
    - "{{ app_log_path }}"
  tags: ['setup']

- name: Configure application service
  template:
    src: application.service.j2
    dest: /etc/systemd/system/application.service
    mode: '0644'
  notify: restart application
  tags: ['config']

- name: Ensure application is running
  systemd:
    name: application
    state: started
    enabled: yes
  tags: ['service']

Idempotency is Critical

All Ansible tasks should be idempotent - running them multiple times should produce the same result without unintended side effects.


Secrets Management

HashiCorp Vault Integration

Core Principle: Sensitive data must never be stored in plain text and should be managed using dedicated secrets management solutions.

# Vault policy
path "secret/data/{{environment}}/{{application}}/*" {
  capabilities = ["read"]
}

path "secret/data/common/*" {
  capabilities = ["read"]
}
provider "vault" {
  address = var.vault_addr
}

data "vault_generic_secret" "db_creds" {
  path = "secret/${var.environment}/database"
}

resource "kubernetes_secret" "application" {
  metadata {
    name      = "app-secrets"
    namespace = var.namespace
  }

  data = {
    DB_PASSWORD = data.vault_generic_secret.db_creds.data["password"]
    DB_USERNAME = data.vault_generic_secret.db_creds.data["username"]
  }
}
import hvac
import os

def get_secrets():
    client = hvac.Client(
        url=os.getenv('VAULT_ADDR'),
        token=os.getenv('VAULT_TOKEN')
    )

    secret_path = f"secret/data/{os.getenv('ENVIRONMENT')}/app"
    response = client.secrets.kv.v2.read_secret_version(
        path=secret_path
    )

    return response['data']['data']

AWS Secrets Manager Integration

import boto3
import json
from botocore.exceptions import ClientError

def get_secret(secret_name, region_name="us-west-2"):
    """
    Retrieve secret from AWS Secrets Manager.
    """
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        response = client.get_secret_value(SecretId=secret_name)

        if 'SecretString' in response:
            return json.loads(response['SecretString'])
        else:
            return response['SecretBinary']

    except ClientError as e:
        if e.response['Error']['Code'] == 'ResourceNotFoundException':
            raise ValueError(f"Secret {secret_name} not found")
        elif e.response['Error']['Code'] == 'InvalidRequestException':
            raise ValueError(f"Invalid request for secret {secret_name}")
        else:
            raise

def rotate_secret(secret_id):
    """
    Rotate database credentials in AWS Secrets Manager.
    """
    client = boto3.client('secretsmanager')

    try:
        # Get current secret value
        response = client.get_secret_value(SecretId=secret_id)
        current_secret = json.loads(response['SecretString'])

        # Generate new credentials
        new_password = generate_secure_password()

        # Update application database
        update_database_password(
            username=current_secret['username'],
            new_password=new_password
        )

        # Update secret in Secrets Manager
        client.put_secret_value(
            SecretId=secret_id,
            SecretString=json.dumps({
                'username': current_secret['username'],
                'password': new_password,
                'host': current_secret['host'],
                'port': current_secret['port']
            })
        )

        return True

    except Exception as e:
        # Implement proper error handling and rollback
        raise SecretRotationError(f"Failed to rotate secret: {str(e)}")

Deployment Automation and Pipelines

CI/CD Pipeline Design

Core Principle: Deployment pipelines must be automated, reliable, and provide clear visibility into the deployment process while maintaining security and compliance requirements.

Key Pipeline Stages:

Stage Purpose Key Actions
Build Compile and package application Code compilation, dependency resolution
Test Validate functionality Unit tests, integration tests, security scans
Deploy Release to environment Environment provisioning, artifact deployment
Verify Confirm deployment health Health checks, smoke tests, monitoring

Pipeline Best Practices

Well-designed deployment pipelines ensure reliable, repeatable deployments while reducing human error and maintaining security standards. Every stage should be automated and provide clear feedback.


GitHub Actions Pipeline Example

# .github/workflows/deployment.yml
name: Deployment Pipeline

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2

      - name: Login to Container Registry
        uses: docker/login-action@v2
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v4
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

      - name: Build and push Docker image
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

      - name: Run Security Scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          format: 'table'
          exit-code: '1'
          ignore-unfixed: true
          severity: 'CRITICAL,HIGH'

      - name: Run tests
        run: |
          docker run --rm \
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            npm test

  deploy-staging:
    needs: build-and-test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/develop'
    environment:
      name: staging
      url: https://staging.example.com

    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-west-2

      - name: Update EKS deployment
        run: |
          aws eks update-kubeconfig --name staging-cluster
          kubectl set image deployment/app-deployment \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          kubectl rollout status deployment/app-deployment --timeout=5m

      - name: Run smoke tests
        run: |
          curl -f https://staging.example.com/health || exit 1

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment:
      name: production
      url: https://api.example.com

    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.PROD_AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.PROD_AWS_SECRET_ACCESS_KEY }}
          aws-region: us-west-2

      - name: Deploy to Production
        run: |
          aws eks update-kubeconfig --name prod-cluster
          kubectl set image deployment/app-deployment \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          kubectl rollout status deployment/app-deployment --timeout=10m

      - name: Verify deployment
        run: |
          # Health check
          curl -f https://api.example.com/health || exit 1

          # Monitor error rates for 5 minutes
          sleep 300

      - name: Notify team
        if: always()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Production deployment: ${{ job.status }}",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "Deployment to production: *${{ job.status }}*\nCommit: ${{ github.sha }}"
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

Jenkins Pipeline Example

// Jenkinsfile
pipeline {
    agent any

    environment {
        DOCKER_REGISTRY = 'company-registry.com'
        APP_NAME = 'myapp'
        VERSION = sh(script: 'git describe --tags --always', returnStdout: true).trim()
        KUBECONFIG = credentials('kubeconfig-prod')
    }

    stages {
        stage('Build') {
            steps {
                script {
                    docker.build("${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}")
                }
            }
        }

        stage('Test') {
            parallel {
                stage('Unit Tests') {
                    steps {
                        sh 'npm test'
                        junit 'test-results/**/*.xml'
                    }
                }
                stage('Integration Tests') {
                    steps {
                        sh 'npm run integration-test'
                    }
                }
                stage('Security Scan') {
                    steps {
                        sh """
                            trivy image \
                              --severity HIGH,CRITICAL \
                              --exit-code 1 \
                              ${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}
                        """
                    }
                }
            }
        }

        stage('Push Image') {
            steps {
                script {
                    docker.withRegistry("https://${DOCKER_REGISTRY}", 'registry-credentials') {
                        docker.image("${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}").push()
                        docker.image("${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}").push('latest')
                    }
                }
            }
        }

        stage('Deploy to Staging') {
            when { branch 'develop' }
            steps {
                script {
                    deployToEnvironment(
                        environment: 'staging',
                        version: VERSION
                    )
                }
            }
        }

        stage('Deploy to Production') {
            when { branch 'main' }
            input {
                message 'Deploy to production?'
                ok 'Yes, deploy!'
            }
            steps {
                script {
                    deployToEnvironment(
                        environment: 'production',
                        version: VERSION
                    )
                }
            }
        }

        stage('Verify Deployment') {
            steps {
                script {
                    sh """
                        kubectl rollout status deployment/${APP_NAME} -n production
                        curl -f https://api.example.com/health || exit 1
                    """
                }
            }
        }
    }

    post {
        success {
            slackSend(
                channel: '#deployments',
                color: 'good',
                message: "Deployment successful: ${APP_NAME}:${VERSION}"
            )
        }
        failure {
            slackSend(
                channel: '#deployments',
                color: 'danger',
                message: "Deployment failed: ${APP_NAME}:${VERSION}"
            )
        }
        always {
            cleanWs()
        }
    }
}

// Helper function for deployment
def deployToEnvironment(Map config) {
    sh """
        kubectl config use-context ${config.environment}
        kubectl set image deployment/${APP_NAME} \
          app=${DOCKER_REGISTRY}/${APP_NAME}:${config.version} \
          -n ${config.environment}
        kubectl rollout status deployment/${APP_NAME} \
          -n ${config.environment} \
          --timeout=5m
    """
}

Pipeline Optimization

Use parallel stages for tests and scans to reduce total pipeline execution time. Cache dependencies between runs to speed up builds.


Deployment Strategies

Blue-Green Deployment

Principle: Maintain two identical production environments, switching traffic between them to achieve zero-downtime deployments.

Key Benefits:

  • Instant rollback capability
  • Zero downtime during deployment
  • Full production environment testing before switch
  • Simple rollback process
# Blue deployment (current)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-blue
  labels:
    version: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
        - name: app
          image: myapp:1.0.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

---
# Green deployment (new)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-green
  labels:
    version: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
        - name: app
          image: myapp:2.0.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

---
# Service that switches between blue and green
apiVersion: v1
kind: Service
metadata:
  name: app-service
spec:
  selector:
    app: myapp
    version: blue  # Change to 'green' to switch
  ports:
    - port: 80
      targetPort: 8080
#!/bin/bash
# blue-green-deploy.sh

NAMESPACE="production"
APP_NAME="myapp"
NEW_VERSION=$1

# Determine current active environment
CURRENT=$(kubectl get service ${APP_NAME}-service -n ${NAMESPACE} \
  -o jsonpath='{.spec.selector.version}')

if [ "$CURRENT" = "blue" ]; then
  NEW_ENV="green"
  OLD_ENV="blue"
else
  NEW_ENV="blue"
  OLD_ENV="green"
fi

echo "Current environment: $CURRENT"
echo "Deploying to: $NEW_ENV"

# Deploy new version to inactive environment
kubectl set image deployment/${APP_NAME}-${NEW_ENV} \
  app=${APP_NAME}:${NEW_VERSION} \
  -n ${NAMESPACE}

# Wait for deployment to be ready
kubectl rollout status deployment/${APP_NAME}-${NEW_ENV} \
  -n ${NAMESPACE} \
  --timeout=5m

# Run smoke tests
echo "Running smoke tests..."
POD=$(kubectl get pod -n ${NAMESPACE} \
  -l app=${APP_NAME},version=${NEW_ENV} \
  -o jsonpath='{.items[0].metadata.name}')

kubectl exec -n ${NAMESPACE} ${POD} -- curl -f http://localhost:8080/health

if [ $? -eq 0 ]; then
  echo "Smoke tests passed. Switching traffic..."

  # Switch service to new environment
  kubectl patch service ${APP_NAME}-service -n ${NAMESPACE} \
    -p "{\"spec\":{\"selector\":{\"version\":\"${NEW_ENV}\"}}}"

  echo "Traffic switched to ${NEW_ENV}"
  echo "Monitor for 5 minutes before removing old deployment"

  sleep 300

  # Optional: Scale down old environment
  # kubectl scale deployment/${APP_NAME}-${OLD_ENV} \
  #   --replicas=0 -n ${NAMESPACE}
else
  echo "Smoke tests failed. Keeping ${OLD_ENV} active"
  exit 1
fi

Canary Deployment

Principle: Release changes incrementally to a subset of users, monitoring for issues before full rollout.

Key Benefits:

  • Reduced blast radius of failures
  • Early detection of issues
  • Gradual traffic shifting
  • Data-driven rollout decisions
# Using Istio for canary deployment
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: app-vsvc
spec:
  hosts:
    - app.example.com
  http:
    - match:
        - headers:
            canary:
              exact: "true"
      route:
        - destination:
            host: app-canary
            subset: v2
    - route:
        - destination:
            host: app-stable
            subset: v1
          weight: 90
        - destination:
            host: app-canary
            subset: v2
          weight: 10

Canary Monitoring Configuration:

# prometheus/canary-rules.yaml
groups:
  - name: canary-deployment
    interval: 30s
    rules:
      - alert: CanaryErrorRateHigh
        expr: |
          sum(rate(http_requests_total{version="canary",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{version="canary"}[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Canary deployment error rate too high"
          description: "Canary version showing {{ $value }}% error rate"

      - alert: CanaryLatencyHigh
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket{version="canary"}[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Canary deployment latency too high"
          description: "Canary p95 latency: {{ $value }}s"

Canary Rollback

Always define clear success criteria before starting a canary deployment. Automate rollback when metrics exceed thresholds.


Rolling Deployment

Principle: Gradually replace instances of the application with new versions while maintaining service availability.

Key Benefits:

  • No infrastructure duplication needed
  • Simple implementation
  • Automatic rollback on failure
  • Resource efficient
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Max new pods above desired
      maxUnavailable: 0  # Max old pods that can be down
  template:
    spec:
      containers:
        - name: app
          image: myapp:2.0.0
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10

Rolling Update Process:

  1. New pod is created
  2. Wait for new pod to be ready
  3. Old pod is terminated
  4. Repeat until all pods updated

Environment Management

Environment Parity

Core Principle: Development, staging, and production environments must maintain maximum parity to ensure reliable testing and deployment processes.

Key Guidelines:

  • Use identical configuration structures across environments
  • Maintain consistent versions of all dependencies
  • Implement similar scaling and redundancy patterns
  • Use production-like data in lower environments
  • Automate environment provisioning

Why Parity Matters

Environment parity minimizes "it works on my machine" issues and ensures that testing in lower environments accurately predicts production behavior. When environments differ, bugs may only surface in production, leading to costly incidents.


Configuration Management Structure

# config/base/config.yaml
app:
  name: myservice
  version: 1.0.0

database:
  type: postgresql
  pool:
    min: 5
    max: 20
    idle_timeout: 300

logging:
  level: info
  format: json

cache:
  ttl: 3600
  max_entries: 10000
# config/environments/production.yaml
extends: ../base/config.yaml

app:
  replicas: 3
  resources:
    cpu: 1000m
    memory: 2Gi

database:
  host: prod-db.example.com
  pool:
    max: 50
  ssl: true

logging:
  level: warn
# config/environments/staging.yaml
extends: ../base/config.yaml

app:
  replicas: 2
  resources:
    cpu: 500m
    memory: 1Gi

database:
  host: staging-db.example.com
  pool:
    max: 30

logging:
  level: info
# config/environments/development.yaml
extends: ../base/config.yaml

app:
  replicas: 1
  resources:
    cpu: 250m
    memory: 512Mi

database:
  host: localhost
  pool:
    max: 10

logging:
  level: debug

Environment Variable Management

Best Practices:

Variable Type Storage Method Example
Non-sensitive Config ConfigMaps Feature flags, API URLs
Sensitive Data Secrets Manager Database passwords, API keys
Environment-specific Environment files Resource limits, replica counts
Build-time Build arguments Version numbers, build dates

Kubernetes ConfigMap and Secrets

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: production
data:
  APP_LOG_LEVEL: "info"
  APP_CACHE_TTL: "3600"
  APP_API_VERSION: "v1"
  METRICS_ENABLED: "true"
  FEATURE_NEW_UI: "true"
  MAX_UPLOAD_SIZE: "10485760"
apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
  namespace: production
type: Opaque
data:
  DB_PASSWORD: <base64-encoded>
  API_KEY: <base64-encoded>
  JWT_SECRET: <base64-encoded>
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
        - name: app
          image: myapp:1.0.0
          envFrom:
            - configMapRef:
                name: app-config
            - secretRef:
                name: app-secrets
          env:
            - name: ENVIRONMENT
              value: "production"
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name

Local Development Environment

Comprehensive Docker Compose Setup

# docker-compose.dev.yml
version: '3.8'

x-logging: &default-logging
  driver: json-file
  options:
    max-size: "10m"
    max-file: "3"

services:
  app:
    build:
      context: .
      target: development
      args:
        NODE_ENV: development
    volumes:
      - .:/app:delegated
      - /app/node_modules
      - ${HOME}/.aws:/root/.aws:ro
    environment:
      - NODE_ENV=development
      - DB_HOST=db
      - DB_PORT=5432
      - DB_NAME=appdb
      - DB_USER=devuser
      - DB_PASSWORD=devpass
      - REDIS_HOST=cache
      - REDIS_PORT=6379
      - AWS_PROFILE=${AWS_PROFILE:-default}
    ports:
      - "${PORT:-3000}:3000"
      - "9229:9229"  # Node.js debugger
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    logging: *default-logging

  db:
    image: postgres:14-alpine
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: devuser
      POSTGRES_PASSWORD: devpass
      POSTGRES_INITDB_ARGS: "-E UTF8"
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d:ro
      - ./backups:/backups
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U devuser -d appdb"]
      interval: 10s
      timeout: 5s
      retries: 5
    logging: *default-logging

  cache:
    image: redis:7-alpine
    command: redis-server --appendonly yes
    volumes:
      - redisdata:/data
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
    logging: *default-logging

  # Development-only services
  mailhog:
    image: mailhog/mailhog:latest
    ports:
      - "1025:1025"  # SMTP
      - "8025:8025"  # Web UI
    logging: *default-logging

  adminer:
    image: adminer:latest
    ports:
      - "8080:8080"
    environment:
      ADMINER_DEFAULT_SERVER: db
    depends_on:
      - db
    logging: *default-logging

volumes:
  pgdata:
    driver: local
  redisdata:
    driver: local

networks:
  default:
    driver: bridge
    ipam:
      config:
        - subnet: 172.28.0.0/16

Development Environment Setup Script

#!/bin/bash
# setup-dev.sh

set -e

echo "Setting up local development environment..."

# Check prerequisites
check_prerequisites() {
    echo "Checking prerequisites..."

    command -v docker >/dev/null 2>&1 || {
        echo "Error: Docker is not installed"
        exit 1
    }

    command -v docker-compose >/dev/null 2>&1 || {
        echo "Error: Docker Compose is not installed"
        exit 1
    }

    echo "Prerequisites satisfied"
}

# Create necessary directories
setup_directories() {
    echo "Creating project directories..."
    mkdir -p backups
    mkdir -p init-scripts
    mkdir -p logs
    echo "Directories created"
}

# Copy environment template
setup_env_file() {
    if [ ! -f .env ]; then
        echo "Creating .env file..."
        cp .env.example .env
        echo ".env file created"
        echo "WARNING: Please review and update .env with your settings"
    else
        echo ".env file already exists"
    fi
}

# Start services
start_services() {
    echo "Starting Docker services..."
    docker-compose -f docker-compose.dev.yml up -d
    echo "Services started"
}

# Wait for services
wait_for_services() {
    echo "Waiting for services to be healthy..."

    max_attempts=30
    attempt=0

    while [ $attempt -lt $max_attempts ]; do
        if docker-compose -f docker-compose.dev.yml ps | grep -q "healthy"; then
            echo "Services are healthy"
            return 0
        fi

        attempt=$((attempt + 1))
        echo "Waiting... ($attempt/$max_attempts)"
        sleep 2
    done

    echo "Error: Services did not become healthy in time"
    return 1
}

# Run database migrations
run_migrations() {
    echo "Running database migrations..."
    docker-compose -f docker-compose.dev.yml exec -T app npm run migrate
    echo "Migrations completed"
}

# Seed development data
seed_data() {
    echo "Seeding development data..."
    docker-compose -f docker-compose.dev.yml exec -T app npm run seed
    echo "Data seeded"
}

# Print access information
print_info() {
    echo ""
    echo "========================================="
    echo "Development environment is ready!"
    echo "========================================="
    echo ""
    echo "Application: http://localhost:3000"
    echo "Database UI: http://localhost:8080"
    echo "Mail Server: http://localhost:8025"
    echo ""
    echo "Useful commands:"
    echo "  docker-compose -f docker-compose.dev.yml logs -f    # View logs"
    echo "  docker-compose -f docker-compose.dev.yml down       # Stop services"
    echo "  docker-compose -f docker-compose.dev.yml restart    # Restart services"
    echo ""
}

# Main execution
main() {
    check_prerequisites
    setup_directories
    setup_env_file
    start_services
    wait_for_services || exit 1
    run_migrations
    seed_data
    print_info
}

main

Development Productivity

Use volume mounts for hot reloading during development. This allows code changes to be reflected immediately without rebuilding containers.


Infrastructure Testing

Testing Layers

Core Principle: Infrastructure code must be validated through multiple testing layers, ensuring both correctness and compliance before any deployment.

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.50.0
    hooks:
      - id: terraform_fmt
      - id: terraform_docs
      - id: terraform_tflint
      - id: terraform_tfsec
      - id: terraform_validate

  - repo: https://github.com/bridgecrewio/checkov
    rev: 2.0.0
    hooks:
      - id: checkov
        args: [--directory, .]
# test_infrastructure.py
import pytest
from infrastructure.validators import validate_vpc_config

def test_vpc_configuration():
    """Test VPC configuration validation."""
    config = {
        'cidr_block': '10.0.0.0/16',
        'region': 'us-west-2',
        'availability_zones': 3
    }

    result = validate_vpc_config(config)

    assert result['subnet_count'] == 6
    assert result['nat_gateway_count'] == 3
    assert result['valid'] is True

def test_invalid_cidr_block():
    """Test validation catches invalid CIDR blocks."""
    config = {
        'cidr_block': '10.0.0.0/8',  # Too large
        'region': 'us-west-2'
    }

    with pytest.raises(ValueError):
        validate_vpc_config(config)
// infrastructure_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
)

func TestKubernetesCluster(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../infrastructure/kubernetes",
        Vars: map[string]interface{}{
            "environment": "test",
            "region":      "us-west-2",
        },
    }

    defer terraform.Destroy(t, terraformOptions)

    terraform.InitAndApply(t, terraformOptions)

    clusterName := terraform.Output(t, terraformOptions, "cluster_name")
    assert.NotEmpty(t, clusterName)

    nodeCount := terraform.OutputList(t, terraformOptions, "node_groups")
    assert.GreaterOrEqual(t, len(nodeCount), 1)
}
# compliance-tests.yaml
tests:
  - name: ensure_encryption_at_rest
    resource_type: aws_rds_cluster
    assertions:
      - property: storage_encrypted
        operator: equals
        value: true
        severity: critical

  - name: verify_vpc_flow_logs
    resource_type: aws_vpc
    assertions:
      - property: enable_flow_logs
        operator: equals
        value: true
        severity: high

  - name: check_s3_versioning
    resource_type: aws_s3_bucket
    assertions:
      - property: versioning.enabled
        operator: equals
        value: true
        severity: medium

  - name: verify_backup_retention
    resource_type: aws_db_instance
    assertions:
      - property: backup_retention_period
        operator: greater_than
        value: 7
        severity: high

Environment Decommissioning

Systematic Cleanup Process

Core Principle: Environment decommissioning must be systematic, ensuring proper cleanup of resources, data preservation, and documentation updates.

Decommissioning Checklist:

Phase Tasks Verification
Planning Create decommission plan, notify stakeholders Plan reviewed and approved
Backup Backup critical data, export configurations Backups validated and accessible
Cleanup Remove Kubernetes resources, cloud infrastructure All resources removed
Documentation Update docs, remove access controls Documentation current
Verification Validate complete removal, cost verification No remaining resources or costs

Decommissioning Script

#!/bin/bash
# decommission-environment.sh

set -e

ENVIRONMENT=$1
DRY_RUN=${2:-false}

if [ -z "$ENVIRONMENT" ]; then
    echo "Usage: $0 <environment> [dry-run]"
    exit 1
fi

echo "Decommissioning environment: $ENVIRONMENT"

if [ "$DRY_RUN" = "true" ]; then
    echo "WARNING: DRY RUN MODE - No actual changes will be made"
fi

# Step 1: Backup critical data
backup_data() {
    echo "Creating final backups..."

    aws rds create-db-snapshot \
        --db-instance-identifier ${ENVIRONMENT}-db \
        --db-snapshot-identifier ${ENVIRONMENT}-final-$(date +%Y%m%d) \
        ${DRY_RUN:+--no-execute}

    kubectl get all -n ${ENVIRONMENT} -o yaml > ${ENVIRONMENT}-backup.yaml

    echo "Backups completed"
}

# Step 2: Scale down services
scale_down_services() {
    echo "Scaling down services..."

    kubectl scale deployment --all --replicas=0 -n ${ENVIRONMENT} \
        ${DRY_RUN:+--dry-run=client}

    echo "Services scaled down"
}

# Step 3: Remove Kubernetes resources
cleanup_kubernetes() {
    echo "Removing Kubernetes resources..."

    kubectl delete namespace ${ENVIRONMENT} \
        ${DRY_RUN:+--dry-run=client}

    echo "Kubernetes resources removed"
}

# Step 4: Remove cloud infrastructure
cleanup_cloud_resources() {
    echo "Removing cloud infrastructure..."

    cd infrastructure/environments/${ENVIRONMENT}

    if [ "$DRY_RUN" = "true" ]; then
        terraform plan -destroy
    else
        terraform destroy -auto-approve
    fi

    echo "Cloud resources removed"
}

# Step 5: Remove DNS entries
cleanup_dns() {
    echo "Removing DNS entries..."

    # Remove Route53 records
    aws route53 list-resource-record-sets \
        --hosted-zone-id ${HOSTED_ZONE_ID} \
        --query "ResourceRecordSets[?contains(Name, '${ENVIRONMENT}')]" \
        | jq -r '.[] | .Name' \
        | while read record; do
            echo "Removing DNS record: $record"
            # Add deletion logic here
        done

    echo "DNS entries removed"
}

# Step 6: Revoke access
revoke_access() {
    echo "Revoking access credentials..."

    # Revoke IAM roles
    aws iam list-roles --query "Roles[?contains(RoleName, '${ENVIRONMENT}')].RoleName" \
        --output text | while read role; do
            echo "Removing role: $role"
            aws iam delete-role --role-name $role ${DRY_RUN:+--no-execute}
        done

    # Delete service accounts
    kubectl delete serviceaccount --all -n ${ENVIRONMENT} \
        ${DRY_RUN:+--dry-run=client}

    echo "Access revoked"
}

# Step 7: Generate decommission report
generate_report() {
    echo "Generating decommission report..."

    cat > ${ENVIRONMENT}-decommission-report.md <<EOF
# Environment Decommission Report

**Environment:** ${ENVIRONMENT}
**Date:** $(date)
**Performed By:** $(whoami)

## Summary

Environment ${ENVIRONMENT} has been successfully decommissioned.

## Backup Locations

- Database Snapshot: ${ENVIRONMENT}-final-$(date +%Y%m%d)
- Configuration Backup: ${ENVIRONMENT}-backup.yaml
- Archive Location: s3://backups/${ENVIRONMENT}/

## Resources Removed

- Kubernetes namespace: ${ENVIRONMENT}
- Cloud infrastructure: infrastructure/environments/${ENVIRONMENT}
- DNS entries: *.${ENVIRONMENT}.example.com
- IAM roles and service accounts

## Verification

- [ ] All cloud resources terminated
- [ ] No ongoing costs
- [ ] Backups accessible
- [ ] Documentation updated
- [ ] Team notified

## Next Steps

1. Verify no unexpected costs appear in next billing cycle
2. Archive documentation after 90 days
3. Delete backups after retention period ($(date -d '+90 days' +%Y-%m-%d))

EOF

    echo "Report generated: ${ENVIRONMENT}-decommission-report.md"
}

# Step 8: Verify cleanup
verify_cleanup() {
    echo "Verifying cleanup..."

    # Check for remaining Kubernetes resources
    remaining_resources=$(kubectl get all -n ${ENVIRONMENT} 2>/dev/null || echo "namespace not found")

    if [ "$remaining_resources" != "namespace not found" ]; then
        echo "WARNING: Some Kubernetes resources still exist"
        echo "$remaining_resources"
    fi

    # Check for remaining AWS resources
    remaining_aws=$(aws resourcegroupstaggingapi get-resources \
        --tag-filters Key=Environment,Values=${ENVIRONMENT} \
        --query 'ResourceTagMappingList[].ResourceARN' \
        --output text)

    if [ -n "$remaining_aws" ]; then
        echo "WARNING: Some AWS resources still exist:"
        echo "$remaining_aws"
    else
        echo "No remaining resources found"
    fi
}

# Main execution
main() {
    echo "Starting decommission process..."
    echo "Environment: $ENVIRONMENT"
    echo "Dry Run: $DRY_RUN"
    echo ""

    read -p "Are you sure you want to decommission $ENVIRONMENT? (yes/no): " confirm

    if [ "$confirm" != "yes" ]; then
        echo "Decommission cancelled"
        exit 0
    fi

    backup_data
    scale_down_services
    cleanup_kubernetes
    cleanup_cloud_resources
    cleanup_dns
    revoke_access
    verify_cleanup
    generate_report

    echo ""
    echo "Decommission completed!"
    echo "Please review the report: ${ENVIRONMENT}-decommission-report.md"
}

main

Critical: Pre-Decommission Verification

Always verify backups are complete and accessible before removing any production resources. Maintain backups according to your data retention policy.


Best Practices Summary

Container Best Practices

Practice Implementation Benefit
Multi-stage builds Separate build and runtime stages Smaller images, better security
Non-root users Create and use dedicated app users Enhanced security
Health checks Implement readiness and liveness probes Reliable deployments
Resource limits Define CPU and memory constraints Predictable performance
Image scanning Automated vulnerability scanning Early security issue detection

Infrastructure as Code Best Practices

IaC Golden Rules

  1. Version Everything - All infrastructure code in version control
  2. Modularize - Create reusable, focused modules
  3. Document - Explain why, not just what
  4. Test - Validate changes before production
  5. Review - Peer review all infrastructure changes

Deployment Best Practices

Pre-Deployment:

  • All tests passing
  • Security scans completed
  • Database migrations tested
  • Rollback plan documented
  • Stakeholders notified

During Deployment:

  • Monitor key metrics
  • Watch error rates
  • Verify health checks
  • Check logs for issues
  • Ready to rollback if needed

Post-Deployment:

  • Verify business-critical flows
  • Monitor for 15-30 minutes
  • Update documentation
  • Communicate success
  • Document any issues

Environment Management Best Practices

  • Use hierarchical configuration inheritance
  • Keep sensitive data in secrets management
  • Document environment-specific deviations
  • Automate configuration validation
  • Maintain consistent tooling versions
  • Use similar scaling patterns
  • Replicate production architecture
  • Test with production-like data
  • Test in staging before production
  • Use automated integration tests
  • Perform load testing regularly
  • Validate disaster recovery procedures

Monitoring and Observability

Key Metrics to Track

Application Metrics:

# prometheus-rules.yaml
groups:
  - name: application-health
    rules:
      - record: app:http_request_duration_seconds:p95
        expr: histogram_quantile(0.95, 
          rate(http_request_duration_seconds_bucket[5m]))

      - record: app:http_requests_total:rate5m
        expr: rate(http_requests_total[5m])

      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / 
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"

Infrastructure Metrics:

# infrastructure-rules.yaml
groups:
  - name: infrastructure-health
    rules:
      - alert: HighCPUUsage
        expr: avg(rate(container_cpu_usage_seconds_total[5m])) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"

      - alert: HighMemoryUsage
        expr: |
          (container_memory_usage_bytes / 
           container_spec_memory_limit_bytes) > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage detected"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod is crash looping"

Logging Strategy

Structured Logging Example:

{
  "timestamp": "2024-10-18T10:30:45.123Z",
  "level": "info",
  "service": "api",
  "environment": "production",
  "version": "1.2.3",
  "trace_id": "abc123def456",
  "user_id": "user_789",
  "endpoint": "/api/orders",
  "method": "POST",
  "status_code": 201,
  "duration_ms": 45,
  "message": "Order created successfully"
}

Log Aggregation:

  • Use centralized logging (ELK, Loki, CloudWatch)
  • Implement log retention policies
  • Structure logs for easy querying
  • Include correlation IDs for tracing
  • Filter sensitive information

Disaster Recovery

Backup Strategy

What to Backup:

Resource Type Frequency Retention Method
Databases Hourly 30 days Automated snapshots
Configuration On change 90 days Version control
Secrets Daily 90 days Encrypted backups
Application State Daily 7 days Volume snapshots
Infrastructure Code On commit Indefinite Git repository

Recovery Procedures

Database Recovery Example:

#!/bin/bash
# restore-database.sh

SNAPSHOT_ID=$1
TARGET_INSTANCE=$2

echo "Restoring database from snapshot: $SNAPSHOT_ID"

# Restore RDS instance from snapshot
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier ${TARGET_INSTANCE} \
    --db-snapshot-identifier ${SNAPSHOT_ID} \
    --db-instance-class db.t3.medium \
    --publicly-accessible false

# Wait for instance to be available
echo "Waiting for instance to be available..."
aws rds wait db-instance-available \
    --db-instance-identifier ${TARGET_INSTANCE}

# Update application configuration
echo "Updating application configuration..."
NEW_ENDPOINT=$(aws rds describe-db-instances \
    --db-instance-identifier ${TARGET_INSTANCE} \
    --query 'DBInstances[0].Endpoint.Address' \
    --output text)

kubectl set env deployment/app \
    DATABASE_HOST=${NEW_ENDPOINT}

echo "Database restored successfully"
echo "New endpoint: $NEW_ENDPOINT"

Disaster Recovery Testing

Regular DR Drills:

  1. Monthly: Test database restoration
  2. Quarterly: Full environment recovery
  3. Bi-annually: Complete disaster scenario
  4. Annually: Cross-region failover test

DR Test Checklist:

  • Identify recovery time objective (RTO)
  • Identify recovery point objective (RPO)
  • Document recovery procedures
  • Test backup restoration
  • Verify data integrity
  • Validate application functionality
  • Document lessons learned
  • Update procedures based on findings

Troubleshooting Guide

Common Issues and Solutions

Symptom: Docker build fails

Common Causes: - Network issues downloading dependencies - Invalid Dockerfile syntax - Insufficient disk space - Build cache corruption

Solutions:

# Clear build cache
docker builder prune -af

# Build without cache
docker build --no-cache -t myapp:latest .

# Check disk space
df -h
docker system df

Symptom: Kubernetes deployment fails

Common Causes: - Image pull errors - Resource constraints - Failed health checks - Configuration errors

Solutions:

# Check pod status
kubectl get pods -n production
kubectl describe pod <pod-name> -n production

# Check logs
kubectl logs <pod-name> -n production

# Check events
kubectl get events -n production --sort-by='.lastTimestamp'

# Rollback deployment
kubectl rollout undo deployment/myapp -n production

Symptom: Service connectivity problems

Common Causes: - DNS resolution failures - Network policy restrictions - Service misconfiguration - Ingress controller issues

Solutions:

# Test DNS resolution
kubectl run -it --rm debug \
  --image=nicolaka/netshoot \
  --restart=Never -- nslookup myapp

# Check service endpoints
kubectl get endpoints myapp -n production

# Verify network policies
kubectl get networkpolicies -n production

# Test connectivity
kubectl run -it --rm debug \
  --image=nicolaka/netshoot \
  --restart=Never -- curl http://myapp:80

Symptom: Slow application performance

Common Causes: - Resource constraints - Database connection pool exhaustion - Memory leaks - Inefficient queries

Solutions:

# Check resource usage
kubectl top pods -n production
kubectl top nodes

# Increase resources
kubectl set resources deployment/myapp \
  --requests=cpu=500m,memory=1Gi \
  --limits=cpu=1000m,memory=2Gi

# Check application metrics
kubectl port-forward svc/myapp 9090:9090
# Access metrics at http://localhost:9090/metrics

# Review logs for errors
kubectl logs -f deployment/myapp -n production


Quick Reference Commands

Docker Commands

# Build and tag
docker build -t myapp:1.0.0 .
docker tag myapp:1.0.0 registry.com/myapp:1.0.0

# Push to registry
docker push registry.com/myapp:1.0.0

# Clean up
docker system prune -af
docker volume prune -f

# Inspect
docker inspect <container-id>
docker logs -f <container-id>

# Execute commands in container
docker exec -it <container-id> /bin/bash

Kubernetes Commands

# Deployments
kubectl apply -f deployment.yaml
kubectl rollout status deployment/myapp
kubectl rollout undo deployment/myapp
kubectl scale deployment/myapp --replicas=3

# Debugging
kubectl get pods -o wide
kubectl describe pod <pod-name>
kubectl logs -f <pod-name>
kubectl exec -it <pod-name> -- /bin/sh

# Configuration
kubectl create configmap myapp-config --from-file=config.yaml
kubectl create secret generic myapp-secret --from-literal=password=secret
kubectl get configmap myapp-config -o yaml

# Services and Ingress
kubectl get services
kubectl get ingress
kubectl port-forward svc/myapp 8080:80

# Namespace management
kubectl get namespaces
kubectl create namespace staging
kubectl config set-context --current --namespace=staging

# Advanced Deployment Management
kubectl rollout history deployment/myapp
kubectl rollout history deployment/myapp --revision=2
kubectl rollout pause deployment/myapp
kubectl rollout resume deployment/myapp
kubectl rollout restart deployment/myapp

# Resource Updates
kubectl patch deployment myapp -p '{"spec":{"replicas":5}}'
kubectl set image deployment/myapp myapp=myapp:2.0.0
kubectl set env deployment/myapp DATABASE_URL=postgres://newdb:5432
kubectl set resources deployment/myapp --limits=cpu=500m,memory=1Gi

# Deployment Strategies
kubectl apply -f deployment.yaml --record
kubectl annotate deployment/myapp kubernetes.io/change-cause="Upgraded to version 2.0"

# StatefulSets (for stateful applications)
kubectl get statefulsets
kubectl scale statefulset/mydb --replicas=3
kubectl rollout status statefulset/mydb
kubectl delete pod mydb-0 --force --grace-period=0

# DaemonSets (for node-level services)
kubectl get daemonsets -A
kubectl rollout status daemonset/node-exporter -n monitoring

# Jobs and CronJobs
kubectl create job backup --image=backup:latest
kubectl get jobs
kubectl get cronjobs
kubectl create cronjob backup --image=backup:latest --schedule="0 2 * * *"

# Resource Quotas and Limits
kubectl get resourcequota -n production
kubectl describe resourcequota production-quota -n production
kubectl create quota production-quota --hard=cpu=10,memory=20Gi,pods=50

# HorizontalPodAutoscaler
kubectl autoscale deployment myapp --cpu-percent=70 --min=2 --max=10
kubectl get hpa
kubectl describe hpa myapp

# Custom Resource Definitions (CRDs)
kubectl get crd
kubectl get <crd-name>
kubectl describe crd <crd-name>

# Helm (Package Manager)
helm repo add stable https://charts.helm.sh/stable
helm repo update
helm search repo nginx
helm install myapp stable/nginx
helm list
helm upgrade myapp stable/nginx --set replicaCount=3
helm rollback myapp 1
helm uninstall myapp

# Kustomize (Configuration Management)
kubectl apply -k ./overlays/production
kubectl kustomize ./overlays/production
kubectl diff -k ./overlays/production

# Network Policies
kubectl get networkpolicies -n production
kubectl describe networkpolicy allow-frontend -n production

# Resource Management
kubectl top nodes
kubectl top pods -n production
kubectl describe node <node-name>
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets
kubectl uncordon <node-name>

# PersistentVolumes and Claims
kubectl get pv
kubectl get pvc -n production
kubectl describe pv <pv-name>
kubectl delete pvc <pvc-name>

# Service Accounts and RBAC
kubectl get serviceaccounts -n production
kubectl create serviceaccount myapp-sa
kubectl get roles -n production
kubectl get rolebindings -n production
kubectl create role pod-reader --verb=get --verb=list --resource=pods
kubectl create rolebinding read-pods --role=pod-reader --serviceaccount=default:myapp-sa

# Cluster Information
kubectl cluster-info
kubectl get nodes -o wide
kubectl api-resources
kubectl api-versions
kubectl version

# Troubleshooting with Events
kubectl get events -n production --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning

# Resource Validation
kubectl apply -f deployment.yaml --dry-run=client
kubectl apply -f deployment.yaml --dry-run=server
kubectl diff -f deployment.yaml

# Label and Selector Management
kubectl label pods myapp-pod env=production
kubectl label pods myapp-pod env-
kubectl get pods -l env=production
kubectl get pods --selector="app=myapp,tier=frontend"

# Annotations
kubectl annotate deployment myapp description="Production API service"
kubectl annotate deployment myapp description-

# Context and Namespace Management
kubectl config get-contexts
kubectl config use-context production
kubectl config set-context --current --namespace=production
kubectl config view

# Certificate Management
kubectl get certificates -n production
kubectl describe certificate myapp-tls -n production
kubectl get certificaterequests -n production

# Advanced Debugging
kubectl alpha debug node/<node-name> -it --image=ubuntu
kubectl cp <pod-name>:/path/to/file ./local-file
kubectl cp ./local-file <pod-name>:/path/to/file
kubectl attach <pod-name> -c <container-name>

# Resource Export
kubectl get deployment myapp -o yaml > myapp-deployment.yaml
kubectl get all -n production -o yaml > production-backup.yaml
kubectl get secret mysecret -o jsonpath='{.data.password}' | base64 -d

# Kubectl Plugins (krew)
kubectl krew install ctx
kubectl krew install ns
kubectl ctx
kubectl ns
 


Terraform Commands

# Initialize
terraform init
terraform init -upgrade

# Plan and apply
terraform plan
terraform plan -out=plan.tfplan
terraform apply
terraform apply -auto-approve
terraform apply plan.tfplan

# Destroy
terraform destroy
terraform destroy -target=aws_instance.example

# State management
terraform state list
terraform state show <resource>
terraform import <resource> <id>
terraform state rm <resource>

# Workspace management
terraform workspace list
terraform workspace new staging
terraform workspace select staging

# Formatting and validation
terraform fmt -recursive
terraform validate
terraform graph | dot -Tsvg > graph.svg

Ansible Commands

# Run playbook
ansible-playbook -i inventory playbook.yml
ansible-playbook -i inventory playbook.yml --check
ansible-playbook -i inventory playbook.yml --tags "deploy"

# Ad-hoc commands
ansible all -i inventory -m ping
ansible webservers -i inventory -a "uptime"
ansible dbservers -i inventory -m service -a "name=postgresql state=restarted"

# Inventory
ansible-inventory -i inventory --list
ansible-inventory -i inventory --graph

# Vault
ansible-vault create secrets.yml
ansible-vault edit secrets.yml
ansible-vault encrypt secrets.yml
ansible-vault decrypt secrets.yml

AWS CLI Commands

# EKS
aws eks list-clusters
aws eks update-kubeconfig --name cluster-name
aws eks describe-cluster --name cluster-name

# RDS
aws rds describe-db-instances
aws rds create-db-snapshot --db-instance-identifier mydb --db-snapshot-identifier mydb-snapshot
aws rds restore-db-instance-from-db-snapshot --db-instance-identifier new-db --db-snapshot-identifier mydb-snapshot

# ECR
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin <account>.dkr.ecr.us-west-2.amazonaws.com
aws ecr describe-repositories
aws ecr list-images --repository-name myapp

# Secrets Manager
aws secretsmanager list-secrets
aws secretsmanager get-secret-value --secret-id myapp/database
aws secretsmanager put-secret-value --secret-id myapp/database --secret-string '{"password":"newpass"}'

OpenStack Commands

# Authentication
openstack token issue
openstack catalog list

# Compute (Nova)
openstack server list
openstack server show <server-name>
openstack server create --flavor m1.medium --image ubuntu-20.04 --network private myserver
openstack server delete <server-name>
openstack server reboot <server-name>
openstack server resize --flavor m1.large <server-name>

# Flavor management
openstack flavor list
openstack flavor show m1.medium

# Images (Glance)
openstack image list
openstack image show <image-name>
openstack image create --file ubuntu.qcow2 --disk-format qcow2 ubuntu-custom
openstack image delete <image-name>

# Networking (Neutron)
openstack network list
openstack network create private-net
openstack network show private-net
openstack subnet create --network private-net --subnet-range 192.168.1.0/24 private-subnet

# Router management
openstack router list
openstack router create myrouter
openstack router add subnet myrouter private-subnet
openstack router set --external-gateway public myrouter

# Floating IPs
openstack floating ip list
openstack floating ip create public
openstack server add floating ip <server-name> <floating-ip>
openstack server remove floating ip <server-name> <floating-ip>

# Security Groups
openstack security group list
openstack security group create web-sg
openstack security group rule create --protocol tcp --dst-port 80 web-sg
openstack security group rule create --protocol tcp --dst-port 443 web-sg
openstack security group rule list web-sg

# Volumes (Cinder)
openstack volume list
openstack volume create --size 100 myvolume
openstack volume show myvolume
openstack server add volume <server-name> myvolume
openstack server remove volume <server-name> myvolume
openstack volume delete myvolume

# Snapshots
openstack volume snapshot list
openstack volume snapshot create --volume myvolume myvolume-snapshot
openstack volume create --snapshot myvolume-snapshot restored-volume

# Orchestration (Heat)
openstack stack list
openstack stack create -t template.yaml mystack
openstack stack show mystack
openstack stack update -t template.yaml mystack
openstack stack delete mystack
openstack stack resource list mystack

# Object Storage (Swift)
openstack container list
openstack container create mycontainer
openstack object list mycontainer
openstack object create mycontainer file.txt
openstack object save mycontainer file.txt

# Quotas
openstack quota show
openstack quota set --instances 20 --cores 40 --ram 81920 <project-id>

# Projects and Users
openstack project list
openstack project create myproject
openstack user list
openstack user create --password secret --project myproject myuser
openstack role add --user myuser --project myproject member

# Resource usage
openstack usage list
openstack limits show --absolute
 


OpenStack Heat Templates (IaC)

# Validate template
openstack orchestration template validate -t template.yaml

# Preview changes
openstack stack preview -t template.yaml mystack

# Show stack events
openstack stack event list mystack
openstack stack event show mystack <event-id>

# Stack outputs
openstack stack output list mystack
openstack stack output show mystack server_ip

# Suspend and resume
openstack stack suspend mystack
openstack stack resume mystack

# Abandon (remove from control without deleting)
openstack stack abandon mystack
 


OpenStack-Kubernetes Integration

# If using Magnum (Kubernetes on OpenStack)
openstack coe cluster list
openstack coe cluster create k8s-cluster \
  --cluster-template kubernetes-template \
  --master-count 3 \
  --node-count 5

openstack coe cluster show k8s-cluster
openstack coe cluster config k8s-cluster
openstack coe cluster resize k8s-cluster --node-count 10
openstack coe cluster upgrade k8s-cluster --cluster-template new-template
openstack coe cluster delete k8s-cluster

# Get kubeconfig
openstack coe cluster config k8s-cluster --dir ~/.kube
 


Additional Resources

Container Technologies:

Infrastructure as Code:

Cloud Providers:

CI/CD Tools:


Tools and Utilities

Container Security:

  • Trivy - Vulnerability scanner
  • Clair - Static analysis tool
  • Anchore - Container security platform
  • Cosign - Container signing

IaC Testing:

Monitoring & Observability:

Logging:

Development Tools:


Glossary

Term Definition
Blue-Green Deployment Deployment strategy using two identical environments
Canary Deployment Gradual rollout to subset of users
ConfigMap Kubernetes object for non-sensitive configuration data
Container Registry Storage and distribution system for container images
Idempotency Property where operation produces same result regardless of repetition
Infrastructure as Code Managing infrastructure through code rather than manual processes
Multi-stage Build Docker build technique using multiple FROM statements
Rolling Deployment Gradual replacement of application instances
Secret Kubernetes object for sensitive information
Service Mesh Infrastructure layer for service-to-service communication

Common Acronyms

Acronym Full Form
CD Continuous Delivery/Deployment
CI Continuous Integration
CIDR Classless Inter-Domain Routing
CRD Custom Resource Definition
DR Disaster Recovery
IAM Identity and Access Management
IaC Infrastructure as Code
RBAC Role-Based Access Control
RPO Recovery Point Objective
RTO Recovery Time Objective
SLA Service Level Agreement
TLS Transport Layer Security

Deployment Checklist Template

Pre-Deployment

Code Quality:

  • Code reviewed and approved
  • All tests passing
  • Code coverage meets requirements
  • Static analysis passed

Security:

  • Security scan completed
  • No critical vulnerabilities
  • Secrets properly managed
  • Access controls verified

Infrastructure:

  • Infrastructure changes reviewed
  • Resource capacity verified
  • Scaling rules configured
  • Monitoring alerts configured

Database:

  • Migration scripts tested
  • Rollback plan documented
  • Backup verified
  • Performance impact assessed

Documentation:

  • Release notes prepared
  • Runbook updated
  • Configuration documented
  • Team notified

During Deployment

Monitoring:

  • Error rates monitored
  • Response times tracked
  • Resource usage checked
  • Logs reviewed

Verification:

  • Health checks passing
  • Smoke tests executed
  • Critical paths verified
  • Database connectivity confirmed

Communication:

  • Status updates provided
  • Stakeholders informed
  • Issue tracker updated
  • Team available

Post-Deployment

Validation:

  • All services healthy
  • Business flows working
  • Performance acceptable
  • No unexpected errors

Documentation:

  • Deployment documented
  • Issues logged
  • Metrics recorded
  • Lessons learned captured

Cleanup:

  • Old resources removed
  • Rollback verified
  • Documentation updated
  • Team debriefed

Incident Response Template

Severity Levels

Level Description Response Time Escalation
P1 - Critical Complete service outage Immediate All hands
P2 - High Major feature unavailable 15 minutes On-call team
P3 - Medium Minor feature degraded 1 hour Assigned team
P4 - Low Cosmetic issue Next business day Queue

Incident Response Steps

1. Acknowledge

  • Incident acknowledged
  • Severity assigned
  • Team notified
  • Status page updated

2. Assess

  • Impact determined
  • Root cause identified
  • Affected systems listed
  • Timeline established

3. Respond

  • Mitigation started
  • Workaround implemented
  • Rollback initiated (if needed)
  • Communication ongoing

4. Recover

  • Service restored
  • Functionality verified
  • Monitoring confirmed
  • Status page updated

5. Review

  • Postmortem scheduled
  • Timeline documented
  • Action items created
  • Process improved

Change Management Template

Change Request

Change Details:

  • Change ID: [AUTO-GENERATED]
  • Requested By: [NAME]
  • Date: [DATE]
  • Environment: [ENVIRONMENT]

Description:

[Detailed description of the change]

Justification:

[Business reason for the change]

Impact Assessment:

  • Systems Affected: [LIST]
  • Users Impacted: [NUMBER/PERCENTAGE]
  • Risk Level: [LOW/MEDIUM/HIGH]

Implementation Plan:

  • Start Time: [DATETIME]
  • Duration: [ESTIMATE]
  • Steps: [NUMBERED LIST]

Rollback Plan:

  • Trigger Conditions: [CONDITIONS]
  • Steps: [NUMBERED LIST]
  • Recovery Time: [ESTIMATE]

Testing:

  • Unit tests passed
  • Integration tests passed
  • UAT completed
  • Performance validated

Approvals:

  • Technical Lead
  • Operations Manager
  • Product Owner
  • Security Team

Last updated: October 2025