Deployment and Infrastructure as Code¶

Section Overview

Modern deployment practices using containers, Infrastructure as Code, and automated pipelines to ensure consistent, reliable, and scalable application delivery.

Container-based Development and Deployment¶

Dockerfile Best Practices¶

Core Principle: Dockerfiles should be optimized for security, efficiency, and maintainability while ensuring consistent and reproducible builds.

Key Guidelines:

Use specific base image versions with SHA digests
Minimize layer count and optimize image size
Follow the principle of least privilege
Implement proper health checks
Leverage multi-stage builds

Why This Matters

Well-structured Dockerfiles ensure consistent environments across all deployment stages, reduce security vulnerabilities, and optimize both build times and runtime performance. They form the foundation of reliable containerized applications.

Layer Optimization Strategy¶

Implementation:

Bad PracticeGood PracticeBest Practice

# Multiple layers - inefficient
RUN apt-get update
RUN apt-get install -y package1
RUN apt-get install -y package2
RUN rm -rf /var/lib/apt/lists/*

# Single optimized layer
RUN apt-get update && apt-get install -y \
    package1 \
    package2 \
    && rm -rf /var/lib/apt/lists/*

# With build arguments and validation
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
    package1=${PACKAGE1_VERSION} \
    package2=${PACKAGE2_VERSION} \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && package1 --version \
    && package2 --version

Security Considerations¶

Critical Security Practices

Run containers as non-root users
Remove unnecessary tools and packages
Scan images for vulnerabilities regularly
Use multi-stage builds to minimize attack surface
Never include secrets in image layers

Example: Secure Container User Setup

# Create non-root user
RUN useradd -r -u 1000 appuser

WORKDIR /app
COPY --from=builder /app .

# Set user before execution
USER appuser

Complete Python Application Example¶

# Build stage
FROM python:3.11-slim-bullseye@sha256:abc123... AS builder

# Build metadata
ARG APP_VERSION
ARG BUILD_DATE
ARG VCS_REF

LABEL org.opencontainers.image.version="${APP_VERSION}" \
      org.opencontainers.image.created="${BUILD_DATE}" \
      org.opencontainers.image.revision="${VCS_REF}"

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy and test application
COPY . .
RUN python -m pytest

# Final stage - minimal runtime image
FROM python:3.11-slim-bullseye@sha256:abc123...

# Create non-root user
RUN useradd -r -u 1000 appuser

WORKDIR /app
COPY --from=builder /app .

USER appuser

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
    CMD curl -f http://localhost:8000/health || exit 1

ENTRYPOINT ["python"]
CMD ["app.py"]

Node.js Application with Build Secrets¶

# Build stage
FROM node:18-alpine@sha256:def456... AS builder

WORKDIR /app

# Mount npm token as secret (not stored in layers)
RUN --mount=type=secret,id=npm_token \
    npm config set //registry.npmjs.org/:_authToken=$(cat /run/secrets/npm_token)

# Install dependencies
COPY package*.json ./
RUN npm ci --only=production

# Build application
COPY . .
RUN npm run build

# Final stage
FROM node:18-alpine@sha256:def456...

# Create non-root user
RUN addgroup -g 1000 appgroup && \
    adduser -u 1000 -G appgroup -s /bin/sh -D appuser

WORKDIR /app

# Copy built artifacts
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules

USER appuser

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
    CMD wget -q --spider http://localhost:3000/health || exit 1

CMD ["node", "dist/server.js"]

.dockerignore Best Practices¶

Optimize Build Context

A well-configured .dockerignore file reduces build context size and speeds up builds significantly.

# Dependencies
node_modules
vendor

# Development files
*.log
npm-debug.log
.env
.env.local

# Version control
.git
.gitignore

# Documentation
*.md
docs/

# Build artifacts
dist
build
*.tar.gz

# IDE files
.vscode
.idea
*.swp

Container Image Management¶

Image Tagging Strategy¶

Core Principle: Container images must be versioned, secured, and managed systematically to ensure reliable and traceable deployments.

Recommended Tagging Format:

Tag Type	Format	Example	Use Case
Semantic Version	`v{major}.{minor}.{patch}`	`v1.2.3`	Production releases
Git Commit	`{short-sha}`	`a1b2c3d`	Development tracking
Build Number	`v{version}-b{build}`	`v1.2.3-b456`	CI/CD integration
Environment	`{version}-{env}`	`v1.2.3-staging`	Environment-specific

Registry Management Workflow¶

Build & TagPush to RegistrySign Images

# Build image
docker build -t myapp:latest .

# Tag for registry
docker tag myapp:latest \
  company-registry.com/team/myapp:1.2.3

docker tag myapp:latest \
  company-registry.com/team/myapp:latest

# Authenticate
docker login company-registry.com \
  --username=$REGISTRY_USER \
  --password-stdin

# Push images
docker push company-registry.com/team/myapp:1.2.3
docker push company-registry.com/team/myapp:latest

# Generate signing keys
cosign generate-key-pair

# Sign image
cosign sign --key cosign.key \
  company-registry.com/team/app:1.2.3

# Verify signature
cosign verify --key cosign.pub \
  company-registry.com/team/app:1.2.3

Security Scanning¶

Mandatory Security Checks

All images must be scanned for vulnerabilities before deployment. Critical and high-severity issues must be resolved.

Trivy Configuration Example:

# trivy.yaml
trivy:
  severity: CRITICAL,HIGH
  ignore-unfixed: true
  vuln-type: os,library
  format: table
  output: scan-results.txt

CI Pipeline Integration:

scan-image:
  script:
    - trivy image --severity HIGH,CRITICAL \
        company-registry.com/team/app:${CI_COMMIT_SHA}
    - |
      if [ $? -eq 1 ]; then
        echo "Critical vulnerabilities found"
        exit 1
      fi

Registry Cleanup Policies¶

cleanup:
  policies:
    - name: keep-latest-versions
      rules:
        - type: tag
          pattern: '^v\d+\.\d+\.\d+$'
          action: keep
          amount: 5

    - name: cleanup-feature-branches
      rules:
        - type: tag
          pattern: '^feature-.*$'
          action: delete
          older-than: 7d

    - name: cleanup-development-tags
      rules:
        - type: tag
          pattern: '^dev-.*$'
          action: delete
          older-than: 3d

Multi-stage Build Optimization¶

Complex Build Pipeline Example¶

# Stage 1: Dependencies
FROM node:18-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN --mount=type=cache,target=/root/.npm \
    npm ci

# Stage 2: Frontend build
FROM node:18-alpine AS frontend-builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY frontend/ .
RUN npm run build

# Stage 3: Backend build
FROM maven:3.8-openjdk-17 AS backend-builder
WORKDIR /app
COPY backend/pom.xml .
RUN --mount=type=cache,target=/root/.m2 \
    mvn dependency:go-offline
COPY backend/ .
RUN mvn package -DskipTests

# Stage 4: Security scan
FROM aquasec/trivy:latest AS security-scan
COPY --from=backend-builder /app/target/*.jar /app/
RUN trivy filesystem --exit-code 1 \
    --severity HIGH,CRITICAL /app

# Stage 5: Final runtime image
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app

# Copy built artifacts
COPY --from=frontend-builder /app/dist /app/public
COPY --from=backend-builder /app/target/*.jar app.jar

# Non-root user
RUN addgroup -g 1000 appgroup && \
    adduser -u 1000 -G appgroup -s /bin/sh -D appuser
USER appuser

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=3s \
    CMD wget -q --spider http://localhost:8080/actuator/health || exit 1

ENTRYPOINT ["java", "-XX:+UseContainerSupport", "-jar", "app.jar"]

Build Cache Optimization

Multi-stage builds with build cache mounts can reduce build times by 50-70% by reusing dependencies across builds.

Docker Compose for Development¶

Development Environment Setup¶

Core Principle: Docker Compose should provide a consistent, reproducible local development environment that mirrors production while optimizing for developer experience.

# docker-compose.yml
version: '3.8'

x-logging: &default-logging
  options:
    max-size: "10m"
    max-file: "3"
  driver: json-file

services:
  app:
    build:
      context: .
      target: development
      args:
        - NODE_ENV=development
    volumes:
      - .:/app:delegated
      - node_modules:/app/node_modules
    ports:
      - "${PORT:-3000}:3000"
      - "9229:9229"  # Debug port
    environment:
      - NODE_ENV=development
      - DATABASE_URL=postgresql://postgres:password@db:5432/myapp
      - REDIS_URL=redis://redis:6379/0
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    logging: *default-logging

  db:
    image: postgres:14-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d
    environment:
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=myapp
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
    logging: *default-logging

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
    logging: *default-logging

volumes:
  node_modules:
  postgres_data:
  redis_data:

networks:
  default:
    driver: bridge

Development Overrides¶

# docker-compose.override.yml
services:
  app:
    command: npm run dev
    environment:
      - DEBUG=app:*

  # Expose ports for local debugging tools
  db:
    ports:
      - "5432:5432"

  redis:
    ports:
      - "6379:6379"

Container Orchestration with Kubernetes¶

Service Deployment¶

Core Principle: Container orchestration should automate deployment, scaling, and management of containerized applications while ensuring high availability and optimal resource utilization.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
    environment: production
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "3000"
    spec:
      securityContext:
        runAsNonRoot: true
        fsGroup: 2000
      containers:
        - name: myapp
          image: company-registry.com/myapp:1.2.3
          ports:
            - containerPort: 3000
              name: http
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 15
            periodSeconds: 20
          startupProbe:
            httpGet:
              path: /health
              port: http
            failureThreshold: 30
            periodSeconds: 10
          env:
            - name: NODE_ENV
              value: "production"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: myapp-secrets
                  key: database-url
          volumeMounts:
            - name: config
              mountPath: /etc/config
              readOnly: true
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: config
          configMap:
            name: myapp-config
        - name: tmp
          emptyDir: {}

Service and Ingress Configuration¶

ServiceIngress

apiVersion: v1
kind: Service
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  type: ClusterIP
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app: myapp

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
    - hosts:
        - myapp.example.com
      secretName: myapp-tls
  rules:
    - host: myapp.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: myapp
                port:
                  name: http

Resource Management

Always define resource requests and limits to ensure fair resource allocation and prevent resource exhaustion on shared clusters.

Infrastructure as Code with Terraform¶

Infrastructure Definition Principles¶

Core Principle: Infrastructure must be defined, versioned, and managed as code, ensuring reproducibility, consistency, and automated provisioning across all environments.

Key Guidelines:

Maintain infrastructure code in version control
Use declarative rather than imperative definitions
Implement modular and reusable components
Follow the principle of idempotency
Document all configuration parameters

Why This Matters

Managing infrastructure as code reduces human error, ensures consistency across environments, and enables automated, repeatable deployments while maintaining a complete audit trail.

Terraform Module Structure¶

Best Practice Directory Layout:

infrastructure/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   └── variables.tf
│   ├── staging/
│   │   ├── main.tf
│   │   └── variables.tf
│   └── prod/
│       ├── main.tf
│       └── variables.tf
├── modules/
│   ├── networking/
│   ├── kubernetes/
│   └── database/
└── shared/
    └── variables.tf

Terraform Implementation Example¶

# Define explicit variable types and validation
variable "environment" {
  type        = string
  description = "Environment name (e.g., staging, production)"
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "region" {
  type        = string
  description = "AWS region for deployment"
  default     = "us-west-2"
}

# Local variables for common configurations
locals {
  vpc_cidr = {
    dev     = "10.0.0.0/16"
    staging = "10.1.0.0/16"
    prod    = "10.2.0.0/16"
  }

  common_tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
    Team        = "platform"
  }
}

# Modular resource organization
module "vpc" {
  source = "./modules/vpc"

  environment = var.environment
  cidr_block  = local.vpc_cidr[var.environment]

  tags = local.common_tags
}

# Dependencies and relationships
module "kubernetes" {
  source = "./modules/kubernetes"

  vpc_id         = module.vpc.vpc_id
  subnet_ids     = module.vpc.private_subnet_ids
  cluster_version = "1.24"

  node_groups = {
    general = {
      desired_size   = lookup(var.node_sizes[var.environment], "desired", 2)
      max_size       = lookup(var.node_sizes[var.environment], "max", 4)
      min_size       = lookup(var.node_sizes[var.environment], "min", 1)
      instance_types = ["t3.medium"]
    }
  }

  depends_on = [module.vpc]
}

Configuration Management with Ansible¶

Role-based Configuration¶

Core Principle: Configuration management should be idempotent, role-based, and maintain clear separation between code, configuration, and variables.

# playbook.yml - Main playbook structure
---
- name: Configure application servers
  hosts: app_servers
  become: true
  vars_files:
    - vars/{{ environment }}.yml

  pre_tasks:
    - name: Validate environment variables
      assert:
        that:
          - environment is defined
          - environment in ['dev', 'staging', 'prod']
        msg: "Environment must be set to dev, staging, or prod"

  roles:
    - role: common
      tags: ['common', 'setup']

    - role: nginx
      tags: ['web', 'nginx']
      vars:
        nginx_worker_processes: "{{ 'auto' if environment == 'prod' else '2' }}"

    - role: application
      tags: ['app']

  post_tasks:
    - name: Verify configuration
      include_tasks: tasks/verify.yml

Application Role Tasks¶

# roles/application/tasks/main.yml
---
- name: Install application dependencies
  apt:
    name: "{{ item }}"
    state: present
    update_cache: yes
  loop: "{{ application_dependencies }}"
  tags: ['install']

- name: Create application directories
  file:
    path: "{{ item }}"
    state: directory
    owner: "{{ app_user }}"
    group: "{{ app_group }}"
    mode: '0755'
  loop:
    - "{{ app_config_path }}"
    - "{{ app_data_path }}"
    - "{{ app_log_path }}"
  tags: ['setup']

- name: Configure application service
  template:
    src: application.service.j2
    dest: /etc/systemd/system/application.service
    mode: '0644'
  notify: restart application
  tags: ['config']

- name: Ensure application is running
  systemd:
    name: application
    state: started
    enabled: yes
  tags: ['service']

Idempotency is Critical

All Ansible tasks should be idempotent - running them multiple times should produce the same result without unintended side effects.

Secrets Management¶

HashiCorp Vault Integration¶

Core Principle: Sensitive data must never be stored in plain text and should be managed using dedicated secrets management solutions.

Vault ConfigurationTerraform IntegrationApplication Usage

# Vault policy
path "secret/data/{{environment}}/{{application}}/*" {
  capabilities = ["read"]
}

path "secret/data/common/*" {
  capabilities = ["read"]
}

provider "vault" {
  address = var.vault_addr
}

data "vault_generic_secret" "db_creds" {
  path = "secret/${var.environment}/database"
}

resource "kubernetes_secret" "application" {
  metadata {
    name      = "app-secrets"
    namespace = var.namespace
  }

  data = {
    DB_PASSWORD = data.vault_generic_secret.db_creds.data["password"]
    DB_USERNAME = data.vault_generic_secret.db_creds.data["username"]
  }
}

import hvac
import os

def get_secrets():
    client = hvac.Client(
        url=os.getenv('VAULT_ADDR'),
        token=os.getenv('VAULT_TOKEN')
    )

    secret_path = f"secret/data/{os.getenv('ENVIRONMENT')}/app"
    response = client.secrets.kv.v2.read_secret_version(
        path=secret_path
    )

    return response['data']['data']

AWS Secrets Manager Integration¶

import boto3
import json
from botocore.exceptions import ClientError

def get_secret(secret_name, region_name="us-west-2"):
    """
    Retrieve secret from AWS Secrets Manager.
    """
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        response = client.get_secret_value(SecretId=secret_name)

        if 'SecretString' in response:
            return json.loads(response['SecretString'])
        else:
            return response['SecretBinary']

    except ClientError as e:
        if e.response['Error']['Code'] == 'ResourceNotFoundException':
            raise ValueError(f"Secret {secret_name} not found")
        elif e.response['Error']['Code'] == 'InvalidRequestException':
            raise ValueError(f"Invalid request for secret {secret_name}")
        else:
            raise

def rotate_secret(secret_id):
    """
    Rotate database credentials in AWS Secrets Manager.
    """
    client = boto3.client('secretsmanager')

    try:
        # Get current secret value
        response = client.get_secret_value(SecretId=secret_id)
        current_secret = json.loads(response['SecretString'])

        # Generate new credentials
        new_password = generate_secure_password()

        # Update application database
        update_database_password(
            username=current_secret['username'],
            new_password=new_password
        )

        # Update secret in Secrets Manager
        client.put_secret_value(
            SecretId=secret_id,
            SecretString=json.dumps({
                'username': current_secret['username'],
                'password': new_password,
                'host': current_secret['host'],
                'port': current_secret['port']
            })
        )

        return True

    except Exception as e:
        # Implement proper error handling and rollback
        raise SecretRotationError(f"Failed to rotate secret: {str(e)}")

Deployment Automation and Pipelines¶

CI/CD Pipeline Design¶

Core Principle: Deployment pipelines must be automated, reliable, and provide clear visibility into the deployment process while maintaining security and compliance requirements.

Key Pipeline Stages:

Stage	Purpose	Key Actions
Build	Compile and package application	Code compilation, dependency resolution
Test	Validate functionality	Unit tests, integration tests, security scans
Deploy	Release to environment	Environment provisioning, artifact deployment
Verify	Confirm deployment health	Health checks, smoke tests, monitoring

Pipeline Best Practices

Well-designed deployment pipelines ensure reliable, repeatable deployments while reducing human error and maintaining security standards. Every stage should be automated and provide clear feedback.

GitHub Actions Pipeline Example¶

# .github/workflows/deployment.yml
name: Deployment Pipeline

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2

      - name: Login to Container Registry
        uses: docker/login-action@v2
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v4
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

      - name: Build and push Docker image
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

      - name: Run Security Scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          format: 'table'
          exit-code: '1'
          ignore-unfixed: true
          severity: 'CRITICAL,HIGH'

      - name: Run tests
        run: |
          docker run --rm \
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            npm test

  deploy-staging:
    needs: build-and-test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/develop'
    environment:
      name: staging
      url: https://staging.example.com

    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-west-2

      - name: Update EKS deployment
        run: |
          aws eks update-kubeconfig --name staging-cluster
          kubectl set image deployment/app-deployment \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          kubectl rollout status deployment/app-deployment --timeout=5m

      - name: Run smoke tests
        run: |
          curl -f https://staging.example.com/health || exit 1

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment:
      name: production
      url: https://api.example.com

    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.PROD_AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.PROD_AWS_SECRET_ACCESS_KEY }}
          aws-region: us-west-2

      - name: Deploy to Production
        run: |
          aws eks update-kubeconfig --name prod-cluster
          kubectl set image deployment/app-deployment \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          kubectl rollout status deployment/app-deployment --timeout=10m

      - name: Verify deployment
        run: |
          # Health check
          curl -f https://api.example.com/health || exit 1

          # Monitor error rates for 5 minutes
          sleep 300

      - name: Notify team
        if: always()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Production deployment: ${{ job.status }}",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "Deployment to production: *${{ job.status }}*\nCommit: ${{ github.sha }}"
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

Jenkins Pipeline Example¶

// Jenkinsfile
pipeline {
    agent any

    environment {
        DOCKER_REGISTRY = 'company-registry.com'
        APP_NAME = 'myapp'
        VERSION = sh(script: 'git describe --tags --always', returnStdout: true).trim()
        KUBECONFIG = credentials('kubeconfig-prod')
    }

    stages {
        stage('Build') {
            steps {
                script {
                    docker.build("${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}")
                }
            }
        }

        stage('Test') {
            parallel {
                stage('Unit Tests') {
                    steps {
                        sh 'npm test'
                        junit 'test-results/**/*.xml'
                    }
                }
                stage('Integration Tests') {
                    steps {
                        sh 'npm run integration-test'
                    }
                }
                stage('Security Scan') {
                    steps {
                        sh """
                            trivy image \
                              --severity HIGH,CRITICAL \
                              --exit-code 1 \
                              ${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}
                        """
                    }
                }
            }
        }

        stage('Push Image') {
            steps {
                script {
                    docker.withRegistry("https://${DOCKER_REGISTRY}", 'registry-credentials') {
                        docker.image("${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}").push()
                        docker.image("${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}").push('latest')
                    }
                }
            }
        }

        stage('Deploy to Staging') {
            when { branch 'develop' }
            steps {
                script {
                    deployToEnvironment(
                        environment: 'staging',
                        version: VERSION
                    )
                }
            }
        }

        stage('Deploy to Production') {
            when { branch 'main' }
            input {
                message 'Deploy to production?'
                ok 'Yes, deploy!'
            }
            steps {
                script {
                    deployToEnvironment(
                        environment: 'production',
                        version: VERSION
                    )
                }
            }
        }

        stage('Verify Deployment') {
            steps {
                script {
                    sh """
                        kubectl rollout status deployment/${APP_NAME} -n production
                        curl -f https://api.example.com/health || exit 1
                    """
                }
            }
        }
    }

    post {
        success {
            slackSend(
                channel: '#deployments',
                color: 'good',
                message: "Deployment successful: ${APP_NAME}:${VERSION}"
            )
        }
        failure {
            slackSend(
                channel: '#deployments',
                color: 'danger',
                message: "Deployment failed: ${APP_NAME}:${VERSION}"
            )
        }
        always {
            cleanWs()
        }
    }
}

// Helper function for deployment
def deployToEnvironment(Map config) {
    sh """
        kubectl config use-context ${config.environment}
        kubectl set image deployment/${APP_NAME} \
          app=${DOCKER_REGISTRY}/${APP_NAME}:${config.version} \
          -n ${config.environment}
        kubectl rollout status deployment/${APP_NAME} \
          -n ${config.environment} \
          --timeout=5m
    """
}

Pipeline Optimization

Use parallel stages for tests and scans to reduce total pipeline execution time. Cache dependencies between runs to speed up builds.

Deployment Strategies¶

Blue-Green Deployment¶

Principle: Maintain two identical production environments, switching traffic between them to achieve zero-downtime deployments.

Key Benefits:

Instant rollback capability
Zero downtime during deployment
Full production environment testing before switch
Simple rollback process

Kubernetes ImplementationDeployment Script

# Blue deployment (current)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-blue
  labels:
    version: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
        - name: app
          image: myapp:1.0.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

---
# Green deployment (new)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-green
  labels:
    version: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
        - name: app
          image: myapp:2.0.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

---
# Service that switches between blue and green
apiVersion: v1
kind: Service
metadata:
  name: app-service
spec:
  selector:
    app: myapp
    version: blue  # Change to 'green' to switch
  ports:
    - port: 80
      targetPort: 8080

#!/bin/bash
# blue-green-deploy.sh

NAMESPACE="production"
APP_NAME="myapp"
NEW_VERSION=$1

# Determine current active environment
CURRENT=$(kubectl get service ${APP_NAME}-service -n ${NAMESPACE} \
  -o jsonpath='{.spec.selector.version}')

if [ "$CURRENT" = "blue" ]; then
  NEW_ENV="green"
  OLD_ENV="blue"
else
  NEW_ENV="blue"
  OLD_ENV="green"
fi

echo "Current environment: $CURRENT"
echo "Deploying to: $NEW_ENV"

# Deploy new version to inactive environment
kubectl set image deployment/${APP_NAME}-${NEW_ENV} \
  app=${APP_NAME}:${NEW_VERSION} \
  -n ${NAMESPACE}

# Wait for deployment to be ready
kubectl rollout status deployment/${APP_NAME}-${NEW_ENV} \
  -n ${NAMESPACE} \
  --timeout=5m

# Run smoke tests
echo "Running smoke tests..."
POD=$(kubectl get pod -n ${NAMESPACE} \
  -l app=${APP_NAME},version=${NEW_ENV} \
  -o jsonpath='{.items[0].metadata.name}')

kubectl exec -n ${NAMESPACE} ${POD} -- curl -f http://localhost:8080/health

if [ $? -eq 0 ]; then
  echo "Smoke tests passed. Switching traffic..."

  # Switch service to new environment
  kubectl patch service ${APP_NAME}-service -n ${NAMESPACE} \
    -p "{\"spec\":{\"selector\":{\"version\":\"${NEW_ENV}\"}}}"

  echo "Traffic switched to ${NEW_ENV}"
  echo "Monitor for 5 minutes before removing old deployment"

  sleep 300

  # Optional: Scale down old environment
  # kubectl scale deployment/${APP_NAME}-${OLD_ENV} \
  #   --replicas=0 -n ${NAMESPACE}
else
  echo "Smoke tests failed. Keeping ${OLD_ENV} active"
  exit 1
fi

Canary Deployment¶

Principle: Release changes incrementally to a subset of users, monitoring for issues before full rollout.

Key Benefits:

Reduced blast radius of failures
Early detection of issues
Gradual traffic shifting
Data-driven rollout decisions

# Using Istio for canary deployment
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: app-vsvc
spec:
  hosts:
    - app.example.com
  http:
    - match:
        - headers:
            canary:
              exact: "true"
      route:
        - destination:
            host: app-canary
            subset: v2
    - route:
        - destination:
            host: app-stable
            subset: v1
          weight: 90
        - destination:
            host: app-canary
            subset: v2
          weight: 10

Canary Monitoring Configuration:

# prometheus/canary-rules.yaml
groups:
  - name: canary-deployment
    interval: 30s
    rules:
      - alert: CanaryErrorRateHigh
        expr: |
          sum(rate(http_requests_total{version="canary",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{version="canary"}[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Canary deployment error rate too high"
          description: "Canary version showing {{ $value }}% error rate"

      - alert: CanaryLatencyHigh
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket{version="canary"}[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Canary deployment latency too high"
          description: "Canary p95 latency: {{ $value }}s"

Canary Rollback

Always define clear success criteria before starting a canary deployment. Automate rollback when metrics exceed thresholds.

Rolling Deployment¶

Principle: Gradually replace instances of the application with new versions while maintaining service availability.

Key Benefits:

No infrastructure duplication needed
Simple implementation
Automatic rollback on failure
Resource efficient

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Max new pods above desired
      maxUnavailable: 0  # Max old pods that can be down
  template:
    spec:
      containers:
        - name: app
          image: myapp:2.0.0
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10

Rolling Update Process:

New pod is created
Wait for new pod to be ready
Old pod is terminated
Repeat until all pods updated

Environment Management¶

Environment Parity¶

Core Principle: Development, staging, and production environments must maintain maximum parity to ensure reliable testing and deployment processes.

Key Guidelines:

Use identical configuration structures across environments
Maintain consistent versions of all dependencies
Implement similar scaling and redundancy patterns
Use production-like data in lower environments
Automate environment provisioning

Why Parity Matters

Environment parity minimizes "it works on my machine" issues and ensures that testing in lower environments accurately predicts production behavior. When environments differ, bugs may only surface in production, leading to costly incidents.

Configuration Management Structure¶

Base ConfigurationProduction OverrideStaging OverrideDevelopment Override

# config/base/config.yaml
app:
  name: myservice
  version: 1.0.0

database:
  type: postgresql
  pool:
    min: 5
    max: 20
    idle_timeout: 300

logging:
  level: info
  format: json

cache:
  ttl: 3600
  max_entries: 10000

# config/environments/production.yaml
extends: ../base/config.yaml

app:
  replicas: 3
  resources:
    cpu: 1000m
    memory: 2Gi

database:
  host: prod-db.example.com
  pool:
    max: 50
  ssl: true

logging:
  level: warn

# config/environments/staging.yaml
extends: ../base/config.yaml

app:
  replicas: 2
  resources:
    cpu: 500m
    memory: 1Gi

database:
  host: staging-db.example.com
  pool:
    max: 30

logging:
  level: info

# config/environments/development.yaml
extends: ../base/config.yaml

app:
  replicas: 1
  resources:
    cpu: 250m
    memory: 512Mi

database:
  host: localhost
  pool:
    max: 10

logging:
  level: debug

Environment Variable Management¶

Best Practices:

Variable Type	Storage Method	Example
Non-sensitive Config	ConfigMaps	Feature flags, API URLs
Sensitive Data	Secrets Manager	Database passwords, API keys
Environment-specific	Environment files	Resource limits, replica counts
Build-time	Build arguments	Version numbers, build dates

Kubernetes ConfigMap and Secrets¶

ConfigMapSecretsUsing in Deployment

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: production
data:
  APP_LOG_LEVEL: "info"
  APP_CACHE_TTL: "3600"
  APP_API_VERSION: "v1"
  METRICS_ENABLED: "true"
  FEATURE_NEW_UI: "true"
  MAX_UPLOAD_SIZE: "10485760"

apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
  namespace: production
type: Opaque
data:
  DB_PASSWORD: <base64-encoded>
  API_KEY: <base64-encoded>
  JWT_SECRET: <base64-encoded>

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
        - name: app
          image: myapp:1.0.0
          envFrom:
            - configMapRef:
                name: app-config
            - secretRef:
                name: app-secrets
          env:
            - name: ENVIRONMENT
              value: "production"
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name

Local Development Environment¶

Comprehensive Docker Compose Setup¶

# docker-compose.dev.yml
version: '3.8'

x-logging: &default-logging
  driver: json-file
  options:
    max-size: "10m"
    max-file: "3"

services:
  app:
    build:
      context: .
      target: development
      args:
        NODE_ENV: development
    volumes:
      - .:/app:delegated
      - /app/node_modules
      - ${HOME}/.aws:/root/.aws:ro
    environment:
      - NODE_ENV=development
      - DB_HOST=db
      - DB_PORT=5432
      - DB_NAME=appdb
      - DB_USER=devuser
      - DB_PASSWORD=devpass
      - REDIS_HOST=cache
      - REDIS_PORT=6379
      - AWS_PROFILE=${AWS_PROFILE:-default}
    ports:
      - "${PORT:-3000}:3000"
      - "9229:9229"  # Node.js debugger
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    logging: *default-logging

  db:
    image: postgres:14-alpine
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: devuser
      POSTGRES_PASSWORD: devpass
      POSTGRES_INITDB_ARGS: "-E UTF8"
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d:ro
      - ./backups:/backups
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U devuser -d appdb"]
      interval: 10s
      timeout: 5s
      retries: 5
    logging: *default-logging

  cache:
    image: redis:7-alpine
    command: redis-server --appendonly yes
    volumes:
      - redisdata:/data
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
    logging: *default-logging

  # Development-only services
  mailhog:
    image: mailhog/mailhog:latest
    ports:
      - "1025:1025"  # SMTP
      - "8025:8025"  # Web UI
    logging: *default-logging

  adminer:
    image: adminer:latest
    ports:
      - "8080:8080"
    environment:
      ADMINER_DEFAULT_SERVER: db
    depends_on:
      - db
    logging: *default-logging

volumes:
  pgdata:
    driver: local
  redisdata:
    driver: local

networks:
  default:
    driver: bridge
    ipam:
      config:
        - subnet: 172.28.0.0/16

Development Environment Setup Script¶

#!/bin/bash
# setup-dev.sh

set -e

echo "Setting up local development environment..."

# Check prerequisites
check_prerequisites() {
    echo "Checking prerequisites..."

    command -v docker >/dev/null 2>&1 || {
        echo "Error: Docker is not installed"
        exit 1
    }

    command -v docker-compose >/dev/null 2>&1 || {
        echo "Error: Docker Compose is not installed"
        exit 1
    }

    echo "Prerequisites satisfied"
}

# Create necessary directories
setup_directories() {
    echo "Creating project directories..."
    mkdir -p backups
    mkdir -p init-scripts
    mkdir -p logs
    echo "Directories created"
}

# Copy environment template
setup_env_file() {
    if [ ! -f .env ]; then
        echo "Creating .env file..."
        cp .env.example .env
        echo ".env file created"
        echo "WARNING: Please review and update .env with your settings"
    else
        echo ".env file already exists"
    fi
}

# Start services
start_services() {
    echo "Starting Docker services..."
    docker-compose -f docker-compose.dev.yml up -d
    echo "Services started"
}

# Wait for services
wait_for_services() {
    echo "Waiting for services to be healthy..."

    max_attempts=30
    attempt=0

    while [ $attempt -lt $max_attempts ]; do
        if docker-compose -f docker-compose.dev.yml ps | grep -q "healthy"; then
            echo "Services are healthy"
            return 0
        fi

        attempt=$((attempt + 1))
        echo "Waiting... ($attempt/$max_attempts)"
        sleep 2
    done

    echo "Error: Services did not become healthy in time"
    return 1
}

# Run database migrations
run_migrations() {
    echo "Running database migrations..."
    docker-compose -f docker-compose.dev.yml exec -T app npm run migrate
    echo "Migrations completed"
}

# Seed development data
seed_data() {
    echo "Seeding development data..."
    docker-compose -f docker-compose.dev.yml exec -T app npm run seed
    echo "Data seeded"
}

# Print access information
print_info() {
    echo ""
    echo "========================================="
    echo "Development environment is ready!"
    echo "========================================="
    echo ""
    echo "Application: http://localhost:3000"
    echo "Database UI: http://localhost:8080"
    echo "Mail Server: http://localhost:8025"
    echo ""
    echo "Useful commands:"
    echo "  docker-compose -f docker-compose.dev.yml logs -f    # View logs"
    echo "  docker-compose -f docker-compose.dev.yml down       # Stop services"
    echo "  docker-compose -f docker-compose.dev.yml restart    # Restart services"
    echo ""
}

# Main execution
main() {
    check_prerequisites
    setup_directories
    setup_env_file
    start_services
    wait_for_services || exit 1
    run_migrations
    seed_data
    print_info
}

main

Development Productivity

Use volume mounts for hot reloading during development. This allows code changes to be reflected immediately without rebuilding containers.

Infrastructure Testing¶

Testing Layers¶

Core Principle: Infrastructure code must be validated through multiple testing layers, ensuring both correctness and compliance before any deployment.

Static AnalysisUnit TestingIntegration TestingCompliance Testing

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.50.0
    hooks:
      - id: terraform_fmt
      - id: terraform_docs
      - id: terraform_tflint
      - id: terraform_tfsec
      - id: terraform_validate

  - repo: https://github.com/bridgecrewio/checkov
    rev: 2.0.0
    hooks:
      - id: checkov
        args: [--directory, .]

# test_infrastructure.py
import pytest
from infrastructure.validators import validate_vpc_config

def test_vpc_configuration():
    """Test VPC configuration validation."""
    config = {
        'cidr_block': '10.0.0.0/16',
        'region': 'us-west-2',
        'availability_zones': 3
    }

    result = validate_vpc_config(config)

    assert result['subnet_count'] == 6
    assert result['nat_gateway_count'] == 3
    assert result['valid'] is True

def test_invalid_cidr_block():
    """Test validation catches invalid CIDR blocks."""
    config = {
        'cidr_block': '10.0.0.0/8',  # Too large
        'region': 'us-west-2'
    }

    with pytest.raises(ValueError):
        validate_vpc_config(config)

// infrastructure_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
)

func TestKubernetesCluster(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../infrastructure/kubernetes",
        Vars: map[string]interface{}{
            "environment": "test",
            "region":      "us-west-2",
        },
    }

    defer terraform.Destroy(t, terraformOptions)

    terraform.InitAndApply(t, terraformOptions)

    clusterName := terraform.Output(t, terraformOptions, "cluster_name")
    assert.NotEmpty(t, clusterName)

    nodeCount := terraform.OutputList(t, terraformOptions, "node_groups")
    assert.GreaterOrEqual(t, len(nodeCount), 1)
}

# compliance-tests.yaml
tests:
  - name: ensure_encryption_at_rest
    resource_type: aws_rds_cluster
    assertions:
      - property: storage_encrypted
        operator: equals
        value: true
        severity: critical

  - name: verify_vpc_flow_logs
    resource_type: aws_vpc
    assertions:
      - property: enable_flow_logs
        operator: equals
        value: true
        severity: high

  - name: check_s3_versioning
    resource_type: aws_s3_bucket
    assertions:
      - property: versioning.enabled
        operator: equals
        value: true
        severity: medium

  - name: verify_backup_retention
    resource_type: aws_db_instance
    assertions:
      - property: backup_retention_period
        operator: greater_than
        value: 7
        severity: high

Environment Decommissioning¶

Systematic Cleanup Process¶

Core Principle: Environment decommissioning must be systematic, ensuring proper cleanup of resources, data preservation, and documentation updates.

Decommissioning Checklist:

Phase	Tasks	Verification
Planning	Create decommission plan, notify stakeholders	Plan reviewed and approved
Backup	Backup critical data, export configurations	Backups validated and accessible
Cleanup	Remove Kubernetes resources, cloud infrastructure	All resources removed
Documentation	Update docs, remove access controls	Documentation current
Verification	Validate complete removal, cost verification	No remaining resources or costs

Decommissioning Script¶

#!/bin/bash
# decommission-environment.sh

set -e

ENVIRONMENT=$1
DRY_RUN=${2:-false}

if [ -z "$ENVIRONMENT" ]; then
    echo "Usage: $0 <environment> [dry-run]"
    exit 1
fi

echo "Decommissioning environment: $ENVIRONMENT"

if [ "$DRY_RUN" = "true" ]; then
    echo "WARNING: DRY RUN MODE - No actual changes will be made"
fi

# Step 1: Backup critical data
backup_data() {
    echo "Creating final backups..."

    aws rds create-db-snapshot \
        --db-instance-identifier ${ENVIRONMENT}-db \
        --db-snapshot-identifier ${ENVIRONMENT}-final-$(date +%Y%m%d) \
        ${DRY_RUN:+--no-execute}

    kubectl get all -n ${ENVIRONMENT} -o yaml > ${ENVIRONMENT}-backup.yaml

    echo "Backups completed"
}

# Step 2: Scale down services
scale_down_services() {
    echo "Scaling down services..."

    kubectl scale deployment --all --replicas=0 -n ${ENVIRONMENT} \
        ${DRY_RUN:+--dry-run=client}

    echo "Services scaled down"
}

# Step 3: Remove Kubernetes resources
cleanup_kubernetes() {
    echo "Removing Kubernetes resources..."

    kubectl delete namespace ${ENVIRONMENT} \
        ${DRY_RUN:+--dry-run=client}

    echo "Kubernetes resources removed"
}

# Step 4: Remove cloud infrastructure
cleanup_cloud_resources() {
    echo "Removing cloud infrastructure..."

    cd infrastructure/environments/${ENVIRONMENT}

    if [ "$DRY_RUN" = "true" ]; then
        terraform plan -destroy
    else
        terraform destroy -auto-approve
    fi

    echo "Cloud resources removed"
}

# Step 5: Remove DNS entries
cleanup_dns() {
    echo "Removing DNS entries..."

    # Remove Route53 records
    aws route53 list-resource-record-sets \
        --hosted-zone-id ${HOSTED_ZONE_ID} \
        --query "ResourceRecordSets[?contains(Name, '${ENVIRONMENT}')]" \
        | jq -r '.[] | .Name' \
        | while read record; do
            echo "Removing DNS record: $record"
            # Add deletion logic here
        done

    echo "DNS entries removed"
}

# Step 6: Revoke access
revoke_access() {
    echo "Revoking access credentials..."

    # Revoke IAM roles
    aws iam list-roles --query "Roles[?contains(RoleName, '${ENVIRONMENT}')].RoleName" \
        --output text | while read role; do
            echo "Removing role: $role"
            aws iam delete-role --role-name $role ${DRY_RUN:+--no-execute}
        done

    # Delete service accounts
    kubectl delete serviceaccount --all -n ${ENVIRONMENT} \
        ${DRY_RUN:+--dry-run=client}

    echo "Access revoked"
}

# Step 7: Generate decommission report
generate_report() {
    echo "Generating decommission report..."

    cat > ${ENVIRONMENT}-decommission-report.md <<EOF
# Environment Decommission Report

**Environment:** ${ENVIRONMENT}
**Date:** $(date)
**Performed By:** $(whoami)

## Summary

Environment ${ENVIRONMENT} has been successfully decommissioned.

## Backup Locations

- Database Snapshot: ${ENVIRONMENT}-final-$(date +%Y%m%d)
- Configuration Backup: ${ENVIRONMENT}-backup.yaml
- Archive Location: s3://backups/${ENVIRONMENT}/

## Resources Removed

- Kubernetes namespace: ${ENVIRONMENT}
- Cloud infrastructure: infrastructure/environments/${ENVIRONMENT}
- DNS entries: *.${ENVIRONMENT}.example.com
- IAM roles and service accounts

## Verification

- [ ] All cloud resources terminated
- [ ] No ongoing costs
- [ ] Backups accessible
- [ ] Documentation updated
- [ ] Team notified

## Next Steps

1. Verify no unexpected costs appear in next billing cycle
2. Archive documentation after 90 days
3. Delete backups after retention period ($(date -d '+90 days' +%Y-%m-%d))

EOF

    echo "Report generated: ${ENVIRONMENT}-decommission-report.md"
}

# Step 8: Verify cleanup
verify_cleanup() {
    echo "Verifying cleanup..."

    # Check for remaining Kubernetes resources
    remaining_resources=$(kubectl get all -n ${ENVIRONMENT} 2>/dev/null || echo "namespace not found")

    if [ "$remaining_resources" != "namespace not found" ]; then
        echo "WARNING: Some Kubernetes resources still exist"
        echo "$remaining_resources"
    fi

    # Check for remaining AWS resources
    remaining_aws=$(aws resourcegroupstaggingapi get-resources \
        --tag-filters Key=Environment,Values=${ENVIRONMENT} \
        --query 'ResourceTagMappingList[].ResourceARN' \
        --output text)

    if [ -n "$remaining_aws" ]; then
        echo "WARNING: Some AWS resources still exist:"
        echo "$remaining_aws"
    else
        echo "No remaining resources found"
    fi
}

# Main execution
main() {
    echo "Starting decommission process..."
    echo "Environment: $ENVIRONMENT"
    echo "Dry Run: $DRY_RUN"
    echo ""

    read -p "Are you sure you want to decommission $ENVIRONMENT? (yes/no): " confirm

    if [ "$confirm" != "yes" ]; then
        echo "Decommission cancelled"
        exit 0
    fi

    backup_data
    scale_down_services
    cleanup_kubernetes
    cleanup_cloud_resources
    cleanup_dns
    revoke_access
    verify_cleanup
    generate_report

    echo ""
    echo "Decommission completed!"
    echo "Please review the report: ${ENVIRONMENT}-decommission-report.md"
}

main

Critical: Pre-Decommission Verification

Always verify backups are complete and accessible before removing any production resources. Maintain backups according to your data retention policy.

Best Practices Summary¶

Container Best Practices¶

Practice	Implementation	Benefit
Multi-stage builds	Separate build and runtime stages	Smaller images, better security
Non-root users	Create and use dedicated app users	Enhanced security
Health checks	Implement readiness and liveness probes	Reliable deployments
Resource limits	Define CPU and memory constraints	Predictable performance
Image scanning	Automated vulnerability scanning	Early security issue detection

Infrastructure as Code Best Practices¶

IaC Golden Rules

Version Everything - All infrastructure code in version control
Modularize - Create reusable, focused modules
Document - Explain why, not just what
Test - Validate changes before production
Review - Peer review all infrastructure changes

Deployment Best Practices¶

Pre-Deployment:

During Deployment:

Post-Deployment:

Environment Management Best Practices¶

ConfigurationParityTesting

Use hierarchical configuration inheritance
Keep sensitive data in secrets management
Document environment-specific deviations
Automate configuration validation

Maintain consistent tooling versions
Use similar scaling patterns
Replicate production architecture
Test with production-like data

Test in staging before production
Use automated integration tests
Perform load testing regularly
Validate disaster recovery procedures

Monitoring and Observability¶

Key Metrics to Track¶

Application Metrics:

# prometheus-rules.yaml
groups:
  - name: application-health
    rules:
      - record: app:http_request_duration_seconds:p95
        expr: histogram_quantile(0.95, 
          rate(http_request_duration_seconds_bucket[5m]))

      - record: app:http_requests_total:rate5m
        expr: rate(http_requests_total[5m])

      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / 
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"

Infrastructure Metrics:

# infrastructure-rules.yaml
groups:
  - name: infrastructure-health
    rules:
      - alert: HighCPUUsage
        expr: avg(rate(container_cpu_usage_seconds_total[5m])) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"

      - alert: HighMemoryUsage
        expr: |
          (container_memory_usage_bytes / 
           container_spec_memory_limit_bytes) > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage detected"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod is crash looping"

Logging Strategy¶

Structured Logging Example:

{
  "timestamp": "2024-10-18T10:30:45.123Z",
  "level": "info",
  "service": "api",
  "environment": "production",
  "version": "1.2.3",
  "trace_id": "abc123def456",
  "user_id": "user_789",
  "endpoint": "/api/orders",
  "method": "POST",
  "status_code": 201,
  "duration_ms": 45,
  "message": "Order created successfully"
}

Log Aggregation:

Use centralized logging (ELK, Loki, CloudWatch)
Implement log retention policies
Structure logs for easy querying
Include correlation IDs for tracing
Filter sensitive information

Disaster Recovery¶

Backup Strategy¶

What to Backup:

Resource Type	Frequency	Retention	Method
Databases	Hourly	30 days	Automated snapshots
Configuration	On change	90 days	Version control
Secrets	Daily	90 days	Encrypted backups
Application State	Daily	7 days	Volume snapshots
Infrastructure Code	On commit	Indefinite	Git repository

Recovery Procedures¶

Database Recovery Example:

#!/bin/bash
# restore-database.sh

SNAPSHOT_ID=$1
TARGET_INSTANCE=$2

echo "Restoring database from snapshot: $SNAPSHOT_ID"

# Restore RDS instance from snapshot
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier ${TARGET_INSTANCE} \
    --db-snapshot-identifier ${SNAPSHOT_ID} \
    --db-instance-class db.t3.medium \
    --publicly-accessible false

# Wait for instance to be available
echo "Waiting for instance to be available..."
aws rds wait db-instance-available \
    --db-instance-identifier ${TARGET_INSTANCE}

# Update application configuration
echo "Updating application configuration..."
NEW_ENDPOINT=$(aws rds describe-db-instances \
    --db-instance-identifier ${TARGET_INSTANCE} \
    --query 'DBInstances[0].Endpoint.Address' \
    --output text)

kubectl set env deployment/app \
    DATABASE_HOST=${NEW_ENDPOINT}

echo "Database restored successfully"
echo "New endpoint: $NEW_ENDPOINT"

Disaster Recovery Testing¶

Regular DR Drills:

Monthly: Test database restoration
Quarterly: Full environment recovery
Bi-annually: Complete disaster scenario
Annually: Cross-region failover test

DR Test Checklist:

Identify recovery time objective (RTO)
Identify recovery point objective (RPO)
Document recovery procedures
Test backup restoration
Verify data integrity
Validate application functionality
Document lessons learned
Update procedures based on findings

Troubleshooting Guide¶

Common Issues and Solutions¶

Build FailuresDeployment FailuresNetwork IssuesPerformance Issues

Symptom: Docker build fails

Common Causes: - Network issues downloading dependencies - Invalid Dockerfile syntax - Insufficient disk space - Build cache corruption

Solutions:

# Clear build cache
docker builder prune -af

# Build without cache
docker build --no-cache -t myapp:latest .

# Check disk space
df -h
docker system df

Symptom: Kubernetes deployment fails

Common Causes: - Image pull errors - Resource constraints - Failed health checks - Configuration errors

Solutions:

# Check pod status
kubectl get pods -n production
kubectl describe pod <pod-name> -n production

# Check logs
kubectl logs <pod-name> -n production

# Check events
kubectl get events -n production --sort-by='.lastTimestamp'

# Rollback deployment
kubectl rollout undo deployment/myapp -n production

Symptom: Service connectivity problems

Common Causes: - DNS resolution failures - Network policy restrictions - Service misconfiguration - Ingress controller issues

Solutions:

# Test DNS resolution
kubectl run -it --rm debug \
  --image=nicolaka/netshoot \
  --restart=Never -- nslookup myapp

# Check service endpoints
kubectl get endpoints myapp -n production

# Verify network policies
kubectl get networkpolicies -n production

# Test connectivity
kubectl run -it --rm debug \
  --image=nicolaka/netshoot \
  --restart=Never -- curl http://myapp:80

Symptom: Slow application performance

Common Causes: - Resource constraints - Database connection pool exhaustion - Memory leaks - Inefficient queries

Solutions:

# Check resource usage
kubectl top pods -n production
kubectl top nodes

# Increase resources
kubectl set resources deployment/myapp \
  --requests=cpu=500m,memory=1Gi \
  --limits=cpu=1000m,memory=2Gi

# Check application metrics
kubectl port-forward svc/myapp 9090:9090
# Access metrics at http://localhost:9090/metrics

# Review logs for errors
kubectl logs -f deployment/myapp -n production

Quick Reference Commands¶

Docker Commands¶

# Build and tag
docker build -t myapp:1.0.0 .
docker tag myapp:1.0.0 registry.com/myapp:1.0.0

# Push to registry
docker push registry.com/myapp:1.0.0

# Clean up
docker system prune -af
docker volume prune -f

# Inspect
docker inspect <container-id>
docker logs -f <container-id>

# Execute commands in container
docker exec -it <container-id> /bin/bash

Kubernetes Commands¶

# Deployments
kubectl apply -f deployment.yaml
kubectl rollout status deployment/myapp
kubectl rollout undo deployment/myapp
kubectl scale deployment/myapp --replicas=3

# Debugging
kubectl get pods -o wide
kubectl describe pod <pod-name>
kubectl logs -f <pod-name>
kubectl exec -it <pod-name> -- /bin/sh

# Configuration
kubectl create configmap myapp-config --from-file=config.yaml
kubectl create secret generic myapp-secret --from-literal=password=secret
kubectl get configmap myapp-config -o yaml

# Services and Ingress
kubectl get services
kubectl get ingress
kubectl port-forward svc/myapp 8080:80

# Namespace management
kubectl get namespaces
kubectl create namespace staging
kubectl config set-context --current --namespace=staging

# Advanced Deployment Management
kubectl rollout history deployment/myapp
kubectl rollout history deployment/myapp --revision=2
kubectl rollout pause deployment/myapp
kubectl rollout resume deployment/myapp
kubectl rollout restart deployment/myapp

# Resource Updates
kubectl patch deployment myapp -p '{"spec":{"replicas":5}}'
kubectl set image deployment/myapp myapp=myapp:2.0.0
kubectl set env deployment/myapp DATABASE_URL=postgres://newdb:5432
kubectl set resources deployment/myapp --limits=cpu=500m,memory=1Gi

# Deployment Strategies
kubectl apply -f deployment.yaml --record
kubectl annotate deployment/myapp kubernetes.io/change-cause="Upgraded to version 2.0"

# StatefulSets (for stateful applications)
kubectl get statefulsets
kubectl scale statefulset/mydb --replicas=3
kubectl rollout status statefulset/mydb
kubectl delete pod mydb-0 --force --grace-period=0

# DaemonSets (for node-level services)
kubectl get daemonsets -A
kubectl rollout status daemonset/node-exporter -n monitoring

# Jobs and CronJobs
kubectl create job backup --image=backup:latest
kubectl get jobs
kubectl get cronjobs
kubectl create cronjob backup --image=backup:latest --schedule="0 2 * * *"

# Resource Quotas and Limits
kubectl get resourcequota -n production
kubectl describe resourcequota production-quota -n production
kubectl create quota production-quota --hard=cpu=10,memory=20Gi,pods=50

# HorizontalPodAutoscaler
kubectl autoscale deployment myapp --cpu-percent=70 --min=2 --max=10
kubectl get hpa
kubectl describe hpa myapp

# Custom Resource Definitions (CRDs)
kubectl get crd
kubectl get <crd-name>
kubectl describe crd <crd-name>

# Helm (Package Manager)
helm repo add stable https://charts.helm.sh/stable
helm repo update
helm search repo nginx
helm install myapp stable/nginx
helm list
helm upgrade myapp stable/nginx --set replicaCount=3
helm rollback myapp 1
helm uninstall myapp

# Kustomize (Configuration Management)
kubectl apply -k ./overlays/production
kubectl kustomize ./overlays/production
kubectl diff -k ./overlays/production

# Network Policies
kubectl get networkpolicies -n production
kubectl describe networkpolicy allow-frontend -n production

# Resource Management
kubectl top nodes
kubectl top pods -n production
kubectl describe node <node-name>
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets
kubectl uncordon <node-name>

# PersistentVolumes and Claims
kubectl get pv
kubectl get pvc -n production
kubectl describe pv <pv-name>
kubectl delete pvc <pvc-name>

# Service Accounts and RBAC
kubectl get serviceaccounts -n production
kubectl create serviceaccount myapp-sa
kubectl get roles -n production
kubectl get rolebindings -n production
kubectl create role pod-reader --verb=get --verb=list --resource=pods
kubectl create rolebinding read-pods --role=pod-reader --serviceaccount=default:myapp-sa

# Cluster Information
kubectl cluster-info
kubectl get nodes -o wide
kubectl api-resources
kubectl api-versions
kubectl version

# Troubleshooting with Events
kubectl get events -n production --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning

# Resource Validation
kubectl apply -f deployment.yaml --dry-run=client
kubectl apply -f deployment.yaml --dry-run=server
kubectl diff -f deployment.yaml

# Label and Selector Management
kubectl label pods myapp-pod env=production
kubectl label pods myapp-pod env-
kubectl get pods -l env=production
kubectl get pods --selector="app=myapp,tier=frontend"

# Annotations
kubectl annotate deployment myapp description="Production API service"
kubectl annotate deployment myapp description-

# Context and Namespace Management
kubectl config get-contexts
kubectl config use-context production
kubectl config set-context --current --namespace=production
kubectl config view

# Certificate Management
kubectl get certificates -n production
kubectl describe certificate myapp-tls -n production
kubectl get certificaterequests -n production

# Advanced Debugging
kubectl alpha debug node/<node-name> -it --image=ubuntu
kubectl cp <pod-name>:/path/to/file ./local-file
kubectl cp ./local-file <pod-name>:/path/to/file
kubectl attach <pod-name> -c <container-name>

# Resource Export
kubectl get deployment myapp -o yaml > myapp-deployment.yaml
kubectl get all -n production -o yaml > production-backup.yaml
kubectl get secret mysecret -o jsonpath='{.data.password}' | base64 -d

# Kubectl Plugins (krew)
kubectl krew install ctx
kubectl krew install ns
kubectl ctx
kubectl ns

Terraform Commands¶

# Initialize
terraform init
terraform init -upgrade

# Plan and apply
terraform plan
terraform plan -out=plan.tfplan
terraform apply
terraform apply -auto-approve
terraform apply plan.tfplan

# Destroy
terraform destroy
terraform destroy -target=aws_instance.example

# State management
terraform state list
terraform state show <resource>
terraform import <resource> <id>
terraform state rm <resource>

# Workspace management
terraform workspace list
terraform workspace new staging
terraform workspace select staging

# Formatting and validation
terraform fmt -recursive
terraform validate
terraform graph | dot -Tsvg > graph.svg

Ansible Commands¶

# Run playbook
ansible-playbook -i inventory playbook.yml
ansible-playbook -i inventory playbook.yml --check
ansible-playbook -i inventory playbook.yml --tags "deploy"

# Ad-hoc commands
ansible all -i inventory -m ping
ansible webservers -i inventory -a "uptime"
ansible dbservers -i inventory -m service -a "name=postgresql state=restarted"

# Inventory
ansible-inventory -i inventory --list
ansible-inventory -i inventory --graph

# Vault
ansible-vault create secrets.yml
ansible-vault edit secrets.yml
ansible-vault encrypt secrets.yml
ansible-vault decrypt secrets.yml

AWS CLI Commands¶

# EKS
aws eks list-clusters
aws eks update-kubeconfig --name cluster-name
aws eks describe-cluster --name cluster-name

# RDS
aws rds describe-db-instances
aws rds create-db-snapshot --db-instance-identifier mydb --db-snapshot-identifier mydb-snapshot
aws rds restore-db-instance-from-db-snapshot --db-instance-identifier new-db --db-snapshot-identifier mydb-snapshot

# ECR
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin <account>.dkr.ecr.us-west-2.amazonaws.com
aws ecr describe-repositories
aws ecr list-images --repository-name myapp

# Secrets Manager
aws secretsmanager list-secrets
aws secretsmanager get-secret-value --secret-id myapp/database
aws secretsmanager put-secret-value --secret-id myapp/database --secret-string '{"password":"newpass"}'

OpenStack Commands¶

# Authentication
openstack token issue
openstack catalog list

# Compute (Nova)
openstack server list
openstack server show <server-name>
openstack server create --flavor m1.medium --image ubuntu-20.04 --network private myserver
openstack server delete <server-name>
openstack server reboot <server-name>
openstack server resize --flavor m1.large <server-name>

# Flavor management
openstack flavor list
openstack flavor show m1.medium

# Images (Glance)
openstack image list
openstack image show <image-name>
openstack image create --file ubuntu.qcow2 --disk-format qcow2 ubuntu-custom
openstack image delete <image-name>

# Networking (Neutron)
openstack network list
openstack network create private-net
openstack network show private-net
openstack subnet create --network private-net --subnet-range 192.168.1.0/24 private-subnet

# Router management
openstack router list
openstack router create myrouter
openstack router add subnet myrouter private-subnet
openstack router set --external-gateway public myrouter

# Floating IPs
openstack floating ip list
openstack floating ip create public
openstack server add floating ip <server-name> <floating-ip>
openstack server remove floating ip <server-name> <floating-ip>

# Security Groups
openstack security group list
openstack security group create web-sg
openstack security group rule create --protocol tcp --dst-port 80 web-sg
openstack security group rule create --protocol tcp --dst-port 443 web-sg
openstack security group rule list web-sg

# Volumes (Cinder)
openstack volume list
openstack volume create --size 100 myvolume
openstack volume show myvolume
openstack server add volume <server-name> myvolume
openstack server remove volume <server-name> myvolume
openstack volume delete myvolume

# Snapshots
openstack volume snapshot list
openstack volume snapshot create --volume myvolume myvolume-snapshot
openstack volume create --snapshot myvolume-snapshot restored-volume

# Orchestration (Heat)
openstack stack list
openstack stack create -t template.yaml mystack
openstack stack show mystack
openstack stack update -t template.yaml mystack
openstack stack delete mystack
openstack stack resource list mystack

# Object Storage (Swift)
openstack container list
openstack container create mycontainer
openstack object list mycontainer
openstack object create mycontainer file.txt
openstack object save mycontainer file.txt

# Quotas
openstack quota show
openstack quota set --instances 20 --cores 40 --ram 81920 <project-id>

# Projects and Users
openstack project list
openstack project create myproject
openstack user list
openstack user create --password secret --project myproject myuser
openstack role add --user myuser --project myproject member

# Resource usage
openstack usage list
openstack limits show --absolute

OpenStack Heat Templates (IaC)¶

# Validate template
openstack orchestration template validate -t template.yaml

# Preview changes
openstack stack preview -t template.yaml mystack

# Show stack events
openstack stack event list mystack
openstack stack event show mystack <event-id>

# Stack outputs
openstack stack output list mystack
openstack stack output show mystack server_ip

# Suspend and resume
openstack stack suspend mystack
openstack stack resume mystack

# Abandon (remove from control without deleting)
openstack stack abandon mystack

OpenStack-Kubernetes Integration¶

# If using Magnum (Kubernetes on OpenStack)
openstack coe cluster list
openstack coe cluster create k8s-cluster \
  --cluster-template kubernetes-template \
  --master-count 3 \
  --node-count 5

openstack coe cluster show k8s-cluster
openstack coe cluster config k8s-cluster
openstack coe cluster resize k8s-cluster --node-count 10
openstack coe cluster upgrade k8s-cluster --cluster-template new-template
openstack coe cluster delete k8s-cluster

# Get kubeconfig
openstack coe cluster config k8s-cluster --dir ~/.kube

Additional Resources¶

Documentation Links¶

Container Technologies:

Infrastructure as Code:

Cloud Providers:

CI/CD Tools:

Tools and Utilities¶

Container Security:

Trivy - Vulnerability scanner
Clair - Static analysis tool
Anchore - Container security platform
Cosign - Container signing

IaC Testing:

Terratest - Go library for testing infrastructure
Kitchen-Terraform - Test Kitchen plugin
InSpec - Compliance testing framework
Checkov - Static code analysis tool

Monitoring & Observability:

Prometheus - Monitoring system
Grafana - Visualization platform
Datadog - Monitoring service
New Relic - Observability platform

Logging:

ELK Stack - Elasticsearch, Logstash, Kibana
Loki - Log aggregation system
Fluentd - Data collector
CloudWatch - AWS monitoring

Development Tools:

k9s - Terminal UI for Kubernetes
Lens - Kubernetes IDE
Docker Desktop - Local development
Minikube - Local Kubernetes

Glossary¶

Term	Definition
Blue-Green Deployment	Deployment strategy using two identical environments
Canary Deployment	Gradual rollout to subset of users
ConfigMap	Kubernetes object for non-sensitive configuration data
Container Registry	Storage and distribution system for container images
Idempotency	Property where operation produces same result regardless of repetition
Infrastructure as Code	Managing infrastructure through code rather than manual processes
Multi-stage Build	Docker build technique using multiple FROM statements
Rolling Deployment	Gradual replacement of application instances
Secret	Kubernetes object for sensitive information
Service Mesh	Infrastructure layer for service-to-service communication

Common Acronyms¶

Acronym	Full Form
CD	Continuous Delivery/Deployment
CI	Continuous Integration
CIDR	Classless Inter-Domain Routing
CRD	Custom Resource Definition
DR	Disaster Recovery
IAM	Identity and Access Management
IaC	Infrastructure as Code
RBAC	Role-Based Access Control
RPO	Recovery Point Objective
RTO	Recovery Time Objective
SLA	Service Level Agreement
TLS	Transport Layer Security

Deployment Checklist Template¶

Pre-Deployment¶

Code Quality:

Code reviewed and approved
All tests passing
Code coverage meets requirements
Static analysis passed

Security:

Security scan completed
No critical vulnerabilities
Secrets properly managed
Access controls verified

Infrastructure:

Infrastructure changes reviewed
Resource capacity verified
Scaling rules configured
Monitoring alerts configured

Database:

Migration scripts tested
Rollback plan documented
Backup verified
Performance impact assessed

Documentation:

Release notes prepared
Runbook updated
Configuration documented
Team notified

During Deployment¶

Monitoring:

Error rates monitored
Response times tracked
Resource usage checked
Logs reviewed

Verification:

Health checks passing
Smoke tests executed
Critical paths verified
Database connectivity confirmed

Communication:

Status updates provided
Stakeholders informed
Issue tracker updated
Team available

Post-Deployment¶

Validation:

All services healthy
Business flows working
Performance acceptable
No unexpected errors

Documentation:

Deployment documented
Issues logged
Metrics recorded
Lessons learned captured

Cleanup:

Old resources removed
Rollback verified
Documentation updated
Team debriefed

Incident Response Template¶

Severity Levels¶

Level	Description	Response Time	Escalation
P1 - Critical	Complete service outage	Immediate	All hands
P2 - High	Major feature unavailable	15 minutes	On-call team
P3 - Medium	Minor feature degraded	1 hour	Assigned team
P4 - Low	Cosmetic issue	Next business day	Queue

Incident Response Steps¶

1. Acknowledge

Incident acknowledged
Severity assigned
Team notified
Status page updated

2. Assess

Impact determined
Root cause identified
Affected systems listed
Timeline established

3. Respond

Mitigation started
Workaround implemented
Rollback initiated (if needed)
Communication ongoing

4. Recover

Service restored
Functionality verified
Monitoring confirmed
Status page updated

5. Review

Postmortem scheduled
Timeline documented
Action items created
Process improved

Change Management Template¶

Change Request¶

Change Details:

Change ID: [AUTO-GENERATED]
Requested By: [NAME]
Date: [DATE]
Environment: [ENVIRONMENT]

Description:

[Detailed description of the change]

Justification:

[Business reason for the change]

Impact Assessment:

Systems Affected: [LIST]
Users Impacted: [NUMBER/PERCENTAGE]
Risk Level: [LOW/MEDIUM/HIGH]

Implementation Plan:

Start Time: [DATETIME]
Duration: [ESTIMATE]
Steps: [NUMBERED LIST]

Rollback Plan:

Trigger Conditions: [CONDITIONS]
Steps: [NUMBERED LIST]
Recovery Time: [ESTIMATE]

Testing:

Unit tests passed
Integration tests passed
UAT completed
Performance validated

Approvals:

Technical Lead
Operations Manager
Product Owner
Security Team

Last updated: October 2025