Deployment and Infrastructure as Code¶
Section Overview
Modern deployment practices using containers, Infrastructure as Code, and automated pipelines to ensure consistent, reliable, and scalable application delivery.
Container-based Development and Deployment¶
Dockerfile Best Practices¶
Core Principle: Dockerfiles should be optimized for security, efficiency, and maintainability while ensuring consistent and reproducible builds.
Key Guidelines:
- Use specific base image versions with SHA digests
- Minimize layer count and optimize image size
- Follow the principle of least privilege
- Implement proper health checks
- Leverage multi-stage builds
Why This Matters
Well-structured Dockerfiles ensure consistent environments across all deployment stages, reduce security vulnerabilities, and optimize both build times and runtime performance. They form the foundation of reliable containerized applications.
Layer Optimization Strategy¶
Implementation:
Security Considerations¶
Critical Security Practices
- Run containers as non-root users
- Remove unnecessary tools and packages
- Scan images for vulnerabilities regularly
- Use multi-stage builds to minimize attack surface
- Never include secrets in image layers
Example: Secure Container User Setup
# Create non-root user
RUN useradd -r -u 1000 appuser
WORKDIR /app
COPY --from=builder /app .
# Set user before execution
USER appuser
Complete Python Application Example¶
# Build stage
FROM python:3.11-slim-bullseye@sha256:abc123... AS builder
# Build metadata
ARG APP_VERSION
ARG BUILD_DATE
ARG VCS_REF
LABEL org.opencontainers.image.version="${APP_VERSION}" \
org.opencontainers.image.created="${BUILD_DATE}" \
org.opencontainers.image.revision="${VCS_REF}"
WORKDIR /app
# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
curl \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy and test application
COPY . .
RUN python -m pytest
# Final stage - minimal runtime image
FROM python:3.11-slim-bullseye@sha256:abc123...
# Create non-root user
RUN useradd -r -u 1000 appuser
WORKDIR /app
COPY --from=builder /app .
USER appuser
# Health check
HEALTHCHECK --interval=30s --timeout=3s \
CMD curl -f http://localhost:8000/health || exit 1
ENTRYPOINT ["python"]
CMD ["app.py"]
Node.js Application with Build Secrets¶
# Build stage
FROM node:18-alpine@sha256:def456... AS builder
WORKDIR /app
# Mount npm token as secret (not stored in layers)
RUN --mount=type=secret,id=npm_token \
npm config set //registry.npmjs.org/:_authToken=$(cat /run/secrets/npm_token)
# Install dependencies
COPY package*.json ./
RUN npm ci --only=production
# Build application
COPY . .
RUN npm run build
# Final stage
FROM node:18-alpine@sha256:def456...
# Create non-root user
RUN addgroup -g 1000 appgroup && \
adduser -u 1000 -G appgroup -s /bin/sh -D appuser
WORKDIR /app
# Copy built artifacts
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
USER appuser
# Health check
HEALTHCHECK --interval=30s --timeout=3s \
CMD wget -q --spider http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
.dockerignore Best Practices¶
Optimize Build Context
A well-configured .dockerignore file reduces build context size and speeds up builds significantly.
# Dependencies
node_modules
vendor
# Development files
*.log
npm-debug.log
.env
.env.local
# Version control
.git
.gitignore
# Documentation
*.md
docs/
# Build artifacts
dist
build
*.tar.gz
# IDE files
.vscode
.idea
*.swp
Container Image Management¶
Image Tagging Strategy¶
Core Principle: Container images must be versioned, secured, and managed systematically to ensure reliable and traceable deployments.
Recommended Tagging Format:
| Tag Type | Format | Example | Use Case |
|---|---|---|---|
| Semantic Version | v{major}.{minor}.{patch} | v1.2.3 | Production releases |
| Git Commit | {short-sha} | a1b2c3d | Development tracking |
| Build Number | v{version}-b{build} | v1.2.3-b456 | CI/CD integration |
| Environment | {version}-{env} | v1.2.3-staging | Environment-specific |
Registry Management Workflow¶
Security Scanning¶
Mandatory Security Checks
All images must be scanned for vulnerabilities before deployment. Critical and high-severity issues must be resolved.
Trivy Configuration Example:
# trivy.yaml
trivy:
severity: CRITICAL,HIGH
ignore-unfixed: true
vuln-type: os,library
format: table
output: scan-results.txt
CI Pipeline Integration:
scan-image:
script:
- trivy image --severity HIGH,CRITICAL \
company-registry.com/team/app:${CI_COMMIT_SHA}
- |
if [ $? -eq 1 ]; then
echo "Critical vulnerabilities found"
exit 1
fi
Registry Cleanup Policies¶
cleanup:
policies:
- name: keep-latest-versions
rules:
- type: tag
pattern: '^v\d+\.\d+\.\d+$'
action: keep
amount: 5
- name: cleanup-feature-branches
rules:
- type: tag
pattern: '^feature-.*$'
action: delete
older-than: 7d
- name: cleanup-development-tags
rules:
- type: tag
pattern: '^dev-.*$'
action: delete
older-than: 3d
Multi-stage Build Optimization¶
Complex Build Pipeline Example¶
# Stage 1: Dependencies
FROM node:18-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN --mount=type=cache,target=/root/.npm \
npm ci
# Stage 2: Frontend build
FROM node:18-alpine AS frontend-builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY frontend/ .
RUN npm run build
# Stage 3: Backend build
FROM maven:3.8-openjdk-17 AS backend-builder
WORKDIR /app
COPY backend/pom.xml .
RUN --mount=type=cache,target=/root/.m2 \
mvn dependency:go-offline
COPY backend/ .
RUN mvn package -DskipTests
# Stage 4: Security scan
FROM aquasec/trivy:latest AS security-scan
COPY --from=backend-builder /app/target/*.jar /app/
RUN trivy filesystem --exit-code 1 \
--severity HIGH,CRITICAL /app
# Stage 5: Final runtime image
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app
# Copy built artifacts
COPY --from=frontend-builder /app/dist /app/public
COPY --from=backend-builder /app/target/*.jar app.jar
# Non-root user
RUN addgroup -g 1000 appgroup && \
adduser -u 1000 -G appgroup -s /bin/sh -D appuser
USER appuser
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=3s \
CMD wget -q --spider http://localhost:8080/actuator/health || exit 1
ENTRYPOINT ["java", "-XX:+UseContainerSupport", "-jar", "app.jar"]
Build Cache Optimization
Multi-stage builds with build cache mounts can reduce build times by 50-70% by reusing dependencies across builds.
Docker Compose for Development¶
Development Environment Setup¶
Core Principle: Docker Compose should provide a consistent, reproducible local development environment that mirrors production while optimizing for developer experience.
# docker-compose.yml
version: '3.8'
x-logging: &default-logging
options:
max-size: "10m"
max-file: "3"
driver: json-file
services:
app:
build:
context: .
target: development
args:
- NODE_ENV=development
volumes:
- .:/app:delegated
- node_modules:/app/node_modules
ports:
- "${PORT:-3000}:3000"
- "9229:9229" # Debug port
environment:
- NODE_ENV=development
- DATABASE_URL=postgresql://postgres:password@db:5432/myapp
- REDIS_URL=redis://redis:6379/0
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
logging: *default-logging
db:
image: postgres:14-alpine
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init-scripts:/docker-entrypoint-initdb.d
environment:
- POSTGRES_PASSWORD=password
- POSTGRES_DB=myapp
ports:
- "5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
logging: *default-logging
redis:
image: redis:7-alpine
volumes:
- redis_data:/data
ports:
- "6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 3
logging: *default-logging
volumes:
node_modules:
postgres_data:
redis_data:
networks:
default:
driver: bridge
Development Overrides¶
# docker-compose.override.yml
services:
app:
command: npm run dev
environment:
- DEBUG=app:*
# Expose ports for local debugging tools
db:
ports:
- "5432:5432"
redis:
ports:
- "6379:6379"
Container Orchestration with Kubernetes¶
Service Deployment¶
Core Principle: Container orchestration should automate deployment, scaling, and management of containerized applications while ensuring high availability and optimal resource utilization.
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
environment: production
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3000"
spec:
securityContext:
runAsNonRoot: true
fsGroup: 2000
containers:
- name: myapp
image: company-registry.com/myapp:1.2.3
ports:
- containerPort: 3000
name: http
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 15
periodSeconds: 20
startupProbe:
httpGet:
path: /health
port: http
failureThreshold: 30
periodSeconds: 10
env:
- name: NODE_ENV
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: myapp-secrets
key: database-url
volumeMounts:
- name: config
mountPath: /etc/config
readOnly: true
- name: tmp
mountPath: /tmp
volumes:
- name: config
configMap:
name: myapp-config
- name: tmp
emptyDir: {}
Service and Ingress Configuration¶
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- myapp.example.com
secretName: myapp-tls
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp
port:
name: http
Resource Management
Always define resource requests and limits to ensure fair resource allocation and prevent resource exhaustion on shared clusters.
Infrastructure as Code with Terraform¶
Infrastructure Definition Principles¶
Core Principle: Infrastructure must be defined, versioned, and managed as code, ensuring reproducibility, consistency, and automated provisioning across all environments.
Key Guidelines:
- Maintain infrastructure code in version control
- Use declarative rather than imperative definitions
- Implement modular and reusable components
- Follow the principle of idempotency
- Document all configuration parameters
Why This Matters
Managing infrastructure as code reduces human error, ensures consistency across environments, and enables automated, repeatable deployments while maintaining a complete audit trail.
Terraform Module Structure¶
Best Practice Directory Layout:
infrastructure/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ └── variables.tf
│ ├── staging/
│ │ ├── main.tf
│ │ └── variables.tf
│ └── prod/
│ ├── main.tf
│ └── variables.tf
├── modules/
│ ├── networking/
│ ├── kubernetes/
│ └── database/
└── shared/
└── variables.tf
Terraform Implementation Example¶
# Define explicit variable types and validation
variable "environment" {
type = string
description = "Environment name (e.g., staging, production)"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "region" {
type = string
description = "AWS region for deployment"
default = "us-west-2"
}
# Local variables for common configurations
locals {
vpc_cidr = {
dev = "10.0.0.0/16"
staging = "10.1.0.0/16"
prod = "10.2.0.0/16"
}
common_tags = {
Environment = var.environment
ManagedBy = "terraform"
Team = "platform"
}
}
# Modular resource organization
module "vpc" {
source = "./modules/vpc"
environment = var.environment
cidr_block = local.vpc_cidr[var.environment]
tags = local.common_tags
}
# Dependencies and relationships
module "kubernetes" {
source = "./modules/kubernetes"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
cluster_version = "1.24"
node_groups = {
general = {
desired_size = lookup(var.node_sizes[var.environment], "desired", 2)
max_size = lookup(var.node_sizes[var.environment], "max", 4)
min_size = lookup(var.node_sizes[var.environment], "min", 1)
instance_types = ["t3.medium"]
}
}
depends_on = [module.vpc]
}
Configuration Management with Ansible¶
Role-based Configuration¶
Core Principle: Configuration management should be idempotent, role-based, and maintain clear separation between code, configuration, and variables.
# playbook.yml - Main playbook structure
---
- name: Configure application servers
hosts: app_servers
become: true
vars_files:
- vars/{{ environment }}.yml
pre_tasks:
- name: Validate environment variables
assert:
that:
- environment is defined
- environment in ['dev', 'staging', 'prod']
msg: "Environment must be set to dev, staging, or prod"
roles:
- role: common
tags: ['common', 'setup']
- role: nginx
tags: ['web', 'nginx']
vars:
nginx_worker_processes: "{{ 'auto' if environment == 'prod' else '2' }}"
- role: application
tags: ['app']
post_tasks:
- name: Verify configuration
include_tasks: tasks/verify.yml
Application Role Tasks¶
# roles/application/tasks/main.yml
---
- name: Install application dependencies
apt:
name: "{{ item }}"
state: present
update_cache: yes
loop: "{{ application_dependencies }}"
tags: ['install']
- name: Create application directories
file:
path: "{{ item }}"
state: directory
owner: "{{ app_user }}"
group: "{{ app_group }}"
mode: '0755'
loop:
- "{{ app_config_path }}"
- "{{ app_data_path }}"
- "{{ app_log_path }}"
tags: ['setup']
- name: Configure application service
template:
src: application.service.j2
dest: /etc/systemd/system/application.service
mode: '0644'
notify: restart application
tags: ['config']
- name: Ensure application is running
systemd:
name: application
state: started
enabled: yes
tags: ['service']
Idempotency is Critical
All Ansible tasks should be idempotent - running them multiple times should produce the same result without unintended side effects.
Secrets Management¶
HashiCorp Vault Integration¶
Core Principle: Sensitive data must never be stored in plain text and should be managed using dedicated secrets management solutions.
provider "vault" {
address = var.vault_addr
}
data "vault_generic_secret" "db_creds" {
path = "secret/${var.environment}/database"
}
resource "kubernetes_secret" "application" {
metadata {
name = "app-secrets"
namespace = var.namespace
}
data = {
DB_PASSWORD = data.vault_generic_secret.db_creds.data["password"]
DB_USERNAME = data.vault_generic_secret.db_creds.data["username"]
}
}
AWS Secrets Manager Integration¶
import boto3
import json
from botocore.exceptions import ClientError
def get_secret(secret_name, region_name="us-west-2"):
"""
Retrieve secret from AWS Secrets Manager.
"""
session = boto3.session.Session()
client = session.client(
service_name='secretsmanager',
region_name=region_name
)
try:
response = client.get_secret_value(SecretId=secret_name)
if 'SecretString' in response:
return json.loads(response['SecretString'])
else:
return response['SecretBinary']
except ClientError as e:
if e.response['Error']['Code'] == 'ResourceNotFoundException':
raise ValueError(f"Secret {secret_name} not found")
elif e.response['Error']['Code'] == 'InvalidRequestException':
raise ValueError(f"Invalid request for secret {secret_name}")
else:
raise
def rotate_secret(secret_id):
"""
Rotate database credentials in AWS Secrets Manager.
"""
client = boto3.client('secretsmanager')
try:
# Get current secret value
response = client.get_secret_value(SecretId=secret_id)
current_secret = json.loads(response['SecretString'])
# Generate new credentials
new_password = generate_secure_password()
# Update application database
update_database_password(
username=current_secret['username'],
new_password=new_password
)
# Update secret in Secrets Manager
client.put_secret_value(
SecretId=secret_id,
SecretString=json.dumps({
'username': current_secret['username'],
'password': new_password,
'host': current_secret['host'],
'port': current_secret['port']
})
)
return True
except Exception as e:
# Implement proper error handling and rollback
raise SecretRotationError(f"Failed to rotate secret: {str(e)}")
Deployment Automation and Pipelines¶
CI/CD Pipeline Design¶
Core Principle: Deployment pipelines must be automated, reliable, and provide clear visibility into the deployment process while maintaining security and compliance requirements.
Key Pipeline Stages:
| Stage | Purpose | Key Actions |
|---|---|---|
| Build | Compile and package application | Code compilation, dependency resolution |
| Test | Validate functionality | Unit tests, integration tests, security scans |
| Deploy | Release to environment | Environment provisioning, artifact deployment |
| Verify | Confirm deployment health | Health checks, smoke tests, monitoring |
Pipeline Best Practices
Well-designed deployment pipelines ensure reliable, repeatable deployments while reducing human error and maintaining security standards. Every stage should be automated and provide clear feedback.
GitHub Actions Pipeline Example¶
# .github/workflows/deployment.yml
name: Deployment Pipeline
on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
build-and-test:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Login to Container Registry
uses: docker/login-action@v2
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
- name: Build and push Docker image
uses: docker/build-push-action@v4
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Run Security Scan
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
format: 'table'
exit-code: '1'
ignore-unfixed: true
severity: 'CRITICAL,HIGH'
- name: Run tests
run: |
docker run --rm \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
npm test
deploy-staging:
needs: build-and-test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/develop'
environment:
name: staging
url: https://staging.example.com
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-west-2
- name: Update EKS deployment
run: |
aws eks update-kubeconfig --name staging-cluster
kubectl set image deployment/app-deployment \
app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
kubectl rollout status deployment/app-deployment --timeout=5m
- name: Run smoke tests
run: |
curl -f https://staging.example.com/health || exit 1
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment:
name: production
url: https://api.example.com
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.PROD_AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.PROD_AWS_SECRET_ACCESS_KEY }}
aws-region: us-west-2
- name: Deploy to Production
run: |
aws eks update-kubeconfig --name prod-cluster
kubectl set image deployment/app-deployment \
app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
kubectl rollout status deployment/app-deployment --timeout=10m
- name: Verify deployment
run: |
# Health check
curl -f https://api.example.com/health || exit 1
# Monitor error rates for 5 minutes
sleep 300
- name: Notify team
if: always()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Production deployment: ${{ job.status }}",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "Deployment to production: *${{ job.status }}*\nCommit: ${{ github.sha }}"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
Jenkins Pipeline Example¶
// Jenkinsfile
pipeline {
agent any
environment {
DOCKER_REGISTRY = 'company-registry.com'
APP_NAME = 'myapp'
VERSION = sh(script: 'git describe --tags --always', returnStdout: true).trim()
KUBECONFIG = credentials('kubeconfig-prod')
}
stages {
stage('Build') {
steps {
script {
docker.build("${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}")
}
}
}
stage('Test') {
parallel {
stage('Unit Tests') {
steps {
sh 'npm test'
junit 'test-results/**/*.xml'
}
}
stage('Integration Tests') {
steps {
sh 'npm run integration-test'
}
}
stage('Security Scan') {
steps {
sh """
trivy image \
--severity HIGH,CRITICAL \
--exit-code 1 \
${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}
"""
}
}
}
}
stage('Push Image') {
steps {
script {
docker.withRegistry("https://${DOCKER_REGISTRY}", 'registry-credentials') {
docker.image("${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}").push()
docker.image("${DOCKER_REGISTRY}/${APP_NAME}:${VERSION}").push('latest')
}
}
}
}
stage('Deploy to Staging') {
when { branch 'develop' }
steps {
script {
deployToEnvironment(
environment: 'staging',
version: VERSION
)
}
}
}
stage('Deploy to Production') {
when { branch 'main' }
input {
message 'Deploy to production?'
ok 'Yes, deploy!'
}
steps {
script {
deployToEnvironment(
environment: 'production',
version: VERSION
)
}
}
}
stage('Verify Deployment') {
steps {
script {
sh """
kubectl rollout status deployment/${APP_NAME} -n production
curl -f https://api.example.com/health || exit 1
"""
}
}
}
}
post {
success {
slackSend(
channel: '#deployments',
color: 'good',
message: "Deployment successful: ${APP_NAME}:${VERSION}"
)
}
failure {
slackSend(
channel: '#deployments',
color: 'danger',
message: "Deployment failed: ${APP_NAME}:${VERSION}"
)
}
always {
cleanWs()
}
}
}
// Helper function for deployment
def deployToEnvironment(Map config) {
sh """
kubectl config use-context ${config.environment}
kubectl set image deployment/${APP_NAME} \
app=${DOCKER_REGISTRY}/${APP_NAME}:${config.version} \
-n ${config.environment}
kubectl rollout status deployment/${APP_NAME} \
-n ${config.environment} \
--timeout=5m
"""
}
Pipeline Optimization
Use parallel stages for tests and scans to reduce total pipeline execution time. Cache dependencies between runs to speed up builds.
Deployment Strategies¶
Blue-Green Deployment¶
Principle: Maintain two identical production environments, switching traffic between them to achieve zero-downtime deployments.
Key Benefits:
- Instant rollback capability
- Zero downtime during deployment
- Full production environment testing before switch
- Simple rollback process
# Blue deployment (current)
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-blue
labels:
version: blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: app
image: myapp:1.0.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
---
# Green deployment (new)
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-green
labels:
version: green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: app
image: myapp:2.0.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
---
# Service that switches between blue and green
apiVersion: v1
kind: Service
metadata:
name: app-service
spec:
selector:
app: myapp
version: blue # Change to 'green' to switch
ports:
- port: 80
targetPort: 8080
#!/bin/bash
# blue-green-deploy.sh
NAMESPACE="production"
APP_NAME="myapp"
NEW_VERSION=$1
# Determine current active environment
CURRENT=$(kubectl get service ${APP_NAME}-service -n ${NAMESPACE} \
-o jsonpath='{.spec.selector.version}')
if [ "$CURRENT" = "blue" ]; then
NEW_ENV="green"
OLD_ENV="blue"
else
NEW_ENV="blue"
OLD_ENV="green"
fi
echo "Current environment: $CURRENT"
echo "Deploying to: $NEW_ENV"
# Deploy new version to inactive environment
kubectl set image deployment/${APP_NAME}-${NEW_ENV} \
app=${APP_NAME}:${NEW_VERSION} \
-n ${NAMESPACE}
# Wait for deployment to be ready
kubectl rollout status deployment/${APP_NAME}-${NEW_ENV} \
-n ${NAMESPACE} \
--timeout=5m
# Run smoke tests
echo "Running smoke tests..."
POD=$(kubectl get pod -n ${NAMESPACE} \
-l app=${APP_NAME},version=${NEW_ENV} \
-o jsonpath='{.items[0].metadata.name}')
kubectl exec -n ${NAMESPACE} ${POD} -- curl -f http://localhost:8080/health
if [ $? -eq 0 ]; then
echo "Smoke tests passed. Switching traffic..."
# Switch service to new environment
kubectl patch service ${APP_NAME}-service -n ${NAMESPACE} \
-p "{\"spec\":{\"selector\":{\"version\":\"${NEW_ENV}\"}}}"
echo "Traffic switched to ${NEW_ENV}"
echo "Monitor for 5 minutes before removing old deployment"
sleep 300
# Optional: Scale down old environment
# kubectl scale deployment/${APP_NAME}-${OLD_ENV} \
# --replicas=0 -n ${NAMESPACE}
else
echo "Smoke tests failed. Keeping ${OLD_ENV} active"
exit 1
fi
Canary Deployment¶
Principle: Release changes incrementally to a subset of users, monitoring for issues before full rollout.
Key Benefits:
- Reduced blast radius of failures
- Early detection of issues
- Gradual traffic shifting
- Data-driven rollout decisions
# Using Istio for canary deployment
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: app-vsvc
spec:
hosts:
- app.example.com
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: app-canary
subset: v2
- route:
- destination:
host: app-stable
subset: v1
weight: 90
- destination:
host: app-canary
subset: v2
weight: 10
Canary Monitoring Configuration:
# prometheus/canary-rules.yaml
groups:
- name: canary-deployment
interval: 30s
rules:
- alert: CanaryErrorRateHigh
expr: |
sum(rate(http_requests_total{version="canary",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{version="canary"}[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Canary deployment error rate too high"
description: "Canary version showing {{ $value }}% error rate"
- alert: CanaryLatencyHigh
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{version="canary"}[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Canary deployment latency too high"
description: "Canary p95 latency: {{ $value }}s"
Canary Rollback
Always define clear success criteria before starting a canary deployment. Automate rollback when metrics exceed thresholds.
Rolling Deployment¶
Principle: Gradually replace instances of the application with new versions while maintaining service availability.
Key Benefits:
- No infrastructure duplication needed
- Simple implementation
- Automatic rollback on failure
- Resource efficient
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Max new pods above desired
maxUnavailable: 0 # Max old pods that can be down
template:
spec:
containers:
- name: app
image: myapp:2.0.0
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
Rolling Update Process:
- New pod is created
- Wait for new pod to be ready
- Old pod is terminated
- Repeat until all pods updated
Environment Management¶
Environment Parity¶
Core Principle: Development, staging, and production environments must maintain maximum parity to ensure reliable testing and deployment processes.
Key Guidelines:
- Use identical configuration structures across environments
- Maintain consistent versions of all dependencies
- Implement similar scaling and redundancy patterns
- Use production-like data in lower environments
- Automate environment provisioning
Why Parity Matters
Environment parity minimizes "it works on my machine" issues and ensures that testing in lower environments accurately predicts production behavior. When environments differ, bugs may only surface in production, leading to costly incidents.
Configuration Management Structure¶
Environment Variable Management¶
Best Practices:
| Variable Type | Storage Method | Example |
|---|---|---|
| Non-sensitive Config | ConfigMaps | Feature flags, API URLs |
| Sensitive Data | Secrets Manager | Database passwords, API keys |
| Environment-specific | Environment files | Resource limits, replica counts |
| Build-time | Build arguments | Version numbers, build dates |
Kubernetes ConfigMap and Secrets¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
spec:
containers:
- name: app
image: myapp:1.0.0
envFrom:
- configMapRef:
name: app-config
- secretRef:
name: app-secrets
env:
- name: ENVIRONMENT
value: "production"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
Local Development Environment¶
Comprehensive Docker Compose Setup¶
# docker-compose.dev.yml
version: '3.8'
x-logging: &default-logging
driver: json-file
options:
max-size: "10m"
max-file: "3"
services:
app:
build:
context: .
target: development
args:
NODE_ENV: development
volumes:
- .:/app:delegated
- /app/node_modules
- ${HOME}/.aws:/root/.aws:ro
environment:
- NODE_ENV=development
- DB_HOST=db
- DB_PORT=5432
- DB_NAME=appdb
- DB_USER=devuser
- DB_PASSWORD=devpass
- REDIS_HOST=cache
- REDIS_PORT=6379
- AWS_PROFILE=${AWS_PROFILE:-default}
ports:
- "${PORT:-3000}:3000"
- "9229:9229" # Node.js debugger
depends_on:
db:
condition: service_healthy
cache:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
logging: *default-logging
db:
image: postgres:14-alpine
environment:
POSTGRES_DB: appdb
POSTGRES_USER: devuser
POSTGRES_PASSWORD: devpass
POSTGRES_INITDB_ARGS: "-E UTF8"
volumes:
- pgdata:/var/lib/postgresql/data
- ./init-scripts:/docker-entrypoint-initdb.d:ro
- ./backups:/backups
ports:
- "5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U devuser -d appdb"]
interval: 10s
timeout: 5s
retries: 5
logging: *default-logging
cache:
image: redis:7-alpine
command: redis-server --appendonly yes
volumes:
- redisdata:/data
ports:
- "6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 3
logging: *default-logging
# Development-only services
mailhog:
image: mailhog/mailhog:latest
ports:
- "1025:1025" # SMTP
- "8025:8025" # Web UI
logging: *default-logging
adminer:
image: adminer:latest
ports:
- "8080:8080"
environment:
ADMINER_DEFAULT_SERVER: db
depends_on:
- db
logging: *default-logging
volumes:
pgdata:
driver: local
redisdata:
driver: local
networks:
default:
driver: bridge
ipam:
config:
- subnet: 172.28.0.0/16
Development Environment Setup Script¶
#!/bin/bash
# setup-dev.sh
set -e
echo "Setting up local development environment..."
# Check prerequisites
check_prerequisites() {
echo "Checking prerequisites..."
command -v docker >/dev/null 2>&1 || {
echo "Error: Docker is not installed"
exit 1
}
command -v docker-compose >/dev/null 2>&1 || {
echo "Error: Docker Compose is not installed"
exit 1
}
echo "Prerequisites satisfied"
}
# Create necessary directories
setup_directories() {
echo "Creating project directories..."
mkdir -p backups
mkdir -p init-scripts
mkdir -p logs
echo "Directories created"
}
# Copy environment template
setup_env_file() {
if [ ! -f .env ]; then
echo "Creating .env file..."
cp .env.example .env
echo ".env file created"
echo "WARNING: Please review and update .env with your settings"
else
echo ".env file already exists"
fi
}
# Start services
start_services() {
echo "Starting Docker services..."
docker-compose -f docker-compose.dev.yml up -d
echo "Services started"
}
# Wait for services
wait_for_services() {
echo "Waiting for services to be healthy..."
max_attempts=30
attempt=0
while [ $attempt -lt $max_attempts ]; do
if docker-compose -f docker-compose.dev.yml ps | grep -q "healthy"; then
echo "Services are healthy"
return 0
fi
attempt=$((attempt + 1))
echo "Waiting... ($attempt/$max_attempts)"
sleep 2
done
echo "Error: Services did not become healthy in time"
return 1
}
# Run database migrations
run_migrations() {
echo "Running database migrations..."
docker-compose -f docker-compose.dev.yml exec -T app npm run migrate
echo "Migrations completed"
}
# Seed development data
seed_data() {
echo "Seeding development data..."
docker-compose -f docker-compose.dev.yml exec -T app npm run seed
echo "Data seeded"
}
# Print access information
print_info() {
echo ""
echo "========================================="
echo "Development environment is ready!"
echo "========================================="
echo ""
echo "Application: http://localhost:3000"
echo "Database UI: http://localhost:8080"
echo "Mail Server: http://localhost:8025"
echo ""
echo "Useful commands:"
echo " docker-compose -f docker-compose.dev.yml logs -f # View logs"
echo " docker-compose -f docker-compose.dev.yml down # Stop services"
echo " docker-compose -f docker-compose.dev.yml restart # Restart services"
echo ""
}
# Main execution
main() {
check_prerequisites
setup_directories
setup_env_file
start_services
wait_for_services || exit 1
run_migrations
seed_data
print_info
}
main
Development Productivity
Use volume mounts for hot reloading during development. This allows code changes to be reflected immediately without rebuilding containers.
Infrastructure Testing¶
Testing Layers¶
Core Principle: Infrastructure code must be validated through multiple testing layers, ensuring both correctness and compliance before any deployment.
# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.50.0
hooks:
- id: terraform_fmt
- id: terraform_docs
- id: terraform_tflint
- id: terraform_tfsec
- id: terraform_validate
- repo: https://github.com/bridgecrewio/checkov
rev: 2.0.0
hooks:
- id: checkov
args: [--directory, .]
# test_infrastructure.py
import pytest
from infrastructure.validators import validate_vpc_config
def test_vpc_configuration():
"""Test VPC configuration validation."""
config = {
'cidr_block': '10.0.0.0/16',
'region': 'us-west-2',
'availability_zones': 3
}
result = validate_vpc_config(config)
assert result['subnet_count'] == 6
assert result['nat_gateway_count'] == 3
assert result['valid'] is True
def test_invalid_cidr_block():
"""Test validation catches invalid CIDR blocks."""
config = {
'cidr_block': '10.0.0.0/8', # Too large
'region': 'us-west-2'
}
with pytest.raises(ValueError):
validate_vpc_config(config)
// infrastructure_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestKubernetesCluster(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../infrastructure/kubernetes",
Vars: map[string]interface{}{
"environment": "test",
"region": "us-west-2",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
clusterName := terraform.Output(t, terraformOptions, "cluster_name")
assert.NotEmpty(t, clusterName)
nodeCount := terraform.OutputList(t, terraformOptions, "node_groups")
assert.GreaterOrEqual(t, len(nodeCount), 1)
}
# compliance-tests.yaml
tests:
- name: ensure_encryption_at_rest
resource_type: aws_rds_cluster
assertions:
- property: storage_encrypted
operator: equals
value: true
severity: critical
- name: verify_vpc_flow_logs
resource_type: aws_vpc
assertions:
- property: enable_flow_logs
operator: equals
value: true
severity: high
- name: check_s3_versioning
resource_type: aws_s3_bucket
assertions:
- property: versioning.enabled
operator: equals
value: true
severity: medium
- name: verify_backup_retention
resource_type: aws_db_instance
assertions:
- property: backup_retention_period
operator: greater_than
value: 7
severity: high
Environment Decommissioning¶
Systematic Cleanup Process¶
Core Principle: Environment decommissioning must be systematic, ensuring proper cleanup of resources, data preservation, and documentation updates.
Decommissioning Checklist:
| Phase | Tasks | Verification |
|---|---|---|
| Planning | Create decommission plan, notify stakeholders | Plan reviewed and approved |
| Backup | Backup critical data, export configurations | Backups validated and accessible |
| Cleanup | Remove Kubernetes resources, cloud infrastructure | All resources removed |
| Documentation | Update docs, remove access controls | Documentation current |
| Verification | Validate complete removal, cost verification | No remaining resources or costs |
Decommissioning Script¶
#!/bin/bash
# decommission-environment.sh
set -e
ENVIRONMENT=$1
DRY_RUN=${2:-false}
if [ -z "$ENVIRONMENT" ]; then
echo "Usage: $0 <environment> [dry-run]"
exit 1
fi
echo "Decommissioning environment: $ENVIRONMENT"
if [ "$DRY_RUN" = "true" ]; then
echo "WARNING: DRY RUN MODE - No actual changes will be made"
fi
# Step 1: Backup critical data
backup_data() {
echo "Creating final backups..."
aws rds create-db-snapshot \
--db-instance-identifier ${ENVIRONMENT}-db \
--db-snapshot-identifier ${ENVIRONMENT}-final-$(date +%Y%m%d) \
${DRY_RUN:+--no-execute}
kubectl get all -n ${ENVIRONMENT} -o yaml > ${ENVIRONMENT}-backup.yaml
echo "Backups completed"
}
# Step 2: Scale down services
scale_down_services() {
echo "Scaling down services..."
kubectl scale deployment --all --replicas=0 -n ${ENVIRONMENT} \
${DRY_RUN:+--dry-run=client}
echo "Services scaled down"
}
# Step 3: Remove Kubernetes resources
cleanup_kubernetes() {
echo "Removing Kubernetes resources..."
kubectl delete namespace ${ENVIRONMENT} \
${DRY_RUN:+--dry-run=client}
echo "Kubernetes resources removed"
}
# Step 4: Remove cloud infrastructure
cleanup_cloud_resources() {
echo "Removing cloud infrastructure..."
cd infrastructure/environments/${ENVIRONMENT}
if [ "$DRY_RUN" = "true" ]; then
terraform plan -destroy
else
terraform destroy -auto-approve
fi
echo "Cloud resources removed"
}
# Step 5: Remove DNS entries
cleanup_dns() {
echo "Removing DNS entries..."
# Remove Route53 records
aws route53 list-resource-record-sets \
--hosted-zone-id ${HOSTED_ZONE_ID} \
--query "ResourceRecordSets[?contains(Name, '${ENVIRONMENT}')]" \
| jq -r '.[] | .Name' \
| while read record; do
echo "Removing DNS record: $record"
# Add deletion logic here
done
echo "DNS entries removed"
}
# Step 6: Revoke access
revoke_access() {
echo "Revoking access credentials..."
# Revoke IAM roles
aws iam list-roles --query "Roles[?contains(RoleName, '${ENVIRONMENT}')].RoleName" \
--output text | while read role; do
echo "Removing role: $role"
aws iam delete-role --role-name $role ${DRY_RUN:+--no-execute}
done
# Delete service accounts
kubectl delete serviceaccount --all -n ${ENVIRONMENT} \
${DRY_RUN:+--dry-run=client}
echo "Access revoked"
}
# Step 7: Generate decommission report
generate_report() {
echo "Generating decommission report..."
cat > ${ENVIRONMENT}-decommission-report.md <<EOF
# Environment Decommission Report
**Environment:** ${ENVIRONMENT}
**Date:** $(date)
**Performed By:** $(whoami)
## Summary
Environment ${ENVIRONMENT} has been successfully decommissioned.
## Backup Locations
- Database Snapshot: ${ENVIRONMENT}-final-$(date +%Y%m%d)
- Configuration Backup: ${ENVIRONMENT}-backup.yaml
- Archive Location: s3://backups/${ENVIRONMENT}/
## Resources Removed
- Kubernetes namespace: ${ENVIRONMENT}
- Cloud infrastructure: infrastructure/environments/${ENVIRONMENT}
- DNS entries: *.${ENVIRONMENT}.example.com
- IAM roles and service accounts
## Verification
- [ ] All cloud resources terminated
- [ ] No ongoing costs
- [ ] Backups accessible
- [ ] Documentation updated
- [ ] Team notified
## Next Steps
1. Verify no unexpected costs appear in next billing cycle
2. Archive documentation after 90 days
3. Delete backups after retention period ($(date -d '+90 days' +%Y-%m-%d))
EOF
echo "Report generated: ${ENVIRONMENT}-decommission-report.md"
}
# Step 8: Verify cleanup
verify_cleanup() {
echo "Verifying cleanup..."
# Check for remaining Kubernetes resources
remaining_resources=$(kubectl get all -n ${ENVIRONMENT} 2>/dev/null || echo "namespace not found")
if [ "$remaining_resources" != "namespace not found" ]; then
echo "WARNING: Some Kubernetes resources still exist"
echo "$remaining_resources"
fi
# Check for remaining AWS resources
remaining_aws=$(aws resourcegroupstaggingapi get-resources \
--tag-filters Key=Environment,Values=${ENVIRONMENT} \
--query 'ResourceTagMappingList[].ResourceARN' \
--output text)
if [ -n "$remaining_aws" ]; then
echo "WARNING: Some AWS resources still exist:"
echo "$remaining_aws"
else
echo "No remaining resources found"
fi
}
# Main execution
main() {
echo "Starting decommission process..."
echo "Environment: $ENVIRONMENT"
echo "Dry Run: $DRY_RUN"
echo ""
read -p "Are you sure you want to decommission $ENVIRONMENT? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
echo "Decommission cancelled"
exit 0
fi
backup_data
scale_down_services
cleanup_kubernetes
cleanup_cloud_resources
cleanup_dns
revoke_access
verify_cleanup
generate_report
echo ""
echo "Decommission completed!"
echo "Please review the report: ${ENVIRONMENT}-decommission-report.md"
}
main
Critical: Pre-Decommission Verification
Always verify backups are complete and accessible before removing any production resources. Maintain backups according to your data retention policy.
Best Practices Summary¶
Container Best Practices¶
| Practice | Implementation | Benefit |
|---|---|---|
| Multi-stage builds | Separate build and runtime stages | Smaller images, better security |
| Non-root users | Create and use dedicated app users | Enhanced security |
| Health checks | Implement readiness and liveness probes | Reliable deployments |
| Resource limits | Define CPU and memory constraints | Predictable performance |
| Image scanning | Automated vulnerability scanning | Early security issue detection |
Infrastructure as Code Best Practices¶
IaC Golden Rules
- Version Everything - All infrastructure code in version control
- Modularize - Create reusable, focused modules
- Document - Explain why, not just what
- Test - Validate changes before production
- Review - Peer review all infrastructure changes
Deployment Best Practices¶
Pre-Deployment:
- All tests passing
- Security scans completed
- Database migrations tested
- Rollback plan documented
- Stakeholders notified
During Deployment:
- Monitor key metrics
- Watch error rates
- Verify health checks
- Check logs for issues
- Ready to rollback if needed
Post-Deployment:
- Verify business-critical flows
- Monitor for 15-30 minutes
- Update documentation
- Communicate success
- Document any issues
Environment Management Best Practices¶
- Use hierarchical configuration inheritance
- Keep sensitive data in secrets management
- Document environment-specific deviations
- Automate configuration validation
- Maintain consistent tooling versions
- Use similar scaling patterns
- Replicate production architecture
- Test with production-like data
- Test in staging before production
- Use automated integration tests
- Perform load testing regularly
- Validate disaster recovery procedures
Monitoring and Observability¶
Key Metrics to Track¶
Application Metrics:
# prometheus-rules.yaml
groups:
- name: application-health
rules:
- record: app:http_request_duration_seconds:p95
expr: histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m]))
- record: app:http_requests_total:rate5m
expr: rate(http_requests_total[5m])
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"
Infrastructure Metrics:
# infrastructure-rules.yaml
groups:
- name: infrastructure-health
rules:
- alert: HighCPUUsage
expr: avg(rate(container_cpu_usage_seconds_total[5m])) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
- alert: HighMemoryUsage
expr: |
(container_memory_usage_bytes /
container_spec_memory_limit_bytes) > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage detected"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
Logging Strategy¶
Structured Logging Example:
{
"timestamp": "2024-10-18T10:30:45.123Z",
"level": "info",
"service": "api",
"environment": "production",
"version": "1.2.3",
"trace_id": "abc123def456",
"user_id": "user_789",
"endpoint": "/api/orders",
"method": "POST",
"status_code": 201,
"duration_ms": 45,
"message": "Order created successfully"
}
Log Aggregation:
- Use centralized logging (ELK, Loki, CloudWatch)
- Implement log retention policies
- Structure logs for easy querying
- Include correlation IDs for tracing
- Filter sensitive information
Disaster Recovery¶
Backup Strategy¶
What to Backup:
| Resource Type | Frequency | Retention | Method |
|---|---|---|---|
| Databases | Hourly | 30 days | Automated snapshots |
| Configuration | On change | 90 days | Version control |
| Secrets | Daily | 90 days | Encrypted backups |
| Application State | Daily | 7 days | Volume snapshots |
| Infrastructure Code | On commit | Indefinite | Git repository |
Recovery Procedures¶
Database Recovery Example:
#!/bin/bash
# restore-database.sh
SNAPSHOT_ID=$1
TARGET_INSTANCE=$2
echo "Restoring database from snapshot: $SNAPSHOT_ID"
# Restore RDS instance from snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier ${TARGET_INSTANCE} \
--db-snapshot-identifier ${SNAPSHOT_ID} \
--db-instance-class db.t3.medium \
--publicly-accessible false
# Wait for instance to be available
echo "Waiting for instance to be available..."
aws rds wait db-instance-available \
--db-instance-identifier ${TARGET_INSTANCE}
# Update application configuration
echo "Updating application configuration..."
NEW_ENDPOINT=$(aws rds describe-db-instances \
--db-instance-identifier ${TARGET_INSTANCE} \
--query 'DBInstances[0].Endpoint.Address' \
--output text)
kubectl set env deployment/app \
DATABASE_HOST=${NEW_ENDPOINT}
echo "Database restored successfully"
echo "New endpoint: $NEW_ENDPOINT"
Disaster Recovery Testing¶
Regular DR Drills:
- Monthly: Test database restoration
- Quarterly: Full environment recovery
- Bi-annually: Complete disaster scenario
- Annually: Cross-region failover test
DR Test Checklist:
- Identify recovery time objective (RTO)
- Identify recovery point objective (RPO)
- Document recovery procedures
- Test backup restoration
- Verify data integrity
- Validate application functionality
- Document lessons learned
- Update procedures based on findings
Troubleshooting Guide¶
Common Issues and Solutions¶
Symptom: Docker build fails
Common Causes: - Network issues downloading dependencies - Invalid Dockerfile syntax - Insufficient disk space - Build cache corruption
Solutions:
Symptom: Kubernetes deployment fails
Common Causes: - Image pull errors - Resource constraints - Failed health checks - Configuration errors
Solutions:
Symptom: Service connectivity problems
Common Causes: - DNS resolution failures - Network policy restrictions - Service misconfiguration - Ingress controller issues
Solutions:
# Test DNS resolution
kubectl run -it --rm debug \
--image=nicolaka/netshoot \
--restart=Never -- nslookup myapp
# Check service endpoints
kubectl get endpoints myapp -n production
# Verify network policies
kubectl get networkpolicies -n production
# Test connectivity
kubectl run -it --rm debug \
--image=nicolaka/netshoot \
--restart=Never -- curl http://myapp:80
Symptom: Slow application performance
Common Causes: - Resource constraints - Database connection pool exhaustion - Memory leaks - Inefficient queries
Solutions:
# Check resource usage
kubectl top pods -n production
kubectl top nodes
# Increase resources
kubectl set resources deployment/myapp \
--requests=cpu=500m,memory=1Gi \
--limits=cpu=1000m,memory=2Gi
# Check application metrics
kubectl port-forward svc/myapp 9090:9090
# Access metrics at http://localhost:9090/metrics
# Review logs for errors
kubectl logs -f deployment/myapp -n production
Quick Reference Commands¶
Docker Commands¶
# Build and tag
docker build -t myapp:1.0.0 .
docker tag myapp:1.0.0 registry.com/myapp:1.0.0
# Push to registry
docker push registry.com/myapp:1.0.0
# Clean up
docker system prune -af
docker volume prune -f
# Inspect
docker inspect <container-id>
docker logs -f <container-id>
# Execute commands in container
docker exec -it <container-id> /bin/bash
Kubernetes Commands¶
# Deployments
kubectl apply -f deployment.yaml
kubectl rollout status deployment/myapp
kubectl rollout undo deployment/myapp
kubectl scale deployment/myapp --replicas=3
# Debugging
kubectl get pods -o wide
kubectl describe pod <pod-name>
kubectl logs -f <pod-name>
kubectl exec -it <pod-name> -- /bin/sh
# Configuration
kubectl create configmap myapp-config --from-file=config.yaml
kubectl create secret generic myapp-secret --from-literal=password=secret
kubectl get configmap myapp-config -o yaml
# Services and Ingress
kubectl get services
kubectl get ingress
kubectl port-forward svc/myapp 8080:80
# Namespace management
kubectl get namespaces
kubectl create namespace staging
kubectl config set-context --current --namespace=staging
# Advanced Deployment Management
kubectl rollout history deployment/myapp
kubectl rollout history deployment/myapp --revision=2
kubectl rollout pause deployment/myapp
kubectl rollout resume deployment/myapp
kubectl rollout restart deployment/myapp
# Resource Updates
kubectl patch deployment myapp -p '{"spec":{"replicas":5}}'
kubectl set image deployment/myapp myapp=myapp:2.0.0
kubectl set env deployment/myapp DATABASE_URL=postgres://newdb:5432
kubectl set resources deployment/myapp --limits=cpu=500m,memory=1Gi
# Deployment Strategies
kubectl apply -f deployment.yaml --record
kubectl annotate deployment/myapp kubernetes.io/change-cause="Upgraded to version 2.0"
# StatefulSets (for stateful applications)
kubectl get statefulsets
kubectl scale statefulset/mydb --replicas=3
kubectl rollout status statefulset/mydb
kubectl delete pod mydb-0 --force --grace-period=0
# DaemonSets (for node-level services)
kubectl get daemonsets -A
kubectl rollout status daemonset/node-exporter -n monitoring
# Jobs and CronJobs
kubectl create job backup --image=backup:latest
kubectl get jobs
kubectl get cronjobs
kubectl create cronjob backup --image=backup:latest --schedule="0 2 * * *"
# Resource Quotas and Limits
kubectl get resourcequota -n production
kubectl describe resourcequota production-quota -n production
kubectl create quota production-quota --hard=cpu=10,memory=20Gi,pods=50
# HorizontalPodAutoscaler
kubectl autoscale deployment myapp --cpu-percent=70 --min=2 --max=10
kubectl get hpa
kubectl describe hpa myapp
# Custom Resource Definitions (CRDs)
kubectl get crd
kubectl get <crd-name>
kubectl describe crd <crd-name>
# Helm (Package Manager)
helm repo add stable https://charts.helm.sh/stable
helm repo update
helm search repo nginx
helm install myapp stable/nginx
helm list
helm upgrade myapp stable/nginx --set replicaCount=3
helm rollback myapp 1
helm uninstall myapp
# Kustomize (Configuration Management)
kubectl apply -k ./overlays/production
kubectl kustomize ./overlays/production
kubectl diff -k ./overlays/production
# Network Policies
kubectl get networkpolicies -n production
kubectl describe networkpolicy allow-frontend -n production
# Resource Management
kubectl top nodes
kubectl top pods -n production
kubectl describe node <node-name>
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets
kubectl uncordon <node-name>
# PersistentVolumes and Claims
kubectl get pv
kubectl get pvc -n production
kubectl describe pv <pv-name>
kubectl delete pvc <pvc-name>
# Service Accounts and RBAC
kubectl get serviceaccounts -n production
kubectl create serviceaccount myapp-sa
kubectl get roles -n production
kubectl get rolebindings -n production
kubectl create role pod-reader --verb=get --verb=list --resource=pods
kubectl create rolebinding read-pods --role=pod-reader --serviceaccount=default:myapp-sa
# Cluster Information
kubectl cluster-info
kubectl get nodes -o wide
kubectl api-resources
kubectl api-versions
kubectl version
# Troubleshooting with Events
kubectl get events -n production --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning
# Resource Validation
kubectl apply -f deployment.yaml --dry-run=client
kubectl apply -f deployment.yaml --dry-run=server
kubectl diff -f deployment.yaml
# Label and Selector Management
kubectl label pods myapp-pod env=production
kubectl label pods myapp-pod env-
kubectl get pods -l env=production
kubectl get pods --selector="app=myapp,tier=frontend"
# Annotations
kubectl annotate deployment myapp description="Production API service"
kubectl annotate deployment myapp description-
# Context and Namespace Management
kubectl config get-contexts
kubectl config use-context production
kubectl config set-context --current --namespace=production
kubectl config view
# Certificate Management
kubectl get certificates -n production
kubectl describe certificate myapp-tls -n production
kubectl get certificaterequests -n production
# Advanced Debugging
kubectl alpha debug node/<node-name> -it --image=ubuntu
kubectl cp <pod-name>:/path/to/file ./local-file
kubectl cp ./local-file <pod-name>:/path/to/file
kubectl attach <pod-name> -c <container-name>
# Resource Export
kubectl get deployment myapp -o yaml > myapp-deployment.yaml
kubectl get all -n production -o yaml > production-backup.yaml
kubectl get secret mysecret -o jsonpath='{.data.password}' | base64 -d
# Kubectl Plugins (krew)
kubectl krew install ctx
kubectl krew install ns
kubectl ctx
kubectl ns
Terraform Commands¶
# Initialize
terraform init
terraform init -upgrade
# Plan and apply
terraform plan
terraform plan -out=plan.tfplan
terraform apply
terraform apply -auto-approve
terraform apply plan.tfplan
# Destroy
terraform destroy
terraform destroy -target=aws_instance.example
# State management
terraform state list
terraform state show <resource>
terraform import <resource> <id>
terraform state rm <resource>
# Workspace management
terraform workspace list
terraform workspace new staging
terraform workspace select staging
# Formatting and validation
terraform fmt -recursive
terraform validate
terraform graph | dot -Tsvg > graph.svg
Ansible Commands¶
# Run playbook
ansible-playbook -i inventory playbook.yml
ansible-playbook -i inventory playbook.yml --check
ansible-playbook -i inventory playbook.yml --tags "deploy"
# Ad-hoc commands
ansible all -i inventory -m ping
ansible webservers -i inventory -a "uptime"
ansible dbservers -i inventory -m service -a "name=postgresql state=restarted"
# Inventory
ansible-inventory -i inventory --list
ansible-inventory -i inventory --graph
# Vault
ansible-vault create secrets.yml
ansible-vault edit secrets.yml
ansible-vault encrypt secrets.yml
ansible-vault decrypt secrets.yml
AWS CLI Commands¶
# EKS
aws eks list-clusters
aws eks update-kubeconfig --name cluster-name
aws eks describe-cluster --name cluster-name
# RDS
aws rds describe-db-instances
aws rds create-db-snapshot --db-instance-identifier mydb --db-snapshot-identifier mydb-snapshot
aws rds restore-db-instance-from-db-snapshot --db-instance-identifier new-db --db-snapshot-identifier mydb-snapshot
# ECR
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin <account>.dkr.ecr.us-west-2.amazonaws.com
aws ecr describe-repositories
aws ecr list-images --repository-name myapp
# Secrets Manager
aws secretsmanager list-secrets
aws secretsmanager get-secret-value --secret-id myapp/database
aws secretsmanager put-secret-value --secret-id myapp/database --secret-string '{"password":"newpass"}'
OpenStack Commands¶
# Authentication
openstack token issue
openstack catalog list
# Compute (Nova)
openstack server list
openstack server show <server-name>
openstack server create --flavor m1.medium --image ubuntu-20.04 --network private myserver
openstack server delete <server-name>
openstack server reboot <server-name>
openstack server resize --flavor m1.large <server-name>
# Flavor management
openstack flavor list
openstack flavor show m1.medium
# Images (Glance)
openstack image list
openstack image show <image-name>
openstack image create --file ubuntu.qcow2 --disk-format qcow2 ubuntu-custom
openstack image delete <image-name>
# Networking (Neutron)
openstack network list
openstack network create private-net
openstack network show private-net
openstack subnet create --network private-net --subnet-range 192.168.1.0/24 private-subnet
# Router management
openstack router list
openstack router create myrouter
openstack router add subnet myrouter private-subnet
openstack router set --external-gateway public myrouter
# Floating IPs
openstack floating ip list
openstack floating ip create public
openstack server add floating ip <server-name> <floating-ip>
openstack server remove floating ip <server-name> <floating-ip>
# Security Groups
openstack security group list
openstack security group create web-sg
openstack security group rule create --protocol tcp --dst-port 80 web-sg
openstack security group rule create --protocol tcp --dst-port 443 web-sg
openstack security group rule list web-sg
# Volumes (Cinder)
openstack volume list
openstack volume create --size 100 myvolume
openstack volume show myvolume
openstack server add volume <server-name> myvolume
openstack server remove volume <server-name> myvolume
openstack volume delete myvolume
# Snapshots
openstack volume snapshot list
openstack volume snapshot create --volume myvolume myvolume-snapshot
openstack volume create --snapshot myvolume-snapshot restored-volume
# Orchestration (Heat)
openstack stack list
openstack stack create -t template.yaml mystack
openstack stack show mystack
openstack stack update -t template.yaml mystack
openstack stack delete mystack
openstack stack resource list mystack
# Object Storage (Swift)
openstack container list
openstack container create mycontainer
openstack object list mycontainer
openstack object create mycontainer file.txt
openstack object save mycontainer file.txt
# Quotas
openstack quota show
openstack quota set --instances 20 --cores 40 --ram 81920 <project-id>
# Projects and Users
openstack project list
openstack project create myproject
openstack user list
openstack user create --password secret --project myproject myuser
openstack role add --user myuser --project myproject member
# Resource usage
openstack usage list
openstack limits show --absolute
OpenStack Heat Templates (IaC)¶
# Validate template
openstack orchestration template validate -t template.yaml
# Preview changes
openstack stack preview -t template.yaml mystack
# Show stack events
openstack stack event list mystack
openstack stack event show mystack <event-id>
# Stack outputs
openstack stack output list mystack
openstack stack output show mystack server_ip
# Suspend and resume
openstack stack suspend mystack
openstack stack resume mystack
# Abandon (remove from control without deleting)
openstack stack abandon mystack
OpenStack-Kubernetes Integration¶
# If using Magnum (Kubernetes on OpenStack)
openstack coe cluster list
openstack coe cluster create k8s-cluster \
--cluster-template kubernetes-template \
--master-count 3 \
--node-count 5
openstack coe cluster show k8s-cluster
openstack coe cluster config k8s-cluster
openstack coe cluster resize k8s-cluster --node-count 10
openstack coe cluster upgrade k8s-cluster --cluster-template new-template
openstack coe cluster delete k8s-cluster
# Get kubeconfig
openstack coe cluster config k8s-cluster --dir ~/.kube
Additional Resources¶
Documentation Links¶
Container Technologies:
Infrastructure as Code:
Cloud Providers:
CI/CD Tools:
Tools and Utilities¶
Container Security:
- Trivy - Vulnerability scanner
- Clair - Static analysis tool
- Anchore - Container security platform
- Cosign - Container signing
IaC Testing:
- Terratest - Go library for testing infrastructure
- Kitchen-Terraform - Test Kitchen plugin
- InSpec - Compliance testing framework
- Checkov - Static code analysis tool
Monitoring & Observability:
- Prometheus - Monitoring system
- Grafana - Visualization platform
- Datadog - Monitoring service
- New Relic - Observability platform
Logging:
- ELK Stack - Elasticsearch, Logstash, Kibana
- Loki - Log aggregation system
- Fluentd - Data collector
- CloudWatch - AWS monitoring
Development Tools:
- k9s - Terminal UI for Kubernetes
- Lens - Kubernetes IDE
- Docker Desktop - Local development
- Minikube - Local Kubernetes
Glossary¶
| Term | Definition |
|---|---|
| Blue-Green Deployment | Deployment strategy using two identical environments |
| Canary Deployment | Gradual rollout to subset of users |
| ConfigMap | Kubernetes object for non-sensitive configuration data |
| Container Registry | Storage and distribution system for container images |
| Idempotency | Property where operation produces same result regardless of repetition |
| Infrastructure as Code | Managing infrastructure through code rather than manual processes |
| Multi-stage Build | Docker build technique using multiple FROM statements |
| Rolling Deployment | Gradual replacement of application instances |
| Secret | Kubernetes object for sensitive information |
| Service Mesh | Infrastructure layer for service-to-service communication |
Common Acronyms¶
| Acronym | Full Form |
|---|---|
| CD | Continuous Delivery/Deployment |
| CI | Continuous Integration |
| CIDR | Classless Inter-Domain Routing |
| CRD | Custom Resource Definition |
| DR | Disaster Recovery |
| IAM | Identity and Access Management |
| IaC | Infrastructure as Code |
| RBAC | Role-Based Access Control |
| RPO | Recovery Point Objective |
| RTO | Recovery Time Objective |
| SLA | Service Level Agreement |
| TLS | Transport Layer Security |
Deployment Checklist Template¶
Pre-Deployment¶
Code Quality:
- Code reviewed and approved
- All tests passing
- Code coverage meets requirements
- Static analysis passed
Security:
- Security scan completed
- No critical vulnerabilities
- Secrets properly managed
- Access controls verified
Infrastructure:
- Infrastructure changes reviewed
- Resource capacity verified
- Scaling rules configured
- Monitoring alerts configured
Database:
- Migration scripts tested
- Rollback plan documented
- Backup verified
- Performance impact assessed
Documentation:
- Release notes prepared
- Runbook updated
- Configuration documented
- Team notified
During Deployment¶
Monitoring:
- Error rates monitored
- Response times tracked
- Resource usage checked
- Logs reviewed
Verification:
- Health checks passing
- Smoke tests executed
- Critical paths verified
- Database connectivity confirmed
Communication:
- Status updates provided
- Stakeholders informed
- Issue tracker updated
- Team available
Post-Deployment¶
Validation:
- All services healthy
- Business flows working
- Performance acceptable
- No unexpected errors
Documentation:
- Deployment documented
- Issues logged
- Metrics recorded
- Lessons learned captured
Cleanup:
- Old resources removed
- Rollback verified
- Documentation updated
- Team debriefed
Incident Response Template¶
Severity Levels¶
| Level | Description | Response Time | Escalation |
|---|---|---|---|
| P1 - Critical | Complete service outage | Immediate | All hands |
| P2 - High | Major feature unavailable | 15 minutes | On-call team |
| P3 - Medium | Minor feature degraded | 1 hour | Assigned team |
| P4 - Low | Cosmetic issue | Next business day | Queue |
Incident Response Steps¶
1. Acknowledge
- Incident acknowledged
- Severity assigned
- Team notified
- Status page updated
2. Assess
- Impact determined
- Root cause identified
- Affected systems listed
- Timeline established
3. Respond
- Mitigation started
- Workaround implemented
- Rollback initiated (if needed)
- Communication ongoing
4. Recover
- Service restored
- Functionality verified
- Monitoring confirmed
- Status page updated
5. Review
- Postmortem scheduled
- Timeline documented
- Action items created
- Process improved
Change Management Template¶
Change Request¶
Change Details:
- Change ID: [AUTO-GENERATED]
- Requested By: [NAME]
- Date: [DATE]
- Environment: [ENVIRONMENT]
Description:
[Detailed description of the change]
Justification:
[Business reason for the change]
Impact Assessment:
- Systems Affected: [LIST]
- Users Impacted: [NUMBER/PERCENTAGE]
- Risk Level: [LOW/MEDIUM/HIGH]
Implementation Plan:
- Start Time: [DATETIME]
- Duration: [ESTIMATE]
- Steps: [NUMBERED LIST]
Rollback Plan:
- Trigger Conditions: [CONDITIONS]
- Steps: [NUMBERED LIST]
- Recovery Time: [ESTIMATE]
Testing:
- Unit tests passed
- Integration tests passed
- UAT completed
- Performance validated
Approvals:
- Technical Lead
- Operations Manager
- Product Owner
- Security Team
Last updated: October 2025