DevOps Essentials: CI/CD Pipelines and Automation

Understanding DevOps Culture

DevOps represents a cultural shift that breaks down silos between development and operations teams. It emphasizes collaboration, automation, and continuous improvement to deliver software faster and more reliably. DevOps isn’t just about tools—it’s about people, processes, and technology working in harmony.

The DevOps Lifecycle

1. Plan

Define features, track work, and manage backlogs using tools like Jira, Azure Boards, or GitHub Projects.

2. Code

Write code collaboratively using version control systems, primarily Git.

3. Build

Compile code and create artifacts automatically when changes are committed.

4. Test

Run automated tests to validate functionality, performance, and security.

5. Release

Package and prepare code for deployment with proper versioning.

6. Deploy

Automatically deploy to various environments (dev, staging, production).

7. Operate

Monitor applications and infrastructure in production.

8. Monitor

Collect metrics, logs, and feedback to identify issues and improvements.

Continuous Integration (CI)

CI is the practice of automatically building and testing code changes as they’re committed to version control.

CI Best Practices

  • Commit code frequently (multiple times per day)
  • Maintain a single source repository
  • Automate the build process
  • Keep builds fast (under 10 minutes ideal)
  • Test in a clone of production environment
  • Make build results visible to everyone
  • Fix broken builds immediately

GitHub Actions CI Example

# .github/workflows/ci.yml
name: CI Pipeline

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest
    
    services:
      postgres:
        image: postgres:14
        env:
          POSTGRES_PASSWORD: postgres
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Node.js
      uses: actions/setup-node@v3
      with:
        node-version: '18'
        cache: 'npm'
    
    - name: Install dependencies
      run: npm ci
    
    - name: Run linter
      run: npm run lint
    
    - name: Run tests
      run: npm test
      env:
        DATABASE_URL: postgresql://postgres:postgres@localhost:5432/test
    
    - name: Upload coverage
      uses: codecov/codecov-action@v3
      with:
        file: ./coverage/coverage-final.json
  
  build:
    needs: test
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Build Docker image
      run: docker build -t myapp:${{ github.sha }} .
    
    - name: Run security scan
      run: |
        docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
          aquasec/trivy image myapp:${{ github.sha }}
    
    - name: Push to registry
      if: github.ref == 'refs/heads/main'
      run: |
        echo "${{ secrets.DOCKER_PASSWORD }}" | docker login -u "${{ secrets.DOCKER_USERNAME }}" --password-stdin
        docker tag myapp:${{ github.sha }} myuser/myapp:latest
        docker push myuser/myapp:latest

Continuous Deployment (CD)

CD extends CI by automatically deploying all code changes to production after passing automated tests.

Deployment Strategies

Blue-Green Deployment

Maintain two identical production environments. Deploy to the inactive one, then switch traffic.

Blue Environment (Current: v1.0)  → 100% traffic
Green Environment (New: v2.0)     → 0% traffic

After testing Green:
Blue Environment (Old: v1.0)      → 0% traffic
Green Environment (New: v2.0)     → 100% traffic

Canary Deployment

Gradually roll out changes to a small subset of users before full deployment.

Stage 1: v2.0 → 5% of users, v1.0 → 95%
Stage 2: v2.0 → 25% of users, v1.0 → 75%
Stage 3: v2.0 → 50% of users, v1.0 → 50%
Stage 4: v2.0 → 100% of users

Rolling Deployment

Gradually replace instances of the old version with the new version.

Kubernetes Deployment Example

# deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
        version: v2.0
    spec:
      containers:
      - name: myapp
        image: myuser/myapp:latest
        ports:
        - containerPort: 3000
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
---
apiVersion: v1
kind: Service
metadata:
  name: myapp-service
spec:
  selector:
    app: myapp
  ports:
  - protocol: TCP
    port: 80
    targetPort: 3000
  type: LoadBalancer

Infrastructure as Code (IaC)

Terraform Example

# main.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

# VPC
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = {
    Name = "${var.project_name}-vpc"
  }
}

# Subnet
resource "aws_subnet" "public" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "${var.aws_region}a"
  
  map_public_ip_on_launch = true
  
  tags = {
    Name = "${var.project_name}-public-subnet"
  }
}

# Security Group
resource "aws_security_group" "web" {
  name        = "${var.project_name}-web-sg"
  description = "Security group for web servers"
  vpc_id      = aws_vpc.main.id
  
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# EC2 Instance
resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"
  subnet_id     = aws_subnet.public.id
  
  vpc_security_group_ids = [aws_security_group.web.id]
  
  user_data = file("${path.module}/user_data.sh")
  
  tags = {
    Name = "${var.project_name}-web-server"
  }
}

# Outputs
output "instance_public_ip" {
  value = aws_instance.web.public_ip
}

Configuration Management

Ansible Playbook Example

# playbook.yml
---
- name: Configure web servers
  hosts: webservers
  become: yes
  
  vars:
    app_user: webapp
    app_dir: /opt/myapp
    
  tasks:
    - name: Update apt cache
      apt:
        update_cache: yes
        cache_valid_time: 3600
    
    - name: Install required packages
      apt:
        name:
          - nginx
          - nodejs
          - npm
          - git
        state: present
    
    - name: Create application user
      user:
        name: "{{ app_user }}"
        shell: /bin/bash
        createhome: yes
    
    - name: Clone application repository
      git:
        repo: https://github.com/user/myapp.git
        dest: "{{ app_dir }}"
        version: main
      become_user: "{{ app_user }}"
    
    - name: Install Node.js dependencies
      npm:
        path: "{{ app_dir }}"
        state: present
      become_user: "{{ app_user }}"
    
    - name: Configure Nginx
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/sites-available/myapp
      notify: Reload Nginx
    
    - name: Enable Nginx site
      file:
        src: /etc/nginx/sites-available/myapp
        dest: /etc/nginx/sites-enabled/myapp
        state: link
      notify: Reload Nginx
    
    - name: Install PM2 globally
      npm:
        name: pm2
        global: yes
        state: present
    
    - name: Start application with PM2
      command: pm2 start app.js --name myapp
      args:
        chdir: "{{ app_dir }}"
      become_user: "{{ app_user }}"
  
  handlers:
    - name: Reload Nginx
      service:
        name: nginx
        state: reloaded

Monitoring and Observability

The Three Pillars

1. Metrics

Quantitative measurements over time (CPU usage, response times, error rates)

2. Logs

Detailed records of events and state changes

3. Traces

End-to-end request flows through distributed systems

Prometheus and Grafana Setup

# docker-compose.yml for monitoring stack
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    ports:
      - "9090:9090"
  
  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    ports:
      - "3000:3000"
    depends_on:
      - prometheus
  
  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"

volumes:
  prometheus_data:
  grafana_data:

Application Metrics in Node.js

const express = require('express');
const promClient = require('prom-client');

const app = express();

// Create metrics
const collectDefaultMetrics = promClient.collectDefaultMetrics;
collectDefaultMetrics();

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code']
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// Middleware to track metrics
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const labels = {
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode
    };
    
    httpRequestDuration.observe(labels, duration);
    httpRequestTotal.inc(labels);
  });
  
  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

// Your application routes
app.get('/api/users', (req, res) => {
  // ... your logic
});

app.listen(3000);

Logging Best Practices

Structured Logging

const winston = require('winston');

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'myapp',
    environment: process.env.NODE_ENV
  },
  transports: [
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' }),
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.simple()
      )
    })
  ]
});

// Usage
logger.info('User logged in', {
  userId: 123,
  email: 'user@example.com',
  ip: req.ip
});

logger.error('Database connection failed', {
  error: err.message,
  stack: err.stack,
  database: 'primary'
});

Security in DevOps (DevSecOps)

Security Scanning in CI/CD

# .github/workflows/security.yml
name: Security Scan

on:
  push:
    branches: [ main ]
  schedule:
    - cron: '0 0 * * 0'  # Weekly

jobs:
  security:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    # Dependency vulnerability scanning
    - name: Run npm audit
      run: npm audit --audit-level=high
    
    # SAST (Static Application Security Testing)
    - name: Run Snyk to check for vulnerabilities
      uses: snyk/actions/node@master
      env:
        SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
    
    # Secret scanning
    - name: Gitleaks scan
      uses: gitleaks/gitleaks-action@v2
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    
    # Container scanning
    - name: Build Docker image
      run: docker build -t myapp:latest .
    
    - name: Scan Docker image
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: myapp:latest
        format: 'sarif'
        output: 'trivy-results.sarif'
    
    - name: Upload Trivy results
      uses: github/codeql-action/upload-sarif@v2
      with:
        sarif_file: 'trivy-results.sarif'

Incident Management

On-Call Best Practices

  • Define clear escalation procedures
  • Create runbooks for common issues
  • Implement proper alerting (avoid alert fatigue)
  • Conduct blameless post-mortems
  • Document incidents and resolutions
  • Automate remediation where possible

Post-Mortem Template

# Incident Post-Mortem

## Incident Summary
- Date: 2025-11-21
- Duration: 45 minutes
- Severity: High
- Impact: 30% of users experienced slow response times

## Timeline
- 14:00 - Monitoring alerts for increased response times
- 14:05 - On-call engineer paged
- 14:10 - Identified database connection pool exhaustion
- 14:20 - Increased pool size
- 14:30 - Deployed fix
- 14:45 - Service fully restored

## Root Cause
Database connection pool sized for average load, not peak traffic

## Resolution
- Increased connection pool size from 10 to 50
- Added connection pool monitoring
- Implemented connection timeout handling

## Action Items
1. [ ] Set up alerts for connection pool utilization (Owner: Alice, Due: 2025-11-28)
2. [ ] Review and optimize slow database queries (Owner: Bob, Due: 2025-12-05)
3. [ ] Implement load testing in CI/CD (Owner: Charlie, Due: 2025-12-12)
4. [ ] Document database scaling procedures (Owner: Diana, Due: 2025-11-25)

## Lessons Learned
- Need better capacity planning for peak traffic
- Alert thresholds should be reviewed quarterly
- Runbook for database issues needs updating

DevOps Tools Ecosystem

Version Control

  • Git, GitHub, GitLab, Bitbucket

CI/CD

  • Jenkins, GitHub Actions, GitLab CI, CircleCI, Travis CI

Containerization

  • Docker, Podman, containerd

Orchestration

  • Kubernetes, Docker Swarm, Nomad

IaC

  • Terraform, Pulumi, CloudFormation, ARM Templates

Configuration Management

  • Ansible, Chef, Puppet, SaltStack

Monitoring

  • Prometheus, Grafana, Datadog, New Relic, Splunk

Logging

  • ELK Stack, Loki, Fluentd, Splunk

Conclusion

DevOps is a journey of continuous improvement, not a destination. Start with small automations, gradually build your CI/CD pipeline, and always prioritize reliability and security. Focus on culture and collaboration as much as tools and technology. The goal is to enable teams to deliver value to customers faster, more reliably, and with higher quality.

Begin by automating your build and test processes, then progressively add deployment automation, monitoring, and infrastructure as code. As your DevOps maturity grows, you’ll see increased deployment frequency, faster time to market, lower failure rates, and quicker recovery times—the key metrics of successful DevOps organizations.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top