Back to blog

Deployment and DevOps lessons nobody warned me about

Wrong env vars, no rollback plan, Friday deploys, silent crashes. After years of production incidents, here are the 10 DevOps lessons I learned the hard way.

March 4, 202612 min read
DevOpsDeploymentCI/CDMonitoringInfrastructure
Deployment and DevOps lessons nobody warned me about
The deployment pipeline - From "it works on my machine" to production reality
The deployment pipeline - From "it works on my machine" to production reality

My first production deployment broke the app for 3 hours.

I had tested everything locally. The code worked. The tests passed. I clicked deploy, went to grab coffee, and came back to a Slack channel on fire.

The database connection string was wrong. One environment variable. Three hours of downtime.

That was the first of many lessons. Here's what I wish someone had told me.


Lesson 1: environment variables will betray you

Environment variables are the #1 cause of "it works on my machine."

I've seen:

  • Typos in variable names (DATABSE_URL instead of DATABASE_URL)
  • Missing variables in production that existed in dev
  • Secrets accidentally committed to git
  • Different values between staging and production

The fix? Validate on startup.

// lib/env.ts
import { z } from 'zod';

const envSchema = z.object({
  DATABASE_URL: z.string().url(),
  STRIPE_SECRET_KEY: z.string().startsWith('sk_'),
  NEXTAUTH_SECRET: z.string().min(32),
  NEXTAUTH_URL: z.string().url(),
});

// This runs when your app starts
// If any variable is missing or invalid, the app crashes immediately
// Better to crash on startup than in production at 3 AM
export const env = envSchema.parse(process.env);

Fail fast. If a variable is wrong, crash immediately. Don't wait until a user hits the broken feature.


Lesson 2: staging must match production

"It works in staging" means nothing if staging doesn't match production.

I've debugged issues caused by:

  • Different Node.js versions
  • Different database versions (PostgreSQL 14 vs 15)
  • Different OS (Ubuntu vs Alpine)
  • Different memory limits
  • Different environment variables

The rule: staging should be a smaller clone of production, not a different setup.

# docker-compose.prod.yml
services:
  app:
    image: node:20-alpine  # Same as production
    environment:
      - NODE_ENV=production  # Same as production
    deploy:
      resources:
        limits:
          memory: 512M  # Same limits as production

If you can't afford identical infrastructure, at least match:

  • Runtime versions (Node, Python, etc.)
  • Database versions
  • OS base image
  • Core environment variables

Lesson 3: never deploy on Friday

I learned this the hard way.

Friday 5 PM deployment. Bug appears Saturday morning. Nobody's around. Customers are angry. I spend my weekend fixing it instead of relaxing.

Now my rules:

  • No deploys after Thursday 4 PM
  • No deploys before a holiday
  • No "quick fixes" on Friday

If it's urgent enough to deploy on Friday, it's urgent enough to have the team on standby. If the team can't be on standby, it can wait until Monday.


Lesson 4: every deploy needs a rollback plan

"We'll fix it forward" is not a plan.

Before every deploy, I ask:

  1. How do I know if this deployment failed?
  2. How do I rollback to the previous version?
  3. How long will the rollback take?

My method: GitHub Releases + Actions.

Each release is a tag on GitHub. No command line needed, you can do everything from the interface:

  1. Go to ReleasesCreate new release
  2. Click Choose a tag → create a new tag (e.g., v1.2.3)
  3. Add a title and release notes (optional but useful)
  4. Click Publish release

The workflow triggers automatically:

# .github/workflows/deploy.yml
name: Deploy on Release

on:
  release:
    types: [published]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: npm ci

      - name: Build
        run: npm run build

      - name: Deploy
        run: npm run deploy
        env:
          DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}

Something broke? Rollback in 3 clicks:

  1. Go to the Actions tab
  2. Find the deployment of the previous working version (e.g., v1.2.2)
  3. Click Re-run all jobs

That's it. No git commands to remember. No stress at 3 AM.

For database migrations, it's trickier. My rule: migrations should be reversible or additive.

-- Bad: can't rollback
ALTER TABLE users DROP COLUMN old_field;

-- Safe: add first, remove later
ALTER TABLE users ADD COLUMN new_field VARCHAR(255);
-- Deploy new code
-- Verify everything works
-- Later, in another release: DROP COLUMN old_field

Lesson 5: logs are your best friend (when they exist)

The first time a production bug happened, I had no idea what went wrong. No logs. No traces. Just a blank error page.

Now I log everything that matters:

// Before: no context
console.log('Error');

// After: actionable information
logger.error('Payment failed', {
  userId: user.id,
  amount: payment.amount,
  errorCode: error.code,
  errorMessage: error.message,
  timestamp: new Date().toISOString(),
});

What to log:

  • Every external API call (request + response + duration)
  • Every database query that fails
  • Every authentication attempt
  • Every payment transaction
  • User actions that modify data

What NOT to log:

  • Passwords
  • Credit card numbers
  • Personal data (GDPR)
  • Full request bodies with sensitive info

Lesson 6: health checks prevent disasters

My app once crashed silently. The process was running, but it wasn't responding to requests. The load balancer kept sending traffic to a dead instance.

Health checks fix this:

// pages/api/health.ts (Next.js)
export default async function handler(req, res) {
  try {
    // Check database connection
    await db.query('SELECT 1');

    // Check external services if critical
    // await redis.ping();

    res.status(200).json({ status: 'healthy' });
  } catch (error) {
    res.status(500).json({ status: 'unhealthy', error: error.message });
  }
}

Your load balancer/orchestrator hits this endpoint every 30 seconds. If it fails, traffic gets routed elsewhere.


Lesson 7: secrets belong in a secrets manager

I've committed secrets to git. More than once.

Even if you delete them, they're in the git history. Bots scan GitHub for exposed credentials. They will find yours.

The rules:

  1. Add .env to .gitignore on day one
  2. Use .env.example with placeholder values
  3. Use a secrets manager for production (Vercel env vars, AWS Secrets Manager, Doppler)
  4. Rotate credentials immediately if exposed
# .gitignore
.env
.env.local
.env.production

# .env.example (commit this)
DATABASE_URL=postgresql://user:password@localhost:5432/mydb
STRIPE_SECRET_KEY=sk_test_xxx

If you accidentally commit a secret:

  1. Rotate the credential immediately
  2. Remove from git history with git filter-branch or BFG
  3. Force push (coordinate with your team)

Lesson 8: monitoring is not optional

"The app is slow" is not actionable. "Average response time jumped from 200ms to 2s at 14:32" is.

Minimum monitoring:

  • Uptime: Is the app responding?
  • Response time: How fast?
  • Error rate: How many 500s?
  • Database: Connection pool, query time
  • Memory/CPU: Are we running out of resources?

Free/cheap options that work:

  • Vercel Analytics (if on Vercel)
  • Sentry (errors + performance)
  • Better Stack / Uptime Robot (uptime)
  • PlanetScale insights (database)

Set up alerts. If error rate spikes, you should know before your users tell you.


Lesson 9: CI/CD is worth the setup time

For months, I deployed manually:

  1. Run tests locally
  2. Build locally
  3. SSH into server
  4. Pull code
  5. Restart app

Every deploy was 15 minutes of manual work. And I'd skip steps when in a hurry.

Now everything is automated:

# .github/workflows/deploy.yml
name: Deploy
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: npm ci

      - name: Run tests
        run: npm test

      - name: Run linter
        run: npm run lint

      - name: Build
        run: npm run build

      - name: Deploy
        run: ./deploy.sh
        env:
          DEPLOY_KEY: ${{ secrets.DEPLOY_KEY }}

Push to main = automatic deploy. Tests fail = deploy blocked. No manual steps = no human error.


Lesson 10: backups are useless until you test them

"We have backups" means nothing if you've never restored one.

Questions to answer:

  • When was the last backup?
  • How long does a restore take?
  • Have you actually tried restoring?
  • Do you backup environment variables and secrets too?

I schedule a quarterly "disaster recovery drill":

  1. Spin up a new environment
  2. Restore from backup
  3. Verify the app works
  4. Document any issues

The worst time to learn your backups don't work is during an actual disaster.


The cheat sheet

LessonAction
Validate env varsUse Zod, crash on startup if invalid
Match staging to prodSame versions, same OS, same limits
No Friday deploysEmergencies only, with team on standby
Plan rollbacksTag releases, reversible migrations
Log everything usefulContext, not just "error"
Add health checksLet infrastructure detect failures
Use secrets managersNever commit .env to git
Monitor proactivelyAlerts before users complain
Automate CI/CDNo manual deploy steps
Test your backupsQuarterly restore drills

The lesson

DevOps isn't about fancy tools. It's about not getting woken up at 3 AM.

Every lesson here came from pain. Downtime. Angry users. Weekend debugging sessions. Uncomfortable conversations with managers.

The goal is simple: deploy with confidence, sleep peacefully.


This is the final part of my "What I learned the hard way" series. Thanks for reading all 6 parts.

Got questions? Hit me up on LinkedIn or check out more on my blog.