Deployment and DevOps lessons nobody warned me about

The deployment pipeline - From "it works on my machine" to production reality

My first production deployment broke the app for 3 hours.

I had tested everything locally. The code worked. The tests passed. I clicked deploy, went to grab coffee, and came back to a Slack channel on fire.

The database connection string was wrong. One environment variable. Three hours of downtime.

That was the first of many lessons. Here's what I wish someone had told me.

Lesson 1: environment variables will betray you

Environment variables are the #1 cause of "it works on my machine."

I've seen:

Typos in variable names (DATABSE_URL instead of DATABASE_URL)
Missing variables in production that existed in dev
Secrets accidentally committed to git
Different values between staging and production

The fix? Validate on startup.

// lib/env.ts
import { z } from 'zod';

const envSchema = z.object({
  DATABASE_URL: z.string().url(),
  STRIPE_SECRET_KEY: z.string().startsWith('sk_'),
  NEXTAUTH_SECRET: z.string().min(32),
  NEXTAUTH_URL: z.string().url(),
});

// This runs when your app starts
// If any variable is missing or invalid, the app crashes immediately
// Better to crash on startup than in production at 3 AM
export const env = envSchema.parse(process.env);

Fail fast. If a variable is wrong, crash immediately. Don't wait until a user hits the broken feature.

Lesson 2: staging must match production

"It works in staging" means nothing if staging doesn't match production.

I've debugged issues caused by:

Different Node.js versions
Different database versions (PostgreSQL 14 vs 15)
Different OS (Ubuntu vs Alpine)
Different memory limits
Different environment variables

The rule: staging should be a smaller clone of production, not a different setup.

# docker-compose.prod.yml
services:
  app:
    image: node:20-alpine  # Same as production
    environment:
      - NODE_ENV=production  # Same as production
    deploy:
      resources:
        limits:
          memory: 512M  # Same limits as production

If you can't afford identical infrastructure, at least match:

Runtime versions (Node, Python, etc.)
Database versions
OS base image
Core environment variables

Lesson 3: never deploy on Friday

I learned this the hard way.

Friday 5 PM deployment. Bug appears Saturday morning. Nobody's around. Customers are angry. I spend my weekend fixing it instead of relaxing.

Now my rules:

No deploys after Thursday 4 PM
No deploys before a holiday
No "quick fixes" on Friday

If it's urgent enough to deploy on Friday, it's urgent enough to have the team on standby. If the team can't be on standby, it can wait until Monday.

Lesson 4: every deploy needs a rollback plan

"We'll fix it forward" is not a plan.

Before every deploy, I ask:

How do I know if this deployment failed?
How do I rollback to the previous version?
How long will the rollback take?

My method: GitHub Releases + Actions.

Each release is a tag on GitHub. No command line needed, you can do everything from the interface:

Go to Releases → Create new release
Click Choose a tag → create a new tag (e.g., v1.2.3)
Add a title and release notes (optional but useful)
Click Publish release

The workflow triggers automatically:

# .github/workflows/deploy.yml
name: Deploy on Release

on:
  release:
    types: [published]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: npm ci

      - name: Build
        run: npm run build

      - name: Deploy
        run: npm run deploy
        env:
          DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}

Something broke? Rollback in 3 clicks:

Go to the Actions tab
Find the deployment of the previous working version (e.g., v1.2.2)
Click Re-run all jobs

That's it. No git commands to remember. No stress at 3 AM.

For database migrations, it's trickier. My rule: migrations should be reversible or additive.

-- Bad: can't rollback
ALTER TABLE users DROP COLUMN old_field;

-- Safe: add first, remove later
ALTER TABLE users ADD COLUMN new_field VARCHAR(255);
-- Deploy new code
-- Verify everything works
-- Later, in another release: DROP COLUMN old_field

Lesson 5: logs are your best friend (when they exist)

The first time a production bug happened, I had no idea what went wrong. No logs. No traces. Just a blank error page.

Now I log everything that matters:

// Before: no context
console.log('Error');

// After: actionable information
logger.error('Payment failed', {
  userId: user.id,
  amount: payment.amount,
  errorCode: error.code,
  errorMessage: error.message,
  timestamp: new Date().toISOString(),
});

What to log:

Every external API call (request + response + duration)
Every database query that fails
Every authentication attempt
Every payment transaction
User actions that modify data

What NOT to log:

Passwords
Credit card numbers
Personal data (GDPR)
Full request bodies with sensitive info

Lesson 6: health checks prevent disasters

My app once crashed silently. The process was running, but it wasn't responding to requests. The load balancer kept sending traffic to a dead instance.

Health checks fix this:

// pages/api/health.ts (Next.js)
export default async function handler(req, res) {
  try {
    // Check database connection
    await db.query('SELECT 1');

    // Check external services if critical
    // await redis.ping();

    res.status(200).json({ status: 'healthy' });
  } catch (error) {
    res.status(500).json({ status: 'unhealthy', error: error.message });
  }
}

Your load balancer/orchestrator hits this endpoint every 30 seconds. If it fails, traffic gets routed elsewhere.

Lesson 7: secrets belong in a secrets manager

I've committed secrets to git. More than once.

Even if you delete them, they're in the git history. Bots scan GitHub for exposed credentials. They will find yours.

The rules:

Add .env to .gitignore on day one
Use .env.example with placeholder values
Use a secrets manager for production (Vercel env vars, AWS Secrets Manager, Doppler)
Rotate credentials immediately if exposed

# .gitignore
.env
.env.local
.env.production

# .env.example (commit this)
DATABASE_URL=postgresql://user:password@localhost:5432/mydb
STRIPE_SECRET_KEY=sk_test_xxx

If you accidentally commit a secret:

Rotate the credential immediately
Remove from git history with git filter-branch or BFG
Force push (coordinate with your team)

Lesson 8: monitoring is not optional

"The app is slow" is not actionable. "Average response time jumped from 200ms to 2s at 14:32" is.

Minimum monitoring:

Uptime: Is the app responding?
Response time: How fast?
Error rate: How many 500s?
Database: Connection pool, query time
Memory/CPU: Are we running out of resources?

Free/cheap options that work:

Vercel Analytics (if on Vercel)
Sentry (errors + performance)
Better Stack / Uptime Robot (uptime)
PlanetScale insights (database)

Set up alerts. If error rate spikes, you should know before your users tell you.

Lesson 9: CI/CD is worth the setup time

For months, I deployed manually:

Run tests locally
Build locally
SSH into server
Pull code
Restart app

Every deploy was 15 minutes of manual work. And I'd skip steps when in a hurry.

Now everything is automated:

# .github/workflows/deploy.yml
name: Deploy
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: npm ci

      - name: Run tests
        run: npm test

      - name: Run linter
        run: npm run lint

      - name: Build
        run: npm run build

      - name: Deploy
        run: ./deploy.sh
        env:
          DEPLOY_KEY: ${{ secrets.DEPLOY_KEY }}

Push to main = automatic deploy. Tests fail = deploy blocked. No manual steps = no human error.

Lesson 10: backups are useless until you test them

"We have backups" means nothing if you've never restored one.

Questions to answer:

When was the last backup?
How long does a restore take?
Have you actually tried restoring?
Do you backup environment variables and secrets too?

I schedule a quarterly "disaster recovery drill":

Spin up a new environment
Restore from backup
Verify the app works
Document any issues

The worst time to learn your backups don't work is during an actual disaster.

The cheat sheet

Lesson	Action
Validate env vars	Use Zod, crash on startup if invalid
Match staging to prod	Same versions, same OS, same limits
No Friday deploys	Emergencies only, with team on standby
Plan rollbacks	Tag releases, reversible migrations
Log everything useful	Context, not just "error"
Add health checks	Let infrastructure detect failures
Use secrets managers	Never commit `.env` to git
Monitor proactively	Alerts before users complain
Automate CI/CD	No manual deploy steps
Test your backups	Quarterly restore drills

The lesson

DevOps isn't about fancy tools. It's about not getting woken up at 3 AM.

Every lesson here came from pain. Downtime. Angry users. Weekend debugging sessions. Uncomfortable conversations with managers.

The goal is simple: deploy with confidence, sleep peacefully.

This is the final part of my "What I learned the hard way" series. Thanks for reading all 6 parts.

Got questions? Hit me up on LinkedIn or check out more on my blog.