
My first production deployment broke the app for 3 hours.
I had tested everything locally. The code worked. The tests passed. I clicked deploy, went to grab coffee, and came back to a Slack channel on fire.
The database connection string was wrong. One environment variable. Three hours of downtime.
That was the first of many lessons. Here's what I wish someone had told me.
Lesson 1: environment variables will betray you
Environment variables are the #1 cause of "it works on my machine."
I've seen:
- Typos in variable names (
DATABSE_URLinstead ofDATABASE_URL) - Missing variables in production that existed in dev
- Secrets accidentally committed to git
- Different values between staging and production
The fix? Validate on startup.
// lib/env.ts
import { z } from 'zod';
const envSchema = z.object({
DATABASE_URL: z.string().url(),
STRIPE_SECRET_KEY: z.string().startsWith('sk_'),
NEXTAUTH_SECRET: z.string().min(32),
NEXTAUTH_URL: z.string().url(),
});
// This runs when your app starts
// If any variable is missing or invalid, the app crashes immediately
// Better to crash on startup than in production at 3 AM
export const env = envSchema.parse(process.env);
Fail fast. If a variable is wrong, crash immediately. Don't wait until a user hits the broken feature.
Lesson 2: staging must match production
"It works in staging" means nothing if staging doesn't match production.
I've debugged issues caused by:
- Different Node.js versions
- Different database versions (PostgreSQL 14 vs 15)
- Different OS (Ubuntu vs Alpine)
- Different memory limits
- Different environment variables
The rule: staging should be a smaller clone of production, not a different setup.
# docker-compose.prod.yml
services:
app:
image: node:20-alpine # Same as production
environment:
- NODE_ENV=production # Same as production
deploy:
resources:
limits:
memory: 512M # Same limits as production
If you can't afford identical infrastructure, at least match:
- Runtime versions (Node, Python, etc.)
- Database versions
- OS base image
- Core environment variables
Lesson 3: never deploy on Friday
I learned this the hard way.
Friday 5 PM deployment. Bug appears Saturday morning. Nobody's around. Customers are angry. I spend my weekend fixing it instead of relaxing.
Now my rules:
- No deploys after Thursday 4 PM
- No deploys before a holiday
- No "quick fixes" on Friday
If it's urgent enough to deploy on Friday, it's urgent enough to have the team on standby. If the team can't be on standby, it can wait until Monday.
Lesson 4: every deploy needs a rollback plan
"We'll fix it forward" is not a plan.
Before every deploy, I ask:
- How do I know if this deployment failed?
- How do I rollback to the previous version?
- How long will the rollback take?
My method: GitHub Releases + Actions.
Each release is a tag on GitHub. No command line needed, you can do everything from the interface:
- Go to Releases → Create new release
- Click Choose a tag → create a new tag (e.g.,
v1.2.3) - Add a title and release notes (optional but useful)
- Click Publish release
The workflow triggers automatically:
# .github/workflows/deploy.yml
name: Deploy on Release
on:
release:
types: [published]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: npm ci
- name: Build
run: npm run build
- name: Deploy
run: npm run deploy
env:
DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}
Something broke? Rollback in 3 clicks:
- Go to the Actions tab
- Find the deployment of the previous working version (e.g.,
v1.2.2) - Click Re-run all jobs
That's it. No git commands to remember. No stress at 3 AM.
For database migrations, it's trickier. My rule: migrations should be reversible or additive.
-- Bad: can't rollback
ALTER TABLE users DROP COLUMN old_field;
-- Safe: add first, remove later
ALTER TABLE users ADD COLUMN new_field VARCHAR(255);
-- Deploy new code
-- Verify everything works
-- Later, in another release: DROP COLUMN old_field
Lesson 5: logs are your best friend (when they exist)
The first time a production bug happened, I had no idea what went wrong. No logs. No traces. Just a blank error page.
Now I log everything that matters:
// Before: no context
console.log('Error');
// After: actionable information
logger.error('Payment failed', {
userId: user.id,
amount: payment.amount,
errorCode: error.code,
errorMessage: error.message,
timestamp: new Date().toISOString(),
});
What to log:
- Every external API call (request + response + duration)
- Every database query that fails
- Every authentication attempt
- Every payment transaction
- User actions that modify data
What NOT to log:
- Passwords
- Credit card numbers
- Personal data (GDPR)
- Full request bodies with sensitive info
Lesson 6: health checks prevent disasters
My app once crashed silently. The process was running, but it wasn't responding to requests. The load balancer kept sending traffic to a dead instance.
Health checks fix this:
// pages/api/health.ts (Next.js)
export default async function handler(req, res) {
try {
// Check database connection
await db.query('SELECT 1');
// Check external services if critical
// await redis.ping();
res.status(200).json({ status: 'healthy' });
} catch (error) {
res.status(500).json({ status: 'unhealthy', error: error.message });
}
}
Your load balancer/orchestrator hits this endpoint every 30 seconds. If it fails, traffic gets routed elsewhere.
Lesson 7: secrets belong in a secrets manager
I've committed secrets to git. More than once.
Even if you delete them, they're in the git history. Bots scan GitHub for exposed credentials. They will find yours.
The rules:
- Add
.envto.gitignoreon day one - Use
.env.examplewith placeholder values - Use a secrets manager for production (Vercel env vars, AWS Secrets Manager, Doppler)
- Rotate credentials immediately if exposed
# .gitignore
.env
.env.local
.env.production
# .env.example (commit this)
DATABASE_URL=postgresql://user:password@localhost:5432/mydb
STRIPE_SECRET_KEY=sk_test_xxx
If you accidentally commit a secret:
- Rotate the credential immediately
- Remove from git history with
git filter-branchor BFG - Force push (coordinate with your team)
Lesson 8: monitoring is not optional
"The app is slow" is not actionable. "Average response time jumped from 200ms to 2s at 14:32" is.
Minimum monitoring:
- Uptime: Is the app responding?
- Response time: How fast?
- Error rate: How many 500s?
- Database: Connection pool, query time
- Memory/CPU: Are we running out of resources?
Free/cheap options that work:
- Vercel Analytics (if on Vercel)
- Sentry (errors + performance)
- Better Stack / Uptime Robot (uptime)
- PlanetScale insights (database)
Set up alerts. If error rate spikes, you should know before your users tell you.
Lesson 9: CI/CD is worth the setup time
For months, I deployed manually:
- Run tests locally
- Build locally
- SSH into server
- Pull code
- Restart app
Every deploy was 15 minutes of manual work. And I'd skip steps when in a hurry.
Now everything is automated:
# .github/workflows/deploy.yml
name: Deploy
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: npm ci
- name: Run tests
run: npm test
- name: Run linter
run: npm run lint
- name: Build
run: npm run build
- name: Deploy
run: ./deploy.sh
env:
DEPLOY_KEY: ${{ secrets.DEPLOY_KEY }}
Push to main = automatic deploy. Tests fail = deploy blocked. No manual steps = no human error.
Lesson 10: backups are useless until you test them
"We have backups" means nothing if you've never restored one.
Questions to answer:
- When was the last backup?
- How long does a restore take?
- Have you actually tried restoring?
- Do you backup environment variables and secrets too?
I schedule a quarterly "disaster recovery drill":
- Spin up a new environment
- Restore from backup
- Verify the app works
- Document any issues
The worst time to learn your backups don't work is during an actual disaster.
The cheat sheet
| Lesson | Action |
|---|---|
| Validate env vars | Use Zod, crash on startup if invalid |
| Match staging to prod | Same versions, same OS, same limits |
| No Friday deploys | Emergencies only, with team on standby |
| Plan rollbacks | Tag releases, reversible migrations |
| Log everything useful | Context, not just "error" |
| Add health checks | Let infrastructure detect failures |
| Use secrets managers | Never commit .env to git |
| Monitor proactively | Alerts before users complain |
| Automate CI/CD | No manual deploy steps |
| Test your backups | Quarterly restore drills |
The lesson
DevOps isn't about fancy tools. It's about not getting woken up at 3 AM.
Every lesson here came from pain. Downtime. Angry users. Weekend debugging sessions. Uncomfortable conversations with managers.
The goal is simple: deploy with confidence, sleep peacefully.
This is the final part of my "What I learned the hard way" series. Thanks for reading all 6 parts.
Got questions? Hit me up on LinkedIn or check out more on my blog.