CI/CD Pipeline Optimization Tips#
Our CI pipeline used to take 23 minutes per commit. Engineers would push code, context-switch to something else, lose focus, and then come back 30 minutes later to find a red build. After a focused optimization effort, we brought it down to 7 minutes. That is a 70% reduction, and the compounding effect on developer productivity was enormous. Here is exactly what we did.
Build Caching: The Lowest Hanging Fruit
The first thing we tackled was dependency installation. Every build was running npm install from scratch, downloading the same 800MB of node_modules that had not changed since last week. We added layer caching at multiple levels.
For our GitHub Actions pipeline, the native cache action saved us 3 minutes per build:
- name: Cache node_modules
uses: actions/cache@v4
with:
path: |
node_modules
~/.npm
key: deps-${{ hashFiles('package-lock.json') }}
restore-keys: |
deps-
For Go services, we cache both the module download cache and the build cache:
- name: Cache Go
uses: actions/cache@v4
with:
path: |
~/go/pkg/mod
~/.cache/go-build
key: go-${{ hashFiles('go.sum') }}
restore-keys: |
go-
The restore-keys fallback is important. Even if go.sum changed slightly (one new dependency), you still get a partial cache hit from the previous build, which is far faster than downloading everything from scratch. This single change saved us an average of 2.5 minutes across all pipelines.
Test Parallelization
Our test suite had 1,847 tests running sequentially. A single slow integration test blocked everything behind it. We split the suite into parallel shards using a fan-out strategy:
jobs:
test:
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- name: Run tests
run: |
TOTAL_SHARDS=4 SHARD_INDEX=${{ matrix.shard }} \
npx jest --shard=${{ matrix.shard }}/4
Jest's built-in --shard flag distributes tests across runners by file. With 4 shards running in parallel, our test phase dropped from 9 minutes to 2.5 minutes. The cost of running 4 parallel runners instead of 1 is negligible on GitHub Actions since you pay per minute, and the total compute time is roughly the same.
For smarter sharding, we feed test timing data back into the splitter. Our CI writes a test-timings.json file that records how long each test file took. The next build uses this data to distribute tests evenly across shards by execution time rather than file count, which prevents one shard from getting all the slow integration tests.
Docker Layer Optimization
Our Docker builds were painfully slow because the Dockerfile was structured poorly. Every code change invalidated the cache from the COPY . . line onward, including dependency installation. The fix is to structure your Dockerfile so that layers change from least frequent to most frequent:
# Bad: everything rebuilds on any code change
FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci
RUN npm run build
# Good: dependency layer is cached separately
FROM node:20-alpine
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production
COPY . .
RUN npm run build
By copying only package.json and package-lock.json first, the npm ci layer is cached as long as dependencies do not change. This alone saved 90 seconds per build for our largest service.
We also adopted multi-stage builds to keep final images small:
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
CMD ["node", "dist/server.js"]
The final image contains only the build output and production dependencies. Our average image size dropped from 1.2GB to 180MB, which also sped up image pull times during deployment.
Monorepo Strategies: Only Build What Changed
In a monorepo with 12 services, running every pipeline on every commit is wasteful. We use path-based filtering to only trigger builds for services that actually changed:
on:
push:
paths:
- 'services/payments/**'
- 'packages/shared-utils/**'
- '.github/workflows/payments.yml'
The critical detail is including shared packages in the path filter. If shared-utils changes, every service that depends on it needs to rebuild. We maintain a dependency graph in a simple JSON file and a script that generates the path filters. When a shared package changes, CI triggers builds for all downstream services automatically.
Reducing Flaky Tests
Flaky tests are the hidden tax on CI performance. A test that fails 5% of the time does not sound bad until you realize that with 1,800 tests, the probability of at least one flaky failure per run is nearly 100%. Engineers retry the pipeline, doubling the average wall-clock time.
We attacked flaky tests with a three-pronged approach:
- Quarantine: Tests that fail more than twice without code changes are automatically moved to a quarantine suite that runs separately and does not block merges. A Slack bot notifies the owning team.
- Test isolation: We banned shared state between tests. Every integration test gets its own database schema created in a
beforeAlland dropped inafterAll. This eliminated race conditions between parallel shards. - Deterministic time: We replaced
Date.now()in tests with a fixed clock usingjest.useFakeTimers(). Time-dependent tests were our biggest source of flakiness — assertions on "created in the last 5 minutes" would fail when CI was slow.
After 6 weeks of focused cleanup, our flaky test rate dropped from 12% of builds to under 1%. That alone saved an estimated 15 hours of engineer time per week in retries and investigations.
The Full Picture
Here is where each optimization contributed to the total improvement:
- Build caching: -2.5 minutes
- Test parallelization: -6.5 minutes
- Docker layer optimization: -1.5 minutes
- Path-based filtering: -3 minutes (average, by skipping unchanged services)
- Flaky test reduction: -2.5 minutes (by eliminating retries)
A fast CI pipeline is not a luxury. It is a direct multiplier on your team's shipping velocity. Every minute you shave off the feedback loop pays dividends across every engineer, every commit, every day.
Start with build caching and test parallelization — they give you the biggest wins with the least effort. Then tackle Docker optimization and flaky tests. Measure everything, and optimize the slowest stage first.