Why I Chose Redis Streams Over Kafka for a Job Queue

Every time I’ve described needing an async job queue to another engineer, the first response has been some variation of “just use Kafka.” It’s the safe answer. It scales to billions of messages. Netflix uses it. LinkedIn built it.

I didn’t use it. Here’s why.

The context

The system was a distributed task queue handling around 50,000 jobs per day. Jobs came from multiple producer services, needed to be processed with configurable concurrency, had to support retries with backoff, and required a dead-letter queue for jobs that exhausted retries. Processing latency mattered — we wanted P99 under 500ms.

This wasn’t a streaming analytics pipeline. It was job dispatch — closer to a work queue than an event log.

What Kafka would have required

Kafka is genuinely excellent for what it’s designed for: high-throughput, durable, replayable event streams with multiple independent consumers reading at their own offsets.

For our use case, that power comes with overhead:

Cluster management. Kafka requires ZooKeeper (or KRaft in newer versions), brokers, and careful partition planning. Even a minimal production cluster is three brokers. That’s infrastructure to provision, monitor, and operate.
Operational complexity. Consumer group offset management, partition rebalancing, and lag monitoring are non-trivial. They require tooling expertise and generate operational toil.
Cost. At 50k jobs/day we’d be paying for substantial infrastructure that’s sized for a load profile we don’t have.
Exactly-once semantics are hard. Kafka provides at-least-once by default. Exactly-once requires idempotent producers and transactional consumers — real complexity to implement correctly.

The complexity is worth it at scale. At 50k jobs/day, it isn’t.

Why Redis Streams fit better

Redis Streams were introduced in Redis 5.0 and are often overlooked in favor of Kafka or RabbitMQ. They’re a persistent, append-only log structure with native consumer group support. That last part matters.

Consumer groups give you:

Multiple workers competing to claim messages (exactly what a job queue needs)
Acknowledgement semantics — a message stays in the pending entries list until a worker ACKs it, so crashes don’t drop jobs
XREADGROUP for atomic claim-and-process — no separate lock required

The implementation looked like this:

Producer         Redis Stream           Workers
   │                   │                   │
   │── XADD ─────────►│                   │
   │                   │◄── XREADGROUP ───│
   │                   │                   │── process ──►
   │                   │◄── XACK ─────────│

Workers call XREADGROUP with a consumer group name and a count. Redis atomically delivers unclaimed messages to the caller and moves them to a pending entries list (PEL). On success, the worker calls XACK. On failure (or crash), the message stays in the PEL and gets redelivered to another worker after a configurable timeout via XAUTOCLAIM.

Dead-letter handling was a separate stream. After three delivery attempts, a job moved to queue:dlq with its full error history. A separate monitor watched DLQ depth and alerted when it grew.

The tradeoffs I accepted

Redis is not Kafka. A few real limitations:

No replay across consumer groups. Kafka lets independent consumer groups read the same stream at different offsets. Redis Streams consumer groups consume destructively — once a message is ACKed, it’s gone from the pending state. If you need multiple independent services to process every event, Redis Streams isn’t the right tool.
Memory-bound. Redis is an in-memory data structure server. Large backlogs need either MAXLEN trimming (which loses history) or a large Redis instance. We set MAXLEN ~ 100,000 and kept PostgreSQL as the source of truth for job metadata — the stream was the delivery mechanism, not the record of what happened.
Single-node throughput ceiling. Redis Cluster shards streams, but consumer groups don’t work cleanly across shards. At high enough throughput this becomes a constraint. At 50k jobs/day it’s nowhere near relevant.

The outcome

P99 processing latency: 380ms (down from ~45s with the previous cron-based polling approach). DLQ surfaced 12 previously invisible failure modes in the first week. Operational overhead: roughly one Redis instance to monitor, no cluster, no offset management tooling.

When to use each

Use Redis Streams when:

You need a reliable work queue with competing consumers
You want acknowledgement semantics without Kafka’s operational cost
Your throughput is in the millions of messages per day or below
You already have Redis in your stack

Use Kafka when:

Multiple independent consumer groups need to read the same event stream
You need long-term event retention and replay
You’re building streaming analytics or event sourcing
Your throughput justifies the operational investment

The temptation is to reach for the most powerful tool. The better instinct is to reach for the simplest one that solves the problem. Redis Streams solved this problem. Kafka would have solved it too — and required three times the infrastructure to do it.