Event-Driven Architecture in Django Without Losing Your Mind
Django gets dismissed in architecture conversations the moment someone says "event-driven." The assumption is that serious async systems need Go or Node or a fleet of Kafka-native microservices, and that Django is for CRUD apps that stopped evolving in 2015.
I've spent the last few years building event-driven backend workflows for interconnected platform services — wallets, authentication, notifications — largely on Django. It works, it scales further than most products will ever need, and the patterns that make it work are transferable to any stack.
Start with why: what events actually buy you
The naive version of a platform service does everything inline: a user tops up a wallet, and the request handler updates the balance, writes a ledger entry, sends a notification, syncs an analytics event, and calls a partner webhook. Five responsibilities, one transaction, one timeout away from an inconsistent state.
Events decouple what happened from what reacts to it. The wallet service records "TopUpCompleted" and moves on. Notifications, analytics, and webhooks each consume that event on their own schedule, with their own retry policies. A slow partner API no longer holds a database transaction hostage.
The outbox pattern is non-negotiable
The classic bug in event-driven systems: you commit the database transaction, then publish the event — and the publish fails. Or you publish first, and the transaction rolls back. Either way, your system now believes two different things.
The transactional outbox fixes this with plain SQL:
- ▸Write the event to an
outboxtable in the same transaction as the state change. - ▸A background worker polls the table and publishes pending events to the broker.
- ▸Mark rows as published only after the broker confirms.
In Django this is almost embarrassingly simple — an Outbox model, a transaction.atomic() block, and a management command running in a loop. No exotic infrastructure. We handled real money flows with this pattern and never lost an event.
Consumers must be idempotent, because retries are a promise
Every messaging system worth using guarantees at-least-once delivery, which is a polite way of saying "you will process duplicates." Design for it from the first consumer:
- ▸Natural idempotency where possible. "Set status to ACTIVE" can run twice harmlessly. Prefer absolute state over relative mutations —
balance = Xis safer thanbalance += Y. - ▸Idempotency keys where it isn't. Ledger entries carry the event ID with a unique constraint. A duplicate insert fails quietly, and the consumer acknowledges and moves on.
- ▸Isolate side effects. Sending an email twice is annoying; charging a card twice is a incident report. Gate the dangerous operations behind the dedup check, always.
Keep the event schema boring
Events are an API, and like any API, they outlive your enthusiasm for changing them. Lessons learned the slow way:
- ▸Fat events beat chatty consumers. Include the data consumers need in the payload. An event that's just an ID forces every consumer to call back into your service, recreating the coupling you were escaping.
- ▸Version from day one. A
versionfield costs nothing now and saves a coordinated multi-service deploy later. - ▸Name events in past tense. "OrderShipped," not "ShipOrder." Events describe facts, not commands. This sounds pedantic until someone builds a consumer that thinks it's being instructed rather than informed.
Ordering is a lie you tell yourself
Sooner or later someone asks: "but what if the events arrive out of order?" The honest answer is that in any system with retries and multiple workers, they will — and designing consumers that depend on ordering is building on sand.
Two strategies cover almost every real case:
- ▸Design events to be order-independent. If "WalletCredited" and "WalletDebited" both carry the resulting balance and a monotonic sequence number, a consumer can detect and discard stale updates regardless of arrival order. State-carrying events make ordering a non-issue.
- ▸Partition where order genuinely matters. When a strict sequence is unavoidable — ledger entries for a single account, say — route all events for that entity to the same partition and process that partition serially. Order within an entity is cheap; global order is a distributed-systems tax you should refuse to pay.
Dead letters are where systems go to rot
Every consumer eventually hits an event it cannot process — malformed payload, missing foreign key, a bug. The default failure mode is silent: the message retries forever, clogs the queue, and nobody notices until something downstream starves.
We route anything that fails N retries to a dead-letter queue, and — this is the part most teams skip — we treat a non-empty DLQ as an alerting condition, not a landfill. Every dead letter is a bug report from production. The discipline that kept ours near zero: weekly review, root-cause each message, and either fix the producer, fix the consumer, or add an explicit rejection rule. An ignored DLQ is just data loss on a delay timer.
You need less infrastructure than you think
We didn't start with Kafka. Redis Streams handled our early volume fine, and Django + Celery covered the worker orchestration we needed. The migration path to a heavier broker exists when the numbers demand it — most systems never get there.
The real lesson: event-driven architecture is a set of data-integrity disciplines, not a technology purchase. Outbox for publishing, idempotency for consuming, boring schemas for longevity. Get those right in Django, and the framework stops being the limiting factor — because it never was.