How Stripe Detects Fraud at the Moment of Payment Authorization — ML in the Critical Path
Every time a card is charged through Stripe, a machine learning model runs inside the authorization path — not asynchronously, not in a nightly batch, but synchronously, in the few milliseconds between "customer clicks Pay" and "bank approves or declines." That model considers hundreds of features about the card, the buyer, the merchant, the device, and signals drawn from the entire Stripe network. It produces a risk score. The score drives whether the transaction is allowed through, silently blocked, or flagged for 3D Secure review. And the entire pipeline — feature extraction, feature lookup, model scoring, decisioning — has to fit inside a latency budget so tight that a single slow database call can cause a checkout to visibly stall.
This post is a deep walk-through of how Stripe Radar actually works, grounded in Stripe's own engineering blog posts, public talks from Stripe engineers, and the broader ML systems literature that describes the class of architecture Radar fits into. The exact model weights are proprietary. The architecture around the model is not — and the architecture is where all the engineering lives.
Why Real-Time Fraud Detection Is Brutally Hard
Before getting into how Stripe solves it, it is worth appreciating why "detect fraud in real time" is not a solved problem you can buy off the shelf. The constraints are:
- Latency budget. Card authorization has a tight end-to-end time budget. Every millisecond Radar spends is a millisecond Stripe cannot spend on the actual card network handshake. The industry target for the full authorization flow is typically under a second, and Radar is only one step of many in that flow.
- Availability requirement. If Radar is down, Stripe still has to process payments — the alternative is refusing transactions for every merchant in the world. The system must fail gracefully to a safe default rather than fail closed.
- Adversarial inputs. Fraudsters actively probe the system. They test stolen cards with small charges, they randomize device fingerprints, they rotate IPs. The model must generalize to patterns it has not seen before, because the patterns it has seen will be stale within days.
- Class imbalance. Fraud is rare. Most transactions are legitimate. A model that always says "not fraud" is technically 99%+ accurate — and useless. Real metrics track precision and recall at specific decision thresholds, not raw accuracy.
- Cost of false positives. Every legitimate customer Radar blocks is lost revenue and a support ticket. Every fraudulent charge Radar lets through is a chargeback, a fee, and a damaged merchant trust score. The model must balance both costs — and the balance differs per merchant.
These constraints rule out most naive architectures. You cannot synchronously query a slow warehouse. You cannot call a heavy deep-learning model that takes 500ms. You cannot hold locks on hot rows. Every choice has to work under the latency and availability envelope, all the time.
The Three-Layer Architecture That Makes It Possible
Stripe Radar, and systems like it built by other payment processors and card networks, follows a three-layer architecture:
- Feature extraction layer — derive signals from the incoming transaction and from recent history
- Model serving layer — run a trained ML model over those features and produce a risk score
- Decision layer — apply the score plus merchant-specific rules to produce an allow / block / challenge outcome
Each layer has its own engineering problems. Let us walk through them.
Layer 1 — Feature Extraction and the Feature Store
A feature is any input the model cares about. For a fraud model, features fall into a few categories:
- Request features — properties of the current transaction: amount, currency, card BIN, card type, IP geography, merchant ID, attempted product category.
- Aggregated card features — how has this card behaved recently? How many charges in the last hour, day, week? How many declines? How many distinct merchants? Average transaction size?
- Aggregated merchant features — what does normal traffic look like for this merchant? What is their typical fraud rate? Are they in a high-risk vertical?
- Device features — browser, operating system, screen resolution, timezone, fingerprint hash, whether Radar.js detected suspicious behavior like a headless browser.
- Network-effect features — has this card been seen recently at other Stripe merchants? Was it declined? Any chargeback history across the Stripe network? This is the unfair advantage of operating at Stripe's scale.
The challenge is that aggregated features cannot be computed synchronously inside the authorization path. You cannot, at charge time, scan every historical transaction for the card to count how many it has had in the last hour — the latency would be disastrous. Instead, Radar uses a feature store architecture.
What a feature store actually does
A feature store is a two-sided piece of infrastructure:
- Offline path (writes): as transactions happen, features are continuously computed by a streaming pipeline. For every charge, the system updates "count of charges for this card in the last hour," "distinct merchants in the last day," and similar aggregations. These updates go into a low-latency key-value store keyed by entity ID (card ID, merchant ID, device ID).
- Online path (reads): at authorization time, the feature extraction layer does a bounded number of key lookups — typically one per entity — into that key-value store to fetch the precomputed features. Each lookup is sub-millisecond. The total feature fetch is bounded by the number of entities involved, not by the depth of history.
This is how Radar can know "this card has been used at 14 merchants in the last 6 hours" without actually scanning historical data at authorization time. The scan happened asynchronously as each of those 14 charges was processed. At authorization time, the feature is already sitting there as a counter, ready to read.
Device fingerprinting with Radar.js
Features about the buyer's device come from a separate pipeline. When a checkout page loads Stripe's JavaScript SDK, it includes Radar.js — a client-side script that collects information about the browser, screen, timezone, installed fonts, canvas rendering quirks, mouse movement patterns, and typing rhythms. This data is sent to Stripe alongside the payment details and fed into the model as device-level features.
This is why fraudsters using headless browsers, botnets, or emulators often score poorly: their device signatures do not match any normal human's profile, and the model has seen enough of these patterns to recognize them. It is also why Stripe (and other payment providers) strongly recommend embedding their JS on checkout pages — without it, the model is flying blind on an entire category of features.
Layer 2 — Model Serving Inside the Critical Path
Once the feature vector is assembled, it has to be scored by a model. This is where most ML systems fail at production scale — either because the model is too large to run quickly, or because the serving infrastructure adds too much overhead, or because the model was trained offline and never productionized.
Stripe's public engineering writing has historically referenced gradient-boosted decision trees as the core model family, which is a sensible choice for this kind of problem:
- They are fast to score — a few microseconds per tree, a few milliseconds for hundreds of trees
- They handle mixed feature types (categorical, numeric, binary) without expensive preprocessing
- They are interpretable via feature importance, which matters for audit and debugging
- They are robust to feature noise and missing values, which matters when the feature store has partial data
- They do not require GPUs, so serving at high QPS is cheap
Modern fraud systems increasingly use neural networks or hybrid approaches, but the fundamental constraint is unchanged: whatever model is used must fit inside the authorization latency budget even under load spikes. This typically means the model serving layer runs inside the same service process as the feature extraction, avoiding a network hop between them.
Shadow scoring and online evaluation
A critical piece of infrastructure that does not show up in architecture diagrams is shadow scoring. When a new model version is being tested, it is deployed alongside the production model and both models score every transaction. The production model's decision is used. The new model's scores are logged and compared offline. Only after the new model has been shadow-scored for enough traffic to prove it is better (by whatever metric Stripe cares about) does it get promoted to production. This is how Radar evolves without ever risking a bad model version affecting real payments.
Layer 3 — The Decision Layer
The model outputs a risk score, typically a number between 0 and 100. A score of 95 is almost certainly fraud. A score of 5 is almost certainly legitimate. The middle is where the interesting decisions happen.
The decision layer turns the score into an action using:
- Merchant-specific thresholds — a luxury goods merchant may set the fraud bar lower than a digital gift card merchant. Stripe lets merchants configure their own risk tolerance.
- Rule overrides — merchants can add explicit rules like "block all charges from IP addresses in country X" or "always require 3D Secure for amounts over Y." Rules run alongside the model and can force-allow or force-block regardless of score.
- 3D Secure triggers — for transactions in the ambiguous middle, Radar can trigger 3D Secure, which punts the decision to the cardholder's issuing bank with an out-of-band authentication challenge. This is a way to get a second opinion without absorbing the fraud risk itself.
- Network-level blocks — Stripe itself maintains global blocklists for cards known to be compromised across its merchant base. These override any individual merchant's decision.
The decision layer is where most of the business logic lives, and it is deliberately kept simple and rule-driven. The ML part of the system is the model score. Everything downstream of the score is readable, auditable rules that a human operator can reason about.
The Network Effect — Why Stripe's Data Is Stripe's Moat
The single biggest advantage Stripe has over any individual merchant building their own fraud system is the cross-merchant network effect. A card used fraudulently at one Stripe merchant leaves a fingerprint in the feature store that every other Stripe merchant benefits from seconds later.
Concrete example: an attacker tests a stolen card with a $1 charge at merchant A. The charge succeeds (low amount, plausible pattern). Twenty seconds later they try a $500 charge at merchant B. Merchant B's individual view of the card is one transaction at a new merchant — a weak signal. Stripe's network view is "this card just had a successful $1 test at merchant A and is now trying a $500 charge at a different merchant" — a much stronger fraud signal. The model sees the network-level feature and scores it accordingly. Merchant B gets protection they could not have built themselves.
This is why payment fraud detection is one of the few areas where "bigger is strictly better." Every merchant Stripe onboards adds data, which improves the model, which improves protection for every existing merchant. It is a network effect in the literal mathematical sense, and it is the reason payment processors like Stripe, Adyen, and Braintree can offer fraud detection that individual merchants cannot replicate.
The DevOps Patterns You Can Actually Reuse
Most engineers are not building a payment fraud system. But the patterns Stripe uses for Radar show up in any real-time ML system, and they are worth understanding because they apply far beyond fraud:
- Precompute aggregations, read them at serve time. This is the core idea of a feature store. Any system where you need "how many events did this entity have recently" as a hot-path feature should push that aggregation out of the critical path and into a continuously-updated key-value store.
- Treat ML models as libraries, not services. Calling a remote ML service adds network latency you often cannot afford. For lightweight models (trees, linear models, small neural networks), embed the inference directly in the service that needs it. The deployment complexity is worth the latency savings.
- Shadow deployments for anything that affects real decisions. Never promote a new model (or a new rule) to production without running it in parallel with the current version for enough traffic to verify it is better. This is the ML equivalent of canary deployments for regular services.
- Separate scoring from decisioning. Keep the ML part focused on producing a numeric score. Put all the business logic — thresholds, rules, overrides — in a separate layer where humans can read and audit it. This is the single most important architectural choice for making ML systems maintainable.
- Fail to a safe default. If the fraud system is unavailable, Stripe does not refuse all payments. It falls back to a safe conservative policy. Your systems should decide, in advance, what the safe default is when the smart part of the system is broken.
- Use tight latency budgets as design constraints, not stretch goals. Pick a budget, monitor P99 latency against it, and reject any design that cannot hit the budget under load. Latency budgets turn "make it fast" from a vague goal into a binary pass/fail.
These are not Stripe secrets. They are the general playbook for any real-time ML system operating under latency and availability constraints. Stripe's scale makes them visible in Radar's architecture, but the patterns work at any scale where the hot path cannot tolerate slow dependencies.
Frequently Asked Questions
What is Stripe Radar?
Stripe Radar is Stripe's machine learning fraud detection product. It runs a model on every payment, in real time, inside the authorization path, and produces a risk score that drives allow / block / challenge decisions.
Does Stripe Radar use rules or ML?
Both. ML is the primary scoring mechanism. Merchants and Stripe itself can layer rules on top to force specific allows, blocks, or 3D Secure triggers in ways the model cannot easily handle on its own.
How does Stripe know if a card is stolen?
Several signals: network-effect data from other Stripe merchants, card issuer feedback (declines, chargebacks), unusual behavior patterns for the specific card, device fingerprinting mismatches, and explicit blocklists. No single signal is conclusive — the model combines them.
Why is my legitimate payment sometimes blocked by Stripe?
Models have false positives. An unusual combination of signals — new device, new country, high amount, rare merchant — can push a legitimate transaction into the ambiguous zone. Often this triggers 3D Secure rather than an outright block, which is why you sometimes see a bank verification step pop up.
Can merchants see why Radar blocked a transaction?
Yes. The Stripe dashboard shows the risk score and the top contributing factors for each flagged transaction. This is possible because gradient-boosted models expose per-feature importance, which makes the decisions inspectable.
Next Steps
- How Uber's Surge Pricing Actually Works — another real-time system making decisions inside a latency budget
- How Netflix-Scale DRM Works — different domain, same "infrastructure hidden behind a simple UI" pattern
- AWS IAM Best Practices — authorization decisions on the critical path
- Free DevOps resources