Quarantine Bad Rows in Lakeflow SDP Without Breaking Your Clean Table
You want two things from data quality that pull in opposite directions. You want the bad rows out of your clean table, so nothing downstream trips over them. And you want those same bad rows captured somewhere you can go look at them, because a row that just vanished tells you nothing.
Lakeflow Spark Declarative Pipelines (SDP, formerly Delta Live Tables) give you the first half through expectations: the row-level data-quality rules SDP runs as data flows through the pipeline. An expect_all_or_drop rule removes any row that fails it before the row lands in your clean table. But "drop" means exactly that: the row is gone. So where do the dropped rows go when you want to inspect them?
Here is the pattern I built and tested for that, packaged as a ready-to-use asset template you can copy into your own bundle, plus the one piece of SQL semantics that quietly breaks it if you miss it.
The pattern: two tables off one source
Permalink to "The pattern: two tables off one source"The shape is simple: good rows here, bad rows there. You build two tables off the same source. The clean table applies your drop rules. The quarantine table applies their inverse. A row that passes every drop rule flows to clean; a row that fails at least one flows to quarantine. That quarantine table is a dead-letter table for SDP: bad rows don't vanish, they land somewhere you can inspect them.
flowchart TD
S["samples.nyctaxi.trips"] --> R["raw_trips<br/>(bronze schema)"]
R -->|"@expect_all_or_drop(drop rules)"| C["clean_trips<br/>(silver schema)"]
R -->|"@expect_all_or_drop(NOT drop rules)"| Q["quarantine_trips<br/>(silver schema)"]
The schemas are the medallion layers; the table names describe content. Both clean_trips and quarantine_trips read raw_trips once each, which is the tradeoff worth naming up front: the source is processed twice.
The rules live in a declarative expectations.json, split into two lists. Drop rules are critical and route failures to quarantine; warn rules are advisory and only annotate the pipeline event log without dropping anything.
| Action | Name | Predicate |
|---|---|---|
| drop | valid_fare |
fare_amount IS NOT NULL AND fare_amount > 0 |
| drop | valid_distance |
trip_distance IS NOT NULL AND trip_distance > 0 |
| drop | valid_pickup_ts |
tpep_pickup_datetime IS NOT NULL |
| warn | fare_within_expected |
fare_amount < 500 |
| warn | dropoff_zip_present |
dropoff_zip IS NOT NULL |
What that looks like in expectations.json
{
"drop": [
{ "name": "valid_fare", "query": "fare_amount IS NOT NULL AND fare_amount > 0" },
{ "name": "valid_distance", "query": "trip_distance IS NOT NULL AND trip_distance > 0" },
{ "name": "valid_pickup_ts", "query": "tpep_pickup_datetime IS NOT NULL" }
],
"warn": [
{ "name": "fare_within_expected", "query": "fare_amount < 500" },
{ "name": "dropoff_zip_present", "query": "dropoff_zip IS NOT NULL" }
]
}
The drop and warn lists are the only thing you edit per dataset. Everything downstream derives from them.
The quarantine predicate is derived automatically as NOT((p1) AND (p2) AND ...) over the drop predicates. You write the drop rules once; you do not maintain a second, hand-inverted copy that can drift out of sync.
The NULL trap
Permalink to "The NULL trap"This is the part worth internalizing before you copy anything.
expect_all_or_drop drops a row only when its predicate evaluates to false. A predicate that evaluates to SQL NULL (unknown) is not false, so the row is kept. That is not a quirk; it is the documented behavior of the is false operator: NULL is false returns false.
Now watch what that does to a naive drop rule. Say you write fare_amount > 0 and a row arrives with a NULL fare. The predicate is NULL, which is not false, so the row is kept in the clean table. The inverse NOT(fare_amount > 0) is also NULL, likewise not false, so the same row is kept in the quarantine table too.
flowchart TD
Row["row: fare_amount = NULL"]
Row -->|clean rule| Clean["fare_amount > 0<br/>→ NULL → kept"]
Row -->|quarantine rule| Quar["NOT(fare_amount > 0)<br/>→ NULL → kept"]
Clean --> Both["kept in BOTH tables<br/>→ partition broken"]
Quar --> Both
-- wrong: a NULL fare is kept in BOTH tables
fare_amount > 0
-- right: a NULL fare is dropped from clean, routed to quarantine
fare_amount IS NOT NULL AND fare_amount > 0
The one row lands in both tables, double-counted, and your clean/quarantine split is no longer a clean partition. It looks like it works until a NULL shows up.
The fix is to make every drop predicate total (NULL-safe): guard each column test with IS NOT NULL. A total predicate never returns NULL, so the drop rules and their inverse partition the raw input exactly once. Every row lands in exactly one of the two tables. The asset writes all drop predicates this way.
If you genuinely cannot guarantee total predicates (for example, rules loaded from an external system you do not control), the robust alternative is to compute an explicit _quarantined boolean with NULL-safe SQL once, then branch both tables on that flag instead of on raw inverse predicates.
Proving it
Permalink to "Proving it"The property the whole pattern rests on is a partition invariant, a formula that has to hold on every run: clean rows + quarantine rows = raw rows
Exactly: no row in both tables, none lost. That is checkable on the deployed pipeline with read-only SQL.
SELECT
(SELECT COUNT(*) FROM raw_trips) AS raw_rows,
(SELECT COUNT(*) FROM clean_trips) AS clean_rows,
(SELECT COUNT(*) FROM quarantine_trips) AS quarantine_rows,
(SELECT COUNT(*) FROM raw_trips)
- (SELECT COUNT(*) FROM clean_trips)
- (SELECT COUNT(*) FROM quarantine_trips) AS gap;
A gap of zero is the pass condition. But you do not want to discover a broken partition after deploying. So the asset also ships an offline unit-test suite that proves the partition on crafted edge rows, including the NULL cases, before anything is deployed. It models expect_all_or_drop's keep-on-NULL behavior with (<predicate>) IS NOT FALSE, so a drop rule that is not NULL-safe shows up as a broken partition and fails the test. The tests model the routing rule; they are not run against the live pipeline.
That suite runs locally with a real Spark session, which asks for a one-time setup: Java and a local Spark, with pyspark and pytest installed via pip (the PySpark install guide covers it; on Windows you also want the Hadoop winutils shim). If you would rather not stand that up, the read-only audits above need no local Spark and prove the same partition on real data, so you still have a verification path either way.
Then there are the live numbers. I validated the pattern on serverless SDP against the public samples.nyctaxi.trips dataset:
- The partition held exactly: 21,932 raw rows split into 21,847 clean and 85 quarantine (21,847 + 85 = 21,932). The public sample already contains naturally invalid rows, so quarantine populated on the first run with no setup.
- The NULL fix worked live. A row with a NULL
fare_amountrouted to quarantine only and did not leak into the clean table, becausevalid_fareis NULL-safe. A naivefare_amount > 0would have treated the NULL as passing and leaked the row. - A row with multiple drop violations produced exactly one quarantine row. No duplication; the event log still attributes each failed expectation separately.
- Warn rules behaved as advisory. A row that only tripped a warn rule flowed to the clean table and was recorded in the event log; quarantine was unaffected.
Adapt it to your own data
Permalink to "Adapt it to your own data"You have two ways in. The first: copy the reference pipeline, point it at your catalog and schemas, deploy, and watch the split hold. The nyctaxi rules are a worked example, not your rules. Rewriting them for your own columns, keeping every predicate NULL-safe, and proving the result is the real work.
So there is a faster path for that second part. The asset ships a companion agent skill alongside the pipeline. Point any coding agent you run (Claude Code, Codex, Genie Code on Databricks, and so on) at the skill, and it adapts the pattern to your dataset instead of you doing it by hand. Concretely, the agent:
- investigates your source schema and samples a few rows;
- proposes a set of drop and warn rules, with a one-line reason each, for you to confirm;
- keeps every drop predicate NULL-safe;
- adapts and runs the offline unit tests green before any deploy (remember to have Java + PySpark installed);
- wires the pipeline resource into your bundle;
- verifies the result: the offline tests first, then read-only audits on the live pipeline.
The human stays in the loop. The agent proposes; you confirm the rules and own whether they are the right rules. It applies and self-verifies a pattern; it is not an auto-rewriter, and it does not guarantee correctness on arbitrary data.
Both paths install with one command:
databricks bundle init https://github.com/vmariiechko/databricks-bundle-template \
--template-dir assets/sdp-quarantine-pattern
Two prompts I use to start the agent
A light start, when you want the agent to explore first and come back with a proposal:
Use the sdp-quarantine-pattern skill.
My source table is <catalog.schema.table>. Investigate the schema and sample the data,
then propose a set of NULL-safe drop rules and warn rules, with a one-line reason each,
for me to confirm before you change anything.
A fuller start, when you already know the rules you care about:
Use the sdp-quarantine-pattern skill to add the quarantine pattern to my pipeline.
- Source table: <catalog.schema.table or file path>
- Quality rules I care about: <e.g. amount must be positive; customer_id required;
event_time must precede ship_time; flag amounts over 10000 as a warning>
Investigate the source, propose drop/warn rules for me to confirm, adapt the pattern,
and run the offline unit tests. Then stop and hand me the deploy and verification steps:
I will run the workspace steps and bring you the results.
Both keep you in the loop: the agent proposes and proves offline, you run the workspace steps.
I dogfooded the skill end-to-end on a different public sample (samples.bakehouse): the agent investigated, proposed rules I confirmed, adapted and ran the offline tests green; I deployed and ran it, and the live read-only audits came back clean. That run proved the agent can drive the whole adaptation-and-verification loop. The routing is shown by the offline tests above (which exercise crafted bad rows) and the live nyctaxi run: bad and NULL rows land in quarantine, not in clean. As of June 2026, agent tooling and SDP behavior both move quickly, so treat the specifics as a snapshot.
Honest limits
Permalink to "Honest limits"- The source is processed twice, once per table. That is the cost of the separate-table approach.
- A different design keeps a single table and adds a validation-status column marking each row good or bad, partitioning on that column instead of using a separate quarantine table. This is not the
_quarantinedflag from earlier, which still builds two tables; it is genuinely one table. It avoids reading the source twice, but the per-expectation pass/fail counts do not surface in the event log the way they do with expectations. The asset builds the separate-table approach, and the agent skill adapts that one, not the single-table variant. - The offline tests model the keep-on-NULL routing rule; they prove the partition on crafted edge rows, not on every shape of data you will ever see. Live verification proves it on whatever real data flowed, not on the future.
- This is one pattern with its tradeoffs documented, not a general data-quality framework. If you want a fuller framework, look at DQX from Databricks Labs.
Grab it
Permalink to "Grab it"The asset lives in my databricks-bundle-template repo: the reference pipeline, the declarative rules, the offline tests, the event-log queries, and the companion skill, installed with the command above.
Copy it, point it at your catalog, and watch the split hold. If a NULL slips the partition on your data, open an issue: that is exactly the edge case I want to fold back in.