Error Handling & Retry Logic in Automated Ingestion & Batch Scanning Workflows

Fault tolerance is the subsystem that decides, for every failed operation in an ingest run, whether to retry, quarantine, or halt — without ever duplicating an asset or corrupting a manifest. Within the broader Automated Ingestion & Batch Scanning Workflows pipeline, this stage owns the worker-level resilience layer that sits beneath every other stage: the deterministic retry wrappers, exponential backoff, and circuit breakers that let a batch survive a dropped connection or a throttled storage endpoint. It receives work from Async Task Queuing for Batches, guards the I/O boundaries exposed by Scanner API Integration & Routing, and shields the compute-heavy OCR Processing Pipelines from transient upstream faults. The engineering discipline here is narrow but unforgiving: in a preservation context a silent failure is worse than a loud one, because a dropped page that no one notices becomes a permanent gap in the archival record.

Scope: The Transient / Permanent Fault Boundary

Every resilient ingest worker turns on a single classification decision. A transient fault is one that a later, identical attempt can plausibly resolve — a TCP reset mid-transfer, a 503 from a rate-limited object store, a scanner that is momentarily busy. A permanent fault is one that no number of retries will fix — a checksum that disagrees with the source manifest, a truncated header, a schema violation. The cardinal anti-pattern this layer exists to prevent is a worker that loops forever on a genuinely corrupt file, burning queue capacity while the real defect goes unremediated. Retry logic must therefore be built around an explicit allow-list of retryable exception types; everything outside that list routes straight to quarantine with a preservation event attached. This boundary is the same distinction the parent pipeline enforces when it separates a retryable transfer error from a hard fixity failure, and it is the invariant that keeps Batch Validation Schemas meaningful: a package that fails structural validation is never retried into acceptance, only isolated for review.

Retry Policy Specification

A retry policy is configuration, not code sprinkled through the worker. Modelling it as a validated data structure lets a policy be version-pinned per batch, serialised into the audit trail, and diffed when behaviour changes. The Pydantic model below captures the parameters an operator actually tunes, and the table documents the semantics and safe ranges for each field.

python

from pydantic import BaseModel, Field, field_validator


class RetryPolicy(BaseModel):
    """Version-pinned retry configuration attached to every batch manifest."""

    max_attempts: int = Field(4, ge=1, le=10)
    initial_delay_s: float = Field(1.0, gt=0)
    max_delay_s: float = Field(60.0, gt=0)
    exp_base: float = Field(2.0, gt=1)
    jitter: bool = True
    circuit_failure_threshold: int = Field(5, ge=1)
    circuit_cooldown_s: float = Field(30.0, gt=0)
    retryable_exceptions: tuple[str, ...] = ("Timeout", "ConnectionError")

    @field_validator("max_delay_s")
    @classmethod
    def _cap_gte_initial(cls, v: float, info) -> float:
        initial = info.data.get("initial_delay_s", 0)
        if v < initial:
            raise ValueError("max_delay_s must be >= initial_delay_s")
        return v

Field	Type	Default	Purpose	Preservation constraint
`max_attempts`	int	4	Total tries before dead-lettering	Must be bounded — an unbounded retry hides a permanent fault
`initial_delay_s`	float	1.0	First backoff interval	Long enough to clear a momentary blip, short enough to keep throughput
`max_delay_s`	float	60.0	Ceiling on any single backoff	Caps tail latency so a batch cannot stall a shift indefinitely
`exp_base`	float	2.0	Growth factor per attempt	Higher bases relieve a struggling dependency faster
`jitter`	bool	true	Randomise delay to de-sync workers	Prevents a thundering-herd retry storm across a worker pool
`circuit_failure_threshold`	int	5	Consecutive failures that trip the breaker	Isolates a dead endpoint instead of hammering it
`circuit_cooldown_s`	float	30.0	Time a tripped breaker stays open	Gives a failed device or store time to recover before probing
`retryable_exceptions`	tuple	timeout/conn	Allow-list of transient error types	Everything else is permanent → quarantine, never retried

The backoff interval for attempt (n) is deterministic given the policy, with jitter added on top:

$$ delay_n = \min\bigl(max_delay,; initial \cdot base^{,n-1}\bigr) + U(0,, initial) $$

The additive jitter term (U(0, initial)) is what keeps a pool of workers that all failed against the same rate-limited endpoint from retrying in lockstep — without it, synchronised retries reproduce the exact congestion that caused the original failure.

Retry Lifecycle State Machine

The state diagram below traces a single task through its retry lifecycle: transient errors trigger backoff and bounded retries, exhausted attempts land in a dead-letter state, and the circuit breaker isolates a failing dependency so the rest of the batch keeps moving.

Each retry is recorded as a discrete PREMIS event; once N attempts are exhausted the task is routed to a dead-letter state for manual review.

Each backoff interval doubles the last and carries an additive jitter term, so a pool of workers never retries in lockstep; once consecutive failures cross the threshold the breaker opens for a cooldown window before the task is dead-lettered.

Core Implementation: Idempotent Retry Wrapper

The pattern below wraps a critical I/O operation with deterministic backoff, structured logging, and a preservation-audit hook. It uses tenacity for the retry mechanics and a mock PREMIS event logger to satisfy institutional audit requirements. The operation is idempotent by construction: the output is keyed on the asset identifier and verified against an expected checksum, so a redelivered or replayed attempt converges to the same result rather than duplicating the asset. Full configuration options are documented in the official Tenacity Documentation.

python

import hashlib
import logging
from datetime import datetime, timezone

import requests
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential_jitter,
    retry_if_exception_type,
    before_sleep_log,
)

# Structured logging for preservation audit trails.
logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("preservation_engine")

# Transient error allow-list: everything else is permanent and must not retry.
TRANSIENT_ERRORS = (requests.exceptions.Timeout, requests.exceptions.ConnectionError)


class PremisEventLogger:
    """Emit discrete retry events compliant with OAIS/PREMIS auditing."""

    @staticmethod
    def log_retry(asset_id: str, attempt: int, error: str, resolution: str) -> None:
        event = {
            "event_type": "Retry",
            "asset_id": asset_id,
            "attempt": attempt,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "error_detail": error,
            "resolution_status": resolution,
            "event_outcome": "Success" if resolution == "Success" else "Failure",
        }
        logger.info("PREMIS EVENT LOGGED: %s", event)


def verify_checksum(data: bytes, expected_hash: str) -> bool:
    """Validate bit-level integrity of a transferred payload."""
    return hashlib.sha256(data).hexdigest() == expected_hash


@retry(
    retry=retry_if_exception_type(TRANSIENT_ERRORS),
    wait=wait_exponential_jitter(initial=1, max=60, exp_base=2),
    stop=stop_after_attempt(4),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    reraise=True,
)
def ingest_asset_with_retry(asset_id: str, payload_url: str, expected_checksum: str) -> bytes:
    """Idempotent asset ingestion with deterministic retry and audit logging."""
    # tenacity exposes live retry statistics on the decorated function;
    # attempt_number is 1 on the initial call and increments on each retry.
    attempt_num = ingest_asset_with_retry.statistics.get("attempt_number", 1)

    try:
        response = requests.get(payload_url, timeout=30)
        response.raise_for_status()
        if not verify_checksum(response.content, expected_checksum):
            # A checksum mismatch is permanent — it is NOT in TRANSIENT_ERRORS,
            # so it propagates straight to quarantine without a retry.
            raise ValueError(f"Checksum mismatch for {asset_id}")
        PremisEventLogger.log_retry(asset_id, attempt_num, "None", "Success")
        return response.content
    except TRANSIENT_ERRORS as exc:
        PremisEventLogger.log_retry(asset_id, attempt_num, str(exc), "Pending Retry")
        raise
    except Exception as exc:
        # Permanent fault: record a terminal failure event and stop.
        PremisEventLogger.log_retry(asset_id, attempt_num, str(exc), "Quarantined")
        raise

Two design choices carry the preservation weight. First, only exceptions in TRANSIENT_ERRORS are eligible for retry; a ValueError from a checksum mismatch bypasses the backoff loop entirely and is logged as a terminal Quarantined event, so a corrupt payload never consumes the retry budget. Second, every attempt — success, pending retry, or quarantine — emits a discrete event, giving an auditor an unbroken record of how the system reacted to failure rather than only its final state.

Integration Points

This resilience layer is not a standalone service; it is the connective tissue between adjacent ingest stages, and each seam has a distinct failure surface.

Upstream — task delivery. Workers pull batches from Async Task Queuing for Batches, where at-least-once delivery means a task body can be redelivered after a crash. Retry logic and the broker’s redelivery must not double-count: the same idempotency key that makes a single attempt safe to repeat also makes a broker redelivery safe, so the two mechanisms compose instead of conflicting.
Hardware boundary — capture faults. When a planetary or overhead scanner drops a connection mid-batch, Scanner API Integration & Routing surfaces the fault as a transient exception and, if the circuit breaker trips, re-dispatches pending frames to a healthy device while preserving batch ordering. Decoupling hardware polling from queue consumption lets the unaffected segments of a batch keep flowing while failed units are isolated.
Downstream — derivative generation. When OCR Processing Pipelines encounter a corrupted image header or an incomplete page boundary, the handler quarantines the affected file, requests a re-scan, and updates the batch manifest — it does not retry recognition against a byte sequence that will never parse.
Validation gate — permanent stops. A package that fails Batch Validation Schemas is a permanent fault by definition; the retry layer must recognise validation errors as non-retryable and route them to quarantine rather than re-submitting a structurally invalid SIP.

Cloud storage and API dependencies demand the same rigour at the far end of the pipeline: provider-side rate limits and network congestion stall large uploads, so tiered storage endpoints and graceful-degradation paths keep throughput up during peak windows without letting a partial upload contaminate the archival tier.

Validation and Compliance Rules

Strict auditability requires that every retry, quarantine action, and manifest update be independently verifiable. Checksum validation runs before and after any retry so that silent bit rot is caught rather than propagated, and each transition is written as a typed preservation event. The event vocabulary this layer emits maps directly onto the PREMIS Metadata Mapping rules that govern how technical characteristics become durable, machine-readable preservation metadata, and onto the audit obligations of the OAIS Reference Model Implementation within the companion OAIS-Compliant Digital Preservation Architecture discipline. The Library of Congress PREMIS Data Dictionary requires that each retry attempt be logged as a discrete preservation event, preserving an auditable trail for compliance mapping against ISO 14721 (OAIS) and FADGI guidelines.

PREMIS event type	Trigger point	Outcome recorded	Auditor expectation
`Retry`	Transient fault caught, backoff scheduled	Attempt number, error detail, elapsed time	Every retry is visible, not just the final state
`message digest calculation`	Post-transfer checksum verified	SHA-256 digest, pass/fail	Bit-level integrity re-confirmed after each attempt
`quarantine`	Permanent fault or exhausted retries	Reason, retained original path	Bad assets are isolated, never deleted or looped
`ingestion`	Asset accepted after successful retry	Final attempt count, replica location	Successful recovery is recorded as a first-class event

Because these events are the evidentiary backbone of a trusted repository, they must be written to append-only or write-once storage. The retry counter, original error payload, elapsed time, and resolution status transform operational noise into actionable compliance evidence, keeping assets trustworthy across decades of technological migration.

Troubleshooting Reference

Error condition	Root cause	Remediation
Worker loops indefinitely on one file	A permanent fault (checksum, truncated header) is inside the retryable allow-list	Remove it from `retryable_exceptions`; route to quarantine with a terminal event
Retry storm saturates a recovering endpoint	Jitter disabled, so a whole pool retries in lockstep	Enable additive jitter; stagger `initial_delay_s` across worker classes
Duplicate assets after a worker crash	Retry not idempotent — output not keyed on a deterministic identifier	Derive output name from asset id + checksum; make a redelivered task detect completed work
Circuit breaker never trips on a dead device	`circuit_failure_threshold` too high, or failures reset the counter on partial success	Lower the threshold; count consecutive failures, not lifetime totals
Dead-letter queue fills silently	No alert on dead-letter depth; batches exceed `max_attempts` unnoticed	Monitor dead-letter depth and raise an alert above a defined threshold
Backoff stalls an entire shift	`max_delay_s` uncapped or set far too high	Cap `max_delay_s`; keep the ceiling proportional to batch SLA
Retry events missing from the audit trail	Events logged only on final outcome, not per attempt	Emit a discrete `Retry` event in the `before_sleep` hook of every attempt

Frequently Asked Questions

When should a failed ingest operation be retried versus quarantined?

Retry only faults that a later identical attempt can plausibly resolve — network timeouts, dropped connections, transient 503 responses from a rate-limited store. Quarantine everything else immediately: a checksum mismatch, a truncated header, or a schema violation is permanent, and retrying it wastes the retry budget while the real defect goes unremediated. Encode the distinction as an explicit allow-list of retryable exception types; anything outside the list routes to quarantine with a preservation event attached.

Why is jitter necessary in exponential backoff?

Without jitter, a pool of workers that all failed against the same endpoint retry at exactly the same computed intervals, reproducing the congestion that caused the original failure — a thundering-herd retry storm. Adding a random component to each backoff interval de-synchronises the workers so their retries spread out over time, giving the struggling dependency room to recover instead of being hit by a synchronised wave.

How do I keep retries from duplicating assets or corrupting a manifest?

Make the operation idempotent. Key the output on a deterministic function of the asset identifier and its expected checksum, verify fixity before and after transfer, and write manifests atomically (temp file, fsync, rename). A redelivered or replayed attempt then either detects that its work is already complete and exits cleanly, or overwrites its own prior output — never creating an orphaned duplicate or a half-written manifest.

What role does the circuit breaker play alongside retries?

Retries handle isolated, per-operation faults; the circuit breaker handles a dependency that has failed systemically. Once consecutive failures cross the threshold, the breaker opens and stops sending work to the failing device or endpoint for a cooldown window, so the pool stops hammering something that cannot respond. After the cooldown it lets a probe through; success closes the breaker and normal processing resumes. This isolates a dead dependency without halting the rest of the batch.

How are retry attempts recorded for an ISO 16363 audit?

Every attempt emits a discrete typed preservation event — a Retry event on each backoff, a message digest calculation event when fixity is re-confirmed, and a terminal quarantine or ingestion event on resolution. Each carries the attempt number, error detail, elapsed time, and outcome, and is written to append-only storage. An auditor can then replay exactly how the system reacted to every failure, not merely its final state.

Async Task Queuing for Batches — the Celery/RQ layer whose at-least-once delivery composes with idempotent retries.
Scanner API Integration & Routing — surfaces hardware faults as retryable exceptions and re-dispatches to healthy devices.
OCR Processing Pipelines — the failure-tolerant derivative stage this layer shields from transient upstream faults.
Batch Validation Schemas — defines the structural failures that must be treated as permanent, never retried.
PREMIS Metadata Mapping — the event vocabulary every retry, quarantine, and recovery is recorded against.

Error Handling & Retry Logic in Automated Ingestion & Batch Scanning Workflows

# Scope: The Transient / Permanent Fault Boundary

# Retry Policy Specification

# Retry Lifecycle State Machine

# Core Implementation: Idempotent Retry Wrapper

# Integration Points

# Validation and Compliance Rules

# Troubleshooting Reference

# Frequently Asked Questions

# When should a failed ingest operation be retried versus quarantined?

# Why is jitter necessary in exponential backoff?

# How do I keep retries from duplicating assets or corrupting a manifest?

# What role does the circuit breaker play alongside retries?

# How are retry attempts recorded for an ISO 16363 audit?

# Related

More in Automated Ingestion & Batch Scanning Workflows