Implementing Celery for Asynchronous Ingestion Tasks

Archival digitization pipelines routinely move multi-gigabyte preservation masters, high-resolution TIFF batches, and complex metadata payloads that overwhelm any synchronous request/response model. This page is the Celery implementation reference for the Async Task Queuing for Batches stage of the broader Automated Ingestion & Batch Scanning Workflows pipeline: it takes the queue contract and broker guarantees defined there and shows exactly how to wire Celery so that a worker crash, a dropped broker connection, or a bad payload never silently loses an irreplaceable master. The specific failure this page solves is the gap between “the task was accepted” and “the preservation work actually completed and was recorded” — the ambiguity that produces phantom acknowledgments, duplicate derivatives, and chain-of-custody violations when Celery is deployed with its convenient defaults.

Root-Cause Analysis of Silent Task Loss

Three intersecting misconfigurations account for nearly every case of work that vanishes between enqueue and completion in a cultural-heritage ingest queue:

Early acknowledgment (task_acks_late=False). By default Celery acknowledges a message to the broker the moment a worker receives it, before the task body runs. If the worker then dies mid-checksum — OOM kill, segfault in an image library, node reboot — the broker has already discarded the message and there is nothing to redeliver. For an irreplaceable master this is silent data loss, not a retry.
Visibility timeout shorter than the task. On Redis or SQS transports the broker re-delivers any message whose acknowledgment window (visibility_timeout) lapses while the task is still running. A large multi-page TIFF undergoing JPEG 2000 normalization can run for minutes; a default timeout re-dispatches the still-running job, producing duplicate preservation events and duplicate derivatives.
Unbounded prefetch on memory-heavy payloads. With the default worker_prefetch_multiplier, each worker greedily reserves several messages at once. When those messages point at uncompressed preservation masters, the worker’s resident set balloons and the OS kills it — which, combined with early acks, deletes the reserved work.

The distinction between an acknowledged message and a completed task is the crux. The state machine below shows why late acknowledgment is the only safe default: the broker keeps the message “unacked” and eligible for redelivery until the task body returns, so a crash at any point before completion re-queues the master rather than dropping it.

Two acknowledgment policies as message state machines: acks_late=False removes the message before the work runs, so a mid-task crash is unrecoverable; acks_late=True holds the message unacked until the body returns, so a crash re-queues the master instead of dropping it.

Broker Configuration and Worker Isolation

High-throughput scanning generates bursty workloads that exhaust broker memory and starve downstream workers. Tune the transport options and prefetch multiplier before anything else; every reliability guarantee below assumes these are set.

python

# celery_config.py
broker_url = "amqp://guest:guest@rabbitmq:5672/"
result_backend = "redis://redis:6379/0"

broker_transport_options = {
    "visibility_timeout": 86400,  # 24 hours — must exceed the longest checksum/migration task
    "confirm_publish": True,
}

worker_prefetch_multiplier = 1   # each worker reserves one task — bounds memory on large masters
worker_max_tasks_per_child = 50  # recycle workers to release leaked memory from imaging libs

task_acks_late = True
task_reject_on_worker_lost = True
task_serializer = "json"
accept_content = ["json"]
result_serializer = "json"

Setting worker_prefetch_multiplier = 1 ensures each worker holds only one task at a time, preventing out-of-memory conditions when handling uncompressed masters. On Redis or SQS transports, visibility_timeout must exceed the maximum expected duration of any checksum verification or format migration; otherwise the broker re-delivers a task that is still running, causing duplicate preservation events and violating archival integrity requirements. On a RabbitMQ/AMQP broker this setting is inert — redelivery is instead governed by consumer acknowledgments, which is why task_acks_late carries the reliability load below.

Step-by-Step Resolution: Reliable, Idempotent Tasks

With the broker tuned, the task layer must guarantee at-least-once delivery and idempotent execution so that a redelivery never corrupts a preservation record. Enable task_acks_late=True to defer acknowledgment until the task body returns, and pair it with task_reject_on_worker_lost=True so a lost worker’s in-flight message is re-queued rather than acked by the dying process. Adjust broker_heartbeat=30 for high-latency storage networks and apply exponential backoff on transient errors.

The checksum task below is the canonical unit of preservation work. It is idempotent — re-running it on the same inputs produces the same recorded outcome — so redelivery is always safe. This is the same fixity discipline the pipeline applies before promotion to archival storage.

python

import logging

logger = logging.getLogger("archival.ingest")

@app.task(
    bind=True,
    acks_late=True,
    max_retries=3,
    default_retry_delay=60,
    autoretry_for=(ConnectionError, TimeoutError, IOError),
)
def verify_checksum(self, file_path: str, expected_hash: str) -> dict:
    """Idempotent SHA-256 verification for a single preservation master."""
    try:
        actual_hash = compute_sha256(file_path)
        if actual_hash != expected_hash:
            logger.error(
                "checksum_mismatch", extra={"file": file_path,
                "expected": expected_hash, "actual": actual_hash, "task_id": self.request.id}
            )
            raise ValueError(f"Checksum mismatch: {actual_hash} != {expected_hash}")
        logger.info("checksum_verified", extra={"file": file_path, "task_id": self.request.id})
        return {"status": "verified", "file": file_path, "hash": actual_hash}
    except (ConnectionError, TimeoutError, IOError) as exc:
        # Exponential backoff prevents a thundering herd against the storage array.
        raise self.retry(exc=exc, countdown=2 ** self.request.retries * 30)

Serialization Security and Metadata Integrity

Serializing complex metadata dictionaries with Pickle introduces cross-platform deserialization vulnerabilities — arbitrary code execution reachable through the broker — and it routinely raises kombu.exceptions.EncodeError on values that do not round-trip. Strict JSON serialization (already set above via task_serializer="json" and accept_content=["json"]) eliminates the code-execution risk and enforces schema validation at the broker boundary, so a malformed manifest is rejected before it reaches a worker. This dovetails with the Batch Validation Schemas that every queued batch must satisfy, and with the manifest fields that Scanner API Integration & Routing stamps at capture time.

python

from celery import Celery

app = Celery("archival_ingest")

@app.task
def validate_batch_schema(batch_id: str, manifest: dict) -> bool:
    """Enforce the batch manifest contract before any capture command is dispatched."""
    required_keys = {"scanner_model", "resolution_dpi", "color_profile", "file_paths"}
    missing = required_keys - manifest.keys()
    if missing:
        raise ValueError(f"Missing required manifest keys: {missing}")
    return True

@app.task
def route_capture_command(scanner_id: str, settings: dict) -> str:
    """Dispatch a capture command to the scanner hardware layer after a readiness check."""
    return f"capture_initiated_{scanner_id}"

Resource Partitioning and Queue Routing

Large digitization runs must partition compute so that optical character recognition does not starve I/O-bound ingest work. Celery’s task_routes directs heavy tasks to dedicated worker pools, isolating the CPU-bound recognition stage owned by OCR Processing Pipelines from thumbnail generation and metadata normalization.

python

# celery_config.py (continued)
task_routes = {
    "archival.tasks.run_ocr_pipeline": {"queue": "compute_intensive"},
    "archival.tasks.generate_thumbnails": {"queue": "io_bound"},
    "archival.tasks.normalize_metadata": {"queue": "default"},
}

# Worker startup commands:
# celery -A archival.tasks worker -Q compute_intensive -c 2 --pool=prefork
# celery -A archival.tasks worker -Q io_bound,default -c 8 --pool=gevent

Pinning OCR to a dedicated queue with -c 2 prevents CPU saturation from blocking storage-array reads, so metadata extraction and thumbnailing keep their throughput while recognition runs hot.

Deterministic Execution with Canvas Primitives

Chain-of-custody requires deterministic execution paths. Celery canvas primitives — chains, groups, and chords — encode archival dependencies explicitly. The flowchart below mirrors the canvas in the code that follows: a chain validates first, then a chord whose group runs capture and metadata extraction in parallel before the enrichment callback fires.

The chord callback waits for both parallel group tasks to finish before triggering enrichment.

python

from celery import chain, group, chord

def execute_ingest_pipeline(batch_manifest: dict):
    """Orchestrate a deterministic, dependency-ordered preservation workflow."""
    validate = validate_batch_schema.s(batch_manifest["id"], batch_manifest)

    parallel_tasks = group(
        route_capture_command.s(batch_manifest["scanner_id"], batch_manifest["settings"]),
        extract_technical_metadata.s(batch_manifest["file_paths"]),
    )

    # The chord callback fires only after both parallel group tasks complete.
    final_step = chord(parallel_tasks, apply_ai_enrichment.s())

    return chain(validate, final_step).apply_async()

Validation and Verification

A configuration that looks correct still needs proof that redelivery, ordering, and idempotency behave as designed. Confirm the fix with live introspection and a recorded audit trail rather than by inspecting settings alone:

Inspect in-flight state. celery -A app inspect active and celery -A app inspect reserved should show at most one reserved task per worker once worker_prefetch_multiplier = 1 is applied — proof that prefetch is bounded.
Force a crash and confirm redelivery. Kill a worker (kill -9) mid-task and verify the message reappears on the queue and completes on another worker. With acks_late=True and task_reject_on_worker_lost=True, the master is reprocessed, not lost.
Assert idempotency. Run verify_checksum twice on the same input; both invocations must record the identical SHA-256 outcome, confirming a redelivery cannot fork the provenance record.
Audit the event, not the log line. Every completed task should emit a durable preservation event — the fixity result and task identifier — that is folded into object provenance through PREMIS Metadata Mapping. Re-derive the checksum from storage and confirm it matches the recorded event.

Propagate task_id through structured logging so a single object’s journey can be reconstructed across distributed workers. When retry logic is correctly configured, transient network failures during validation or normalization resolve automatically without manual intervention.

Edge Cases and Gotchas

Failure mode	Root cause	Diagnostic command	Resolution
Phantom ACKs	`acks_late=False` + aggressive heartbeat	`celery -A app inspect active`	Enable `task_acks_late=True`
Zombie workers	Broker connection drop during I/O	`rabbitmqctl list_connections`	Set `task_reject_on_worker_lost=True`
`EncodeError`	Pickle serialization of complex dicts	`celery -A app inspect registered`	Enforce `task_serializer="json"`
CPU starvation	OCR competing with ingest	`top -p $(pgrep -f celery)`	Route OCR to `compute_intensive`

Beyond the common four, three archival-specific traps recur:

Multi-page TIFF timeouts. A single bound-volume TIFF can exceed task_soft_time_limit even when average pages are fast. Set soft and hard limits from the worst-case object, not the mean, or the broker re-delivers a still-running normalization and duplicates the derivative.
Result-backend expiry eating the audit trail. Redis result backends default to expiring task results, which silently deletes the evidence an auditor needs. Persist preservation outcomes to durable storage — do not treat the Celery result backend as the system of record.
Poison messages and retry storms. A batch that fails deterministically (a corrupt manifest, an unreadable master) will exhaust max_retries on every redelivery and hammer the storage array. Route terminally failed tasks to a dead-letter/quarantine queue — the same escalation pattern used by Error Handling & Retry Logic — instead of retrying indefinitely.

Adhering to these standards yields scalable, fault-tolerant ingestion that aligns with ISO 16363 requirements and maintains strict chain-of-custody. For broker tuning specifics, consult the Celery Configuration Reference, and the Python json module documentation for secure serialization.

Frequently Asked Questions

Why enable `task_acks_late` instead of relying on retries?

Retries only fire when a task raises inside a live worker. If the worker dies outright — OOM kill, segfault, node reboot — no exception is ever raised, so retry logic never runs. Late acknowledgment operates one level lower: the broker keeps the message unacked until the task body returns, so a worker that dies mid-task leaves the message eligible for redelivery to a healthy worker. For irreplaceable masters, this broker-level guarantee is the difference between reprocessing and silent loss.

How long should the visibility timeout be?

On Redis or SQS transports, longer than the single longest task the queue will ever run — including worst-case checksum verification and format migration on the largest bound-volume TIFF. If the timeout lapses while a task is still running, the broker re-delivers it and a second worker produces a duplicate derivative and a duplicate preservation event. A conservative value (24 hours in the config above) is safer than a tight one, because an over-long timeout only delays recovery from a genuinely dead worker, whereas a too-short one corrupts provenance under normal load.

Is Pickle serialization ever acceptable for ingest tasks?

No. Pickle deserialization can execute arbitrary code carried in a message, so a compromised or malformed payload reaching the broker becomes remote code execution on the worker. JSON serialization removes that class of vulnerability and enforces a schema boundary at the broker, rejecting malformed manifests before they reach a worker. The minor cost is that task arguments must be JSON-serializable, which is a healthy constraint for auditable preservation payloads.

How do I keep OCR from starving the ingest queue?

Partition compute with task_routes and dedicated worker pools. Pin CPU-bound recognition to a compute_intensive queue served by a small prefork pool (-c 2), and run I/O-bound ingest and thumbnailing on a larger gevent pool. This isolation stops recognition from saturating cores and blocking storage-array reads, preserving deterministic throughput for the rest of the pipeline.

Async Task Queuing for Batches — the parent stage: queue contract, broker selection, and the batch manifest schema this implementation consumes.
Error Handling & Retry Logic — dead-letter routing, backoff policy, and quarantine patterns for terminally failed batches.
Batch Validation Schemas — the manifest constraints enforced at the broker boundary before a task runs.
Scanner API Integration & Routing — how capture commands and manifest fields are produced upstream of the queue.

Implementing Celery for Asynchronous Ingestion Tasks

# Root-Cause Analysis of Silent Task Loss

# Broker Configuration and Worker Isolation

# Step-by-Step Resolution: Reliable, Idempotent Tasks

# Serialization Security and Metadata Integrity

# Resource Partitioning and Queue Routing

# Deterministic Execution with Canvas Primitives

# Validation and Verification

# Edge Cases and Gotchas

# Frequently Asked Questions

# Why enable task_acks_late instead of relying on retries?

# How long should the visibility timeout be?

# Is Pickle serialization ever acceptable for ingest tasks?

# How do I keep OCR from starving the ingest queue?

# Related

More in Automated Ingestion & Batch Scanning Workflows

Root-Cause Analysis of Silent Task Loss

Broker Configuration and Worker Isolation

Step-by-Step Resolution: Reliable, Idempotent Tasks

Serialization Security and Metadata Integrity

Resource Partitioning and Queue Routing

Deterministic Execution with Canvas Primitives

Validation and Verification

Edge Cases and Gotchas

Frequently Asked Questions

Why enable `task_acks_late` instead of relying on retries?

How long should the visibility timeout be?

Is Pickle serialization ever acceptable for ingest tasks?

How do I keep OCR from starving the ingest queue?

Related