Async Task Queuing for Batches in Archival Digitization

Async task queuing is the coordination layer that lets a digitization lab accept, validate, and route thousands of scanned objects without ever blocking the capture floor. Within the parent Automated Ingestion & Batch Scanning Workflows architecture, this is the stage that sits directly after hardware capture and directly before validation and derivative generation: raw masters land in a staging directory, a producer enqueues a batch manifest, and a pool of workers drains that queue asynchronously. Decoupling the ingest trigger from the computational payload is what makes overnight runs of fragile manuscripts, photographic negatives, and bound volumes tractable — the operator interface stays responsive, worker memory stays bounded, and a broker restart never loses an irreplaceable master. This page specifies the queue contract, the routing logic that hands work to Scanner API Integration & Routing and OCR Processing Pipelines, and the resilience rules that keep the pipeline auditable under real infrastructure failure.

Queue Architecture and Broker Selection

A production queue relies on a message broker to serialize batch manifests, distribute payloads across worker nodes, and track execution state. The producer never runs preservation work itself — it validates a manifest, persists it, and returns immediately, leaving the broker to fan the payload out to whichever workers have free capacity. The diagram below shows a producer enqueuing a batch manifest, the broker distributing tasks across a parallel worker pool, and failures routing to a quarantine queue rather than being dropped.

The broker distributes parallel tasks to the worker pool; failed jobs are routed to a quarantine queue rather than dropped.

Broker choice is an architectural decision with direct preservation consequences: it determines whether an interrupted job is re-delivered, how backpressure is signalled when workers fall behind, and whether task results can be inspected for audit. For Python-driven preservation stacks, Implementing Celery for asynchronous ingestion tasks covers the full framework setup; the table below summarises how the common broker/backend options behave against the guarantees a digitization lab actually needs.

Broker / backend	Delivery guarantee	Backpressure model	Best fit for digitization
Redis	At-least-once (visibility timeout)	List depth, memory-bound	Small–mid labs; fast, simple, single-node staging
RabbitMQ	At-least-once with publisher confirms	Prefetch + flow control	High-volume, multi-queue topic routing per collection
Amazon SQS	At-least-once, per-message visibility	Long-poll, effectively unbounded	Cloud-adjacent tiers, cross-site ingest
Result backend (Redis/DB)	N/A — stores task outcome	N/A	PREMIS-grade audit of every task result

The two guarantees that matter most are at-least-once delivery and a bounded visibility timeout. At-least-once ensures that if a worker dies mid-task, the broker re-enqueues the message once its acknowledgment window lapses, so a partially processed master is retried rather than silently lost. The visibility timeout must exceed the worst-case processing time of a single object — a large multi-page TIFF undergoing JPEG 2000 normalization can run for minutes, and a too-short timeout causes the broker to re-deliver a job that is still running, producing duplicate derivatives.

The Batch Manifest Contract

Before any computational payload executes, the queue enforces a strict manifest contract. The manifest is the unit of work a producer enqueues: it names the batch, the scanner profile that produced it, the expected file count, the per-object checksums, and the descriptive metadata standard the batch must satisfy. Modelling this contract with Pydantic gives the producer a single validation surface and gives every downstream worker a typed, self-describing payload. The field specification below is the authoritative contract every queued batch must honour.

Field	Type	Constraint	Purpose
`batch_id`	`str`	Unique, collection-prefixed	Provenance + idempotency key
`scanner_profile`	`str`	Must match a registered profile	Ties output to capture device + ICC profile
`file_count`	`int`	> 0	Guards against empty/truncated transfers
`checksums`	`dict[str, str]`	SHA-256 hex digests	Bit-level fixity for every member file
`metadata_schema`	`str`	One of METS, PREMIS, DublinCore, MODS	Selects the validation profile applied downstream

Structural validation of the manifest itself is only the entry gate; the deeper semantic checks against institutional preservation standards are owned by Batch Validation Schemas, which the queue invokes once a payload is accepted. A manifest that fails the contract never reaches a worker — it is rejected at enqueue time with a diagnostic record, keeping malformed packages out of the processing tier entirely.

python

import hashlib
import json
import logging
from datetime import datetime, timezone
from celery import Celery
from pydantic import BaseModel, field_validator

logger = logging.getLogger("preservation.ingest")

app = Celery("archival_queue", broker="redis://localhost:6379/0")

class BatchManifest(BaseModel):
    batch_id: str
    scanner_profile: str
    file_count: int
    checksums: dict[str, str]
    metadata_schema: str

    @field_validator("metadata_schema")
    @classmethod
    def validate_schema_compliance(cls, v: str) -> str:
        allowed = {"METS", "PREMIS", "DublinCore", "MODS"}
        if v not in allowed:
            raise ValueError(f"Unsupported schema: {v}. Must be one of {allowed}")
        return v

def validate_batch_payload(payload: dict) -> bool:
    """Pre-flight validation with strict checksum and schema enforcement."""
    try:
        manifest = BatchManifest(**payload)
        # Verify manifest integrity. The stored digest is computed over the
        # manifest body, so we must exclude the checksums field itself before
        # hashing — otherwise the digest can never reconcile with its own value.
        body = manifest.model_dump(exclude={"checksums"})
        computed = hashlib.sha256(
            json.dumps(body, sort_keys=True).encode()
        ).hexdigest()
        if computed != manifest.checksums.get("manifest_sha256"):
            logger.error(f"Checksum mismatch for batch {manifest.batch_id}")
            return False
        logger.info(f"Batch {manifest.batch_id} passed pre-flight validation")
        return True
    except Exception as e:
        logger.error(f"Validation failed: {e}")
        return False

Task Routing and Sub-Pipeline Orchestration

Once a manifest is accepted, queued tasks trigger the specialized sub-pipelines that do the heavy lifting of digitization. Routing is deliberately declarative: the orchestrating task does not perform capture, OCR, or metadata harvesting itself — it dispatches each unit of work to the dedicated queue that owns it. Scanner API Integration & Routing coordinates hardware handshakes, manages TWAIN/SANE protocol translation, and enforces consistent DPI and color-profile application across heterogeneous capture devices. For text-heavy collections the workflow hands off to OCR Processing Pipelines to produce ALTO or hOCR sidecars, while parallel workers run Metadata Extraction Workflows to harvest embedded EXIF/IPTC data and generate technical preservation derivatives. Because each hand-off is an independent task on its own queue, a slow OCR stage never stalls fixity hashing, and a scanner timeout is isolated to the routing queue rather than propagating across the whole batch.

python

@app.task(bind=True, max_retries=3, default_retry_delay=30)
def process_digitization_batch(self, batch_payload: dict) -> dict:
    """Orchestrates scanner routing, derivative generation, and metadata extraction."""
    if not validate_batch_payload(batch_payload):
        # raise self.retry(...) re-enqueues the task and aborts the current run
        raise self.retry(countdown=60, exc=ValueError("Invalid manifest payload"))

    batch_id = batch_payload["batch_id"]
    logger.info(f"Dispatching sub-pipelines for batch {batch_id}")

    # Route each unit of work to the queue that owns it. In production these are
    # Celery signatures sent to named queues, or RabbitMQ topic-exchange bindings.
    dispatched = {
        "scanner_route": app.send_task(
            "scanner.route_capture", args=[batch_id], queue="capture"),
        "ocr_queue": app.send_task(
            "ocr.enqueue_batch", args=[batch_id], queue="ocr"),
        "metadata_extraction": app.send_task(
            "metadata.extract_batch", args=[batch_id], queue="metadata"),
    }
    results = {stage: task.id for stage, task in dispatched.items()}

    # Structured audit record — one immutable line per dispatch for PREMIS.
    logger.info(json.dumps({
        "event": "TASK_DISPATCH",
        "batch_id": batch_id,
        "status": "COMPLETED",
        "worker_id": self.request.hostname,
        "task_id": self.request.id,
        "child_tasks": results,
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }))
    return results

Resilience, Retry Semantics, and Resource Governance

Asynchronous queues introduce operational challenges specific to preservation environments: long-running conversions, constrained lab hardware, and the absolute requirement that no master is ever lost. Transient failures — a scanner disconnect, a storage-latency spike, a broker reconnection — must be retried without corrupting partial derivatives, which is the concern owned by Error Handling & Retry Logic. The retry policy itself should be an exponential backoff with jitter, so that when a shared dependency recovers, hundreds of paused workers do not stampede it simultaneously. The delay before the n-th retry follows:

$$ t_n = \min!\left(t_{\max},; t_0 \cdot 2^{,n}\right) \cdot U(0, 1) $$

where ( t_0 ) is the base delay, ( t_{\max} ) caps the backoff, and ( U(0,1) ) is a uniform random jitter factor that spreads retries across the recovery window. The lifecycle a task moves through — queued, running, retry-scheduled, and finally either acknowledged or quarantined — is a small state machine worth making explicit for operators tuning the pool.

Resource governance is the other half of resilience. Worker pool concurrency must be tuned to the lab’s actual hardware so pools never saturate I/O channels or exhaust RAM on the digitization server — CPU-bound fixity hashing belongs in process pools, while I/O-bound file transfers belong in thread pools. Long-lived workers that call image libraries such as Pillow and OpenCV accumulate C-level buffers that bypass Python’s garbage collector, so bounding each worker’s lifetime with a max-tasks-per-child setting is the pragmatic defence against slow memory growth. Chunked uploads, a local staging cache, and traffic-shaping rules that prioritise preservation masters over derivative generation keep throughput predictable when a batch runs into the terabytes.

Validation and Compliance Rules

Every state transition, retry attempt, and validation outcome must be serialized into an immutable audit log so the queue can produce verifiable chain-of-custody evidence. In practice this means emitting a preservation event at each significant boundary, mapped to the vocabulary defined by PREMIS Metadata Mapping and aligned with the ingest functional entity of the OAIS Reference Model Implementation. The events a batch queue is responsible for recording are catalogued below; each is written to the audit backend with the triggering batch_id, the worker identity, and a UTC timestamp.

Queue boundary	PREMIS eventType	eventOutcome on failure	Notes
Manifest accepted at enqueue	`validation`	`manifest rejected`	Structural + checksum pre-flight
Fixity verified before routing	`fixity check`	`checksum mismatch`	SHA-256 against manifest digest
Task dispatched to sub-pipeline	`ingestion`	`dispatch failed`	One event per child queue
Retry scheduled	`replication`	`retry exhausted`	Records attempt count + backoff
Payload quarantined	`quarantine`	`unresolved`	Bitstream preserved, promotion blocked

Fixity is non-negotiable at the queue boundary: the digest recomputed by the worker must equal the value carried in the manifest, and any mismatch routes the payload to quarantine with a fixity check event rather than allowing it toward the archival tier. The official PREMIS Data Dictionary governs the semantics of these event records, and consulting the Celery documentation on task routing, acknowledgment, and result backends is essential for wiring the audit hooks without dropping outcomes under load.

Troubleshooting Reference

Error condition	Root cause	Remediation
Duplicate derivatives for one master	Visibility timeout shorter than task runtime → broker re-delivers a running job	Raise the timeout above worst-case per-object runtime; make the task idempotent on `batch_id`
Tasks vanish after a worker crash	Early acknowledgment (ack-on-receive) discards the message before work completes	Switch to late acknowledgment (`task_acks_late=True`) so re-delivery covers mid-task failures
Worker RAM climbs across a long run	C-level buffers in Pillow/OpenCV bypass Python GC	Set `max_tasks_per_child` to recycle workers; profile with `tracemalloc`
Recovering service is hammered by retries	Fixed-delay retries release all paused workers at once	Apply exponential backoff with jitter; cap with `t_max`
Queue depth grows unbounded during peak scanning	Producer enqueue rate exceeds worker drain rate	Enforce broker prefetch limits; add workers or throttle capture; alert on queue depth
Valid batch routed to quarantine	Manifest `metadata_schema` not in the allowed set, or stale scanner profile	Correct the manifest field; register the profile before re-ingest

Frequently Asked Questions

Should I use Celery or RQ for archival batch queuing?

Both are valid Python task queues. RQ is lighter and simpler to reason about for single-broker Redis setups, while Celery offers richer routing (topic exchanges, chains, chords), multiple broker backends, and mature result inspection — which matters when every task outcome must be auditable. For high-volume, multi-collection labs the routing and result-backend features generally justify Celery; smaller pipelines are well served by RQ.

How large should a single batch be?

Size batches so that one worker can finish a batch inside the broker’s visibility timeout and so that a re-delivery after a crash re-processes an acceptable amount of work. In practice, grouping by collection identifier and capping at a few hundred objects per manifest keeps retries cheap and audit records legible, rather than enqueuing one enormous manifest per scanning session.

What happens to a batch that fails validation?

It is never promoted. A failed pre-flight or fixity check routes the payload to the quarantine queue, preserves the original bitstream untouched, and writes a PREMIS event documenting the discrepancy. A curator or an automated re-ingest can then act on the diagnostic manifest without any risk of the malformed package reaching the archival tier.

How do I guarantee a master is never processed twice?

Combine at-least-once delivery with idempotent tasks. Use the batch_id (and per-file identifiers) as an idempotency key so a re-delivered message detects that its output already exists and short-circuits, and set the visibility timeout comfortably above the worst-case task runtime so the broker does not re-deliver a job that is still running.

Implementing Celery for asynchronous ingestion tasks — hands-on Celery setup for distributed ingest job lifecycles.
Error Handling & Retry Logic — backoff, circuit breakers, and deterministic retry policies for zero-loss ingest.
Batch Validation Schemas — the structural and semantic gate every queued manifest must pass.
Scanner API Integration & Routing — the capture-side queue this orchestrator dispatches to.
PREMIS Metadata Mapping — the preservation-event vocabulary the queue’s audit log records against.

Async Task Queuing for Batches in Archival Digitization

# Queue Architecture and Broker Selection

# The Batch Manifest Contract

# Task Routing and Sub-Pipeline Orchestration

# Resilience, Retry Semantics, and Resource Governance

# Validation and Compliance Rules

# Troubleshooting Reference

# Frequently Asked Questions

# Should I use Celery or RQ for archival batch queuing?

# How large should a single batch be?

# What happens to a batch that fails validation?

# How do I guarantee a master is never processed twice?

# Related Pages

Explore Async Task Queuing for Batches in Archival Digitization