Batch Validation Schemas: Structural Gatekeepers for Automated Ingest

Batch validation schemas are the deterministic quality gate inside Automated Ingestion & Batch Scanning Workflows: the enforcement layer that decides whether a freshly captured batch becomes an Archival Information Package (AIP) or is quarantined for review. This page specifies how those schemas are structured, how they are enforced against every incoming Submission Information Package (SIP), and how their verdicts are recorded as permanent preservation evidence. The gate sits directly downstream of Async Task Queuing for Batches, which distributes validation jobs across worker nodes, and directly upstream of Metadata Extraction Workflows and OCR Processing Pipelines, which assume every object they receive has already passed structural and fixity checks. By codifying preservation requirements into machine-readable assertions — JSON Schema profiles, XSD/Schematron rules, and BagIt profiles — institutions replace subjective manual review with reproducible, auditable adjudication that scales alongside a high-throughput scanning floor and satisfies the audit obligations of ISO 14721 (OAIS) and ISO 16363.

At a high level, the validation gate evaluates each incoming SIP and routes it to either repository promotion or quarantine, as illustrated below.

The validation schema acts as the adjudicator: compliant packages advance to AIP promotion while failures are quarantined with a permanent PREMIS audit record.

Schema Architecture: The Four Validation Layers

A production validation schema is not a single document but a layered stack, where each layer assumes the layer beneath it has already passed. Treating validation as one monolithic check produces useless error messages (“package invalid”) and forces re-processing of the entire batch on any failure. Decomposing it into ordered layers lets the pipeline fail fast, emit a precise diagnostic, and short-circuit expensive downstream work.

Layer	What it enforces	Tooling	Failure verdict
1. Structural	Package layout, required files present, manifest parses	BagIt profile validator, `bagit-python`	`structural_error`
2. Schema	Field types, cardinality, controlled vocabularies, required elements	`jsonschema` (Draft 2020-12), `lxml` XSD/Schematron	`schema_violation`
3. Fixity	Declared checksums match recomputed hashes	`hashlib` SHA-256/SHA-512	`checksum_mismatch`
4. Policy	Capture profile conformance (bit depth, colour space, resolution, format)	Institutional profile rules on extracted technical metadata	`policy_violation`

The schema layer itself is where most engineering effort concentrates. It begins with strict type checking, cardinality constraints, and controlled-vocabulary enforcement, expressed as declarative data contracts at the point of ingest. Defining these contracts explicitly prevents metadata drift, catches structural anomalies before they propagate, and produces a verifiable trail of exactly which assertion a package failed. The schema-first approach aligns with the JSON Schema specification, which supplies a robust vocabulary for annotating and validating JSON manifests against an institutional preservation profile.

Manifest field specification

Every SIP carries a JSON manifest describing its members. The table below is the field-level contract enforced by the schema layer; it is the single source of truth that both the scanner-side packager and the ingest validator compile against.

Field	JSON type	Required	Constraint	PREMIS / METS mapping
`bag_identifier`	string	yes	URN pattern, unique per batch	`premis:objectIdentifier`
`created`	string	yes	RFC 3339 UTC timestamp	`premis:eventDateTime`
`capture_profile`	string	yes	enum: `fadgi-4star`, `fadgi-3star`	technical metadata anchor
`files`	array	yes	`minItems: 1`	`mets:fileGrp`
`files[].path`	string	yes	POSIX relative path, no `..`	`premis:hasOriginalName`
`files[].sha256`	string	yes	64-char lowercase hex	`premis:messageDigest`
`files[].format_puid`	string	yes	PRONOM PUID (e.g. `fmt/353`)	`premis:formatRegistryKey`
`files[].bit_depth`	integer	conditional	`>= 8`, required for images	technical metadata

The format_puid field is what links this schema to Format Registry Integration: the validator does not merely check that the field is present, it confirms the declared PUID resolves against the local PRONOM mirror before the package is promoted. Type and structural correctness are necessary but not sufficient — a package can be perfectly well-formed and still declare a format the repository has no preservation plan for.

Where Validation Runs in the Pipeline

Validation is a synchronous checkpoint embedded in an otherwise asynchronous topology. When a batch of high-resolution TIFFs, PDF/A derivatives, or born-digital transfers enters the system, the task queue distributes validation jobs across worker nodes, preventing resource contention while holding to strict throughput service-level agreements. During this phase, metadata extraction runs in parallel to parse technical metadata, read embedded EXIF/XMP headers, and recompute cryptographic checksums. The schema is the adjudicator: it rejects payloads that fail any layer and routes compliant batches forward. This decoupling ensures that infrastructure limits — worker count, storage IOPS, network partitions — throttle throughput without ever weakening the correctness guarantee.

The fixity layer rests on a collision-resistance assumption worth stating explicitly. For an $n$-file batch validated under SHA-256, the probability that any two distinct files share a digest is bounded by the birthday approximation

$$ P(\text{collision}) \approx \frac{n^2}{2^{,257}} $$

which is negligible for any realistic batch size, so a matching digest is treated as proof of byte-for-byte identity between the declared and stored object.

Core Implementation

The following pattern implements the layered gate as a single, type-hinted routine. It validates the manifest against a compiled schema, recomputes and compares checksums, and writes an immutable audit record for every verdict — the minimum an ingest node must do to be defensible under ISO 16363.

python

import json
import hashlib
import logging
from pathlib import Path
from typing import Dict, List
from jsonschema import validate, ValidationError, Draft202012Validator
from datetime import datetime, timezone

# Configure structured audit logging
logging.basicConfig(
    filename="ingest_validation_audit.log",
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
    datefmt="%Y-%m-%dT%H:%M:%S%z",
)
logger = logging.getLogger("batch_validator")


class BatchValidator:
    """Layered ingest gate: structural -> schema -> fixity, with audit logging."""

    def __init__(self, schema_path: Path) -> None:
        with open(schema_path, "r", encoding="utf-8") as fh:
            self.schema = json.load(fh)
        # Fail loudly at construction if the profile itself is malformed.
        Draft202012Validator.check_schema(self.schema)
        self.validator = Draft202012Validator(self.schema)

    def compute_sha256(self, file_path: Path) -> str:
        digest = hashlib.sha256()
        with open(file_path, "rb") as fh:
            for chunk in iter(lambda: fh.read(8192), b""):
                digest.update(chunk)
        return digest.hexdigest()

    def _verify_fixity(self, base: Path, files: List[Dict]) -> List[str]:
        mismatches: List[str] = []
        for entry in files:
            target = base / entry["path"]
            if not target.is_file():
                mismatches.append(f"{entry['path']} (missing)")
                continue
            if self.compute_sha256(target) != entry.get("sha256", ""):
                mismatches.append(f"{entry['path']} (digest)")
        return mismatches

    def validate_manifest(self, manifest_path: Path) -> Dict:
        base = manifest_path.parent
        with open(manifest_path, "r", encoding="utf-8") as fh:
            manifest = json.load(fh)

        # Layer 2 — schema. Collect every violation, not just the first.
        errors = sorted(self.validator.iter_errors(manifest), key=lambda e: e.path)
        if errors:
            detail = "; ".join(f"{'/'.join(map(str, e.path))}: {e.message}" for e in errors)
            logger.error("verdict=schema_violation bag=%s detail=%s",
                         manifest_path.name, detail)
            return {"status": "rejected", "verdict": "schema_violation", "detail": detail}

        # Layer 3 — fixity.
        mismatches = self._verify_fixity(base, manifest["files"])
        if mismatches:
            logger.error("verdict=checksum_mismatch bag=%s files=%s",
                         manifest_path.name, mismatches)
            return {"status": "rejected", "verdict": "checksum_mismatch", "files": mismatches}

        logger.info("verdict=approved bag=%s files=%d",
                    manifest_path.name, len(manifest["files"]))
        return {
            "status": "approved",
            "verdict": "validated",
            "bag": manifest.get("bag_identifier"),
            "timestamp": datetime.now(timezone.utc).isoformat(),
        }


if __name__ == "__main__":
    validator = BatchValidator(Path("schemas/preservation_manifest_v2.json"))
    result = validator.validate_manifest(Path("ingest/batch_20231015/manifest.json"))
    print(json.dumps(result, indent=2))

Two details make this production-grade rather than illustrative. First, iter_errors collects every schema violation in one pass, so a curator sees the complete set of problems in a batch instead of fixing them one round-trip at a time. Second, every terminal path — approval or either rejection class — emits a structured log line with a machine-parsable verdict, which is what the audit ledger and monitoring dashboards consume.

Integration Points

The validation gate is defined by its neighbours as much as by its own logic. It coordinates most tightly with Scanner API Integration & Routing, where hardware-generated file manifests must be cross-referenced against expected capture parameters, bit-depth specifications, and resolution metadata before any transformation runs. Scanner outputs routinely carry proprietary headers, inconsistent naming conventions, or malformed XML sidecars; the policy layer normalises these device-specific quirks by applying transformation rules and rejecting payloads that deviate from the approved capture profile. Enforcing schema compliance at this routing boundary means only structurally sound batches reach derivative generation, which prevents cascading failures downstream and avoids storing malformed or incomplete captures.

Once a batch clears the gate, validated master files flow to OCR Processing Pipelines for text extraction, layout analysis, and searchable PDF generation. Because the upstream schema guarantees consistent image geometry, colour space, and resolution, OCR engines run with predictable accuracy and less wasted compute. In parallel, compliant metadata packages feed the enrichment models handled by Metadata Extraction Workflows, which generate subject headings, entity tags, and provenance annotations. The schema guarantees those models receive the required structural anchors — dc:identifier, premis:hasOriginalName, mets:fileGrp — preventing enrichment from operating on incomplete input. When any downstream stage rejects a batch, the Error Handling & Retry Logic subsystem quarantines the payload, emits a diagnostic manifest, and schedules automated re-ingest or curator review.

Validation and Compliance Rules

A validation verdict is only trustworthy if it is recorded. Every assertion the gate evaluates must be captured as a preservation event that survives technology migration and organisational change — the standard developed in depth on Validating schema compliance during digital ingest. Institutional compliance with the PREMIS Data Dictionary requires that preservation metadata explicitly document the validation events certifying an object’s fitness for long-term storage, so each verdict the validation gate produces maps to a specific PREMIS event type recorded through PREMIS Metadata Mapping.

Validation layer	PREMIS eventType	eventOutcome on failure	Recorded in
Structural	`validation`	`structural_error`	`mets:amdSec` / `premis:event`
Schema	`validation`	`schema_violation`	`premis:event` + linking to profile version
Fixity	`fixity check`	`bitstream_altered`	`premis:event` + `premis:messageDigest`
Policy	`validation`	`policy_violation`	`premis:event` + capture-profile id
Promotion	`ingestion`	— (`pass` only)	AIP `premis:event`

Embedding these results into METS <amdSec> elements and PREMIS <event> records creates a verifiable chain of custody: the audit trail captures the schema version in force, the validation engine state, the routing decision, and the cryptographic proof of integrity. This is the alignment point with the broader OAIS Reference Model Implementation — validation events populate the Provenance and Fixity information required for a package to be a legitimate AIP, not merely a stored file. Batch validation schemas therefore function less as transient checks and more as permanent structural guarantees.

Troubleshooting Reference

Symptom	Root cause	Remediation
`schema_violation` on `format_puid`	PRONOM PUID absent or unresolved against local mirror	Sync the format registry; if the format is genuinely new, add a signature and preservation plan before re-ingest
Intermittent `checksum_mismatch` on large TIFFs	Truncated transfer or NFS write-back race before the reader flushes	Re-stage from source, verify staging quota, add a post-write settle/`fsync` barrier prior to hashing
`structural_error`: manifest parses but `files[]` empty	Scanner packager emitted the SIP before capture completed	Gate packaging on a capture-complete sentinel; treat empty `files` as retryable, not terminal
Every batch fails after a profile change	Schema profile bumped without recompiling scanner-side validator	Version the profile; pin producer and consumer to the same `schema_version`; roll out via the registry, not ad hoc
Validation passes but OCR quality collapses	Policy layer omits a bit-depth/colour-space rule the profile assumes	Add the missing constraint to the policy layer; backfill-validate the affected batches
Audit log missing verdicts under load	Log handler buffering or worker crash before flush	Use synchronous/append-only logging on the ingest node; ship verdicts to the audit ledger, not a local file only

FAQ

What is the difference between a batch validation schema and a BagIt profile?

A BagIt profile constrains package structure — which tag files exist, which manifests are required, allowed serialisations. A batch validation schema is broader: it wraps the BagIt profile as its structural layer and adds field-level type and vocabulary enforcement, fixity comparison, and capture-policy rules. The profile answers “is this a well-formed bag?”; the schema answers “is this a package this repository will accept as an AIP?”.

Should validation reject a batch on the first error or collect all errors?

Collect all errors within a layer before returning a verdict. Using iter_errors rather than a fail-fast validate call lets a curator resolve every schema problem in a batch in a single round-trip. The layers themselves are still ordered and short-circuiting — there is no reason to recompute checksums for a package whose manifest does not parse.

How does schema versioning avoid breaking an in-flight ingest?

Pin the producer (scanner-side packager) and consumer (ingest validator) to the same schema_version, distribute profile changes through Format Registry Integration rather than editing files in place, and record the active version in every PREMIS validation event. A package is always judged against the profile that was current when it was produced.

Where are validation results stored for audit?

In two places. A structured log line is emitted for real-time monitoring, and a durable PREMIS event record is written for each verdict and linked to the object identifier — mapped as described in PREMIS Metadata Mapping. The PREMIS record is the authoritative, migration-surviving copy; the log is operational.

Validating schema compliance during digital ingest — the step-by-step audit-record pattern behind every verdict this gate produces.
Async Task Queuing for Batches — how validation jobs are distributed across worker nodes without contention.
Scanner API Integration & Routing — the capture-profile source the policy layer validates against.
Error Handling & Retry Logic — what happens to a payload after it is quarantined.
PREMIS Metadata Mapping — how each validation verdict becomes a permanent preservation event.

Batch Validation Schemas: Structural Gatekeepers for Automated Ingest

# Schema Architecture: The Four Validation Layers

# Manifest field specification

# Where Validation Runs in the Pipeline

# Core Implementation

# Integration Points

# Validation and Compliance Rules

# Troubleshooting Reference

# FAQ

# What is the difference between a batch validation schema and a BagIt profile?

# Should validation reject a batch on the first error or collect all errors?

# How does schema versioning avoid breaking an in-flight ingest?

# Where are validation results stored for audit?

# Related

Explore Batch Validation Schemas