Setting Up OAIS SIP/AIP/DIP Workflows in Python

Archival digitization pipelines rarely fail in the code that does the interesting work. They fail at the seams — the moment a raw scan crosses from an untrusted Submission Information Package (SIP) into a sealed Archival Information Package (AIP), or when a Dissemination Information Package (DIP) is projected for an access system and quietly leaks metadata it should have redacted. This page is the hands-on build guide within the parent OAIS Reference Model Implementation section: it turns the model’s package-promotion contract into runnable Python. The recurring defect is treating promotion as a file copy. Under an OAIS-Compliant Digital Preservation Architecture, each transition is a validated, event-emitting boundary — a SIP is admitted only after fixity and schema both pass, an AIP is sealed with provenance before it is stored, and a DIP is a read-only derivative that never mutates the original. Get the ordering wrong and the failures surface months later as XMLSyntaxError on files that looked fine, checksum mismatches after a network hop, or a namespace-mangled METS record that no auditor can parse.

Root-Cause Analysis of Promotion Failures

When a SIP/AIP/DIP pipeline degrades in production, the cause almost always sits at one of the three package boundaries, each with a distinct technical signature:

Silent corruption at the SIP boundary. Legacy storage systems strip extended attributes, network middleware injects a UTF-8 BOM or zero-byte padding, and Python’s default XML parser then drops the malformed bytes without complaint. The scan looks valid; the digest computed after transfer no longer matches the one recorded at capture. The root cause is admitting bytes before verifying them.
Metadata mangling during AIP assembly. The standard-library xml.etree auto-prefixes namespaces (ns0:, ns1:) on serialization, and event types get mapped to the wrong PREMIS URIs. The AIP serializes without error but its provenance chain is unreadable downstream. This is where disciplined PREMIS metadata mapping and signature-based preservation format identification have to be wired directly into the transformation, not bolted on afterward.
Uncontrolled exposure at the DIP boundary. A DIP generated by naively zipping the AIP payload carries restricted descriptive fields, rights statements, or internal identifiers into the access layer, violating institutional digital preservation security policies. The DIP must be a deliberate projection, not a repackaged copy.

A fourth, quieter failure is the race: multiple ingest workers append to a shared manifest, or an asynchronous storage write returns before the bytes are durable. Debugging these means reading the actual byte stream and the audit log — never adjusting a validation threshold to make a failing SIP pass.

The flowchart below maps the pipeline implemented across the code samples on this page, from SIP receipt and fixity validation through AIP assembly and storage to on-demand DIP generation.

The SIP-to-AIP-to-DIP pipeline as implemented: validation gates feed AIP assembly, storage, and on-request DIP generation.

Step One: Admit the SIP Only After Fixity and Schema Pass

The SIP boundary is the only place raw, untrusted bytes enter the system, so it carries the strictest gate. Two checks run before anything downstream sees the payload: a memory-efficient SHA-256 digest over the bitstream, and strict XML validation of the metadata against its XSD with recover=False so malformed input raises instead of being silently repaired. When a SIP fails, the pipeline isolates that one payload into a quarantine or retry path rather than aborting the whole batch — the same isolation discipline used by the batch error handling and retry logic that feeds this stage. Transient I/O errors get exponential backoff; a schema failure is fatal and short-circuits immediately.

python

import hashlib
import logging
import time
from pathlib import Path
from typing import Optional

from lxml import etree

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("sip_ingest")


def compute_sha256(file_path: Path, chunk_size: int = 8192) -> str:
    """Compute a SHA-256 digest with memory-efficient chunking."""
    sha256 = hashlib.sha256()
    with open(file_path, "rb") as f:
        while chunk := f.read(chunk_size):
            sha256.update(chunk)
    return sha256.hexdigest()


def validate_xml_schema(xml_path: Path, xsd_path: Path) -> bool:
    """Strict XML validation with recover=False and byte-offset error logging."""
    with open(xsd_path, "rb") as xsd_file:
        schema = etree.XMLSchema(etree.XML(xsd_file.read()))

    parser = etree.XMLParser(recover=False, resolve_entities=False)
    try:
        doc = etree.parse(str(xml_path), parser)
        schema.assertValid(doc)
        return True
    except etree.XMLSyntaxError as e:
        logger.error("XML syntax error at byte offset %s: %s", e.position[1], e.msg)
        return False
    except etree.DocumentInvalid as e:
        logger.error("Schema validation failed: %s", e.error_log)
        return False


def ingest_sip_with_retry(sip_dir: Path, max_retries: int = 3) -> Optional[dict]:
    """Admit a SIP only after schema and fixity pass; back off on transient I/O."""
    for attempt in range(max_retries):
        try:
            metadata_xml = sip_dir / "metadata.xml"
            if not validate_xml_schema(metadata_xml, sip_dir / "schema.xsd"):
                raise ValueError("Metadata schema validation failed")

            payload = sip_dir / "payload.tif"
            fixity = compute_sha256(payload)
            logger.info("SIP validated. SHA-256: %s", fixity)
            return {"status": "valid", "fixity": fixity, "path": str(sip_dir)}
        except OSError as e:
            wait = 2 ** attempt
            logger.warning("Transient I/O error on attempt %d: %s. Retrying in %ds.", attempt + 1, e, wait)
            time.sleep(wait)
        except Exception as e:
            logger.critical("Fatal SIP validation error: %s", e)
            break
    return None

The XSD used here is the same profile enforced upstream by batch validation schemas, so a package that clears the ingest queue arrives structurally sound and only fixity remains to be reconfirmed at the boundary.

Step Two: Seal the AIP With PREMIS and a Format Identity

Once a SIP is admitted, promotion assembles the AIP: it binds the bitstream to a PREMIS record so that provenance, rights, and technical characteristics survive future format migrations. The most frequent defect here is mapping an event to the wrong PREMIS event type, which breaks the downstream audit trail. Constraining the model with pydantic enforces type safety before serialization, so an invalid event can never be written into a sealed package. The format designation is not trusted from the scanner — it is resolved through automated format registry integration against PRONOM (via fido or siegfried) and folded into the same object.

python

import logging
import uuid
from datetime import datetime, timezone
from pathlib import Path
from typing import List

from pydantic import BaseModel, Field, ValidationError

logger = logging.getLogger("aip_transform")


class PremisEvent(BaseModel):
    event_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
    event_type: str = Field(pattern=r"^(ingestion|format identification|fixity check|migration)$")
    event_date: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
    event_detail: str
    linking_agent: str


class PremisObject(BaseModel):
    object_id: str
    format_name: str
    format_version: str
    size_bytes: int
    fixity_sha256: str
    events: List[PremisEvent] = Field(default_factory=list)


def transform_to_aip(sip_data: dict, format_sig: dict) -> PremisObject:
    """Normalize an admitted SIP into a validated, PREMIS-bearing AIP object."""
    try:
        event = PremisEvent(
            event_type="format identification",
            event_detail=f"Identified via PRONOM: {format_sig.get('puid', 'unknown')}",
            linking_agent="preservation_pipeline_v2",
        )
        source_path = Path(sip_data["path"]) / "payload.tif"
        return PremisObject(
            object_id=Path(sip_data["path"]).name,
            format_name=format_sig.get("name", "unknown"),
            format_version=format_sig.get("version", "0.0"),
            size_bytes=source_path.stat().st_size,
            fixity_sha256=sip_data["fixity"],
            events=[event],
        )
    except ValidationError as e:
        logger.error("PREMIS mapping failed: %s", e.json())
        raise

Carrying fixity_sha256 from the SIP result straight into the AIP object is deliberate: the digest verified at admission becomes the anchor the sealed package keeps for its entire lifetime. The sealed AIP is then handed to the long-term storage architecture layer, which is responsible for replication, WORM object-lock, and the durability guarantees promotion itself does not provide.

Step Three: Project a Read-Only DIP

The DIP is derived on demand and never mutates the AIP. Access derivatives are packaged with zipfile or tarfile alongside only the descriptive metadata cleared for exposure — restricted rights fields and internal identifiers are filtered out before anything reaches an access system. For resilience, DIP delivery pushes copies to distributed endpoints using idempotent transfers, and each destination re-verifies the checksum so a partial network outage cannot corrupt a delivered package undetected. Because the DIP is a projection, regenerating it is always safe: the sealed AIP remains the single source of truth.

Validation and Verification

A pipeline that runs clean still needs proof each boundary actually held. Confirm the build with observable evidence, not by re-reading the code:

Re-verify fixity after every hop. Re-compute the SHA-256 of the stored AIP payload and assert it equals fixity_sha256 recorded at admission. A mismatch means a transfer re-encoded or padded the bytes — the digest, not the file size, is the authority.
Round-trip the metadata through its schema. Re-parse the sealed AIP’s METS/PREMIS with recover=False and assertValid. If serialization introduced ns0: prefixes or dropped a namespace declaration, this fails loudly instead of at audit time.
Inspect the PREMIS event, not the log line. Confirm exactly one event was emitted per transition and that its event_type matches the constrained vocabulary. A promotion that logs success but emits no durable event has no audit trail.
Prove the DIP is reducible to its AIP. Assert every byte in the DIP payload derives from the AIP and that no redacted field survived the projection — the DIP should be reconstructable from the AIP, never the reverse.

Edge Cases and Gotchas

Symptom	Root cause	Remediation
`XMLSyntaxError` on valid-looking files	Scanner output buffer injected a BOM or null bytes	Strip leading bytes with `read_bytes().lstrip(b"\xef\xbb\xbf")` before parsing; log the offset
Checksum mismatch after transfer	Network middleware applied zero-padding or line-ending normalization	Enforce binary-mode (`rb`/`wb`) transfers and verify with `hashlib.file_digest` (Python 3.11+)
Namespace drift in METS/PREMIS	Standard-library `xml.etree` auto-prefixed on serialize	Use `lxml.etree` with an explicit `nsmap` and `register_namespace()`
Async write race on the manifest	Multiple workers appended to a shared file	Use file locking (`fcntl`/`msvcrt`) or an atomic `tempfile` + `os.replace` write
PREMIS URI resolution failures	Outdated URIs hardcoded in templates	Externalize URIs to version-controlled YAML and validate against the PREMIS standard

Three archival-specific traps deserve extra care. Multi-page TIFFs must be treated as one logical object through the whole promotion — split a bound volume across page-level SIPs and the DIP will rehydrate a torn record. Proprietary scanner extensions frequently wrap the payload in a vendor container that fido mis-identifies; pin the signature file and characterize the inner stream, not the wrapper. And legacy Latin-1 descriptive metadata will pass byte-level fixity yet corrupt on re-encode, so declare the source encoding explicitly rather than letting Python guess. Above all, make the whole pipeline idempotent and checkpoint-resumable: a promotion that cannot restart from its last durable state will, at petabyte scale, eventually strand a package mid-transition.

Frequently Asked Questions

Why validate fixity and schema separately at the SIP boundary?

They catch different corruption. Fixity (SHA-256) proves the bitstream is bit-identical to what was captured; schema validation proves the metadata is structurally parseable and complete. A file can pass one and fail the other — a byte-perfect payload with a BOM-corrupted metadata record, or a well-formed XML record describing a truncated scan. Admitting a SIP requires both gates to pass, in that order, before anything downstream runs.

Where should format identification run — at SIP ingest or AIP assembly?

Run it during AIP assembly, not at the SIP boundary. The SIP gate confirms the package is trustworthy and complete; format characterization is a preservation decision that belongs in the sealed record, emitted as a PREMIS format identification event with the resolved PUID. Running it against PRONOM via fido or siegfried at assembly time keeps the format identity bound to the object’s provenance, where migration planning later reads it.

Can I regenerate a DIP without touching the AIP?

Yes — that is the point of the projection model. A DIP is a read-only derivative built on demand from the sealed AIP, so regenerating it (for a new access format, or after a redaction-policy change) never mutates the original. If your DIP generation ever needs to write back to the AIP, the boundary has been drawn in the wrong place.

How do I stop parallel workers from corrupting a shared manifest?

Never let two workers append to the same manifest file. Either give each worker an atomic tempfile + os.replace write to a per-package manifest, or serialize appends behind an OS-level lock (fcntl on POSIX, msvcrt on Windows). The race is what produces half-written manifest lines that pass no schema check and read as corruption during an audit.

OAIS Reference Model Implementation — the parent topic: typed SIP/AIP/DIP contracts, deterministic package promotion, and ISO 16363 audit mapping.
PREMIS Metadata Mapping — building the provenance, rights, and event records this pipeline seals into every AIP.
Format Registry Integration — resolving authoritative PRONOM identities so the format designation written at AIP assembly is trustworthy and migration-ready.

Setting Up OAIS SIP/AIP/DIP Workflows in Python

# Root-Cause Analysis of Promotion Failures

# Step One: Admit the SIP Only After Fixity and Schema Pass

# Step Two: Seal the AIP With PREMIS and a Format Identity

# Step Three: Project a Read-Only DIP

# Validation and Verification

# Edge Cases and Gotchas

# Frequently Asked Questions

# Why validate fixity and schema separately at the SIP boundary?

# Where should format identification run — at SIP ingest or AIP assembly?

# Can I regenerate a DIP without touching the AIP?

# How do I stop parallel workers from corrupting a shared manifest?

# Related

More in OAIS-Compliant Digital Preservation Architecture