Preservation Format Identification in OAIS-Compliant Digital Preservation Architecture

Preservation format identification is the gatekeeping stage that turns a raw bitstream into a file with a known, defensible technical identity. Within the parent OAIS-Compliant Digital Preservation Architecture, this is the subsystem that answers what a file actually is from its bytes — before it can be normalized, migrated, or promoted to archival storage. It sits immediately upstream of Format Registry Integration, which resolves the identifier it produces into versioned registry intelligence, and it feeds the provenance record built by PREMIS Metadata Mapping. For archivists and Python automation engineers, this process transcends superficial file-extension parsing; it requires cryptographic signature matching, container introspection, and automated schema validation to guarantee long-term renderability. This page specifies the identification-record contract, the decision flow that routes each object, a production-ready async identification pipeline, the integration edges it honours, and the compliance rules every identification event must satisfy.

Identification Record Specification and Normalisation Contract

An identification result is only useful if every worker in the pipeline reads it the same way. Different tools disagree on output shape — DROID emits CSV rows with PUID and METHOD columns, Siegfried emits nested JSON matches, Apache Tika returns a MIME string — so the identification layer must normalise all of them into a single, version-pinned internal record before the result drives any preservation decision. Modelling that record as an immutable, validated structure lets a malformed or low-confidence identification be quarantined at the resolution boundary rather than poisoning a downstream metadata record.

The field specification below is the authoritative shape every identification record must honour. A result that fails any constraint is flagged for manual review and never reaches PREMIS.

Field	Type	Constraint / source	PREMIS target
`object_identifier`	`str`	Path or persistent ID relative to the ingest root	`objectIdentifierValue`
`puid`	`str`	Pattern `(fmt\|x-fmt)/\d+` or `UNKNOWN`	`formatRegistryKey`
`mime_type`	`str`	RFC 6838 `type/subtype`; may be `application/octet-stream`	`formatNote`
`identification_method`	`str`	Enum `signature`, `container`, `extension`, `none`	`eventDetail`
`extension_mismatch`	`bool`	`True` when the extension contradicts the signature match	drives review routing
`signature_release`	`str`	The signature-file release the match was made against	`formatRegistryRole` context
`fixity_algorithm`	`str`	`SHA-256` (or stronger); records the digest family	`messageDigestAlgorithm`
`fixity_value`	`str`	Hex digest computed before identification	`messageDigest`
`identification_timestamp`	`str`	ISO 8601 UTC	`eventDateTime`
`preservation_status`	`str`	Enum `pending_validation`, `identified`, `manual_review`	routing key

The two fields that carry the most preservation weight are identification_method and extension_mismatch. A match made from a byte signature is materially stronger than one inferred from a file extension: a renamed .txt that is actually a JPEG must be caught here, not three stages later when a viewer fails. Recording how the identity was resolved — and flagging any disagreement between the declared extension and the observed signature — is what lets preservation planning trust the identifier without re-deriving it.

OAIS Functional Mapping and Continuous Validation

The Open Archival Information System reference model explicitly positions format identification within the Preservation Planning and Ingest functional entities. A robust OAIS Reference Model implementation treats format identification not as a discrete, one-time ingest step, but as a continuous validation loop. During initial ingest, the system must extract technical metadata, verify file integrity, and assign a persistent format identifier. Preservation planning then monitors these identifiers against evolving obsolescence curves. If a format drifts into unsupported territory, the architecture triggers migration or emulation workflows. This cyclical approach ensures that the enforcement points defined in Digital Preservation Security Policies are applied at the point of entry and maintained throughout the object’s lifecycle.

Modern preservation systems rely on Format Registry Integration to maintain authoritative mappings between binary signatures and standardized identifiers. The UK National Archives’ PRONOM registry remains the de facto standard, providing a continuously updated signature database that captures container hierarchies, version drift, and known vulnerabilities. By programmatically querying registry APIs or synchronizing local signature files, engineering teams can automate risk scoring and flag deprecated formats before they compromise ingest pipelines.

Identification is attempted strongest-first: a byte signature outranks container introspection, which outranks a bare extension. When the declared extension contradicts the signature match, the object is routed to manual review rather than trusted.

The identification decision flow turns a raw bitstream into either a recorded format or a flagged exception:

Format identification: signature and extension analysis feed a PRONOM match decision that routes each object to PREMIS or manual review.

Deterministic Python Orchestration and Auditability

Scalable format identification demands deterministic, repeatable automation. Python-based orchestration frameworks have become the standard for scaling identification across terabyte-scale ingest pipelines. By wrapping command-line utilities in asynchronous task queues, engineering teams can process thousands of files concurrently while maintaining strict resource limits. The critical step lies in configuring format identification tools like DROID to leverage updated PRONOM signature files, custom regex rules, and container-level parsing. When integrated into a CI/CD-style preservation pipeline, these tools output structured JSON or XML reports that feed directly into validation engines.

The following production-ready Python pattern demonstrates an asynchronous, audit-focused format identification pipeline. It computes cryptographic checksums, invokes CLI-based identification tools, and emits PREMIS-aligned JSON reports suitable for downstream validation and storage routing.

python

import asyncio
import hashlib
import json
import logging
import subprocess
from pathlib import Path
from typing import Dict, List
from datetime import datetime, timezone

# Configure structured audit logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
    datefmt="%Y-%m-%dT%H:%M:%S%z"
)
logger = logging.getLogger("preservation_format_id")

class PreservationIdentifier:
    """Async pipeline for cryptographic verification and format identification."""
    
    def __init__(self, ingest_dir: Path, output_dir: Path, cli_tool: str = "sf"):
        self.ingest_dir = ingest_dir
        self.output_dir = output_dir
        self.cli_tool = cli_tool
        self.output_dir.mkdir(parents=True, exist_ok=True)

    async def compute_sha256(self, file_path: Path) -> str:
        """Compute SHA-256 digest with memory-efficient chunking, off the event loop."""
        def _digest() -> str:
            sha256 = hashlib.sha256()
            with open(file_path, "rb") as f:
                while chunk := f.read(8192):
                    sha256.update(chunk)
            return sha256.hexdigest()
        return await asyncio.to_thread(_digest)

    async def identify_format(self, file_path: Path) -> Dict:
        """Execute a JSON-capable CLI identifier (e.g. Siegfried `sf -json`)."""
        try:
            cmd = [self.cli_tool, "-json", str(file_path)]
            result = await asyncio.to_thread(
                subprocess.run, cmd, capture_output=True, text=True, timeout=30
            )
            if result.returncode == 0 and result.stdout.strip():
                # Siegfried emits {"files": [{"matches": [{"id": ..., "mime": ...}]}]};
                # normalize into the flat shape consumed by process_file().
                payload = json.loads(result.stdout)
                files = payload.get("files", [])
                matches = files[0].get("matches", []) if files else []
                if matches:
                    top = matches[0]
                    return {
                        "format_id": top.get("id", "UNKNOWN"),
                        "mime_type": top.get("mime", "application/octet-stream"),
                        "status": "identified",
                    }
                return {"format_id": "UNKNOWN", "status": "no_match"}
            logger.warning(f"CLI tool failed for {file_path.name}: {result.stderr.strip()}")
            return {"format_id": "UNKNOWN", "status": "cli_error"}
        except (json.JSONDecodeError, OSError, subprocess.SubprocessError) as e:
            logger.error(f"Identification exception for {file_path.name}: {e}")
            return {"format_id": "UNKNOWN", "status": "exception"}

    async def process_file(self, file_path: Path) -> Dict:
        """Orchestrate checksum, identification, and PREMIS-aligned metadata assembly."""
        checksum = await self.compute_sha256(file_path)
        format_report = await self.identify_format(file_path)
        
        # Extract primary format identifier (PRONOM or MIME fallback)
        fmt_id = format_report.get("format_id", "UNKNOWN")
        mime = format_report.get("mime_type", "application/octet-stream")
        
        return {
            "object_identifier": str(file_path.relative_to(self.ingest_dir)),
            "fixity_algorithm": "SHA-256",
            "fixity_value": checksum,
            "format_registry": "PRONOM",
            "format_identifier": fmt_id,
            "mime_type": mime,
            "identification_timestamp": datetime.now(timezone.utc).isoformat(),
            "preservation_status": "pending_validation"
        }

    async def run_pipeline(self) -> List[Dict]:
        """Execute concurrent processing across ingest directory."""
        files = [f for f in self.ingest_dir.rglob("*") if f.is_file()]
        logger.info(f"Starting identification pipeline for {len(files)} objects.")
        
        tasks = [self.process_file(f) for f in files]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # Filter and serialize valid results
        valid_results = [r for r in results if isinstance(r, dict)]
        report_path = self.output_dir / "format_identification_report.json"
        report_path.write_text(json.dumps(valid_results, indent=2), encoding="utf-8")
        
        logger.info(f"Pipeline complete. Report written to {report_path}")
        return valid_results

if __name__ == "__main__":
    # Example execution (requires `sf` / Siegfried in PATH)
    ingest_path = Path("./ingest_staging")
    output_path = Path("./audit_reports")
    pipeline = PreservationIdentifier(ingest_path, output_path)
    asyncio.run(pipeline.run_pipeline())

Computing the fixity digest before identification is deliberate: the SHA-256 is taken over the exact bytes that were identified, so the recorded identity and the recorded checksum describe the same object with no window in between. Every failure path — a non-zero exit, empty output, malformed JSON, or an OS error — is classified and logged rather than swallowed, so a tool crash surfaces as a manual_review object instead of a silent gap in the report.

Integration Points

Format identification is an early pipeline stage, so most of its value comes from the contracts it honours at each edge. Its input is the byte stream produced by capture and staged by the ingestion-side Metadata Extraction Workflows; when a scanner mislabels a JPEG 2000 master as a plain TIFF, the extension mismatch is caught here rather than committed to the archive. Its primary output — the normalised identification record with its PUID and identification method — is consumed by Format Registry Integration, which resolves that identifier into versioned registry intelligence and an obsolescence risk score.

Downstream, the fixity value and format designation are folded into object provenance by PREMIS Metadata Mapping, which records the identification as a first-class preservation event. The format identity established here also drives storage decisions: a well-architected Long-Term Storage Architecture uses format metadata to route objects across hot, warm, and cold tiers by access frequency and preservation risk, and cross-institutional replication relies on consistent identifiers to confirm that mirrored copies maintain technical parity. Extending signature coverage for rare, legacy, or proprietary file types that fall outside the standard sets is handled in configuring format identification tools like DROID.

Validation and Compliance Rules

Raw identification output is insufficient without rigorous schema validation. Preservation metadata must conform to standardized data dictionaries to ensure interoperability across systems. Mapping identification results into the structures defined by PREMIS Metadata Mapping guarantees that technical characteristics, provenance events, and fixity values are serialized in a machine-actionable format. Automated schema validators — lxml against the PREMIS XSDs, or a JSON Schema validator against the identification-record contract — should be embedded directly into the pipeline to reject malformed payloads before they reach downstream storage tiers.

Each identification emits exactly one PREMIS event of type format identification, cross-referenced to the signature-file release (the agent), the resolved PUID and identification method (the detail), and the fixity digest (the linked object). Under ISO 16363, the trusted repository must demonstrate that it identifies the file formats of all ingested content (clause 4.2.5.1) and that every preservation action is traceable (clause 4.1.8) — the recorded identification method, signature-file release, and pre-identification checksum are precisely the evidence an auditor probes for. All identification records, PREMIS events, and hashes should be persisted in a write-once ledger.

Because format identification directly informs storage routing, replication, and disaster recovery, its metadata must remain valid across decades. As cryptographic standards evolve, identification pipelines should be designed to swap SHA-256 for stronger digests (SHA-3-512) and adopt post-quantum, hash-based signature schemes without breaking existing record schemas or audit trails — which is why fixity_algorithm is a recorded field rather than an assumed constant.

Troubleshooting Reference

Error condition	Root cause	Remediation
File identified only by extension, `identification_method = extension`	Truncated transfer removed the magic bytes, or the format has no registered signature	Re-fetch the source bytes; treat extension-only matches as low confidence and route to manual review
`extension_mismatch = True` on a known-good master	A renamed file (e.g. a JPEG saved as `.tif`) or a scanner writing the wrong extension	Trust the signature, not the extension; record the mismatch as a PREMIS event and correct the declared type
`fmt/8` (TIFF v4) conflated with `fmt/10` (TIFF v6)	Signature collision on variant headers or an outdated signature file	Pin and update the signature-file release; cross-check the byte sequence with a hex inspection before committing
Outer wrapper reported instead of the payload	Container introspection disabled — no container signature file loaded	Enable container parsing (DROID `-Nc`) for objects flagged as wrappers; run a two-pass triage
Ingest surge times out on identification	Unbounded concurrency starves CPU/memory on large or deeply nested files	Bound the task pool and set per-file timeouts; isolate container recursion in memory-limited workers
`UNKNOWN` PUID for an institutionally common format	The signature set lacks coverage for a rare or proprietary type	Author and version-control a custom signature, validate it against a golden corpus, and deploy through CI/CD

Frequently Asked Questions

Why identify by signature instead of file extension?

A file extension is a claim, not evidence. A file renamed from .jpg to .txt still contains JPEG bytes, and a scanner can emit the wrong extension entirely. Signature identification reads the actual magic bytes and container structure, so it catches these disagreements at ingest and records them as an extension_mismatch rather than letting a wrong identity propagate until a viewer fails to render the object years later.

Should the ingest path call the public PRONOM service for every file?

No. Identification runs against locally synchronised signature files, not a live network call per object. Coupling ingest to an external service collapses under load during a capture surge and makes identification non-reproducible. Mirror the signature files, pin the release each match is made against, and update the mirror through a gated pipeline — the same discipline described in Format Registry Integration.

Why compute the checksum before running identification?

So the recorded fixity digest and the recorded format identity describe the exact same bytes with no window between them. If identification ran first and the file changed, the checksum would no longer attest to what was identified. Hashing first, then identifying, makes the identification event and its linked fixity value a single, coherent piece of provenance.

What happens when a tool returns no match?

The object is not dropped or guessed at. The pipeline records a UNKNOWN PUID with preservation_status = manual_review, logs the reason (no signature match, tool error, or exception), and routes the object to a human queue. This preserves a complete accounting of the batch — every file has a record — and prevents an unidentified object from being silently promoted to archival storage.

How does identification tie into an auditable chain of custody?

Each identification writes one PREMIS format-identification event that links the resolved PUID, the identification method, the signature-file release, and the pre-identification SHA-256 digest, then persists it to append-only storage. An auditor can later re-hash the stored bytes, confirm the digest matches, and see exactly how — and against which signature release — the recorded format identity was established.

Format Registry Integration — resolves the PUID produced here into versioned registry intelligence and obsolescence risk scores.
Configuring format identification tools like DROID — extending PRONOM signature coverage for rare and proprietary formats.
PREMIS Metadata Mapping — folds the format designation and identification event into object provenance.
OAIS Reference Model implementation — the SIP/AIP/DIP workflows that position identification within Ingest and Preservation Planning.
Metadata Extraction Workflows — the capture-side stage that stages files and reconciles device-reported technical characteristics.
Long-Term Storage Architecture — routes objects across storage tiers using the format identity established here.

Preservation Format Identification in OAIS-Compliant Digital Preservation Architecture

# Identification Record Specification and Normalisation Contract

# OAIS Functional Mapping and Continuous Validation

# Deterministic Python Orchestration and Auditability

# Integration Points

# Validation and Compliance Rules

# Troubleshooting Reference

# Frequently Asked Questions

# Why identify by signature instead of file extension?

# Should the ingest path call the public PRONOM service for every file?

# Why compute the checksum before running identification?

# What happens when a tool returns no match?

# How does identification tie into an auditable chain of custody?

# Related

Explore Preservation Format Identification in OAIS-Compliant Digital Preservation Architecture