OpenXML Audit

OpenXML Audit

PyPI Downloads Python License: MIT CI SDK Parity ODF Parity pytest

Validate OOXML (PPTX/DOCX/XLSX) and ODF files in pure Python — no .NET required.

A Python port of Microsoft's Open XML SDK validation logic. Check whether generated or modified Office files will open cleanly, directly from Python scripts, CI pipelines, or anywhere .NET isn't practical.

Also supports OASIS OpenDocument Format (ODT/ODS/ODP) with staged conformance levels.

Evidence ladder

Validation is the floor tier. Whether a file survives depends on more than ECMA legality — it also has to load in the target app, survive a save, behave correctly at runtime, and ideally match what the app itself would author. openxml-audit organizes this as an evidence ladder (openxml_audit.EvidenceTier):

  1. schema-valid — parses against ECMA/OASIS schemas (this is what openxml-audit validate checks)
  2. loadable — the target app opens without repair
  3. roundtrip-preserved — the app's save does not rewrite the intent
  4. slideshow-verified — runtime behavior matches intent
  5. ui-authored — the app itself produced this structure

Tiers 2–5 are backed by curated corpora of target-app-authored XML. The first corpus lives at docs/pptx_oracle/ — PowerPoint animation/timing, where "schema-valid but silently rewritten" is the dominant failure mode. DOCX and XLSX corpora can follow the same layout when the research starts.

from openxml_audit import EvidenceTier
from openxml_audit.pptx import check_capability

check_capability("pptx.anim.effect.entr.fade", minimum_tier=EvidenceTier.LOADABLE)

Features

Why validate?

Libraries that generate Office files routinely produce corrupt output — python-pptx has 12+ open corruption issues, docxtpl has 7, XlsxWriter 25+. These surface as "PowerPoint found a problem" dialogs for end users or silent failures in CI. With AI agents now generating slides and reports, the problem is getting worse.

openxml-audit catches these before your users do — same checks Microsoft's SDK runs, in pure Python.

Ecosystem Examples How openxml-audit helps
File generators python-pptx, python-docx, openpyxl, XlsxWriter Validate output in tests and CI — catch corruption before release
Template engines docxtpl, pptx-template Jinja2 rendering can break XML structure — validate after render
Data pipelines pandas to_excel, tablib, django-import-export Assert valid exports in pipeline tests
AI/LLM agents Auto-PPT, GenFilesMCP, Docling AI-generated Office files are unreliable — validate and retry
Government / ODF Suite Numerique, odfpy ODF conformance for EU regulatory requirements

Performance

Pure Python, but close to .NET — lxml does the heavy XML lifting in C.

Benchmark .NET SDK openxml-audit Ratio
Cold start (6 files, mixed formats) 994ms 1,175ms 1.2x
Warm (798K DOCX) 46ms 101ms 2.2x
Warm (1.4MB PPTX) 83ms
Warm (114K XLSX) 29ms

Batch validation supports --parallel N for multiprocess speedup. The pytest plugin uses session-scoped fixtures so schema loading happens once per test run.

Installation

pip install openxml-audit

Or install from source:

git clone https://github.com/BramAlkema/openxml-audit.git
cd openxml-audit
pip install -e .

Quick Start

Command Line

# Validate a single file
openxml-audit presentation.pptx

# Validate an OASIS OpenDocument file
openxml-audit document.odt

# Validate with JSON output
openxml-audit presentation.pptx --output json

# Validate with XML output
openxml-audit presentation.pptx --output xml

# Validate all matching files in a directory
openxml-audit ./presentations/ --recursive

# Validate against a specific Office version
openxml-audit presentation.pptx --format Office2007

# Limit maximum errors reported
openxml-audit presentation.pptx --max-errors 10

Python API

from openxml_audit import validate_pptx, is_valid_pptx, OpenXmlValidator

# Quick check
if is_valid_pptx("presentation.pptx"):
    print("File is valid!")

# Detailed validation
result = validate_pptx("presentation.pptx")
if not result.is_valid:
    print(f"Found {result.error_count} errors, {result.warning_count} warnings")
    for error in result.errors:
        print(f"  [{error.severity.value}] {error.description}")

# With custom options
from openxml_audit import FileFormat

validator = OpenXmlValidator(
    file_format=FileFormat.OFFICE_2019,
    max_errors=100,
    schema_validation=True,
    semantic_validation=True,
)
result = validator.validate("presentation.pptx")

Documentation

ODF Validation Depth

ODF validation is staged by explicit conformance level.

Level Includes Does not include
foundation package/manifest integrity + XML parse sweep Relax NG schema-core routing, semantic-core rules, security-core checks
schema-core foundation + Relax NG validation for routed XML members semantic-core and security-core checks
semantic-core foundation + semantic-core rule families (ODFSEM*) Relax NG schema-core routing, security-core checks
security-core semantic-core + signature/encryption structural checks (ODFSEC*) full cryptographic trust guarantees unless crypto verification backend is configured

Rule registry and policy references:

CLI Conformance Selection

Use --odf-level when validating ODF files:

# foundation
openxml-audit file.odt --validator odf --odf-level foundation

# semantic-core (default)
openxml-audit file.odt --validator odf --odf-level semantic-core

# security-core
openxml-audit file.odt --validator odf --odf-level security-core

Schema-core uses bundled OASIS Relax NG schemas by default:

openxml-audit file.odt \
  --validator odf \
  --odf-level schema-core

Pass --odf-schema-routes only when you want to override or extend routing. It accepts either shape:

Security-core crypto verification hook:

openxml-audit file.odt \
  --validator odf \
  --odf-level security-core \
  --odf-verify-cryptography

API Conformance Selection

from openxml_audit import FileFormat
from openxml_audit.odf import OdfValidator

# foundation
foundation = OdfValidator(
    file_format=FileFormat.ODF_1_3,
    schema_validation=False,
    semantic_validation=False,
    security_validation=False,
)

# schema-core (bundled schemas by default)
schema_core = OdfValidator(
    file_format=FileFormat.ODF_1_3,
    schema_validation=True,
    semantic_validation=False,
    security_validation=False,
    relaxng_validation=True,
)

# schema-core with custom routes
schema_core_custom = OdfValidator(
    file_format=FileFormat.ODF_1_3,
    schema_validation=True,
    semantic_validation=False,
    security_validation=False,
    relaxng_validation=True,
    schema_routes={"1.3": {"content.xml": "schemas/odf/1.3/content.rng"}},
)

# semantic-core
semantic_core = OdfValidator(
    file_format=FileFormat.ODF_1_3,
    schema_validation=True,
    semantic_validation=True,
    security_validation=False,
)

# security-core
security_core = OdfValidator(
    file_format=FileFormat.ODF_1_3,
    schema_validation=True,
    semantic_validation=True,
    security_validation=True,
    verify_cryptography=False,  # set True when crypto backend is available
)

ODF Benchmarking

# Benchmark an ODF file (5 iterations by default)
python scripts/odf/benchmark_validation.py document.odt

# More iterations, with security checks
python scripts/odf/benchmark_validation.py document.odt --iterations 20 --security

# Foundation-only (skip schema/semantic)
python scripts/odf/benchmark_validation.py document.odt --no-schema --no-semantic

Reports avg/min/max/P95 with per-phase breakdown (package_structure, xml_parse, schema, semantic, security).

OOXML benchmark: python scripts/benchmark_validation.py presentation.pptx

Known ODF Limitations

ODF Reference Calibration

Compare Python results against external validators (ODF Toolkit, OPF) using the scripts in scripts/odf/:

Script Purpose
run_reference_validators.py Run Python + external validators on pinned corpus
compare_reference_results.py Diff results into mismatch families
check_reference_drift.py Enforce drift policy against baseline
bootstrap_reference_validators.py Auto-build external validator commands

CI workflow: .github/workflows/odf-reference-calibration.yml — builds ODF Toolkit and OPF at runtime via Maven/Docker.

Set command templates via --odf-toolkit-cmd / --opf-cmd or env vars ODF_TOOLKIT_CMD / OPF_ODF_VALIDATOR_CMD. Placeholders: {file}, {file_dir}, {file_name}, {file_stem}, {file_suffix}.

Open XML SDK (Standalone)

Run the .NET SDK validator separately (requires .NET SDK 8.x or Docker):

dotnet run --project scripts/sdk_check/sdk_check.csproj -- /path/to/file.pptx
dotnet run --project scripts/sdk_compare/OpenXmlSdkValidator.csproj -- /path/to/file.pptx  # JSON

# Via Docker
docker run --rm -v "$PWD:/work" -w /work mcr.microsoft.com/dotnet/sdk:8.0 \
  dotnet run --project scripts/sdk_check/sdk_check.csproj -- /work/path/to/file.pptx

Supports PPTX/DOCX/XLSX and variants. Configured for Office 2019.

GitHub Action

Validate Office files in your PRs automatically:

# .github/workflows/validate-office-files.yml
name: Validate Office Files
on: [pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - uses: BramAlkema/openxml-audit@main
        with:
          changed-only: "true"  # only validate files changed in the PR

Options:

Input Default Description
path . Directory or file to validate
format Office2019 Office version to validate against
changed-only false Only validate files changed in the PR
recursive true Search subdirectories
max-errors 100 Maximum errors per file

Pre-commit Hook

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/BramAlkema/openxml-audit
    rev: v0.5.0
    hooks:
      - id: openxml-audit

Validates any .pptx, .docx, .xlsx, .odt, .ods, or .odp file before commit.

Examples

Ready-to-run scripts in examples/:

Script Description
validate_python_pptx.py Generate a PPTX with python-pptx and validate it
validate_openpyxl.py Generate an XLSX with openpyxl and validate it
validate_odf.py Validate an ODF file (ODT/ODS/ODP)
ci_validation.py Validate all Office files in a directory (CI-ready, OOXML + ODF)

CI Workflows

Workflow Trigger Purpose
parity-gate.yml PR / push Enforce OOXML parity + perf budget against SDK baseline
calibrate-parity.yml Weekly / dispatch Calibrate against Open XML SDK upstream
sdk-update.yml Quarterly / dispatch Track upstream SDK version changes
odf-reference-calibration.yml Dispatch Run ODF reference validators and drift checks
validate-inputs.yml Push to inputs/ Validate dropped files with both Python and .NET SDK
release.yml Tag push (v*) Build and publish to PyPI
pages.yml Push to main Deploy documentation site

OOXML parity details: docs/parity_contract.md. ODF reference contract: docs/odf_validation_contract.md.

pytest Plugin

Fixtures are registered automatically — just pip install openxml-audit and use them:

def test_my_presentation(assert_valid_pptx, tmp_path):
    output = tmp_path / "output.pptx"
    generate_pptx(output)
    assert_valid_pptx(output)  # fails with detailed errors if invalid

def test_my_document(assert_valid_docx, tmp_path):
    output = tmp_path / "output.docx"
    generate_docx(output)
    assert_valid_docx(output)

def test_my_spreadsheet(assert_valid_xlsx, tmp_path):
    output = tmp_path / "output.xlsx"
    generate_xlsx(output)
    assert_valid_xlsx(output)

def test_odf_file(assert_valid_odf, tmp_path):
    output = tmp_path / "output.odt"
    generate_odt(output)
    assert_valid_odf(output)

CLI options:

# Validate against a specific Office version
pytest --openxml-format Office2007

# Limit errors collected per file
pytest --openxml-max-errors 50

Available fixtures: openxml_validator, assert_valid_pptx, assert_valid_docx, assert_valid_xlsx, assert_valid_odf.

Integration Helpers

# Context manager
from openxml_audit import validation_context

with validation_context(raise_on_invalid=True) as validator:
    result = validator.validate("presentation.pptx")

# Decorator — validate after save
from openxml_audit import validate_on_save

@validate_on_save(raise_on_invalid=True)
def create_presentation(output_path: str) -> None:
    Presentation().save(output_path)

# Decorator — require valid input
from openxml_audit import require_valid_pptx

@require_valid_pptx()
def process(input_path: str) -> dict: ...

API Reference

OpenXmlValidator / OdfValidator

OpenXmlValidator(file_format=FileFormat.OFFICE_2019, max_errors=1000,
                 schema_validation=True, semantic_validation=True)

OdfValidator(file_format=FileFormat.ODF_1_3, max_errors=1000,
             schema_validation=True, semantic_validation=True,
             security_validation=False, strict=True)

Both expose: - validate(path) -> ValidationResult - validate_with_timings(path) -> (ValidationResult, dict[str, float]) - is_valid(path) -> bool

ValidationResult

Property Type Description
is_valid bool No ERROR-severity issues
errors list[ValidationError] All errors and warnings
error_count / warning_count int Counts by severity
file_path str Validated file path
file_format FileFormat Version validated against

ValidationError

Property Type Description
error_type ValidationErrorType PACKAGE, BINARY, SCHEMA, SEMANTIC, RELATIONSHIP, MARKUP_COMPATIBILITY
severity ValidationSeverity ERROR, WARNING, INFO
description str Human-readable message
part_uri str \| None Affected part URI
path str \| None XPath to affected element

Supported Formats

OOXML ODF
OFFICE_2007 through MICROSOFT_365 (default: OFFICE_2019) ODF_1_2, ODF_1_3 (default: ODF_1_3)

Convenience Functions

Works Well With

These libraries create Office files — openxml-audit checks them:

Library Format Link
python-pptx PPTX Create and update PowerPoint files
python-docx DOCX Create and update Word files
openpyxl XLSX Create and update Excel files
from pptx import Presentation
from openxml_audit import validate_pptx

Presentation().save("output.pptx")

result = validate_pptx("output.pptx")
if not result.is_valid:
    print(f"{result.error_count} issues found")

Contributing

Contributions are welcome! See CONTRIBUTING.md for dev setup and guidelines.

Looking for Maintainers

This project is actively looking for co-maintainers — especially people working with:

If you're interested, open an issue or reach out.

Funding

If this project saves you time, consider sponsoring its development:

GitHub Sponsors

Changelog

See CHANGELOG.md for a full list of changes by version.

License

MIT

Acknowledgments

Based on the validation logic from Microsoft's Open XML SDK for .NET.