Validate OOXML (PPTX/DOCX/XLSX) and ODF files in pure Python — no .NET required.
A Python port of Microsoft's Open XML SDK validation logic. Check whether generated or modified Office files will open cleanly, directly from Python scripts, CI pipelines, or anywhere .NET isn't practical.
Also supports OASIS OpenDocument Format (ODT/ODS/ODP) with staged conformance levels.
Validation is the floor tier. Whether a file survives depends on more than ECMA legality — it also has to load in the target app, survive a save, behave correctly at runtime, and ideally match what the app itself would author. openxml-audit organizes this as an evidence ladder (openxml_audit.EvidenceTier):
schema-valid — parses against ECMA/OASIS schemas (this is what openxml-audit validate checks)loadable — the target app opens without repairroundtrip-preserved — the app's save does not rewrite the intentslideshow-verified — runtime behavior matches intentui-authored — the app itself produced this structureTiers 2–5 are backed by curated corpora of target-app-authored XML. The first corpus lives at docs/pptx_oracle/ — PowerPoint animation/timing, where "schema-valid but silently rewritten" is the dominant failure mode. DOCX and XLSX corpora can follow the same layout when the research starts.
from openxml_audit import EvidenceTier
from openxml_audit.pptx import check_capability
check_capability("pptx.anim.effect.entr.fade", minimum_tier=EvidenceTier.LOADABLE)
docs/pptx_oracle/) verify loadability, roundtrip preservation, and runtime behavior above it — for features like animation/timing where "schema-valid" isn't enoughassert_valid_pptx, assert_valid_docx, assert_valid_xlsx, assert_valid_odf — zero configLibraries that generate Office files routinely produce corrupt output — python-pptx has 12+ open corruption issues, docxtpl has 7, XlsxWriter 25+. These surface as "PowerPoint found a problem" dialogs for end users or silent failures in CI. With AI agents now generating slides and reports, the problem is getting worse.
openxml-audit catches these before your users do — same checks Microsoft's SDK runs, in pure Python.
| Ecosystem | Examples | How openxml-audit helps |
|---|---|---|
| File generators | python-pptx, python-docx, openpyxl, XlsxWriter | Validate output in tests and CI — catch corruption before release |
| Template engines | docxtpl, pptx-template | Jinja2 rendering can break XML structure — validate after render |
| Data pipelines | pandas to_excel, tablib, django-import-export |
Assert valid exports in pipeline tests |
| AI/LLM agents | Auto-PPT, GenFilesMCP, Docling | AI-generated Office files are unreliable — validate and retry |
| Government / ODF | Suite Numerique, odfpy | ODF conformance for EU regulatory requirements |
Pure Python, but close to .NET — lxml does the heavy XML lifting in C.
| Benchmark | .NET SDK | openxml-audit | Ratio |
|---|---|---|---|
| Cold start (6 files, mixed formats) | 994ms | 1,175ms | 1.2x |
| Warm (798K DOCX) | 46ms | 101ms | 2.2x |
| Warm (1.4MB PPTX) | — | 83ms | — |
| Warm (114K XLSX) | — | 29ms | — |
Batch validation supports --parallel N for multiprocess speedup. The pytest plugin uses session-scoped fixtures so schema loading happens once per test run.
pip install openxml-audit
Or install from source:
git clone https://github.com/BramAlkema/openxml-audit.git
cd openxml-audit
pip install -e .
# Validate a single file
openxml-audit presentation.pptx
# Validate an OASIS OpenDocument file
openxml-audit document.odt
# Validate with JSON output
openxml-audit presentation.pptx --output json
# Validate with XML output
openxml-audit presentation.pptx --output xml
# Validate all matching files in a directory
openxml-audit ./presentations/ --recursive
# Validate against a specific Office version
openxml-audit presentation.pptx --format Office2007
# Limit maximum errors reported
openxml-audit presentation.pptx --max-errors 10
from openxml_audit import validate_pptx, is_valid_pptx, OpenXmlValidator
# Quick check
if is_valid_pptx("presentation.pptx"):
print("File is valid!")
# Detailed validation
result = validate_pptx("presentation.pptx")
if not result.is_valid:
print(f"Found {result.error_count} errors, {result.warning_count} warnings")
for error in result.errors:
print(f" [{error.severity.value}] {error.description}")
# With custom options
from openxml_audit import FileFormat
validator = OpenXmlValidator(
file_format=FileFormat.OFFICE_2019,
max_errors=100,
schema_validation=True,
semantic_validation=True,
)
result = validator.validate("presentation.pptx")
ODF validation is staged by explicit conformance level.
| Level | Includes | Does not include |
|---|---|---|
foundation |
package/manifest integrity + XML parse sweep | Relax NG schema-core routing, semantic-core rules, security-core checks |
schema-core |
foundation + Relax NG validation for routed XML members | semantic-core and security-core checks |
semantic-core |
foundation + semantic-core rule families (ODFSEM*) |
Relax NG schema-core routing, security-core checks |
security-core |
semantic-core + signature/encryption structural checks (ODFSEC*) |
full cryptographic trust guarantees unless crypto verification backend is configured |
Rule registry and policy references:
openxml_audit.odf.get_odf_semantic_rules()docs/odf_security_policy.mddocs/odf_validation_contract.mdUse --odf-level when validating ODF files:
# foundation
openxml-audit file.odt --validator odf --odf-level foundation
# semantic-core (default)
openxml-audit file.odt --validator odf --odf-level semantic-core
# security-core
openxml-audit file.odt --validator odf --odf-level security-core
Schema-core uses bundled OASIS Relax NG schemas by default:
openxml-audit file.odt \
--validator odf \
--odf-level schema-core
Pass --odf-schema-routes only when you want to override or extend routing. It accepts either
shape:
{"1.3": {"content.xml": "schemas/odf/1.3/content.rng"}}{"content.xml": "schemas/odf/content.rng"}Security-core crypto verification hook:
openxml-audit file.odt \
--validator odf \
--odf-level security-core \
--odf-verify-cryptography
from openxml_audit import FileFormat
from openxml_audit.odf import OdfValidator
# foundation
foundation = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=False,
semantic_validation=False,
security_validation=False,
)
# schema-core (bundled schemas by default)
schema_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=False,
security_validation=False,
relaxng_validation=True,
)
# schema-core with custom routes
schema_core_custom = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=False,
security_validation=False,
relaxng_validation=True,
schema_routes={"1.3": {"content.xml": "schemas/odf/1.3/content.rng"}},
)
# semantic-core
semantic_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=True,
security_validation=False,
)
# security-core
security_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=True,
security_validation=True,
verify_cryptography=False, # set True when crypto backend is available
)
# Benchmark an ODF file (5 iterations by default)
python scripts/odf/benchmark_validation.py document.odt
# More iterations, with security checks
python scripts/odf/benchmark_validation.py document.odt --iterations 20 --security
# Foundation-only (skip schema/semantic)
python scripts/odf/benchmark_validation.py document.odt --no-schema --no-semantic
Reports avg/min/max/P95 with per-phase breakdown (package_structure, xml_parse, schema, semantic, security).
OOXML benchmark: python scripts/benchmark_validation.py presentation.pptx
schema_routes to extend or
override routing for additional XML parts.--odf-level only applies when the selected/auto-detected validator is ODF.Compare Python results against external validators (ODF Toolkit, OPF) using the scripts in scripts/odf/:
| Script | Purpose |
|---|---|
run_reference_validators.py |
Run Python + external validators on pinned corpus |
compare_reference_results.py |
Diff results into mismatch families |
check_reference_drift.py |
Enforce drift policy against baseline |
bootstrap_reference_validators.py |
Auto-build external validator commands |
CI workflow: .github/workflows/odf-reference-calibration.yml — builds ODF Toolkit and OPF at runtime via Maven/Docker.
Set command templates via --odf-toolkit-cmd / --opf-cmd or env vars ODF_TOOLKIT_CMD / OPF_ODF_VALIDATOR_CMD. Placeholders: {file}, {file_dir}, {file_name}, {file_stem}, {file_suffix}.
Run the .NET SDK validator separately (requires .NET SDK 8.x or Docker):
dotnet run --project scripts/sdk_check/sdk_check.csproj -- /path/to/file.pptx
dotnet run --project scripts/sdk_compare/OpenXmlSdkValidator.csproj -- /path/to/file.pptx # JSON
# Via Docker
docker run --rm -v "$PWD:/work" -w /work mcr.microsoft.com/dotnet/sdk:8.0 \
dotnet run --project scripts/sdk_check/sdk_check.csproj -- /work/path/to/file.pptx
Supports PPTX/DOCX/XLSX and variants. Configured for Office 2019.
Validate Office files in your PRs automatically:
# .github/workflows/validate-office-files.yml
name: Validate Office Files
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- uses: BramAlkema/openxml-audit@main
with:
changed-only: "true" # only validate files changed in the PR
Options:
| Input | Default | Description |
|---|---|---|
path |
. |
Directory or file to validate |
format |
Office2019 |
Office version to validate against |
changed-only |
false |
Only validate files changed in the PR |
recursive |
true |
Search subdirectories |
max-errors |
100 |
Maximum errors per file |
# .pre-commit-config.yaml
repos:
- repo: https://github.com/BramAlkema/openxml-audit
rev: v0.5.0
hooks:
- id: openxml-audit
Validates any .pptx, .docx, .xlsx, .odt, .ods, or .odp file before commit.
Ready-to-run scripts in examples/:
| Script | Description |
|---|---|
validate_python_pptx.py |
Generate a PPTX with python-pptx and validate it |
validate_openpyxl.py |
Generate an XLSX with openpyxl and validate it |
validate_odf.py |
Validate an ODF file (ODT/ODS/ODP) |
ci_validation.py |
Validate all Office files in a directory (CI-ready, OOXML + ODF) |
| Workflow | Trigger | Purpose |
|---|---|---|
parity-gate.yml |
PR / push | Enforce OOXML parity + perf budget against SDK baseline |
calibrate-parity.yml |
Weekly / dispatch | Calibrate against Open XML SDK upstream |
sdk-update.yml |
Quarterly / dispatch | Track upstream SDK version changes |
odf-reference-calibration.yml |
Dispatch | Run ODF reference validators and drift checks |
validate-inputs.yml |
Push to inputs/ |
Validate dropped files with both Python and .NET SDK |
release.yml |
Tag push (v*) |
Build and publish to PyPI |
pages.yml |
Push to main |
Deploy documentation site |
OOXML parity details: docs/parity_contract.md. ODF reference contract: docs/odf_validation_contract.md.
Fixtures are registered automatically — just pip install openxml-audit and use them:
def test_my_presentation(assert_valid_pptx, tmp_path):
output = tmp_path / "output.pptx"
generate_pptx(output)
assert_valid_pptx(output) # fails with detailed errors if invalid
def test_my_document(assert_valid_docx, tmp_path):
output = tmp_path / "output.docx"
generate_docx(output)
assert_valid_docx(output)
def test_my_spreadsheet(assert_valid_xlsx, tmp_path):
output = tmp_path / "output.xlsx"
generate_xlsx(output)
assert_valid_xlsx(output)
def test_odf_file(assert_valid_odf, tmp_path):
output = tmp_path / "output.odt"
generate_odt(output)
assert_valid_odf(output)
CLI options:
# Validate against a specific Office version
pytest --openxml-format Office2007
# Limit errors collected per file
pytest --openxml-max-errors 50
Available fixtures: openxml_validator, assert_valid_pptx, assert_valid_docx, assert_valid_xlsx, assert_valid_odf.
# Context manager
from openxml_audit import validation_context
with validation_context(raise_on_invalid=True) as validator:
result = validator.validate("presentation.pptx")
# Decorator — validate after save
from openxml_audit import validate_on_save
@validate_on_save(raise_on_invalid=True)
def create_presentation(output_path: str) -> None:
Presentation().save(output_path)
# Decorator — require valid input
from openxml_audit import require_valid_pptx
@require_valid_pptx()
def process(input_path: str) -> dict: ...
OpenXmlValidator / OdfValidatorOpenXmlValidator(file_format=FileFormat.OFFICE_2019, max_errors=1000,
schema_validation=True, semantic_validation=True)
OdfValidator(file_format=FileFormat.ODF_1_3, max_errors=1000,
schema_validation=True, semantic_validation=True,
security_validation=False, strict=True)
Both expose:
- validate(path) -> ValidationResult
- validate_with_timings(path) -> (ValidationResult, dict[str, float])
- is_valid(path) -> bool
ValidationResult| Property | Type | Description |
|---|---|---|
is_valid |
bool |
No ERROR-severity issues |
errors |
list[ValidationError] |
All errors and warnings |
error_count / warning_count |
int |
Counts by severity |
file_path |
str |
Validated file path |
file_format |
FileFormat |
Version validated against |
ValidationError| Property | Type | Description |
|---|---|---|
error_type |
ValidationErrorType |
PACKAGE, BINARY, SCHEMA, SEMANTIC, RELATIONSHIP, MARKUP_COMPATIBILITY |
severity |
ValidationSeverity |
ERROR, WARNING, INFO |
description |
str |
Human-readable message |
part_uri |
str \| None |
Affected part URI |
path |
str \| None |
XPath to affected element |
| OOXML | ODF |
|---|---|
OFFICE_2007 through MICROSOFT_365 (default: OFFICE_2019) |
ODF_1_2, ODF_1_3 (default: ODF_1_3) |
validate_pptx(path) -> ValidationResultis_valid_pptx(path) -> boolThese libraries create Office files — openxml-audit checks them:
| Library | Format | Link |
|---|---|---|
| python-pptx | PPTX | Create and update PowerPoint files |
| python-docx | DOCX | Create and update Word files |
| openpyxl | XLSX | Create and update Excel files |
from pptx import Presentation
from openxml_audit import validate_pptx
Presentation().save("output.pptx")
result = validate_pptx("output.pptx")
if not result.is_valid:
print(f"{result.error_count} issues found")
Contributions are welcome! See CONTRIBUTING.md for dev setup and guidelines.
This project is actively looking for co-maintainers — especially people working with:
If you're interested, open an issue or reach out.
If this project saves you time, consider sponsoring its development:
See CHANGELOG.md for a full list of changes by version.
Based on the validation logic from Microsoft's Open XML SDK for .NET.