ifrs-17 validation accounting-engine python xlwings

Validation Tools for IFRS 17 Accounting Engines: The Practitioner Toolkit

· 7 min read · View on LinkedIn

Validation of an IFRS 17 accounting engine is not a manual exercise. By the time you have a few hundred test cases running against the live vendor product and your shadow model, anything short of proper tooling collapses under its own weight. This post covers the specific tools I build — or adapt — on a validation workstream, and the design choices behind them. It is a companion to the broader piece on validation framework and methodology; here the focus is strictly on the toolkit.

The toolset has four main components: a shadow model that recalculates independently, a test case generator, an input converter, and a comparison tool. Each is buildable in reasonable time if you pick the right platform and scope the coverage honestly.

1. The shadow model

The shadow model is the heart of the validation toolkit. Its job is to replicate the vendor accounting engine’s calculations independently, using a different codebase, so that when the two disagree you have two sources to triangulate from rather than one black box to argue with.

Python is the right platform for the shadow model in most programmes. It has the numerical libraries, the readability for audit review, and the integration story into Excel and databases that validation requires. I structure the shadow model in three layers:

Business event layer. A Python representation of each business event the vendor engine recognises — initial recognition, measurement, unlocking, adjustment to coverage units, derecognition, and so on. Each event is a function that takes contract-level inputs and produces a set of output values with the same semantics as the vendor’s event output.

Posting logic layer. A deterministic mapping from business events to debits and credits, with sign-convention logic expressed explicitly. This layer is the one auditors will read most closely, so it has to be readable. I write it as a rules table wherever possible — pairs of (event type, posting rule) — rather than as nested conditionals, because a rules table is inspectable and diff-able.

Aggregation layer. Roll-up from contract-level postings to group, portfolio, and entity level. This layer is where most reconciliation differences between the shadow model and the vendor surface, and where the code needs to be instrumented to trace an aggregate result back to its contributing contracts.

The shadow model does not need to be as fast as the vendor engine. It needs to be correct and inspectable. A shadow run that takes two hours to process a test batch is fine. A shadow model whose internals cannot be walked through at an audit session is not.

2. The test case generator

The test case generator produces structured inputs for both the shadow model and the vendor engine in their respective formats. Its job is to make it cheap to add new test cases and to measure coverage.

I build the generator around a template pattern: a base test case (a single contract with standard assumptions) and a set of variations that can be applied independently (onerous versus profitable, long-duration versus short, single-currency versus multi-currency, direct participating versus non-participating, and so on). Any combination of variations produces a valid test case, and the generator can enumerate all combinations or sample from them.

Coverage is measured against the variation space, not against line-of-code metrics. The question is not “what fraction of the shadow model’s code was exercised” but “what fraction of the variation combinations were tested”. That translates into a matrix — rows are test cases, columns are the variations covered — and a target coverage level agreed with the programme.

Structurally, I keep the generator simple. Python dictionaries, YAML test case definitions, and Excel for the output manifest are enough for most programmes. Over-engineering the generator is a classic validation anti-pattern: teams build a sophisticated test DSL and spend three months on tooling that does not produce any test results.

3. The input converter

The test cases come out of the generator in a canonical format. The vendor engine and the shadow model each need inputs in their own format. The input converter bridges the two.

Two design principles matter here. First, the converter has to be lossless in both directions. If it cannot round-trip — canonical to vendor format and back to canonical — then reconciliation differences could be artefacts of the converter rather than the engine, which destroys the diagnostic value of the validation work.

Second, the converter has to include data checks. In my experience, something like seven out of ten differences between the shadow and the vendor on an early validation run trace back to data issues — missing fields, wrong units, mis-formatted dates, duplicated keys. Baking the data checks into the converter catches these before they surface as fake “calculation differences”.

Python and xlwings are a good combination for the converter, especially in programmes where the insurer’s team is more comfortable in Excel than in code. xlwings lets Python drive the conversion while letting Excel host the configuration tables and the manual override controls that always end up being needed.

4. The comparison tool

The comparison tool imports the two result sets, maps line items across, and identifies differences. This is the most deceptively simple component of the toolkit. The complication is not the diff itself — pandas or similar libraries handle that — but the mapping.

Mapping issues show up because the vendor engine and the shadow model produce results at subtly different granularities, with slightly different naming, in slightly different currency conventions, sometimes with different sign conventions on the same account. Without a mapping layer, the diff is unreadable.

My comparison tool has four pieces:

A mapping table. A spreadsheet — usually maintained by the methodology team — that pairs shadow-model line items with vendor line items. Includes a sign flip column for the handful of accounts where the two engines disagree on convention.

A reconciliation engine. Applies the mapping, aggregates to the right level, computes differences, and classifies them by magnitude. I use a tolerance band per account type rather than a single global tolerance, because rounding differences in large accounts are not the same thing as logic differences in small ones.

A drill-down viewer. Starts at the aggregate diff and lets the user walk down to the contributing business events and contracts. Built in Python with a simple web UI or as an Excel dashboard driven by xlwings — either works. The key property is that the user can move from “there is a R50,000 difference at group level” to “contract 1234567 contributes R12,000 of it from the measurement event on 2023-06-30” in three clicks.

An audit-trace export. For each reconciled difference, the tool produces a structured export showing the input, both engines’ calculations, the difference, and the classification. These exports become the evidence trail for the auditor and the regulator. Design them to be standalone — readable without the tool — because auditors will review them in files, not in a live system.

5. Sampling strategy

No programme will run every possible test case combination. Sampling is essential, and the sampling strategy is a decision the validation lead has to own.

I use three types of sampling in combination. Full coverage of edge cases — every combination that the methodology team has flagged as risky or unusual. Stratified random sampling of the main variation space — enough to catch unexpected interactions, sized by the coverage target. Real-data slicing — actual contracts from the insurer’s book, chosen to span the portfolio’s mix of products, durations, and cohorts.

The combination matters. Edge cases alone miss interaction effects. Random sampling alone under-weights rare but important combinations. Real-data slicing alone misses scenarios the book has not yet contained but might in future.

6. What good looks like

A validation toolkit that is working well has visible properties. Turnaround from “new test case idea” to “test case result” is hours, not weeks. Differences are classified automatically and routed to the right owner — methodology to the methodology team, data to the data team, vendor bugs to the vendor. Audit artefacts are produced continuously as a side-effect of validation runs, not as a last-minute documentation exercise.

When tooling is weak, the symptoms are predictable. The team runs the same fifty test cases repeatedly because adding new ones is expensive. Every difference is investigated manually. Audit documentation is written three months late and is incomplete.

The tools described above are not elaborate by software engineering standards. A small team — one or two Python-competent actuaries or engineers, plus Excel-side support — can build them in six to eight weeks. That is a small fraction of the total validation effort and a disproportionate multiplier on what the team can actually deliver.

Without a proper toolkit, the validation phase becomes the programme’s critical path, and the only way to meet the deadline is to cut coverage. Building the tools properly is how you avoid that trade-off.

Working on something similar?

I've delivered IFRS 17, AI advisory, and actuarial training across 15 jurisdictions. If this topic is relevant to your team, let's talk.

Book 30 Minutes