Audit-Grade AI: Why 'Explainable' Is Not Enough

Clu Labs
Apr 18, 2024
3 min read

The regulatory conversation about AI has settled, for now, on 'explainability' as the key standard. Can you show your working? Can a human understand why the system produced this output? Can you describe the logic in terms a regulator will accept?

These are the right questions. But they're not sufficient ones.

Explainability describes a property of a model's output. Audit-grade refers to a property of the entire decision system; its inputs, methodology, outputs, audit trail, and the ability to reproduce the same result under scrutiny. For workforce decisions made in regulated environments, the distinction matters enormously.

The reproducibility problem

Generative AI, the technology behind most workforce analytics platforms entering the market, is stochastic by design. The same prompt, given to the same model, on two different days, can and will produce two different outputs. You can explain either output. You cannot guarantee they'll match.

For the vast majority of AI applications, this is fine. Content generation, ideation, and first-draft documents, these tasks tolerate variance. The output is a starting point, not a record.

For workforce decisions, it's disqualifying. If your AI system produced a restructuring recommendation in January and a tribunal asks you to reproduce the analysis in September, a stochastic system cannot do it. The methodology is non-deterministic by architecture. There is no audit trail that will satisfy a regulator who wants to know whether your process was consistent, fair, and repeatable.

Explainability tells you what the system decided. Audit-grade means you can prove, under scrutiny, that it would reach the same decision again and show your working.

What audit-grade AI actually requires

Audit-grade by architecture means the system is deterministic: the same inputs always produce the same outputs. It means the methodology is documented and version-controlled, so changes to the analytical logic are tracked. It means the evidence base for any output is traceable to its source - every score, every classification, every risk flag has a provenance chain.

It also means the system doesn't introduce risk at the data layer. Non-generative AI doesn't hallucinate. It doesn't invent relationships between data points that don't exist in the input. It surfaces what's there. The outputs are constrained by the organisation's structural reality, not shaped by a model's prior training on other organisations' data.

This isn't a theoretical distinction. The Employment Rights Bill's strengthened collective consultation requirements, the EU AI Act's Annex III classification of workforce AI as high-risk, and the ongoing ICO scrutiny of AI in employment decisions are all moving in the same direction: towards a standard of evidence that generative AI systems cannot currently meet.

The architecture that solves for compliance from day one

Clu's analytical engine is non-generative and deterministic. Our methodology is pre-registered before analysis runs, which means the evaluation framework is locked in before any output is produced, eliminating the possibility of post-hoc rationalisation. Every score is reproducible. Every output has a traceable source.

We designed this not because regulation required it at the time - it didn't - but because we were building for organisations in regulated industries where the question 'can you prove this?' is never hypothetical. UK-sovereign, non-generative, audit-grade by architecture: that's the infrastructure for workforce decisions that survive scrutiny.

As regulation tightens, and it will, organisations that chose audit-grade architecture early will have a structural advantage. Those who chose convenience and explainability as proxies for compliance will find themselves retrofitting governance onto systems that weren't built for it.

That's an expensive problem to discover late.