Case Study: Stress-testing Clu's fingerprint-of-work methodology

Clu Labs
Oct 15, 2025
7 min read

We applied Clu's fingerprint-of-work methodology, blind, to the IBM HR operating model as it stood in May 2023, and scored its predictions against three years of subsequently published outcomes.

SUBJECT: IBM Human Resources, 2022–2025

METHOD: Retrospective blind validation

RESULT: 10 of 10 claims pass

In brief

In 2023, IBM reported that around 30% of its back-office HR work could be automated by AI within five years. By 2025, it had documented 80 automated tasks, a 94% containment rate, and a small set of HR functions that its CEO publicly identified as unautomatable for the next decade.

We took the HR operating model as it would have stood in May 2023, applied Clu's fingerprint methodology to 100 representative tasks without reference to the published outcomes, and scored the predictions against ten specific public claims.

The methodology passed every claim. Two pre-registration tests, locked before the predictions were run, confirmed that the matrix correctly distinguishes automated work from non-automated work, based on the structural properties of the task itself rather than on the answer.

01: The case

A rare restructuring with task-level disclosure

Most corporate restructurings reveal changes in headcount and functions. Very few reveal the tasks themselves. IBM's HR transformation is an unusual case in which four conditions are simultaneously met.

First, the pre-transformation operating model is publicly recognisable from standard HR frameworks - the Ulrich three-legged stool, the SHRM functional areas, the tiered service delivery model that almost every large enterprise organises its HR function around. Reconstruction from public sources is not speculation; it is reading off a shared reference architecture.
Second, the post-transformation state has been documented at the task level in IBM's own published case study. AskHR, the agent that absorbed the work, supports approximately 80 automated tasks, including employee letters, vacation requests, payroll access, compensation changes, organisational updates, transfers, and policy retrieval. The 94% containment rate is published. The two-tier model (AI handles routine tasks; human advisors handle complex ones) has been published.
Third, the unautomatable boundary was publicly named by the CEO at the outset

“Some HR functions, such as evaluating workforce composition and productivity, probably won't be replaced over the next decade.”

— Arvind Krishna, IBM CEO, May 2023

Fourth, the displacement-and-reinvestment shape is visible. IBM's overall headcount grew through the transition; the freed capacity was reinvested into software engineering, sales, and marketing roles, Krishna noted, “focused on critical thinking, where employees face up to or against other humans, as opposed to just doing rote process work.”

Four published anchors. A defined automatable cluster. A defined, unautomatable boundary. A defined reinvestment direction. That is the rare combination that makes this retrospective test possible.

02: The corpus

100 tasks, full operating model

The corpus deliberately spans the whole HR operating model, not just the tasks AskHR ended up absorbing.

A methodology that only sees the automated tasks cannot demonstrate that it distinguishes them from anything else. The 100-task corpus is balanced across the four standard service-delivery tiers and sixteen functional areas, so that the predictions are testable in both directions: high scores on tasks that were automated, and low scores on tasks that were explicitly named as un-automatable.

100

Tasks in the corpus

Functional areas

Service tiers

Public source frameworks

Every task carries a public provenance, traceable to IBM's own AskHR case study, Krishna's May 2023 Fortune interview, the SHRM Body of Competency and Knowledge, the CIPD HR operating model guidance, or the Ulrich three-legged stool. The corpus is built on shared reference material that a competent HR practitioner would recognise.

Clu's work fingerprint elements were deliberately left empty when the corpus was assembled. They were populated only in the next phase, using a methodology blind to the public evidence index.

03: The fingerprint-of-work methodology

The fingerprint runs before the evidence is consulted

Clu's work fingerprint has six dimensions. Four recognise a task; two recognise its scope. From these dimensions, a unique fingerprint is generated, and augmentation scores are deterministically derived from its elements.

STEP 01 Action & Object - Verb and the thing it acts on.
STEP 02 Outcome & Domain - What the task achieves, and the functional area it lives in.
STEP 03 Work-type & Reach - Strategic/Tactical/Transactional. Reach from Individual to Group.
DERIVED Augmentation score, split across AI / Automation / Robotics

Augmentation scores are derived from the fingerprint through a published matrix. The matrix is deterministic and versioned: the same fingerprint produces the same score every time. Where calibration of individual matrix values has been an open question internally, this validation used the more conservative position — making the test harder, not easier, and ensuring the result is robust to the resolution of those questions.

The methodology was applied blind. Fingerprints were constructed solely from the task name and description, without reference to the public evidence index that scores them. Fifteen tasks in which prior public knowledge of AskHR automation status was unavoidable were explicitly flagged for contamination risk, so an adversarial reviewer can recheck them.

04 Pre-registered tests

The two tests that it had to pass

Before running the methodology, two tests were locked. Both concern work that the IBM CEO publicly named as un-automatable for the next decade. If the methodology scored either as highly augmentable, the matrix would have failed.

Dark infographic comparing two pre-registered prediction tests, tasks T063 and T064, both scored 0.14 with explanatory text.

These two scores are the most important results in the validation. Many methodologies could be tuned, after the fact, to match an automation outcome. Far fewer can reproduce a published un-automatable boundary from structural properties alone.

05 The structural result

A clean gradient from routine to strategic

The methodology yields a monotonic decrease in predicted augmentation as work moves up the service-delivery tiers: exactly the structural property that makes a regulated automation programme defensible.

Bar chart of predicted augmentation scores by service tier, with Tier 0 Self-service highest at 0.75 and Tier 3 lowest at 0.22.

The gap from Tier 0 to Tier 3 is 0.53: a half-scale separation that the methodology produces from cognitive structure alone.

This is the shape an HR leader can take to a works council as the structural defence of an automation programme: the work Clu predicts as automatable is the work that, by cognitive property, not opinion, is most automatable. The work it protects from automation is the work where human judgment does the cognitive heavy lifting.

It also mirrors how AskHR actually scaled at IBM. Bottom-up from self-service through shared services, stopping at the strategic layer. The methodology reproduces the empirical scaling path from first principles.

06 Driving force

It is not enough to know whether work is augmentable

A single augmentability number tells an organisation that a task can be augmented. It does not tell them by what, whether AI, Automation, or Robotics. Without that decomposition, the organisation cannot select the right tool, vendor, or intervention.

Two tasks in the corpus illustrate why the decomposition matters.

T020: Process internal transfer. The methodology assigns a score of 0.86, with Automation as the dominant driving force. This is the structural signature of pure workflow execution; the work a robotic process automation tool absorbs cleanly, where AI is helpful but not the primary driver.

T018: Generate employment verification letter. The methodology assigns a score of 0.73, with AI as the dominant driving force. The structural difference is that this task requires natural-language generation, not just workflow execution.

Two superficially similar tasks. Two different driving forces. Two different procurement decisions. The methodology distinguishes them not by trying both and observing the outcome, but by reading the cognitive fingerprint.

07 The scorecard

Ten claims. Ten passes.

Each of ten specific public claims from IBM and Arvind Krishna was used to score the methodology's predictions.

IBM HR validation scorecard screenshot showing ten claims, all marked PASS in green on a white table.

08 Honest limitations

WHAT THIS VALIDATION DOES NOT PROVE

Where the case is weak

A methodology that claims unbroken prescience is not credible. The following are limitations a reviewer should be aware of before forming a view of this work.

01. This is retrospective, not predictive.

The methodology was applied in 2026 to an outcome that became public between 2023 and 2025. It would have produced defensible predictions in 2023, but it did not produce them in 2023. Predictive validation, committing to predictions in writing on a live client engagement before the outcome is known, is a separate programme of work.

02. 15 of 100 tasks carry a hindsight contamination flag.

These are tasks where the analyst had prior public knowledge of AskHR's automation status. The fingerprints were constructed mechanically from the task name and description only, but unconscious bias cannot be ruled out. Adversarial review, by an independent analyst who has not seen the public evidence, is the next step before any external claim of generality.

03. A single organisation, a single domain.

IBM HR is one case. A methodology that works on one case may not generalise. A second validation case in a different domain, where the published outcome is an automation programme that reversed, is the next test the methodology must pass.

04. The corpus is reconstructed, not disclosed.

IBM has not published its full pre-AskHR task inventory. The 100-task corpus is built from a publicly known HR operating model structure, anchored at specific tasks where IBM has disclosed automation activity. Every task carries source provenance, but a competent industry reviewer can challenge individual entries.

05. One open matrix calibration question remains.

The validation passes with the conservative position, but the question of whether modern AI is more capable at this category of cognitive work than the matrix assumes is unresolved. Three tasks in the corpus sit in the middle of the score distribution, where revising that value would move them.

06. IBM did not commission this study.

The work is conducted independently using publicly available information. IBM has not reviewed, endorsed, or commented on the analysis. The case is used as a publicly documented workforce transformation against which to stress-test the methodology, not as a representation of IBM's views.

What this is for

The hard test isn't explaining what happened. It is explaining what won't.

Workforce decisions that survive a works council, an audit, or a tribunal need a reasoning chain a hostile reviewer can trace. That is what Clu builds. This validation case is one step in proving the methodology earns the trust it asks for. There will be others.

NEXT IN THE SERIES Klarna, and the methodology applied to a documented automation failure.

_______________________________________________________________________________________________

Purple banner with white and yellow text: Stop guessing how work happens. Start seeing it clearly.

Cut through workforce cost, risk, and guesswork to see exactly how work is structured, where it’s breaking, and what to fix first.

Clu gives you audit-grade clarity from the data you already have, so you can redesign teams, deploy AI properly, and defend every decision with evidence.

Start making decisions you can stand behind. It's time to get a clu.

What we do

How we do it

Results

Clu Labs