Celestial Mode: Day

MFid Methodology

One canonical formula. Two reporting forms. Published rubrics for every dimension.

Why this page exists

A firm whose entire product is auditing other people’s claims cannot ship a metric in two formulas, two evidence-tier names, and an undocumented Intentionality rubric. This page is the canonical reference. Every other page on the site is required to match it. If a page contradicts this one, this one wins and the other page is a finding against ourselves.

This is the firewall recommended in the 2026-05-26 thesis review (§4.1, §4.2, §4.5). It exists because the review was right.

The canonical formula

MFidaggregate = (D × E × O × I)1/4

D, E, O, I ∈ [0,1]. Geometric mean of four normalized dimension scores.

Why a geometric mean and not an arithmetic mean. A geometric mean is dominated by the smallest factor. A single weak dimension caps the composite — an MFid of 0.5 on any one dimension limits the aggregate to roughly 0.84 even if the other three are perfect. That is the policy we want. An arithmetic mean would let a perfect Observability score paper over a broken Determinism score. We treat fidelity gaps as non-substitutable: you cannot fix a lie about latency by being more transparent about it.

Why these four dimensions and not three or five. D, E, O, I are the smallest set that survives every domain we have scored. Determinism, Efficiency, and Observability are the engineering core. Intentionality is the dimension that distinguishes a fast wrong answer from a fast right one — the one most measurement frameworks omit and the one most needed in autonomous-systems era.

Two reporting forms — and how they relate

Past versions of this site published two formulas without saying so. That was a finding. The reconciliation:

Form 1 — Aggregate MFid (the canonical number)

MFidaggregate = (D × E × O × I)1/4. One number per system, vendor, process, or stack. This is the number that goes on the board deck. It is always a geometric mean of four [0,1] dimension scores. Weights are fixed at 1/4 each. There are no per-engagement weight choices.

Form 2 — Domain-projected fidelity (MFidapp, MFidnet, …)

When a system exposes domain-specific service-level indicators (latency, throughput, reliability for an app; bandwidth, jitter, loss for a network), we compute a domain projection:

MFidapp = wL·L + wT·T + wR·R

where L, T, R ∈ [0,1] are fidelity scores per SLI (claimed/observed clipped to 1.0) and weights w sum to 1. Weights are not chosen per case. They are derived from the system’s own published SLO portfolio — if the operator weights latency at 50% of their SLO budget, we use 0.5. If no SLO portfolio exists, we use the default Tier-1 weighting (0.5 / 0.3 / 0.2 in order of business impact: response time, throughput, reliability) and publish the choice in the finding.

A domain projection is not the aggregate MFid. It is one input into one of the four dimensions. Roll-up:

  • Determinism (D) ← tail dispersion of each L/T/R signal (p99/p50 ratio inverted and clipped).
  • Efficiency (E) ← resource-bounded form of L and T (claimed cost-per-unit ÷ observed cost-per-unit).
  • Observability (O) ← coverage of L/T/R signals: fraction of customer journeys for which a measurement exists.
  • Intentionality (I) ← scored separately; see rubric below.

Every published MFidapp in our case studies is annotated with both forms going forward: the projection that motivated the finding, and the aggregate it rolled into.

Operational definitions of D, E, O, I

Each dimension is a normalized composite of underlying measurements. Each has a published formula. None of them are subjective. None of them are scored by vibes.

D — Determinism

Definition: Same input, under stated conditions, produces the same observable output within a stated tolerance.

Measurement: D = 1 − min(1, σ / (μ × τ)) where σ is the standard deviation of the observable, μ is its mean, and τ is the published tolerance band (e.g. 10%). Computed per SLI; aggregated by minimum across SLIs (a system is only as deterministic as its worst-behaved indicator).

Worst-case binding: If a single tail event in the measurement window exceeded 2σ on a critical SLI, D is capped at 0.9 regardless of the formula above. We refuse to let a calm hour hide a panic minute.

E — Efficiency

Definition: Resource cost per unit of useful output, relative to the published or contracted cost.

Measurement: E = min(1, claimed_cost_per_unit / observed_cost_per_unit), computed in the natural unit of the system (cycles/token, watts/inference, dollars/transaction, joules/request, bytes/query). One unit per system, declared up front. E is a one-sided ratio: under-cost (better than claim) is clipped at 1.0 — we report it but it does not inflate the score. Only Porsches get credit for under-promising, and the credit is qualitative.

O — Observability

Definition: Fraction of the customer-relevant behavior surface for which a current, queryable, retained measurement exists.

Measurement: O = (covered_SLIs / required_SLIs) × retention_factor × freshness_factor. Required SLIs are enumerated up front for each system from its specification (not from what is currently instrumented — that would let absence become a credit). Retention factor is 1.0 if telemetry is retained ≥ 30 days, scaled down otherwise. Freshness factor is 1.0 if the dashboard is queryable in < 60 seconds, scaled down otherwise. A score you have to mine from log files is not observability; it is archaeology.

I — Intentionality

This is the dimension the 2026-05-26 review flagged as undefined. The review was correct. The rubric below is the answer.

Definition: The fraction of system activity that demonstrably serves the stated purpose under audit, with the remainder classified as out-of-spec drift (not necessarily harmful — but not what was contracted).

Important framing. Intentionality is not a property of the artifact in isolation; it is a property of the artifact relative to its specification. We measure the specification as carefully as we measure the system. An unclear spec produces a low Intentionality ceiling, not a low Intentionality score — we publish the ceiling and recommend the spec be tightened.

Measurement (three-part, each scored [0,1], aggregated by geometric mean):

  1. Ispec — Specification clarity. Does a written, dated, signed specification exist that enumerates required behaviors and forbidden behaviors? Scored by document analysis on a 7-point checklist (existence, dating, scope, behavior list, forbidden-list, change log, sign-off). Reproducible across reviewers with κ ≥ 0.7 on a 50-spec calibration set; calibration set published on request.
  2. Itrace — Operational coverage. Of the operations the system performed during the measurement window, what fraction can be traced to a specified behavior? Computed as (traced_operations / total_operations) from logs, traces, or transaction records. Operations with no trace match are not assumed malicious — they are assumed unscored, and counted against Itrace.
  3. Idrift — Forbidden-behavior detection. Of operations classifiable as “outside spec” (output schemas, data destinations, decision boundaries the spec excludes), what fraction were caught by automated guardrails before having effect? Computed as (blocked_violations / detected_violations). A system with no violations and no detection capability scores 0.5 (we cannot tell whether it is well-behaved or unmonitored).

I = (Ispec × Itrace × Idrift)1/3.

For ML and autonomous systems specifically: Ispec is scored against the model card and policy document. Itrace is computed from prompt/response logs against the policy classifier. Idrift measures jailbreak/red-team catch rate. A worked example on a public LLM endpoint is published in the MFid open-source repository as examples/intentionality_llm_worked.md.

What I is not. I is not a measure of whether the system is good. A perfectly-malicious system with a clear spec authorizing maliciousness scores I = 1.0. I measures fidelity to the spec, not the wisdom of the spec. That is by design. We score the gap between claim and reality; we do not score the claim itself. That is the customer’s job.

Evidence tiers — one name, one definition

Every published MFid number carries an explicit evidence tier and a coverage percentage. The tiers, canonically:

  1. Tier 1 — Measured Reality. What we observe in the client environment, under client load, on the client’s worst day. The verdict. Where direct measurement is incomplete, the uncovered portion is labeled Tier 1E (Engineering Estimation) with the inference method named.
  2. Tier 2 — Published Specification. What the vendor put in writing. The claim being tested. Example: NVMe latency vs. datasheet; ISP bandwidth vs. contract; cloud uptime vs. SLA.
  3. Tier 3 — Scientific Calculation. What physics, mathematics, or architectural law allow. The ceiling no claim can exceed. Example: thermal throttling derived from TDP; channel capacity bounded by Shannon’s theorem.

v2.1 renumbering (2026-05-28). This is a renumbering, not a methodology change. The three categories and their definitions are unchanged. No published scores moved as a result. Tier 1 is now Measured Reality because that is where reader intuition puts the strongest evidence — the prior numbering (T1 = Scientific Calculation) inverted that intuition. The sub-label for incomplete measurement moves with the tier and is now 1E.

The naming history. Earlier versions of the site used “Measured Reality” on the homepage and “Engineering Estimation” on the manifesto for the same tier. That was a finding against ourselves. The reconciliation is above: one tier (Tier 1 — Measured Reality) with an explicit sub-label (1E — Engineering Estimation) only when the measurement is incomplete.

The coverage rule. Every published MFid number carries the form:

MFid 0.81 (Tier-1 over 72% of subsystems; Tier-2 published spec for the remaining 28%)

A score with no tier label and no coverage percentage is not a published MFid. It is a draft.

Limitations we will not hide

  • The four dimensions are not commensurable in a strict measurement-theoretic sense. The geometric-mean aggregation is a policy choice — it penalizes weakness — and not a derivation. We have argued the choice above; we have not proved it is unique.
  • Intentionality is a meta-level measurement. It depends on a specification existing, being current, and being readable. The rubric handles this by scoring spec clarity (Ispec) as a sub-factor, but the dependence is real.
  • The aggregate MFid is computed by SDCorp. The framework, the rubrics, the rollups, the publication cadence are ours. The audit-of-the-audit problem is unresolved by this page alone. Our current path to resolution: standards engagement (in progress), academic co-authorship (in progress), third-party attestation of a sample engagement (planned). Status published quarterly on the live status page.

Versioning and change log

This methodology is versioned. The current version is MFid Methodology v2.0 (2026-05-27). The v1.x line published two formulas without reconciling them and used two names for the third evidence tier; v2.0 fixes both, publishes the Intentionality rubric, and adds the coverage-label requirement. Older case-study numbers will be retro-annotated with v2.0-equivalent labels by end of quarter.

Change log is maintained in the MFid open-source repository under CHANGELOG.md.

Want the formula applied to your stack?

Bring the spec, we bring the math. The number does the rest.

Request an Investigation
Celestial Mode: Day