How we measure engineering performance
Lines of code, commits, story points, and DORA describe activity. They don't answer whether the output became more valuable. ETV scores every merged file change by what it meant, where it landed, and whether it touched the architecture, then rolls those values up to engineers and organizations.
At a glance
ETV is the unit
Engineering Throughput Value is computed per file inside each merged commit. It combines five factors and stays separate by work type.
Five factors, one per-file score
Complexity, Engagement, Architecture, Decay, and Multiplier combine into the per-file score. No collapsing across factors.
Three buckets, kept separate
Growth, Maintenance, and Fixes are scored independently and reported side by side. The buckets are non-additive on purpose.
Deterministic, with ML calibration
The scoring algorithm is deterministic. ML tunes thresholds inside it; an LLM only classifies when pattern signals are insufficient.
Overview
ETV reads each change the way a senior reviewer would: not just what changed, but what it meant, where it landed, and whether it touched the architecture. The measurement is built so that shipping a new capability, refactoring a module, and repairing a long-standing bug are counted as different kinds of output.
ETV, Engineering Throughput Value
A dimensionless unit for engineering output. ETV is additive within a work type and deliberately not additive across types, so Growth, Maintenance, and Fixes never collapse into a single performance score.
Per file, per merged commit
The atomic unit is an individual file change within a merged commit on the default branch. Each file change receives a work-type label and a per-file score. Per-file scores roll up hierarchically: contributor, team, repository, organization.
ETV applied per file, per merged commit, combines five factors. The first two set the base score. The third places the change in context. The last two adjust the result before aggregation.
Complexity
Structural weight of the change. Cognitive load of added and modified lines inside their function scope.
Engagement
Ratio of surrounding code complexity to the change itself. How much of the file and the repository the developer had to understand to land the change safely.
Architecture
Where the change lands in the feature graph. Cross-feature surfaces and integration points carry more weight than leaf-node edits.
Decay
Reduces credit for mechanical refactors, boilerplate, copy-paste, and rewrites of the author's own recent work.
Multiplier
Amplifies fixes when the underlying bug was costly: old, across an authorship boundary, or sitting under heavy churn.
per_file_score = f(complexity, engagement) × architecture × decay × multiplier
Conceptual shape. Complexity and engagement produce the base score; architecture, decay, and multiplier adjust it. Coefficients are calibrated against a labeled corpus and applied identically across organizations.
The three buckets
Every file change is classified into one of three work types. Scores stay separate by bucket so the report can show what an organization shipped, what it maintained, and what it repaired without collapsing the distinction into one number.
| Bucket | Definition | Conventional hint |
|---|---|---|
| Growth | Net-new functionality and capability: added endpoints, new modules, new product surface area. | feat |
| Maintenance | Upkeep, refactors, cleanup, performance tuning, tests, dependency updates, docs, style, build, CI. | chore, refactor, perf, test, style, build, ci, docs |
| Fixes | Work that corrects previous output: bug fixes, regressions, hotfixes. | fix |
How classification works
Classification is per file, not per commit. A commit that adds a feature and patches a test in the same diff produces Growth ETV for the feature file and Maintenance ETV for the test file. Mixed commits are common; the per-file split keeps them honest.
Pattern signals (commit message, path, change shape, surrounding context) resolve most files deterministically. An LLM classifies the remaining ambiguous cases. For fixes, the engine traces the bug back to the commit and author that introduced it.
Display modes
The canonical view shows three buckets side by side. An executive-friendly alternative collapses Maintenance and Fixes into KTLO (Keep The Lights On) and contrasts it with Growth. Both views use the same underlying scores; only the grouping changes.
Scoring engine
Scoring runs per file inside each merged commit. The engine is deterministic: the same diff in the same context always produces the same score. ML tunes thresholds and coefficients inside that deterministic structure; an LLM is invoked only for classification when pattern signals are insufficient.
Stage 1, AI classification
Reads commit message, full diff, and surrounding code. Produces per-file work type, identifies changed symbols, and for fixes traces the bug back to the commit and author that introduced it.
Stage 2, mechanical scoring
Deterministic algorithms compute complexity and engagement, apply architecture, decay, and multiplier, and aggregate. No LLM at this stage. Reproducible from the same inputs.
Context complexity
Cognitive weight of modified lines, computed per function scope over added and modified lines. Same line count carries different weight depending on what those lines do. Treated consistently across languages and paradigms; React JSX is scored like Go.
Engagement
Surface area the developer had to understand to land the change safely. Within the file it tracks shared identifiers and data flow. Within the repository it traces values through callers and callees. Bounded so a one-line edit to a universal helper does not dominate.
Calibration is global
Thresholds and coefficients are calibrated against a labeled commit corpus and recalibrated periodically. The same calibration applies to every organization studied. There is no per-customer model training. Formulas are fixed and auditable.
Architecture
The architecture factor reflects where in the system a change lands. The engine derives a feature graph from code organization alone, maps each commit to one or more features, and applies a multiplier based on how that feature connects to the rest of the system.
Feature graph
AI analysis discovers distinct named features in the repository (for example auth, billing, checkout) and assigns each to a vertical layer: frontend, backend, or data. The graph is built from the code itself, not from issue trackers or roadmaps.
Commits are mapped to features through weighted path scoring. Exact path match outranks directory containment, which outranks filename affinity. A single commit can touch multiple features and is split accordingly.
Architecture multiplier
Changes at cross-feature integration points or at the seam between layers carry more weight than leaf-node edits. The multiplier rises with the number of features connected by the change and with the depth of the integration.
Decay and amplification
Four adjustments shape the per-file score after complexity, engagement, and architecture. Three of them reduce credit for mechanical or low-novelty work. One of them amplifies fixes when the underlying bug was expensive.
| Factor | Direction | Effect |
|---|---|---|
| Similarity dampener | Reduces | Credit drops when the change structure matches existing patterns: mechanical refactors and boilerplate. |
| Blame decay | Reduces | Discounts overwriting very recent work by the same author over a short business-day window. |
| Copy decay | Reduces | Reduces credit for literally duplicated lines copied from elsewhere in the repository. |
| Waste multiplier | Amplifies | Applies to fixes only. Grows with bug age, with context transfer (fix touches another author's code), and with churn in the affected area since the bug was introduced. |
What is scored
Not every file is scored. Generated artifacts, lockfiles, and binaries are filtered before analysis. Language coverage is split between full structural analysis and partial line-level analysis.
Filtered files (excluded from scoring)
- Generated code: .pb.go, _grpc.pb.go, .pb.ts, .graphql.ts, OpenAPI specs, *_generated.go, *.gen.go, zz_generated.*
- Dependency lockfiles: go.sum, package-lock.json, yarn.lock, pnpm-lock.yaml, Cargo.lock, Gemfile.lock
- Build artifacts and vendored trees: dist/, build/, .next/, vendor/, node_modules/, content-hash bundles
- Minified files detected by heuristic (average line length over 300 characters)
- Binary and media: images, fonts, PDFs, archives, compiled binaries
Supported languages
Full structural analysis (13)
Partial analysis, classification and line-level signals (7)
Report-layer aggregation
The report layer is independent of the scoring engine. It defines how per-commit ETV values aggregate up to engineers and organizations, how cohorts are constructed across quarters, and how confidence intervals are estimated.
Three aggregation levels
| Level | Computation |
|---|---|
| Per commit | Sum of the three per-file work-type subscores (Growth, Maintenance, Fixes). |
| Per engineer per quarter | Sum of per-commit values for the engineer's merged commits in the quarter, by work type. |
| Per organization per quarter | Developer-weighted mean across qualifying engineers in the organization. |
Developer-weighted mean
Cross-organization aggregates use a developer-weighted mean so larger organizations do not dominate the comparison. The org-level value reports output per software engineer, not total output.
Attribution
Primary credit goes to the git author of the merged commit, after email-alias resolution. Co-authors are tracked in the knowledge graph but do not receive score credit. Commits are attributed to the quarter of their merge date, not their authorship date.
A default list of automation accounts (dependabot, renovate, github-actions, and similar) is filtered from aggregates. Platform aggregates are scoped to contributors tagged with role = SWE; non-engineering roles still produce commits in the knowledge graph but do not enter the numerators or denominators of reported figures.
At the organization level, contributors active in multiple repositories are counted once and their output is summed across repositories.
Bootstrap confidence intervals
Intervals around aggregate values are computed by the bootstrap method, resampling the underlying observations with replacement, rather than by assuming a parametric distribution. The shaded bands on aggregate charts are 95% bootstrap intervals.
Fixed-period cohort
The cohort used for quarter-over-quarter comparison is the intersection of qualifying SWEs across every quarter of the reporting window. A qualifying SWE merged at least one scored commit in every quarter. Contributors who appear in only some quarters are excluded from the cohort but remain visible in per-quarter aggregates.
This fixed-period construction controls for composition changes (growth, attrition, reorganizations) so quarterly deltas reflect changes in output, not changes in who was counted.
Limitations
Every measurement system has boundaries. These are the ones worth keeping in mind when interpreting the numbers or comparing organizations.
- Public default branches only. Private forks, feature branches, and unmerged work are not scored.
- Merged code only. Reviews, planning, incident response, mentorship, and architectural decisions are not directly captured.
- The repository is the outer boundary of engagement. Cross-repository data flow is not modeled, so a change in one repo carries no engagement weight from another.
- Cross-organization raw totals are not automatically normalized. Compare trends within an organization rather than absolute totals across organizations with different sizes and language mixes.
- Co-authored commits credit only the primary author. Collaborators are tracked but not scored.
- Work-type classification is probabilistic per file. Individual edge cases may be debatable; aggregate statistics are stable.
- Causal claims (this change drove this outcome) are out of scope. ETV describes output, not business impact.