* METHODOLOGY · CALIBRATION_V1

Methodology

We test our developer-scoring algorithm against 20 hand-curated repos covering wrappers, SDK-glue MVPs, senior production infra, and deep-tech foundations. Here's how it scores them and how close that lands to a human reviewer's judgment.

17/20
In expected range
2
Soft drift (±5)
0
Hard drift (>5)
1
Failed audits

How scoring works

Scores combine four buckets: Features (40), Architecture (15), Intent & Standards (25), and Forensics (20). Total: 100.

After the LLM produces tentative claims, we apply two deterministic caps:

  • Layer cap: UI-only claims (frontend components, shaders, audio visualization) cannot reach Tier 3. AI can one-shot most UI work — it's not Engineering Depth.
  • SDK-glue cap: claims dominated by external-SDK orchestration (Twilio, LiveKit, OpenAI, etc.) cap at Tier 2 unless the work implements a custom protocol or novel coordination pattern.

The LLM also extracts "Why X over Y" tradeoffs for each claim — explicit engineering decisions visible in code or comments, with file:line citations. These are the actual judgment signals hiring managers act on, not the score.

Calibration tiers

Wrapper
Expected: 40-55

Simple utilities, single-purpose libraries. No engineering depth.

Mid-Glue
Expected: 65-78

Competent SDK orchestration. Real engineering, not novel algorithms.

Senior Infra
Expected: 75-88

Production-grade systems with real operational maturity.

Deep Tech
Expected: 85-95

Novel algorithms, distributed primitives, compiled-language depth.

Edge Case
Expected: varies

Algo robustness checks. Repos where naive scoring would fail.

Calibration set

Wrapper(3 repos)

Mid-Glue(2 repos)

Senior Infra(10 repos)

Deep Tech(3 repos)

Edge Case(2 repos)

Known limitations

Two repos in the calibration set drift slightly above expected (within ±5-10 points). Both stem from the same algo behavior: the LLM occasionally dresses up procedural orchestration with state-machine / transformation-engine framing, which the post-LLM rule-9 enforcement catches in most cases but missed these specific phrasings. Documented here transparently — fixing them is on the iteration backlog.

asottile/pyupgradealgo 92 · expected 75-88 · +10 over

Algo flagged "Source-to-Source Transformation Engine" as TIER_3_DEEP. Reading _main.py:141-171 reveals the function delegates to tokenize-rt'sparse_format /unparse_parsed_string — procedural wrapping of existing infrastructure, not novel algorithmic work. Should be TIER_2_LOGIC.

jd/tenacityalgo 90 · expected 75-88 · +10 over

Algo flagged "action-based state machine for retry iteration" as TIER_3_DEEP. Reading __init__.py:405-427: actual code is for action in self.iter_state.actions: action(retry_state)a callback queue, not a state machine. Should be TIER_2_LOGIC.

Disagree with a score?

Hand-labels are subjective. If you think a repo's expected range is wrong, or the algo behaves badly on a class of repo we haven't tested, open a PR on the calibration set. We re-run the corpus on every algorithm change and surface drifts publicly.