Methodology
We test our developer-scoring algorithm against 20 hand-curated repos covering wrappers, SDK-glue MVPs, senior production infra, and deep-tech foundations. Here's how it scores them and how close that lands to a human reviewer's judgment.
How scoring works
Scores combine four buckets: Features (40), Architecture (15), Intent & Standards (25), and Forensics (20). Total: 100.
After the LLM produces tentative claims, we apply two deterministic caps:
- Layer cap: UI-only claims (frontend components, shaders, audio visualization) cannot reach Tier 3. AI can one-shot most UI work — it's not Engineering Depth.
- SDK-glue cap: claims dominated by external-SDK orchestration (Twilio, LiveKit, OpenAI, etc.) cap at Tier 2 unless the work implements a custom protocol or novel coordination pattern.
The LLM also extracts "Why X over Y" tradeoffs for each claim — explicit engineering decisions visible in code or comments, with file:line citations. These are the actual judgment signals hiring managers act on, not the score.
Calibration tiers
Simple utilities, single-purpose libraries. No engineering depth.
Competent SDK orchestration. Real engineering, not novel algorithms.
Production-grade systems with real operational maturity.
Novel algorithms, distributed primitives, compiled-language depth.
Algo robustness checks. Repos where naive scoring would fail.
Calibration set
Wrapper(3 repos)
Mid-Glue(2 repos)
Senior Infra(10 repos)
Deep Tech(3 repos)
Edge Case(2 repos)
Known limitations
Two repos in the calibration set drift slightly above expected (within ±5-10 points). Both stem from the same algo behavior: the LLM occasionally dresses up procedural orchestration with state-machine / transformation-engine framing, which the post-LLM rule-9 enforcement catches in most cases but missed these specific phrasings. Documented here transparently — fixing them is on the iteration backlog.
Algo flagged "Source-to-Source Transformation Engine" as TIER_3_DEEP. Reading _main.py:141-171 reveals the function delegates to tokenize-rt'sparse_format /unparse_parsed_string — procedural wrapping of existing infrastructure, not novel algorithmic work. Should be TIER_2_LOGIC.
Algo flagged "action-based state machine for retry iteration" as TIER_3_DEEP. Reading __init__.py:405-427: actual code is for action in self.iter_state.actions: action(retry_state) — a callback queue, not a state machine. Should be TIER_2_LOGIC.
Disagree with a score?
Hand-labels are subjective. If you think a repo's expected range is wrong, or the algo behaves badly on a class of repo we haven't tested, open a PR on the calibration set. We re-run the corpus on every algorithm change and surface drifts publicly.