* METHODOLOGY . BETA . SINGLE-RATER

Methodology

Oddit Hire audits a developer's public shipping behavior and produces a 6-section audit report with file-and-line evidence. This page documents what the algorithm measures, what it caps, and the 20-repo calibration corpus we tune against. The scores surface evidence to inform interviews; hiring decisions remain with the recruiter.

Beta status . The calibration corpus is currently single-rater (the founder). We have not yet completed multi-rater validation. That work is on the roadmap; until it ships, scores represent one engineer's judgment about what each rubric tier should look like, not a peer-reviewed validity study.

17/20

In expected range

Soft drift (within 5)

Hard drift (over 5)

Failed audits

How scoring works

The pipeline (V4) classifies each audited repo into one of 10 engineering disciplines (Web Frontend, Mobile Frontend, Backend/Systems, Fullstack, AI/ML, Data Science, Data Analytics, DevOps/SRE, Library/Framework/CLI) and applies a discipline-specific tier rubric. Each claim found in the repo gets classified into one of three depth tiers:

TIER_1_INTEGRATION . wired standard tools / framework defaults
TIER_2_ENGINEERING . wrote custom decision logic on top of standard tools
TIER_3_INVENTION . built from lower-level primitives / novel architecture

Scores combine four buckets: Features (40), Architecture (15), Intent & Standards (25), and Forensics (20). Total: 100. A separate Ownership bucket (0-100) measures author-engagement signals (V5 person-level features) and is reported alongside the V4 repo score, not folded in.

To dampen single-call LLM variance, every audit runs the synthesis phase k=3 times in parallel with median-of-k selection (Wang et al., Self-Consistency, ICLR 2023). Reduces the documented ~+/-15pt single-shot variance to ~+/-9pt. Tagger and map-phase results are cached across samples so the cost increase is modest.

After the LLM produces tentative claims, we apply deterministic caps and gates:

Layer cap . claims whose evidence sits entirely in UI-layer files (components, shaders, audio visualization) cannot reach TIER_3_INVENTION via this path. Most UI work is generatable; lifting it to invention tier needs explicit non-UI primitive evidence elsewhere in the same claim.
SDK-glue cap . claims dominated by external-SDK orchestration (Twilio, LiveKit, OpenAI, etc.) cap at TIER_2_ENGINEERING unless the work implements a custom protocol or novel coordination pattern.
Rule-9 / textbook-pattern caps . catch over-promotion when the LLM's own tier reasoning admits the work is well-documented or wraps a known pattern.
Universal tier gate . TIER_2 and TIER_3 require feature_type in {COMPLEX, CUSTOM}. Wrapper claims stay at TIER_1 regardless of LLM enthusiasm. Currently uniform across disciplines. Discipline-conditioned gates are on the roadmap.

The LLM also extracts "Why X over Y" tradeoffs for each claim . explicit engineering decisions visible in code or comments, with file:line citations. These are the qualitative signals a recruiter can act on directly.

Every evidence span is verified against the actual repo content (line-bounds + non-blank check) before display. Hallucinated spans are stripped and claim confidence is capped when all spans on a claim fail verification.

Limitations & what we're honest about

Single-rater calibration. Hand-labels are currently the founder's judgment. Multi-rater validation (3-5 senior engineers, per-discipline rubric) is on the roadmap but not done. Until then, scores reflect one engineer's opinion about what each tier should look like, not a peer-reviewed inter-rater agreement claim.
Corpus size + discipline skew. The hand-labeled set shown below is 20 repos. A larger 58-repo v5.4 calibration corpus exists at the algo layer but is not yet hand-labeled per-row. Both lean backend / library / fullstack; frontend, mobile, and ML are under-represented. We spot-check with named frontend repos (Tldraw, react-window, floating-ui), but a stratified hand-labeled corpus is the next major step.
No predictive-validity study. We have not yet correlated V4 scores with downstream hiring outcomes (interview pass-rate, offer-accept, tenure). The scoring rubric is an audit of engineering depth, not a hiring predictor.
AI-generated code detection is limited. Public detectors fall apart on real-world commits with FPR around 10-20% and recall 50-70% per the 2025 literature (Droid EMNLP, CodeMirage NeurIPS). We surface transparency signals (Co-authored-by trailers, commit-cadence patterns) rather than make a detection verdict.
Not an automated employment decision tool. Oddit Hire is designed to surface evidence . file:line citations, claim tiers, work patterns . to inform interviews. All hiring decisions remain with the recruiter. The product is positioned as a verified developer portfolio, not a hire/no-hire automation.

Calibration tiers

Wrapper

Expected: 40-55

Simple utilities, single-purpose libraries. No engineering depth.

Mid-Glue

Expected: 65-78

Competent SDK orchestration. Real engineering, not novel algorithms.

Senior Infra

Expected: 75-88

Production-grade systems with real operational maturity.

Deep Tech

Expected: 85-95

Novel algorithms, distributed primitives, compiled-language depth.

Edge Case

Expected: varies

Algo robustness checks. Repos where naive scoring would fail.

Calibration set

Wrapper(3 repos)

Mid-Glue(2 repos)

Senior Infra(10 repos)

Deep Tech(3 repos)

Edge Case(2 repos)

Known limitations

Two repos in the calibration set drift slightly above expected (within ±5-10 points). Both stem from the same algo behavior: the LLM occasionally dresses up procedural orchestration with state-machine / transformation-engine framing, which the post-LLM rule-9 enforcement catches in most cases but missed these specific phrasings. Documented here transparently - fixing them is on the iteration backlog.

asottile/pyupgradealgo 92 · expected 75-88 · +10 over

Algo flagged "Source-to-Source Transformation Engine" as TIER_3_INVENTION. Reading _main.py:141-171 reveals the function delegates to tokenize-rt'sparse_format /unparse_parsed_string - procedural wrapping of existing infrastructure, not novel algorithmic work. Should be TIER_2_ENGINEERING.

jd/tenacityalgo 90 · expected 75-88 · +10 over

Algo flagged "action-based state machine for retry iteration" as TIER_3_INVENTION. Reading __init__.py:405-427: actual code is for action in self.iter_state.actions: action(retry_state) - a callback queue, not a state machine. Should be TIER_2_ENGINEERING.

Disagree with a score?

Hand-labels are subjective. If you think a repo's expected range is wrong, the algo behaves badly on a class of repo we haven't tested, or you spotted a methodology issue, email us. We re-run the corpus on every algorithm change and update this page with the drift.

Email methodology@weoddit.com