Your LLM-as-judge eval set is too small. Here is the math.
Sample-size calculations for judge calibration against humans, with code and confidence intervals. https://medium.com/@maya.andersson/your-llm-as-judge-eval-set-is-too-small-here-is-the-math-da21cdde24f1
Method summary:
Cohen's kappa with bootstrap confidence intervals
Sample-size lookup for target CI width (Monte Carlo, not closed-form)
McNemar's test for paired judge comparison
Three production judges calibrated with N=200; one drift detection, one stable, one too low
Limitations and open questions on benchmark validity
How many human-labeled examples do you need to calibrate an LLM-as-judge against humans on your task? The default answer most teams use is "enough," which usually means whatever they had time to label. That answer is wrong in a specific, mathematically tractable way.
The short version: if your judge has Cohen's kappa around 0.6 against humans and you want a 95% confidence interval no wider than 0.10, you need approximately 200 paired labels. If your judge has kappa around 0.4, you need approximately 400. Most production teams I have read about are using 50, which gives a CI width of 0.20 or wider at the same kappa range. That CI width is too wide to detect drift, and detecting drift is the operational purpose of calibration.
This piece walks through the math, validates it against three production judges I have calibrated in the last 18 months, and closes with the limitations I am still working through.
Method
Cohen's kappa (Cohen 1960) measures inter-rater agreement adjusted for chance. For a five-class scoring problem it ranges from negative (worse than chance) to 1 (perfect). The classical interpretation thresholds (Landis & Koch 1977) treat 0.40 to 0.60 as "moderate" and 0.60 to 0.80 as "good." Most published LLM-as-judge papers report a single kappa value without a confidence interval, which obscures the question of how trustworthy that single number is.
The variance of an estimated kappa shrinks with sample size, but slower than linearly. For a fixed true kappa, doubling N narrows the CI by roughly sqrt(2). To halve the CI width, you need 4x the data. This is the practical observation behind sample-size sensitivity.
Concretely, here is a bootstrap-CI calculation:
import numpy as np from sklearn.metrics import cohen_kappa_score def kappa_with_bootstrap_ci(judge_scores, human_scores, n_resamples=2000, ci=0.95): """Returns (point_estimate, (low, high)) bootstrap CI.""" paired = list(zip(judge_scores, human_scores)) n = len(paired) point_estimate = cohen_kappa_score(judge_scores, human_scores) resampled_kappas = [] rng = np.random.default_rng(42) for _ in range(n_resamples): idx = rng.integers(0, n, size=n) bs_pairs = [paired[i] for i in idx] bs_judge = [p[0] for p in bs_pairs] bs_human = [p[1] for p in bs_pairs] resampled_kappas.append( cohen_kappa_score(bs_judge, bs_human) ) alpha = 1 - ci low = np.percentile(resampled_kappas, 100 * alpha / 2) high = np.percentile(resampled_kappas, 100 * (1 - alpha / 2)) return point_estimate, (low, high)For very small N (less than 50) the bootstrap is unreliable; in that range I prefer the closed-form Fleiss (1981) variance approximation, accepting that the closed form is also fragile when class prevalence is imbalanced.
The bounded sample size problem
The CI width is the quantity that determines whether a kappa estimate is operationally useful. A point estimate of 0.65 with CI [0.45, 0.85] gives almost no information; the true kappa could be in either the "moderate" or "good" range. A point estimate of 0.65 with CI [0.60, 0.70] tells you the judge is reliably "good."
For production LLM-as-judge calibration, the typical use case is drift detection: did the judge kappa drop from last week? To answer that question, you need CIs tight enough that drift is distinguishable from sampling noise. Roughly: CI width below 0.10 lets you reliably detect 0.10-point drops; CI width 0.20 does not.
How large does N need to be to get CI width 0.10? Approximately:
True kappa N for CI width 0.10 N for CI width 0.20 0.3 approximately 450 approximately 115 0.5 approximately 250 approximately 65 0.7 approximately 150 approximately 40 0.9 approximately 50 approximately 15 These are approximations from a Monte Carlo simulation, not closed-form derivations. The exact formula for kappa variance involves prevalence and bias terms; I use simulation because the closed form is fragile when the rating distribution is imbalanced.
Most production teams I have audited use N=50. That is tight only if the judge is already very strong (kappa around 0.9). If you do not know the kappa, you need more.
What N to actually use
A practical heuristic:
def recommend_n(target_kappa: float, target_ci_width: float = 0.1) -> int: """Returns approximate N for target CI width at given kappa. Lookup from Monte Carlo simulation; not a closed form.""" if target_kappa >= 0.85: return max(50, int(40 / target_ci_width**2 * 0.5)) elif target_kappa >= 0.65: return max(150, int(40 / target_ci_width**2 * 1.5)) elif target_kappa >= 0.45: return max(250, int(40 / target_ci_width**2 * 2.5)) else: return max(450, int(40 / target_ci_width**2 * 4.5))The chicken-and-egg problem is that you do not know your judge's kappa until you have calibrated it. The practical answer: start with N=200 for the initial calibration. If the observed kappa is below 0.65, label more (use the table above) and re-estimate.
Three production judges, three decisions
I have run this calibration on three production LLM-as-judges in the last 18 months. The numbers, anonymized:
Judge A (refund agent factual accuracy). Initial N=200. Observed kappa 0.61 [CI 0.54, 0.68] on a held-out sample. After 3 weeks in production, kappa on a fresh 200-example sample dropped to 0.39 [CI 0.30, 0.48]. Distribution shift on the input traces (more financial jargon than the calibration set). The drop was detectable because both CIs were tight; if I had used N=50, the CIs would have overlapped around 0.45 and I could not have flagged the drift.
Judge B (customer-support tone scoring). Initial N=200, observed kappa 0.72 [CI 0.67, 0.78]. Stable across two months of production traces.
Judge C (code-review quality scoring). Initial N=200, observed kappa 0.31 [CI 0.22, 0.40]. Too low to use. The judge was effectively at chance for the harder cases. We reverted to human-only review for code-quality scoring and revisited the rubric.
In all three cases, N=200 was sufficient to make a clear decision (drift detected, stable, too low). If I had used N=50, two of the three decisions would have been ambiguous, and I would have either acted on a noisy signal or failed to act on a real one.
Limitations
This analysis has several limitations worth flagging.
First, kappa is a single-criterion metric. Production judges often score multiple criteria (correctness, tone, completeness). The right approach is per-criterion kappa with separate CIs. Aggregating across criteria into a single kappa hides per-criterion failure modes.
Second, the prevalence of each score class affects kappa variance. If your judge mostly outputs 4 or 5 with rare 1s and 2s, kappa estimates from a uniform sample over-weight the rare classes. Stratified sampling helps; my Monte Carlo simulations assume balanced classes, which is not realistic for most production agents.
Third, the bootstrap CI is approximate. For small N (less than 50) the bootstrap is unreliable. For very small N, use Fleiss's closed form, or accept that you do not have enough data to estimate kappa precisely.
Fourth, and most important, this is about agreement, not validity. A judge can have high kappa with humans who are themselves wrong. Independent ground truth (when available) matters more than agreement. Sara Hooker has written persuasively on this point in her work on benchmark validity, and her conclusion (that most benchmarks measure less than they claim) probably extends to LLM-as-judge calibration as well.
Open questions
A few directions I have not pinned down.
The relationship between calibration set size and drift-detection sensitivity for production traces. My working hypothesis is that detection sensitivity tracks 1 over sqrt(N), but I have not derived this formally and the implicit assumptions about temporal autocorrelation are nontrivial.
The right cadence for re-labeling. Weekly works for me in practice. Daily is probably overkill. Monthly is too slow if the underlying model is updated frequently. There is likely a closed-form relationship between re-labeling cadence and model-update cadence that I have not seen written down. If anyone has a citation, I would appreciate it.
Cross-judge agreement as a partial substitute for human labels. If three independent LLM judges agree on a case, the marginal value of a human label is lower. The published literature is thin here; the Farquhar et al. 2024 paper on semantic entropy is the closest related work I know of, but that paper is about hallucination detection, not judge calibration. Zheng et al.'s LMSYS LLM-as-a-judge paper hints at this direction but does not run the experiment systematically.
The implication for benchmark validity is the one I find most concerning. Most published LLM-as-judge benchmarks report kappa point estimates with sample sizes below what is required to detect 0.05 to 0.10-point differences. The published kappa rankings between judges may be within sampling noise. The literature on this is not yet settled, and the publication incentives favor reporting a single sharp number over a hedged CI. I do not know what to do about this beyond hoping the next round of benchmark papers report CIs by default.




