Boolean questions account for the largest share (35 of 100), meaning deterministic exact-match scoring applies to over a third of the dataset.
01
Rare types need attention
Date and names questions are extremely rare (1 and 3 respectively) but may carry edge-case formatting risks — list vs comma-separated, date format, etc.
02
Free text requires LLM judge
Free-text answers (30 questions, 30%%) are evaluated by an LLM judge for correctness, completeness, grounding, and clarity — the most nuanced scoring category.
03
Deterministic majority
The remaining 70 questions (70%%) use deterministic fact-checking: boolean, number, name, names, and date types scored by exact match.
04
Competition Intel — #tech-discussion
Submission Format
Script generates a JSON file submitted via platform API.
Per question: answer, TTFT, time/output-token, total time, chunk pages, input/output tokens, model name.
Starter kit with baseline + submission script expected by Mar 9.
Chunk ID Format
Page-level references: {pdf_id}_{page}.
Include only pages actually used, not all retrieved.
0-based vs 1-based paging TBD (confirmation pending).
Resources & Hosting
Self-hosted; no resource limits.
Only publicly accessible APIs or open-source components allowed.
Rule of thumb: if you can publish on GitHub without violating licenses, it’s acceptable.
Preprocessing
Documents delivered as ZIP of PDFs.
Preprocessing time is not evaluated.
48-hour window after private test set release — must be automated.
TTFT Scoring
Final score × coefficient: 0.85 (avg TTFT > 5s) to 1.05 (< 1s).
TTFT = first token of the final answer, not intermediate steps.
Parallel model calls OK; all pipeline time counts toward TTFT.
Grounding Evaluation
Weighted F-score (β = 2.5): recall ~6× more important than precision.
Golden source set per question (specific pages in specific docs).
Missing evidence penalized much harder than extra sources.
Answer Evaluation
Free-text: LLM-as-judge compares to reference answers.