Public Questions Analysis

Competition Intel — #tech-discussion

Submission Format

Script generates a JSON file submitted via platform API.
Per question: answer, TTFT, time/output-token, total time, chunk pages, input/output tokens, model name.
Starter kit with baseline + submission script expected by Mar 9.

Chunk ID Format

Page-level references: {pdf_id}_{page}.
Include only pages actually used, not all retrieved.
0-based vs 1-based paging TBD (confirmation pending).

Resources & Hosting

Self-hosted; no resource limits.
Only publicly accessible APIs or open-source components allowed.
Rule of thumb: if you can publish on GitHub without violating licenses, it’s acceptable.

Preprocessing

Documents delivered as ZIP of PDFs.
Preprocessing time is not evaluated.
48-hour window after private test set release — must be automated.

TTFT Scoring

Final score × coefficient: 0.85 (avg TTFT > 5s) to 1.05 (< 1s).
TTFT = first token of the final answer, not intermediate steps.
Parallel model calls OK; all pipeline time counts toward TTFT.

Grounding Evaluation

Weighted F-score (β = 2.5): recall ~6× more important than precision.
Golden source set per question (specific pages in specific docs).
Missing evidence penalized much harder than extra sources.

Answer Evaluation

Free-text: LLM-as-judge compares to reference answers.
Deterministic types (boolean, number, name, date): exact match.

Unanswerable Questions

Dataset intentionally includes questions with no answer in corpus.
Deterministic types: null answer + empty refs.
Free-text: state info is unavailable (e.g. "There is no information on this question") + empty refs.

Duplicate Documents

87 files, only 67 unique — 20 duplicates (known issue).
Will be fixed before Mar 11 public dataset release.

Agents & Tools

No restrictions on multi-agent pipelines.
Copilot, Codex, Cursor, etc. allowed during development.

Grounding Score Formula

Source retrieval is evaluated using a weighted F-score with β = 2.5, making recall ~6× more important than precision.

Precision & Recall

$$\text{Precision} = \frac{|\text{predicted} \cap \text{gold}|}{|\text{predicted}|}\qquad\text{Recall} = \frac{|\text{predicted} \cap \text{gold}|}{|\text{gold}|}$$

G-Score

$$G = F_{\beta=2.5} = \frac{(1 + \beta^{2}) \cdot \text{Precision} \cdot \text{Recall}}{\beta^{2} \cdot \text{Precision} + \text{Recall}}$$