AGENTIC
RAGLEGAL
CHALLENGE
Public Questions Analysis
Trial dataset — 100 questions across 87 documents
Questions
100
Documents
87
Answer Types
6
Duration
11-25March
Boolean dominates
Boolean questions account for the largest share (35 of 100), meaning deterministic exact-match scoring applies to over a third of the dataset.
01
Rare types need attention
Date and names questions are extremely rare (1 and 3 respectively) but may carry edge-case formatting risks — list vs comma-separated, date format, etc.
02
Free text requires LLM judge
Free-text answers (30 questions, 30%%) are evaluated by an LLM judge for correctness, completeness, grounding, and clarity — the most nuanced scoring category.
03
Deterministic majority
The remaining 70 questions (70%%) use deterministic fact-checking: boolean, number, name, names, and date types scored by exact match.
04

Competition Intel — #tech-discussion

Submission Format
  • Script generates a JSON file submitted via platform API.
  • Per question: answer, TTFT, time/output-token, total time, chunk pages, input/output tokens, model name.
  • Starter kit with baseline + submission script expected by Mar 9.
Chunk ID Format
  • Page-level references: {pdf_id}_{page}.
  • Include only pages actually used, not all retrieved.
  • 0-based vs 1-based paging TBD (confirmation pending).
Resources & Hosting
  • Self-hosted; no resource limits.
  • Only publicly accessible APIs or open-source components allowed.
  • Rule of thumb: if you can publish on GitHub without violating licenses, it’s acceptable.
Preprocessing
  • Documents delivered as ZIP of PDFs.
  • Preprocessing time is not evaluated.
  • 48-hour window after private test set release — must be automated.
TTFT Scoring
  • Final score × coefficient: 0.85 (avg TTFT > 5s) to 1.05 (< 1s).
  • TTFT = first token of the final answer, not intermediate steps.
  • Parallel model calls OK; all pipeline time counts toward TTFT.
Grounding Evaluation
  • Weighted F-score (β = 2.5): recall ~6× more important than precision.
  • Golden source set per question (specific pages in specific docs).
  • Missing evidence penalized much harder than extra sources.
Answer Evaluation
  • Free-text: LLM-as-judge compares to reference answers.
  • Deterministic types (boolean, number, name, date): exact match.
Unanswerable Questions
  • Dataset intentionally includes questions with no answer in corpus.
  • Deterministic types: null answer + empty refs.
  • Free-text: state info is unavailable (e.g. "There is no information on this question") + empty refs.
Duplicate Documents
  • 87 files, only 67 unique — 20 duplicates (known issue).
  • Will be fixed before Mar 11 public dataset release.
Agents & Tools
  • No restrictions on multi-agent pipelines.
  • Copilot, Codex, Cursor, etc. allowed during development.

Grounding Score Formula

Source retrieval is evaluated using a weighted F-score with β = 2.5, making recall ~6× more important than precision.
Precision & Recall
$$\text{Precision} = \frac{|\text{predicted} \cap \text{gold}|}{|\text{predicted}|}\qquad\text{Recall} = \frac{|\text{predicted} \cap \text{gold}|}{|\text{gold}|}$$
G-Score
$$G = F_{\beta=2.5} = \frac{(1 + \beta^{2}) \cdot \text{Precision} \cdot \text{Recall}}{\beta^{2} \cdot \text{Precision} + \text{Recall}}$$