Benchmark
FinModelBench
A structured evaluation framework for measuring AI financial modeling capability, currently in development. Designed by professionals who build models for a living.
The problem
The evaluation gap
Most benchmarks test whether an AI can answer questions about finance. They measure recall and surface-level reasoning. They do not measure whether an AI can construct a working financial model with linked schedules, circular references, and scenario toggles.
Current benchmarks
- Multiple choice finance questions
- Text-based reasoning tasks
- Single-formula calculations
- Static, one-shot evaluation
- No structural complexity
- Surface-level correctness only
FinModelBench
- Multi-tab Excel construction tasks
- Linked schedules and circular references
- Scenario toggles and sensitivity analysis
- Formula integrity verification
- Institutional-grade structural standards
- Expert-reviewed scoring rubrics
Methodology
How we evaluate
Every task is scored by professionals with transaction experience. Evaluation emphasizes technical integrity over surface-level plausibility. We measure what matters in a working model.
Technical criteria
- Formula correctness and internal consistency
- Schedule linking across tabs
- Circular reference handling and iteration logic
- Edge-case behavior under stress scenarios
- Output formatting and presentation standards
Methodological criteria
- Assumption realism and sourcing
- Structural soundness of model architecture
- Output reasonableness and sanity checks
- Error handling and defensive construction
- Prompt-to-model alignment verification
Rubric design
- 30+ binary scoring items per task
- Negative criteria to catch systematic flaws
- Value tolerances with explicit thresholds
- Source citations required for key assumptions
Task scope
- Valuation, transaction, and reporting categories
- Strategic finance and treasury/risk models
- Capital markets and project finance tasks
- Multi-variant prompts for different AI capabilities
Strategic value
Why this matters
Measure real capability
Better evaluation infrastructure enables better training. When you can reliably measure construction capability, you can reliably improve it.
Credible signal
Rigorous evaluation creates a verifiable signal in a category where capability claims are common and independent verification is rare.
Open infrastructure
FinModelBench is designed to be an open, shared resource. Transparent methodology. Reproducible results. Available to the research community.
Interested in early access or collaboration?
FinModelBench is currently in active development. We are looking for AI labs and research teams who want early access or want to collaborate on building evaluation infrastructure for financial modeling.
business@model2.co