Abstract

Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs' performance on common Finance NLP (FinNLP) tasks.

To address this issue, we present FLaME (Financial Language Model Evaluation), the first comprehensive benchmark suite for evaluating language models on financial NLP tasks. FLaME includes a standardized evaluation framework, carefully selected datasets across six core financial tasks, and extensive testing across both open and proprietary LMs. Our benchmark assesses models on text classification, sentiment analysis, information retrieval, causal analysis, summarization, and question answering within the financial domain.

Our results reveal significant variations in model performance across different financial tasks and domains, with no single model emerging as universally superior. We also find that performance does not always correlate with model size, and that open-weight models can offer strong cost-performance efficiency compared to their proprietary counterparts. FLaME provides a foundation for building, testing, and advancing specialized NLP models for finance.

6

FinNLP Task Categories

19

Datasets

24

LLMs Evaluated

Key Findings

Our comprehensive evaluation reveals several important insights about language model performance on financial tasks:

No Universal Winner

No single language model performs best across all financial NLP tasks

Domain Variations

Performance varies significantly based on domain and task structure

Cost Efficiency

Open-weight and mid-scale models show strong cost/performance efficiency

Numeric Challenges

Numeric reasoning tasks (FNXL, ConvFinQA) remain challenging for all models

Size Doesn't Matter

There's often no consistent correlation between model size and performance

Top Performers

DeepSeek R1, OpenAI o1-mini, and Claude 3.5 Sonnet demonstrate strong capabilities

Methodology

Our methodological framework consists of four key components designed to provide comprehensive evaluation of language models on financial tasks.

1

Taxonomy

Define task categories

2

Datasets

Select & prepare data

3

Models

Benchmark LMs

4

Evaluation

Measure performance

Taxonomy

We developed a scenario-based taxonomy organizing financial NLP tasks along three dimensions:

Tasks: Six core tasks - text classification, sentiment analysis, information retrieval, causal analysis, text summarization, and question answering
Domains: Categorized by what (type of data), who (data source), where (origination), when (time period), how (generation method), and why (purpose)
Languages: Currently focusing on English with identified need for multilingual expansion

Datasets

We carefully selected datasets based on:

Domain relevance: Majority of content directly related to finance
Licensing: Fair usage licensing and proper attribution
Quality: Transparent sourcing with minimal risk of label corruption
Complexity: Exercises real financial knowledge, not trivial tasks

Models

We evaluated diverse language models across two categories:

Proprietary

GPT-4o & o1-mini
Gemini-1.5 Pro
Claude 3.5 Sonnet
Claude 3 Haiku
Cohere Command (R 7B, R+)
Jamba 1.5 Mini & Large

Open-weight

Llama-3 (8B, 70B)
DeepSeek V3 & R-1
DBRX Instruct
Qwen-2 (72B) & QwQ-32B
Mistral (7B, 8x7B, 8x22B)
Gemma-2 (9B, 27B)
WizardLM-2 (8x22B)

Evaluation Process

Our two-stage evaluation approach includes:

Generation: Language model generates responses to task-specific inputs
Extraction: Separate process identifies relevant output using structured pattern matching

Pipeline stages:

Configuration of tasks, datasets, and parameters
Model interaction via local instantiation or API
Post-processing and structured output extraction
Task-specific metric computation and logging

Results

Our comprehensive evaluation reveals significant performance variations across different financial tasks and models.

FLaME: Financial Language Model Evaluation Results

This page presents the results of the FLaME evaluation across various financial NLP tasks. Each tab shows performance metrics for different task categories.

All Tasks
Causal Analysis
Information Retrieval
Question Answering
Sentiment Analysis
Text Classification
Text Summarization
Cost Analysis