FLaME: Holistic Financial Language Model Evaluation

1Georgia Institute of Technology

FLaME is the first comprehensive benchmarking suite for evaluating language models on financial NLP tasks.

FLaME Evaluation Framework
Figure 1: Functional Overview of FLaME. The evaluation framework includes a comprehensive taxonomy, carefully selected datasets, diverse models, and standardized metrics.

Task Taxonomy

FLaME Task Taxonomy

Task Taxonomy: Illustrative breakdown for each of the six core NLP task categories (Classification, Sentiment Analysis, Information Retrieval, Causal Analysis, Summarization, and Question Answering). Each category encompasses specialized variants depending on data format, user needs, and domain constraints.

Domain Taxonomy

FLaME Domain Taxonomy

Domain Taxonomy: Holistic taxonomy for FLaME. Unlike previous FinNLP benchmarks that were tied to specific tasks with single metrics, FLaME takes a comprehensive approach by mapping the full space of tasks, scenarios, and metrics across multiple dimensions for complete analysis.

Abstract

Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs' performance on common Finance NLP (FinNLP) tasks.

To address this issue, we present FLaME (Financial Language Model Evaluation), the first comprehensive benchmark suite for evaluating language models on financial NLP tasks. FLaME includes a standardized evaluation framework, carefully selected datasets across six core financial tasks, and extensive testing across both open and proprietary LMs. Our benchmark assesses models on text classification, sentiment analysis, information retrieval, causal analysis, summarization, and question answering within the financial domain.

Our results reveal significant variations in model performance across different financial tasks and domains, with no single model emerging as universally superior. We also find that performance does not always correlate with model size, and that open-weight models can offer strong cost-performance efficiency compared to their proprietary counterparts. FLaME provides a foundation for building, testing, and advancing specialized NLP models for finance.

6

FinNLP Task Categories

19

Datasets

24

LLMs Evaluated

Key Findings

Our comprehensive evaluation reveals several important insights about language model performance on financial tasks:

No Universal Winner

No single language model performs best across all financial NLP tasks

Domain Variations

Performance varies significantly based on domain and task structure

Cost Efficiency

Open-weight and mid-scale models show strong cost/performance efficiency

Numeric Challenges

Numeric reasoning tasks (FNXL, ConvFinQA) remain challenging for all models

Size Doesn't Matter

There's often no consistent correlation between model size and performance

Top Performers

DeepSeek R1, OpenAI o1-mini, and Claude 3.5 Sonnet demonstrate strong capabilities

Methodology

Our methodological framework consists of four key components designed to provide comprehensive evaluation of language models on financial tasks.

1
Taxonomy

Define task categories

2
Datasets

Select & prepare data

3
Models

Benchmark LMs

4
Evaluation

Measure performance

Taxonomy

We developed a scenario-based taxonomy organizing financial NLP tasks along three dimensions:

  • Tasks: Six core tasks - text classification, sentiment analysis, information retrieval, causal analysis, text summarization, and question answering
  • Domains: Categorized by what (type of data), who (data source), where (origination), when (time period), how (generation method), and why (purpose)
  • Languages: Currently focusing on English with identified need for multilingual expansion

Datasets

We carefully selected datasets based on:

  • Domain relevance: Majority of content directly related to finance
  • Licensing: Fair usage licensing and proper attribution
  • Quality: Transparent sourcing with minimal risk of label corruption
  • Complexity: Exercises real financial knowledge, not trivial tasks

Models

We evaluated diverse language models across two categories:

Proprietary
  • GPT-4o & o1-mini
  • Gemini-1.5 Pro
  • Claude 3.5 Sonnet
  • Claude 3 Haiku
  • Cohere Command (R 7B, R+)
  • Jamba 1.5 Mini & Large
Open-weight
  • Llama-3 (8B, 70B)
  • DeepSeek V3 & R-1
  • DBRX Instruct
  • Qwen-2 (72B) & QwQ-32B
  • Mistral (7B, 8x7B, 8x22B)
  • Gemma-2 (9B, 27B)
  • WizardLM-2 (8x22B)

Evaluation Process

Our two-stage evaluation approach includes:

  1. Generation: Language model generates responses to task-specific inputs
  2. Extraction: Separate process identifies relevant output using structured pattern matching

Pipeline stages:

  • Configuration of tasks, datasets, and parameters
  • Model interaction via local instantiation or API
  • Post-processing and structured output extraction
  • Task-specific metric computation and logging

Results

Our comprehensive evaluation reveals significant performance variations across different financial tasks and models.

FLaME: Financial Language Model Evaluation Results

This page presents the results of the FLaME evaluation across various financial NLP tasks. Each tab shows performance metrics for different task categories.

Overall Performance Across All Tasks

Model Information Retrieval * Sentiment Analysis Causal Analysis Text Classification Question Answering Summarization
Dataset FiNERF1 FRF1 RDF1 FNXLF1 FEF1 FiQAMSE SQAF1 FPBF1 CDF1 CCF1 B77F1 FBF1 FOMCF1 NCF1 HLAcc CFQAAcc FinQAAcc TQAAcc ECTSumBERT-F1 EDTSumBERT-F1
Llama 3 70B Instruct .701.332.883.020 .469.123.535.902 .142.192 .645.309.652.386.811 .709.809.772 .754.817
Llama 3 8B Instruct .565.289.705.003 .350.161.600.698 .049.234 .512.659.497.511.763 .268.767.706 .757.811
DBRX Instruct .489.304.778.009 .006.160.436.499 .087.231 .574.483.193.319.746 .252.738.633 .729.806
DeepSeek LLM (67B) .745.334.879.007 .416.118.462.811 .025.193 .578.492.407.151.778 .174.742.355 .681.807
Gemma 2 27B .761.356.902.006 .298.100.515.884 .133.242 .621.538.620.408.808 .268.768.734 .723.814
Gemma 2 9B .651.331.892.005 .367.189.491.940 .105.207 .609.541.519.365.856 .292.779.750 .585.817
Mistral (7B) Instruct v0.3 .526.276.771.004 .368.135.522.841 .052.227 .528.503.542.412.779 .199.655.553 .750.811
Mixtral-8x22B Instruct .635.367.811.009 .435.221.510.776 .125.308 .602.221.465.513.835 .285.766.666 .758.815
Mixtral-8x7B Instruct .598.282.845.009 .267.208.498.893 .055.229 .547.396.603.583.805 .315.611.501 .747.810
Qwen 2 Instruct (72B) .748.348.854.012 .483.205.576.901 .190.184 .627.495.605.639.830 .269.819.715 .752.811
WizardLM-2 8x22B .744.355.852.008 .226.129.566.779 .114.201 .648.500.505.272.797 .247.796.725 .735.808
DeepSeek-V3 .790.437.934.045 .549.150.583.814 .198.170 .714.487.578.675.729 .261.840.779 .750.815
DeepSeek R1 .807.393.952.057 .587.110.499.902 .337.202 .763.419.670.688.769 .853.836.858 .759.804
QwQ-32B-Preview .685.270.656.001 .005.141.550.815 .131.220 .613.784.555.020.744 .282.793.796 .696.817
Jamba 1.5 Mini .552.284.844.005 .132.119.418.765 .043.270 .508.898.499.151.682 .218.666.586 .741.816
Jamba 1.5 Large .693.341.862.005 .397.183.582.798 .074.176 .628.618.550.541.782 .225.790.660 .734.818
Claude 3.5 Sonnet .799.439.891.047 .655.101.553.944 .196.197 .668.634.674.692.827 .402.844.700 .767.813
Claude 3 Haiku .711.285.883.015 .494.167.463.908 .081.200 .622.022.631.558.781 .421.803.733 .646.808
Cohere Command R 7B .748.194.845.018 .441.164.532.840 .057.255 .516.762.459.068.770 .212.709.716 .750.815
Cohere Command R + .756.333.922.021 .452.106.533.699 .080.238 .651.684.393.118.812 .259.776.698 .751.810
Google Gemini 1.5 Pro .712.374.944.019 .393.144.593.885 .196.217 .418.336.579.525.837 .280.829.763 .777.817
OpenAI gpt-4o .766.399.942.037 .523.184.541.928 .130.222 .710.524.664.750.824 .749.836.754 .773.816
OpenAI o1-mini .761.403.876.010 .662.120.542.917 .289.209 .670.612.635.720.769 .840.799.698 .763.816

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Causal Analysis Task Results

Model Causal Detection (CD) Causal Classification (CC)
Accuracy Precision Recall F1 Precision Recall F1 Accuracy
Llama 3 70B Instruct 0.148 0.429 0.148 0.142 0.241 0.329 0.192 0.198
Llama 3 8B Instruct 0.097 0.341 0.097 0.049 0.232 0.241 0.234 0.380
DBRX Instruct 0.078 0.521 0.078 0.087 0.276 0.313 0.231 0.235
DeepSeek LLM (67B) 0.026 0.214 0.026 0.025 0.141 0.328 0.193 0.221
Gemma 2 27B 0.115 0.510 0.115 0.133 0.309 0.310 0.242 0.262
Gemma 2 9B 0.115 0.394 0.115 0.105 0.275 0.294 0.207 0.258
Mistral (7B) Instruct v0.3 0.078 0.455 0.078 0.052 0.339 0.361 0.227 0.258
Mixtral-8x22B Instruct 0.131 0.486 0.131 0.125 0.344 0.310 0.308 0.318
Mixtral-8x7B Instruct 0.088 0.510 0.088 0.055 0.308 0.314 0.229 0.273
Qwen 2 Instruct (72B) 0.139 0.489 0.139 0.190 0.208 0.330 0.184 0.188
WizardLM-2 8x22B 0.076 0.453 0.076 0.114 0.263 0.347 0.201 0.237
DeepSeek-V3 0.164 0.528 0.164 0.198 0.194 0.327 0.170 0.248
DeepSeek R1 0.245 0.643 0.245 0.337 0.385 0.318 0.202 0.221
QwQ-32B-Preview 0.110 0.473 0.110 0.131 0.193 0.262 0.220 0.465
Jamba 1.5 Mini 0.050 0.280 0.050 0.043 0.323 0.283 0.270 0.295
Jamba 1.5 Large 0.076 0.517 0.076 0.074 0.268 0.248 0.176 0.200
Claude 3.5 Sonnet 0.154 0.564 0.154 0.196 0.259 0.336 0.197 0.235
Claude 3 Haiku 0.082 0.388 0.082 0.081 0.369 0.347 0.200 0.203
Cohere Command R 7B 0.089 0.363 0.089 0.057 0.379 0.356 0.255 0.275
Cohere Command R + 0.090 0.453 0.090 0.080 0.353 0.336 0.238 0.265
Google Gemini 1.5 Pro 0.165 0.514 0.165 0.196 0.265 0.357 0.217 0.258
OpenAI gpt-4o 0.082 0.576 0.082 0.130 0.254 0.327 0.222 0.235
OpenAI o1-mini 0.206 0.648 0.206 0.289 0.325 0.316 0.209 0.233

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Information Retrieval Task Results

Model FiNER-ORD FinRED ReFiND FNXL FinEntity
Precision Recall F1 Accuracy Accuracy Precision Recall F1 Accuracy Precision Recall F1 Precision Recall F1 Accuracy Precision Recall Accuracy F1
Llama 3 70B Instruct 0.715 0.693 0.701 0.911 0.314 0.454 0.314 0.332 0.879 0.904 0.879 0.883 0.015 0.030 0.020 0.010 0.474 0.485 0.485 0.469
Llama 3 8B Instruct 0.581 0.558 0.565 0.854 0.296 0.357 0.296 0.289 0.723 0.755 0.723 0.705 0.003 0.004 0.003 0.002 0.301 0.478 0.478 0.350
DBRX Instruct 0.516 0.476 0.489 0.802 0.329 0.371 0.329 0.304 0.766 0.825 0.766 0.778 0.008 0.011 0.009 0.005 0.004 0.014 0.014 0.006
DeepSeek LLM (67B) 0.752 0.742 0.745 0.917 0.344 0.403 0.344 0.334 0.874 0.890 0.874 0.879 0.005 0.009 0.007 0.003 0.456 0.405 0.405 0.416
Gemma 2 27B 0.772 0.754 0.761 0.923 0.352 0.437 0.352 0.356 0.897 0.914 0.897 0.902 0.005 0.008 0.006 0.003 0.320 0.295 0.295 0.298
Gemma 2 9B 0.665 0.643 0.651 0.886 0.336 0.373 0.336 0.331 0.885 0.902 0.885 0.892 0.004 0.008 0.005 0.003 0.348 0.419 0.419 0.367
Mistral (7B) Instruct 0.540 0.522 0.526 0.806 0.278 0.383 0.278 0.276 0.767 0.817 0.767 0.771 0.004 0.006 0.004 0.002 0.337 0.477 0.477 0.368
Mixtral-8x22B Instruct 0.653 0.625 0.635 0.870 0.381 0.414 0.381 0.367 0.807 0.847 0.807 0.811 0.010 0.008 0.009 0.005 0.428 0.481 0.481 0.435
Mixtral-8x7B Instruct 0.613 0.591 0.598 0.875 0.291 0.376 0.291 0.282 0.840 0.863 0.840 0.845 0.007 0.012 0.009 0.005 0.251 0.324 0.324 0.267
Qwen 2 Instruct (72B) 0.766 0.742 0.748 0.899 0.365 0.407 0.365 0.348 0.850 0.881 0.850 0.854 0.010 0.016 0.012 0.006 0.468 0.530 0.530 0.483
WizardLM-2 8x22B 0.755 0.741 0.744 0.920 0.362 0.397 0.362 0.355 0.846 0.874 0.846 0.852 0.008 0.009 0.008 0.004 0.222 0.247 0.247 0.226
DeepSeek-V3 0.798 0.787 0.790 0.945 0.450 0.463 0.450 0.437 0.927 0.943 0.927 0.934 0.034 0.067 0.045 0.023 0.563 0.544 0.544 0.549
DeepSeek R1 0.813 0.805 0.807 0.944 0.412 0.424 0.412 0.393 0.946 0.960 0.946 0.952 0.044 0.082 0.057 0.029 0.600 0.586 0.586 0.587
QwQ-32B-Preview 0.695 0.681 0.685 0.907 0.278 0.396 0.278 0.270 0.680 0.770 0.680 0.656 0.001 0.001 0.001 0.000 0.005 0.005 0.005 0.005
Jamba 1.5 Mini 0.564 0.556 0.552 0.818 0.308 0.450 0.308 0.284 0.830 0.864 0.830 0.844 0.004 0.006 0.005 0.003 0.119 0.182 0.182 0.132
Jamba 1.5 Large 0.707 0.687 0.693 0.883 0.341 0.452 0.341 0.341 0.856 0.890 0.856 0.862 0.004 0.005 0.005 0.002 0.403 0.414 0.414 0.397
Claude 3.5 Sonnet 0.811 0.794 0.799 0.922 0.455 0.465 0.455 0.439 0.873 0.927 0.873 0.891 0.034 0.080 0.047 0.024 0.658 0.668 0.668 0.655
Claude 3 Haiku 0.732 0.700 0.711 0.895 0.294 0.330 0.294 0.285 0.879 0.917 0.879 0.883 0.011 0.022 0.015 0.008 0.498 0.517 0.517 0.494
Cohere Command R + 0.769 0.750 0.756 0.902 0.353 0.405 0.353 0.333 0.917 0.930 0.917 0.922 0.016 0.032 0.021 0.011 0.462 0.459 0.459 0.452
Google Gemini 1.5 Pro 0.728 0.705 0.712 0.891 0.373 0.436 0.373 0.374 0.934 0.955 0.934 0.944 0.014 0.028 0.019 0.010 0.399 0.400 0.400 0.393
OpenAI gpt-4o 0.778 0.760 0.766 0.911 0.402 0.445 0.402 0.399 0.931 0.955 0.931 0.942 0.027 0.056 0.037 0.019 0.537 0.517 0.517 0.523
OpenAI o1-mini 0.772 0.755 0.761 0.922 0.407 0.444 0.407 0.403 0.867 0.900 0.867 0.876 0.007 0.015 0.010 0.005 0.661 0.681 0.681 0.662

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Question Answering Task Results

Model Datasets (Accuracy)
FinQA ConvFinQA TATQA
Llama 3 70B Instruct 0.809 0.709 0.772
Llama 3 8B Instruct 0.767 0.268 0.706
DBRX Instruct 0.738 0.252 0.633
DeepSeek LLM (67B) 0.742 0.174 0.355
Gemma 2 27B 0.768 0.268 0.734
Gemma 2 9B 0.779 0.292 0.750
Mistral (7B) Instruct v0.3 0.655 0.199 0.553
Mixtral-8x22B Instruct 0.766 0.285 0.666
Mixtral-8x7B Instruct 0.611 0.315 0.501
Qwen 2 Instruct (72B) 0.819 0.269 0.715
WizardLM-2 8x22B 0.796 0.247 0.725
DeepSeek-V3 0.840 0.261 0.779
DeepSeek R1 0.836 0.853 0.858
QwQ-32B-Preview 0.793 0.282 0.796
Jamba 1.5 Mini 0.666 0.218 0.586
Jamba 1.5 Large 0.790 0.225 0.660
Claude 3.5 Sonnet 0.844 0.402 0.700
Claude 3 Haiku 0.803 0.421 0.733
Cohere Command R 7B 0.709 0.212 0.716
Cohere Command R + 0.776 0.259 0.698
Google Gemini 1.5 Pro 0.829 0.280 0.763
OpenAI gpt-4o 0.836 0.749 0.754
OpenAI o1-mini 0.799 0.840 0.698

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Sentiment Analysis Task Results

Model FiQA Task 1 FinEntity SubjECTive-QA Financial Phrase Bank (FPB)
MSE MAE rΒ² Score Precision Recall Accuracy F1 Precision Recall F1 Accuracy Accuracy Precision Recall F1
Llama 3 70B Instruct 0.123 0.290 0.272 0.474 0.485 0.485 0.469 0.652 0.573 0.535 0.573 0.901 0.904 0.901 0.902
Llama 3 8B Instruct 0.161 0.344 0.045 0.301 0.478 0.478 0.350 0.635 0.625 0.600 0.625 0.738 0.801 0.738 0.698
DBRX Instruct 0.160 0.321 0.052 0.004 0.014 0.014 0.006 0.654 0.541 0.436 0.541 0.524 0.727 0.524 0.499
DeepSeek LLM (67B) 0.118 0.278 0.302 0.456 0.405 0.405 0.416 0.676 0.544 0.462 0.544 0.815 0.867 0.815 0.811
Gemma 2 27B 0.100 0.266 0.406 0.320 0.295 0.295 0.298 0.562 0.524 0.515 0.524 0.890 0.896 0.890 0.884
Gemma 2 9B 0.189 0.352 -0.120 0.348 0.419 0.419 0.367 0.570 0.499 0.491 0.499 0.940 0.941 0.940 0.940
Mistral (7B) Instruct v0.3 0.135 0.278 0.200 0.337 0.477 0.477 0.368 0.607 0.542 0.522 0.542 0.847 0.854 0.847 0.841
Mixtral-8x22B Instruct 0.221 0.364 -0.310 0.428 0.481 0.481 0.435 0.614 0.538 0.510 0.538 0.768 0.845 0.768 0.776
Mixtral-8x7B Instruct 0.208 0.307 -0.229 0.251 0.324 0.324 0.267 0.611 0.518 0.498 0.518 0.896 0.898 0.896 0.893
Qwen 2 Instruct (72B) 0.205 0.409 -0.212 0.468 0.530 0.530 0.483 0.644 0.601 0.576 0.601 0.904 0.908 0.904 0.901
WizardLM-2 8x22B 0.129 0.283 0.239 0.222 0.247 0.247 0.226 0.611 0.570 0.566 0.570 0.765 0.853 0.765 0.779
DeepSeek-V3 0.150 0.311 0.111 0.563 0.544 0.544 0.549 0.640 0.572 0.583 0.572 0.828 0.851 0.828 0.814
DeepSeek R1 0.110 0.289 0.348 0.600 0.586 0.586 0.587 0.644 0.489 0.499 0.489 0.904 0.907 0.904 0.902
QwQ-32B-Preview 0.141 0.290 0.165 0.005 0.005 0.005 0.005 0.629 0.534 0.550 0.534 0.812 0.827 0.812 0.815
Jamba 1.5 Mini 0.119 0.282 0.293 0.119 0.182 0.182 0.132 0.380 0.525 0.418 0.525 0.784 0.814 0.784 0.765
Jamba 1.5 Large 0.183 0.363 -0.085 0.403 0.414 0.414 0.397 0.635 0.573 0.582 0.573 0.824 0.850 0.824 0.798
Claude 3.5 Sonnet 0.101 0.268 0.402 0.658 0.668 0.668 0.655 0.634 0.585 0.553 0.585 0.944 0.945 0.944 0.944
Claude 3 Haiku 0.167 0.349 0.008 0.498 0.517 0.517 0.494 0.619 0.538 0.463 0.538 0.907 0.913 0.907 0.908
Cohere Command R 7B 0.164 0.319 0.028 0.457 0.446 0.446 0.441 0.609 0.547 0.532 0.547 0.835 0.861 0.835 0.840
Cohere Command R + 0.106 0.274 0.373 0.462 0.459 0.459 0.452 0.608 0.547 0.533 0.547 0.741 0.806 0.741 0.699
Google Gemini 1.5 Pro 0.144 0.329 0.149 0.399 0.400 0.400 0.393 0.642 0.587 0.593 0.587 0.890 0.895 0.890 0.885
OpenAI gpt-4o 0.184 0.317 -0.089 0.537 0.517 0.517 0.523 0.639 0.515 0.541 0.515 0.929 0.931 0.929 0.928
OpenAI o1-mini 0.120 0.295 0.289 0.661 0.681 0.681 0.662 0.660 0.515 0.542 0.515 0.918 0.917 0.918 0.917

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Text Classification Task Results

Model Banking77 FinBench FOMC NumClaim Headlines
Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy Precision Recall F1 Precision Recall Accuracy F1 Accuracy
Llama 3 70B Instruct 0.660 0.748 0.660 0.645 0.222 0.826 0.222 0.309 0.661 0.662 0.661 0.652 0.430 0.240 0.980 0.386 0.811
Llama 3 8B Instruct 0.534 0.672 0.534 0.512 0.543 0.857 0.543 0.659 0.565 0.618 0.565 0.497 0.801 0.463 0.571 0.511 0.763
DBRX Instruct 0.578 0.706 0.578 0.574 0.359 0.851 0.359 0.483 0.285 0.572 0.285 0.193 0.222 0.190 1.000 0.319 0.746
DeepSeek LLM (67B) 0.596 0.711 0.596 0.578 0.369 0.856 0.369 0.492 0.532 0.678 0.532 0.407 0.832 1.000 0.082 0.151 0.778
Gemma 2 27B 0.639 0.730 0.639 0.621 0.410 0.849 0.410 0.538 0.651 0.704 0.651 0.620 0.471 0.257 1.000 0.408 0.808
Gemma 2 9B 0.630 0.710 0.630 0.609 0.412 0.848 0.412 0.541 0.595 0.694 0.595 0.519 0.371 0.224 0.990 0.365 0.856
Mistral (7B) Instruct v0.3 0.547 0.677 0.547 0.528 0.375 0.839 0.375 0.503 0.587 0.598 0.587 0.542 0.521 0.266 0.918 0.412 0.779
Mixtral-8x22B Instruct 0.622 0.718 0.622 0.602 0.166 0.811 0.166 0.221 0.562 0.709 0.562 0.465 0.732 0.384 0.775 0.513 0.835
Mixtral-8x7B Instruct 0.567 0.693 0.567 0.547 0.285 0.838 0.285 0.396 0.623 0.636 0.623 0.603 0.765 0.431 0.898 0.583 0.805
Qwen 2 Instruct (72B) 0.644 0.730 0.644 0.627 0.370 0.848 0.370 0.495 0.623 0.639 0.623 0.605 0.821 0.506 0.867 0.639 0.830
WizardLM-2 8x22B 0.664 0.737 0.664 0.648 0.373 0.842 0.373 0.500 0.583 0.710 0.583 0.505 0.831 0.630 0.173 0.272 0.797
DeepSeek-V3 0.722 0.774 0.722 0.714 0.362 0.845 0.362 0.487 0.625 0.712 0.625 0.578 0.860 0.586 0.796 0.675 0.729
DeepSeek R1 0.772 0.789 0.772 0.763 0.306 0.846 0.306 0.419 0.679 0.682 0.679 0.670 0.851 0.557 0.898 0.688 0.769
QwQ-32B-Preview 0.577 0.747 0.577 0.613 0.716 0.871 0.716 0.784 0.591 0.630 0.591 0.555 0.819 1.000 0.010 0.020 0.744
Jamba 1.5 Mini 0.528 0.630 0.528 0.508 0.913 0.883 0.913 0.898 0.572 0.678 0.572 0.499 0.812 0.429 0.092 0.151 0.682
Jamba 1.5 Large 0.642 0.746 0.642 0.628 0.494 0.851 0.494 0.618 0.597 0.650 0.597 0.550 0.855 0.639 0.469 0.541 0.782
Claude 3.5 Sonnet 0.682 0.755 0.682 0.668 0.513 0.854 0.513 0.634 0.675 0.677 0.675 0.674 0.879 0.646 0.745 0.692 0.827
Claude 3 Haiku 0.639 0.735 0.639 0.622 0.067 0.674 0.067 0.022 0.633 0.634 0.633 0.631 0.838 0.556 0.561 0.558 0.781
Cohere Command R 7B 0.530 0.650 0.530 0.516 0.682 0.868 0.682 0.762 0.536 0.505 0.536 0.459 0.797 0.210 0.041 0.068 0.770
Cohere Command R + 0.660 0.747 0.660 0.651 0.575 0.859 0.575 0.684 0.526 0.655 0.526 0.393 0.804 0.333 0.071 0.118 0.812
Google Gemini 1.5 Pro 0.483 0.487 0.483 0.418 0.240 0.823 0.240 0.336 0.619 0.667 0.619 0.579 0.700 0.369 0.908 0.525 0.837
OpenAI gpt-4o 0.704 0.792 0.704 0.710 0.396 0.846 0.396 0.524 0.681 0.719 0.681 0.664 0.896 0.667 0.857 0.750 0.824
OpenAI o1-mini 0.681 0.760 0.681 0.670 0.487 0.851 0.487 0.612 0.651 0.670 0.651 0.635 0.888 0.664 0.786 0.720 0.769

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Text Summarization Task Results

Model ECTSum EDTSum
BERTScore Precision BERTScore Recall BERTScore F1 BERTScore Precision BERTScore Recall BERTScore F1
Llama 3 70B Instruct 0.715 0.801 0.754 0.793 0.844 0.817
Llama 3 8B Instruct 0.724 0.796 0.757 0.785 0.841 0.811
DBRX Instruct 0.680 0.786 0.729 0.774 0.843 0.806
DeepSeek LLM (67B) 0.692 0.678 0.681 0.779 0.840 0.807
Gemma 2 27B 0.680 0.777 0.723 0.801 0.829 0.814
Gemma 2 9B 0.651 0.531 0.585 0.803 0.833 0.817
Mistral (7B) Instruct v0.3 0.702 0.806 0.750 0.783 0.842 0.811
Mixtral-8x22B Instruct 0.713 0.812 0.758 0.790 0.843 0.815
Mixtral-8x7B Instruct 0.727 0.773 0.747 0.785 0.839 0.810
Qwen 2 Instruct (72B) 0.709 0.804 0.752 0.781 0.846 0.811
WizardLM-2 8x22B 0.677 0.806 0.735 0.774 0.847 0.808
DeepSeek-V3 0.703 0.806 0.750 0.791 0.842 0.815
DeepSeek R1 0.724 0.800 0.759 0.770 0.843 0.804
QwQ-32B-Preview 0.653 0.751 0.696 0.797 0.841 0.817
Jamba 1.5 Mini 0.692 0.798 0.741 0.798 0.838 0.816
Jamba 1.5 Large 0.679 0.800 0.734 0.799 0.841 0.818
Claude 3.5 Sonnet 0.737 0.802 0.767 0.786 0.843 0.813
Claude 3 Haiku 0.683 0.617 0.646 0.778 0.844 0.808
Cohere Command R 7B 0.724 0.781 0.750 0.790 0.844 0.815
Cohere Command R + 0.724 0.782 0.751 0.789 0.834 0.810
Google Gemini 1.5 Pro 0.757 0.800 0.777 0.800 0.836 0.817
OpenAI gpt-4o 0.755 0.793 0.773 0.795 0.840 0.816
OpenAI o1-mini 0.731 0.801 0.763 0.795 0.840 0.816

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Model Cost Analysis

Model FOMC FPB FinQA FiQA-1 FiQA-2 HL FB FR RD EDTSum B77 CD CC ECTSum FE FiNER FNXL NC TQA CFQA SQA Total
Llama 3 70B Instruct 0.10 0.11 1.14 0.06 0.72 1.00 0.40 0.38 1.34 1.94 1.64 0.07 0.05 1.56 0.12 0.33 0.25 0.09 1.11 2.96 1.17 16.54
Llama 3 8B Instruct 0.02 0.03 0.25 0.01 0.16 0.22 0.09 0.09 0.32 0.43 0.37 0.02 0.01 0.36 0.03 0.08 0.06 0.02 0.26 0.69 0.26 3.79
DBRX Instruct 0.14 0.17 1.50 0.06 0.95 1.29 0.56 0.57 2.05 2.93 2.14 0.11 0.10 2.45 0.17 0.47 0.34 0.13 1.47 4.19 1.55 23.35
DeepSeek LLM (67B) 0.10 0.12 1.25 0.05 0.76 0.87 0.42 0.37 1.45 1.85 2.03 0.08 0.05 0.83 0.13 0.34 0.24 0.09 1.20 3.17 1.17 16.57
Gemma 2 27B 0.08 0.09 1.05 0.05 0.66 0.91 0.30 0.34 1.37 1.75 1.77 0.07 0.04 1.46 0.11 0.30 0.21 0.08 1.00 2.84 1.04 15.50
Gemma 2 9B 0.03 0.03 0.40 0.02 0.24 0.33 0.12 0.14 0.51 0.66 0.66 0.03 0.02 0.00 0.04 0.11 0.08 0.03 0.37 1.08 0.39 5.29
Mistral (7B) Instruct v0.3 0.03 0.03 0.28 0.01 0.18 0.24 0.10 0.09 0.36 0.57 0.48 0.02 0.01 0.45 0.03 0.08 0.06 0.02 0.27 0.78 0.26 4.36
Mixtral-8x22B Instruct 0.14 0.17 1.80 0.07 1.05 1.44 0.58 0.56 2.04 3.42 2.89 0.11 0.07 2.66 0.18 0.48 0.35 0.14 1.73 4.90 1.55 26.35
Mixtral-8x7B Instruct 0.08 0.09 0.88 0.04 0.53 0.70 0.30 0.30 1.07 1.72 1.50 0.06 0.05 1.30 0.09 0.24 0.20 0.07 0.87 2.55 0.78 13.41
Qwen 2 Instruct (72B) 0.10 0.12 1.29 0.05 0.74 0.96 0.43 0.43 1.44 2.36 1.61 0.08 0.05 1.80 0.12 0.34 0.24 0.10 1.18 3.41 1.17 18.02
WizardLM-2 8x22B 0.16 0.19 1.94 0.08 1.07 1.47 0.61 0.61 2.24 3.47 3.00 0.11 0.10 2.85 0.18 0.49 0.34 0.14 1.94 5.31 1.55 27.87
DeepSeek-V3 0.13 0.15 1.57 0.07 0.98 1.36 0.52 0.54 2.10 2.99 2.55 0.11 0.06 2.33 0.16 0.55 0.28 0.12 1.56 4.28 1.62 24.03
DeepSeek R1 1.99 2.10 14.18 1.48 17.82 20.11 6.63 12.65 31.00 21.15 23.28 3.75 1.06 15.02 7.31 8.34 11.21 1.88 13.72 39.42 9.07 263.16
QwQ-32B-Preview 0.15 0.18 2.38 0.08 0.93 1.37 0.60 0.68 2.18 3.12 2.36 0.11 0.07 2.76 0.14 0.65 0.54 0.14 2.61 7.83 1.55 30.43
Jamba 1.5 Mini 0.02 0.03 0.30 0.02 0.23 0.22 0.10 0.08 0.44 0.55 0.51 0.02 0.01 0.49 0.05 0.10 0.07 0.02 0.25 0.72 0.26 4.47
Jamba 1.5 Large 0.31 0.36 4.42 0.30 3.47 4.81 1.78 0.94 4.97 5.80 5.51 0.35 0.13 7.07 0.56 1.67 0.77 0.30 2.87 7.45 2.59 56.42
Claude 3.5 Sonnet 0.62 0.72 6.98 0.55 6.50 8.81 3.44 3.21 12.32 9.50 11.11 0.61 0.22 7.09 0.90 3.01 1.79 0.57 9.18 16.86 3.89 107.87
Claude 3 Haiku 0.06 0.07 0.56 0.05 0.54 0.73 0.28 0.25 0.82 0.81 0.90 0.05 0.02 0.21 0.06 0.23 0.14 0.05 0.64 1.28 0.32 8.07
Cohere Command R 7B 0.01 0.01 0.08 0.00 0.07 0.09 0.04 0.03 0.11 0.11 0.10 0.01 0.00 0.08 0.01 0.03 0.01 0.01 0.08 0.19 0.05 1.09
Cohere Command R + 0.41 0.45 5.40 0.35 4.41 4.00 2.30 0.93 3.87 7.03 7.21 0.43 0.12 5.55 0.48 1.69 0.97 0.42 4.59 10.09 3.24 63.95
Google Gemini 1.5 Pro 0.23 0.21 2.26 0.18 2.20 2.78 1.02 0.49 2.27 3.45 2.70 0.21 0.07 2.65 0.25 0.87 0.58 0.21 2.13 5.78 1.62 32.16
OpenAI gpt-4o 0.35 0.41 4.99 0.32 4.45 5.33 1.55 1.21 5.77 6.57 5.00 0.35 0.14 4.85 0.44 1.94 0.96 0.34 4.95 10.36 3.24 63.52
OpenAI o1-mini 0.90 0.90 5.25 0.73 9.70 12.20 3.27 4.89 13.60 1.29 9.29 2.56 0.75 3.18 2.92 1.91 6.39 0.92 6.97 15.71 1.42 104.73
Low Cost ($0-$10)
Medium Cost ($10-$35)
High Cost ($35-$70)
Very High Cost ($70+)

Note: All costs are in USD and represent the expense to run the model on each specific dataset. Colors indicate cost tiers on the total cost, with darker blue representing higher costs. For cost-efficiency analysis, consider comparing these costs with the corresponding performance metrics in other tabs.

Model Performance Highlights

πŸ† Top Performing Models

DeepSeek R1 Logo

DeepSeek R1

OpenAI Logo

OpenAI o1-mini

Claude 3.5 Sonnet Logo

Claude 3.5 Sonnet


Key Insights from Model Analysis

πŸ† No single dominant model: DeepSeek R1 leads in complex multi-step QA, while Claude 3.5 excels in sentiment tasks. GPT-4o is strong in classification and summarization.

βš–οΈ Inconsistent scaling: Larger models don’t always outperform smaller onesβ€”DeepSeek R1 trails in summarization despite excelling in QA.

πŸ› οΈ Open-weight models: Many open-weight models like DeepSeek-V3 and Llama 3.1 70B offer competitive performance while being cost-effective.

πŸ’° Cost-performance disparities: Running DeepSeek R1 can cost up to $260 per million tokens, while Claude 3.5 Sonnet and o1-mini cost around $105, and Meta’s Llama 3.1 8B only $4.

πŸ“ˆ Numeric reasoning challenges: Even the best models struggle with financial numeric reasoning tasks, achieving low F1 scores (≀ 0.06).

πŸ”’ Step-by-step deductions: Multi-turn financial QA (e.g., ConvFinQA) significantly reduces model accuracy due to complex dependencies.

Error Analysis & Key Findings

Common challenges and limitations identified in our evaluations:

Concerns regarding outdated models

LLama 2 13B Chat produces trivial or empty responses, possibly due to misalignment during fine-tuning.

Numeric Regression Issues

LMs struggle with precision and rounding in continuous-valued regressions (e.g., financial percentages). Post-hoc normalization is needed.

Data Contamination

Overlap between public financial datasets and pretraining corpora can inflate zero-shot performance, requiring time-split test sets.

Challenges in Causal Classification

Most models struggle with financial causal reasoning, requiring structured knowledge bases or explicit symbolic reasoning.

Language Drift

Qwen 2 72B exhibits unintended shifts to Chinese output in English summarization tasks, indicating strong pretraining priors.

Summarization Nuances

Models achieve high BERTScores (~80-82%) on extractive summarization but suffer on abstractive tasks, especially in finance-specific jargon.

Prompt Design Limitations

Prompts tuned on Llama 3 8B may not generalize across models, leading to inconsistencies in label generation (e.g., minor syntactic variations).

Differences in QA Datasets

ConvFinQA consistently underperforms compared to FinQA due to its multi-turn dialogue complexity.

Efficiency and Cost Considerations

Inference costs vary by up to 2Γ— among similarly sized models, requiring a balance between performance and resource usage.

Contributions & Future Work

Our work introduces a standardized, large-scale, and holistic evaluation framework for financial language models.

Key Contributions

Standardized Evaluation Framework

We introduce an open-source, modular benchmarking suite for systematic LM evaluations on core financial NLP tasks.

Large-Scale Model Assessment

We benchmark 23 foundation LMsβ€”open-weight and proprietaryβ€”across 20 financial tasks, revealing performance-cost trade-offs.

Holistic Dataset Taxonomy

We establish a structured dataset taxonomy, categorizing financial NLP tasks based on domain, data format, and linguistic complexity.

Living Benchmark & Open Collaboration

We introduce a continuously updated leaderboard, inviting researchers to contribute new datasets and evaluation results.

Error Analysis & Cost-Performance Insights

We analyze systematic model errors and quantify cost-performance trade-offs for informed deployment in real-world applications.

Open-Source Implementation

We release a fully open-source framework, enabling the research community to extend and refine financial LM evaluation methodologies.

Limitations

While our benchmark provides valuable insights, several limitations must be acknowledged:

❌ Data Contamination Risks

Benchmark testing data may overlap with model pretraining corpora, leading to artificially inflated performance. We actively work on novel datasets to mitigate these risks.

⚠️ Dataset Size & Diversity

Our dataset scope is limited, affecting model generalization across diverse financial domains and languages.

⚠️ Zero-Shot Focus

Due to budget constraints, our evaluations rely on zero-shot learning only, without fine-tuning or few-shot prompting.

⚠️ Limited Adaptation Strategies

We do not explore chain-of-thought reasoning or advanced prompting, though these techniques are known to improve model performance.

ℹ️ English Language Bias

The benchmark primarily focuses on English due to the availability of financial datasets, limiting insights into multilingual model performance.

ℹ️ Real-World Complex Tasks

Existing tasks do not fully capture the dynamic and evolving nature of financial markets, requiring ongoing dataset expansion.

Recognizing these limitations is essential for improving future financial NLP benchmarks. Our ongoing work aims to address these challenges through dataset refinement, broader task coverage, and multilingual support.

Future Work

To strengthen the robustness and adaptability of our framework, we advocate for open collaboration within the research community and propose the following future directions to expand its capabilities:

🌍 Multilingual Expansion

Extending benchmarks beyond English to include multilingual financial datasets and evaluations.

🧠 Few-Shot & Chain-of-Thought

Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).

βš™οΈ Domain-Adaptive Training

Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.

πŸ“Š Expanded Dataset Coverage

Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.

βš–οΈ Efficiency & Cost Benchmarking

Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.

πŸ“ˆ Advanced Evaluation Metrics

Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.

These improvements will enable more accurate and fair comparisons of financial language models, fostering greater transparency, reproducibility, and real-world applicability.

Framework & Resources

FLaME Framework

An open-source, modular benchmarking suite for evaluating financial language models.

Framework Features

πŸ”§ Standardized Pipelines

Pre-built evaluation pipelines for key financial NLP tasks.

βš™οΈ Customizable Assessments

Supports easy model-to-model and dataset-to-dataset comparisons.

Reproducible Benchmarking

Ensures consistent evaluation metrics and transparent methodology.

πŸ”— Extensible Architecture

Easily integrates new tasks, datasets, and evaluation modules.

Getting Started

1

Clone Repository

git clone https://github.com/gtfintechlab/FLaME.git

2

Install Dependencies

pip install -r requirements.txt

3

Configure Models

Edit config.yaml with API keys and model settings.

4

Run Evaluations

python -m flame.run --task all

For detailed documentation, visit our GitHub repository.

Datasets & Domains

πŸ“Š Numerical Reasoning & Question Answering

  • FinQA – Multi-step financial numerical reasoning.
  • ConvFinQA – Conversational numerical reasoning.
  • TaT-QA – Hybrid table-text question answering.

πŸ“ Text Summarization

  • ECTSum – Earnings call transcript summarization.
  • EDTSum – Financial news summarization.

πŸ”Ž Information Retrieval

  • FiNER-ORD – Named entity recognition for financial documents.
  • FinEntity – Entity-based sentiment classification.
  • Financial Numeric Extreme Labeling (FNXL) – Automated numeral annotation in financial reports.
  • FinRED – Relation extraction in finance.
  • REFinD – Relation extraction from SEC filings.

😐 Sentiment Analysis

  • FiQA (Task 1) – Aspect-based sentiment analysis.
  • FiQA (Task 2) – Opinion-based financial QA.
  • Financial Phrase Bank (FPB) – Market sentiment classification.
  • SubjECTive-QA – Subjectivity detection in earnings call Q&A.

🏷️ Text Classification

  • Numerical Claim Detection – Fine-grained investor claim detection.
  • News Headline Classification – Market sentiment and trend detection.
  • FOMC Dataset – Hawkish-Dovish stance classification.
  • Banking77 – Fine-grained intent detection.
  • FinBench – Financial risk classification (default, fraud, churn).

🧠 Causal Analysis

  • FinCausal – Causal reasoning in financial news.
  • FinCausal-SC – Cause-effect span extraction.

Citation

If you use FLaME in your research, please cite our paper:

@article{flame2025,
  author    = {Glenn Matlin, Mika Okamoto, Huzaifa Pardawala, Yang Yang, Sudheer Chava},
  title     = {FLaME: Holistic Financial Language Model Evaluation},
  year      = {2025},
  month     = {February},
}