Finding the very best LLM models for finance use cases
The growing complexity of monetary language models (LLMs) necessitates evaluations that transcend general NLP benchmarks. While traditional leaderboards concentrate on broader NLP tasks like translation or summarization, they often fall short in addressing the particular needs of the finance industry. Financial tasks, comparable to predicting stock movements, assessing credit risks, and extracting information from financial reports, present unique challenges that require models with specialized skills. That is why we decided to create the Open FinLLM Leaderboard.
The leaderboard provides a specialized evaluation framework tailored specifically to the financial sector. We hope it fills this critical gap, by providing a transparent framework that assesses model readiness for real-world use with a one-stop solution. The leaderboard is designed to spotlight a model’s financial skill by specializing in tasks that matter most to finance professionals—comparable to information extraction from financial documents, market sentiment evaluation, and forecasting financial trends.
- Comprehensive Financial Task Coverage: The leaderboard evaluates models only on tasks which are directly relevant to finance. These tasks include information extraction, sentiment evaluation, credit risk scoring, and stock movement forecasting, that are crucial for real-world financial decision-making.
- Real-World Financial Relevance: The datasets used for the benchmarks represent real-world challenges faced by the finance industry. This ensures that models are literally assessed on their ability to handle complex financial data, making them suitable for industry applications.
- Focused Zero-Shot Evaluation: The leaderboard employs a zero-shot evaluation method, testing models on unseen financial tasks with none prior fine-tuning. This approach reveals a model’s ability to generalize and perform well in financial contexts, comparable to predicting stock price movements or extracting entities from regulatory filings, without being explicitly trained on those tasks.
Key Features of the Open Financial LLM Leaderboard
- Diverse Task Categories: The leaderboard covers tasks across seven categories: Information Extraction (IE), Textual Evaluation (TA), Query Answering (QA), Text Generation (TG), Risk Management (RM), Forecasting (FO), and Decision-Making (DM).
- Evaluation Metrics: Models are assessed using quite a lot of metrics, including Accuracy, F1 Rating, ROUGE Rating, and Matthews Correlation Coefficient (MCC). These metrics provide a multidimensional view of model performance, helping users discover the strengths and weaknesses of every model.
Supported Tasks and Metric
The Open Financial LLM Leaderboard (OFLL) evaluates financial language models across a various set of categories that reflect the complex needs of the finance industry. Each category targets specific capabilities, ensuring a comprehensive assessment of model performance in tasks directly relevant to finance.
Categories
The number of task categories in OFLL is intentionally designed to capture the complete range of capabilities required by financial models. This approach is influenced by each the varied nature of monetary applications and the complexity of the tasks involved in financial language processing.
- Information Extraction (IE): The financial sector often requires structured insights from unstructured documents comparable to regulatory filings, contracts, and earnings reports. Information extraction tasks include Named Entity Recognition (NER), Relation Extraction, and Causal Classification. These tasks evaluate a model’s ability to discover key financial entities, relationships, and events, that are crucial for downstream applications comparable to fraud detection or investment strategy.
- Textual Evaluation (TA): Financial markets are driven by sentiment, opinions, and the interpretation of monetary news and reports. Textual evaluation tasks comparable to Sentiment Evaluation, News Classification, and Hawkish-Dovish Classification help assess how well a model can interpret market sentiment and textual data, aiding in tasks like investor sentiment evaluation and policy interpretation.
- Query Answering (QA): This category addresses the power of models to interpret complex financial queries, particularly people who involve numerical reasoning or domain-specific knowledge. The QA tasks, comparable to those derived from datasets like FinQA and TATQA, evaluate a model’s capability to answer detailed financial questions, which is critical in areas like risk evaluation or financial advisory services.
- Text Generation (TG): Summarization of complex financial reports and documents is crucial for decision-making. Tasks like ECTSum and EDTSum test models on their ability to generate concise and coherent summaries from lengthy financial texts, which is worthwhile in generating reports or analyst briefings.
- Forecasting (FO): One of the critical applications in finance is the power to forecast market movements. Tasks under this category evaluate a model’s ability to predict stock price movements or market trends based on historical data, news, and sentiment. These tasks are central to tasks like portfolio management and trading strategies.
- Risk Management (RM): This category focuses on tasks that evaluate a model’s ability to predict and assess financial risks, comparable to Credit Scoring, Fraud Detection, and Financial Distress Identification. These tasks are fundamental for credit evaluation, risk management, and compliance purposes.
- Decision-Making (DM): In finance, making informed decisions based on multiple inputs (e.g., market data, sentiment, and historical trends) is crucial. Decision-making tasks simulate complex financial decisions, comparable to Mergers & Acquisitions and Stock Trading, testing the model’s ability to handle multimodal inputs and offer actionable insights.
Metrics
- F1-score, the harmonic mean of precision and recall, offers a balanced evaluation, especially necessary in cases of sophistication imbalance inside the dataset. Each metrics are standard in classification tasks and together give a comprehensive view of the model’s capability to discern sentiments in financial language.
- Accuracy measures the proportion of appropriately classified instances out of all instances, providing a simple assessment of overall performance.
- RMSE provides a measure of the typical deviation between predicted and actual sentiment scores, offering a quantitative insight into the accuracy of the model’s predictions.
- Entity F1 Rating (EntityF1). This metric assesses the balance between precision and recall specifically for the recognized entities, providing a transparent view of the model’s effectiveness in identifying relevant financial entities. A high EntityF1 indicates that the model is proficient at each detecting entities and minimizing false positives, making it an important measure for applications in financial data evaluation and automation.
- Exact Match Accuracy (EmAcc) measures the proportion of questions for which the model’s answer exactly matches the bottom truth, providing a transparent indication of the model’s effectiveness in understanding and processing numerical information in financial contexts. A high EmAcc reflects a model’s capability to deliver precise and reliable answers, crucial for applications that depend upon accurate financial data interpretation.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the standard of summaries by comparing them to reference summaries. It focuses on the overlap of n-grams between the generated summaries and the reference summaries, providing a measure of content coverage and fidelity.
- BERTScore utilizes contextual embeddings from the BERT model to guage the similarity between generated and reference summaries. By comparing the cosine similarity of the embeddings for every token, BERTScore captures semantic similarity, allowing for a more nuanced evaluation of summary quality.
- BARTScore relies on the BART (Bidirectional and Auto-Regressive Transformers) model, which mixes the advantages of each autoregressive and autoencoding approaches for text generation. It assesses how well the generated summary aligns with the reference summary by way of coherence and fluency, providing insights into the general quality of the extraction process.
- Matthews Correlation Coefficient (MCC) takes into consideration true and false positives and negatives, thereby offering insights into the model’s effectiveness in a binary classification context. Together, these metrics ensure a comprehensive assessment of a model’s predictive capabilities within the difficult landscape of stock movement forecasting.
- Sharpe Ratio (SR). The Sharpe Ratio measures the model’s risk-adjusted return, providing insight into how well the model’s trading strategies perform relative to the extent of risk taken. A better Sharpe Ratio indicates a more favorable return per unit of risk, making it a critical indicator of the effectiveness and efficiency of the trading strategies generated by the model. This metric enables users to gauge the model’s overall profitability and robustness in various market conditions.
Individual Tasks
We use 40 tasks on this leaderboard, across these categories:
- Information Extraction(IE): NER, FiNER-ORD, FinRED, SC, CD, FNXL, FSRL
- Textual Evaluation(TA): FPB, FiQA-SA, TSA, Headlines, FOMC, FinArg-ACC, FinArg-ARC, MultiFin, MA, MLESG
- Query Answering(QA): FinQA, TATQA, Regulations, ConvFinQA
- Text Generation(TG): ECTSum, EDTSum
- Risk Management(RM): German, Australian, LendingClub, ccf, ccfraud, polish, taiwan, ProtoSeguro, travelinsurance
- Forecasting(FO): BigData22, ACL18, CIKM18
- Decision-Making(DM): FinTrade
- Spanish: MultiFin-ES, EFP ,EFPA ,FinanceES, TSA-Spanish
Click here for a brief explanation of every task
-
**FPB (Financial PhraseBank Sentiment Classification) **
Description: Sentiment evaluation of phrases in financial news and reports, classifying into positive, negative, or neutral categories.
Metrics: Accuracy, F1-Rating -
FiQA-SA (Sentiment Evaluation in Financial Domain)
Description: Sentiment evaluation in financial media (news, social media). Classifies sentiments into positive, negative, and neutral, aiding in market sentiment interpretation.
Metrics: F1-Rating -
TSA (Sentiment Evaluation on Social Media)
Description: Sentiment classification for financial tweets, reflecting public opinion on market trends. Challenges include informal language and brevity. Metrics: F1-Rating, RMSE -
Headlines (News Headline Classification)
Description: Classification of monetary news headlines into sentiment or event categories. Critical for understanding market-moving information.
Metrics: AvgF1 -
FOMC (Hawkish-Dovish Classification)
Description: Classification of FOMC statements as hawkish (favoring higher rates of interest) or dovish (favoring lower rates), key for monetary policy predictions.
Metrics: F1-Rating, Accuracy -
FinArg-ACC (Argument Unit Classification)
Description: Identifies key argument units (claims, evidence) in financial texts, crucial for automated document evaluation and transparency.
Metrics: F1-Rating, Accuracy -
FinArg-ARC (Argument Relation Classification)
Description: Classification of relationships between argument units (support, opposition) in financial documents, helping analysts construct coherent narratives.
Metrics: F1-Rating, Accuracy -
MultiFin (Multi-Class Sentiment Evaluation)
Description: Classification of diverse financial texts into sentiment categories (bullish, bearish, neutral), worthwhile for sentiment-driven trading.
Metrics: F1-Rating, Accuracy -
MA (Deal Completeness Classification)
Description: Classifies mergers and acquisitions reports as accomplished, pending, or terminated. Critical for investment and strategy decisions.
Metrics: F1-Rating, Accuracy -
MLESG (ESG Issue Identification)
Description: Identifies Environmental, Social, and Governance (ESG) issues in financial documents, necessary for responsible investing.
Metrics: F1-Rating, Accuracy -
NER (Named Entity Recognition in Financial Texts)
Description: Identifies and categorizes entities (firms, instruments) in financial documents, essential for information extraction.
Metrics: Entity F1-Rating -
FINER-ORD (Ordinal Classification in Financial NER)
Description: Extends NER by classifying entity relevance inside financial documents, helping prioritize key information.
Metrics: Entity F1-Rating -
FinRED (Financial Relation Extraction)
Description: Extracts relationships (ownership, acquisition) between entities in financial texts, supporting knowledge graph construction.
Metrics: F1-Rating, Entity F1-Rating -
SC (Causal Classification)
Description: Classifies causal relationships in financial texts (e.g., “X caused Y”), aiding in market risk assessments.
Metrics: F1-Rating, Entity F1-Rating -
CD (Causal Detection)
Description: Detects causal relationships in financial texts, helping in risk evaluation and investment strategies.
Metrics: F1-Rating, Entity F1-Rating -
FinQA (Numerical Query Answering in Finance)
Description: Answers numerical questions from financial documents (e.g., balance sheets), crucial for automated reporting and evaluation.
Metrics: Exact Match Accuracy (EmAcc) -
TATQA (Table-Based Query Answering)
Description: Extracts information from financial tables (balance sheets, income statements) to reply queries requiring numerical reasoning.
Metrics: F1-Rating, EmAcc -
ConvFinQA (Multi-Turn QA in Finance)
Description: Handles multi-turn dialogues in financial query answering, maintaining context throughout the conversation.
Metrics: EmAcc -
FNXL (Numeric Labeling)
Description: Labels numeric values in financial documents (e.g., revenue, expenses), aiding in financial data extraction.
Metrics: F1-Rating, EmAcc -
FSRL (Financial Statement Relation Linking)
Description: Links related information across financial statements (e.g., revenue in income statements and money flow data).
Metrics: F1-Rating, EmAcc -
EDTSUM (Extractive Document Summarization)
Description: Summarizes long financial documents, extracting key information for decision-making.
Metrics: ROUGE, BERTScore, BARTScore -
ECTSUM (Extractive Content Summarization)
Description: Summarizes financial content, extracting key sentences or phrases from large texts.
Metrics: ROUGE, BERTScore, BARTScore -
BigData22 (Stock Movement Prediction)
Description: Predicts stock price movements based on financial news, using textual data to forecast market trends.
Metrics: Accuracy, MCC -
ACL18 (Financial News-Based Stock Prediction)
Description: Predicts stock price movements from news articles, interpreting sentiment and events for short-term forecasts.
Metrics: Accuracy, MCC -
CIKM18 (Financial Market Prediction Using News)
Description: Predicts broader market movements (indices) from financial news, synthesizing information for market trend forecasts.
Metrics: Accuracy, MCC -
German (Credit Scoring in Germany)
Description: Predicts creditworthiness of loan applicants in Germany, necessary for responsible lending and risk management.
Metrics: F1-Rating, MCC -
Australian (Credit Scoring in Australia)
Description: Predicts creditworthiness within the Australian market, considering local economic conditions.
Metrics: F1-Rating, MCC -
LendingClub (Peer-to-Peer Lending Risk Prediction)
Description: Predicts loan default risk for peer-to-peer lending, helping lenders manage risk.
Metrics: F1-Rating, MCC -
ccf (Credit Card Fraud Detection)
Description: Identifies fraudulent bank card transactions, ensuring financial security and fraud prevention.
Metrics: F1-Rating, MCC -
ccfraud (Credit Card Transaction Fraud Detection)
Description: Detects anomalies in bank card transactions that indicate fraud, while handling imbalanced datasets.
Metrics: F1-Rating, MCC -
Polish (Credit Risk Prediction in Poland)
Description: Predicts credit risk for loan applicants in Poland, assessing aspects relevant to local economic conditions.
Metrics: F1-Rating, MCC -
Taiwan (Credit Risk Prediction in Taiwan)
Description: Predicts credit risk within the Taiwanese market, helping lenders manage risk in local contexts.
Metrics: F1-Rating, MCC -
[Portoseguro (Claim Analysis in Brazil)](https://huggingface.co/datasets/TheFinAI/en-forecasting-portosegur
Description: Predicts the end result of insurance claims in Brazil, specializing in auto insurance claims.
Metrics: F1-Rating, MCC -
[Travel Insurance (Claim Prediction)](https://huggingface.co/datasets/TheFinA
Description: Predicts the likelihood of travel insurance claims, helping insurers manage pricing and risk.
Metrics: F1-Rating, MCC -
MultiFin-ES (Sentiment Evaluation in Spanish)
Description: Classifies sentiment in Spanish-language financial texts (bullish, bearish, neutral).
Metrics: F1-Rating -
EFP (Financial Phrase Classification in Spanish)
Description: Classifies sentiment or intent in Spanish financial phrases (positive, negative, neutral).
Metrics: F1-Rating -
EFPA (Argument Classification in Spanish)
Description: Identifies claims, evidence, and counterarguments in Spanish financial texts.
Metrics: F1-Rating -
FinanceES (Sentiment Classification in Spanish)
Description: Classifies sentiment in Spanish financial documents, understanding linguistic nuances.
Metrics: F1-Rating -
TSA-Spanish (Sentiment Evaluation in Spanish Tweets)
Description: Sentiment evaluation of Spanish tweets, interpreting informal language in real-time market discussions.
Metrics: F1-Rating -
FinTrade (Stock Trading Simulation)
Description: Evaluates models on stock trading simulations, analyzing historical stock prices and financial news to optimize trading outcomes.
Metrics: Sharpe Ratio (SR)
Click here for an in depth explanation of every task
This section will document each task inside the categories in additional detail, explaining the particular datasets, evaluation metrics, and financial relevance.
-
FPB (Financial PhraseBank Sentiment Classification)
- Task Description. On this task, we evaluate a language model’s ability to perform sentiment evaluation on financial texts. We employ the Financial PhraseBank dataset1, which consists of annotated phrases extracted from financial news articles and reports. Each phrase is labeled with certainly one of three sentiment categories: positive, negative, or neutral. The dataset provides a nuanced understanding of sentiments expressed in financial contexts, making it essential for applications comparable to market sentiment evaluation and automatic trading strategies. The first objective is to categorise each financial phrase accurately based on its sentiment. Example inputs, outputs, and the prompt templates utilized in this task are detailed in Table 5 and Table 8 within the Appendix.
- Metric. Accuracy, F1-score.
-
FiQA-SA (Sentiment Evaluation on FiQA Financial Domain)
- Task Description. The FiQA-SA task evaluates a language model’s capability to perform sentiment evaluation inside the financial domain, particularly specializing in data derived from the FiQA dataset. This dataset features a diverse collection of monetary texts sourced from various media, including news articles, financial reports, and social media posts. The first objective of the duty is to categorise sentiments expressed in these texts into distinct categories, comparable to positive, negative, and neutral. This classification is crucial for understanding market sentiment, as it might probably directly influence investment decisions and methods. The FiQA-SA task is especially relevant in today’s fast-paced financial environment, where the interpretation of sentiment can result in timely and informed decision-making.
- Metrics. F1 Rating.
-
TSA (Sentiment Evaluation on Social Media)
- Task Description. The TSA task focuses on evaluating a model’s ability to perform sentiment evaluation on tweets related to financial markets. Utilizing a dataset comprised of social media posts, this task seeks to categorise sentiments as positive, negative, or neutral. The dynamic nature of social media makes it a wealthy source of real-time sentiment data, reflecting public opinion on market trends, company news, and economic events. The TSA dataset features a wide range of tweets, featuring diverse expressions of sentiment related to financial topics, starting from stock performance to macroeconomic indicators. Given the brevity and informal nature of tweets, this task presents unique challenges in accurately interpreting sentiment, as context and subtleties can significantly impact meaning. Subsequently, effective models must reveal robust understanding and evaluation of informal language, slang, and sentiment indicators commonly used on social media platforms.
- Metrics. F1 Rating, RMSE. RMSE provides a measure of the typical deviation between predicted and actual sentiment scores, offering a quantitative insight into the accuracy of the model’s predictions.
-
Headlines (News Headline Classification)
- Task Description. The Headlines task involves classifying financial news headlines into various categories, reflecting distinct financial events or sentiment classes. This dataset consists of a wealthy collection of headlines sourced from reputable financial news outlets, capturing a big selection of topics starting from corporate earnings reports to market forecasts. The first objective of this task is to guage a model’s ability to accurately interpret and categorize temporary, context-rich text segments that always drive market movements. Given the succinct nature of headlines, the classification task requires models to quickly grasp the underlying sentiment and relevance of every headline, which might significantly influence investor behavior and market sentiment.
- Metrics. Average F1 Rating (AvgF1). This metric provides a balanced measure of precision and recall, allowing for a nuanced understanding of the model’s performance in classifying headlines. A high AvgF1 indicates that the model is effectively identifying and categorizing the sentiment and events reflected within the headlines, making it a critical metric for assessing its applicability in real-world financial contexts.
-
FOMC (Hawkish-Dovish Classification)
- Task Description. The FOMC task evaluates a model’s ability to categorise statements derived from transcripts of Federal Open Market Committee (FOMC) meetings as either hawkish or dovish. Hawkish statements typically indicate a preference for higher rates of interest to curb inflation, while dovish statements suggest a concentrate on lower rates to stimulate economic growth. This classification is crucial for understanding monetary policy signals that may impact financial markets and investment strategies. The dataset includes a spread of statements from FOMC meetings, providing insights into the Federal Reserve’s stance on economic conditions, inflation, and employment. Accurately categorizing these statements allows analysts and investors to anticipate market reactions and adjust their strategies accordingly, making this task highly relevant within the context of monetary decision-making.
- Metrics. F1 Rating and Accuracy.
-
FinArg-ACC (Financial Argument Unit Classification)
- Task Description. The FinArg-ACC task focuses on classifying argument units inside financial documents, aiming to discover key components comparable to predominant claims, supporting evidence, and counterarguments. This dataset comprises a various collection of monetary texts, including research reports, investment analyses, and regulatory filings. The first objective is to evaluate a model’s ability to dissect complex financial narratives into distinct argument units, which is crucial for automated financial document evaluation. This task is especially relevant within the context of accelerating regulatory scrutiny and the necessity for transparency in financial communications, where understanding the structure of arguments can aid in compliance and risk management.
- Metrics. F1 Rating, Accuracy.
-
FinArg-ARC (Financial Argument Relation Classification)
- Task Description. The FinArg-ARC task focuses on classifying relationships between different argument units inside financial texts. This involves identifying how various claims, evidence, and counterarguments relate to one another, comparable to support, opposition, or neutrality. The dataset comprises annotated financial documents that highlight argument structures, enabling models to learn the nuances of monetary discourse. Understanding these relationships is crucial for constructing coherent narratives and analyses from fragmented data, which might aid financial analysts, investors, and researchers in drawing meaningful insights from complex information. Given the intricate nature of monetary arguments, effective models must reveal proficiency in discerning subtle distinctions in meaning and context, that are essential for accurate classification.
- Metrics. F1 Rating, Accuracy
-
MultiFin (Multi-Class Financial Sentiment Evaluation)
- Task Description. The MultiFin task focuses on the classification of sentiments expressed in a various array of monetary texts into multiple categories, comparable to bullish, bearish, or neutral. This dataset includes various financial documents, starting from reports and articles to social media posts, providing a comprehensive view of sentiment across different sources and contexts. The first objective of this task is to evaluate a model’s ability to accurately discern and categorize sentiments that influence market behavior and investor decisions. Models must reveal a sturdy understanding of contextual clues and ranging tones inherent in financial discussions. The MultiFin task is especially worthwhile for applications in sentiment-driven trading strategies and market evaluation, where precise sentiment classification can result in more informed investment decisions.
- Metrics. F1 Rating, Accuracy.
-
MA (Deal Completeness Classification)
- Task Description:
The MA task focuses on classifying mergers and acquisitions (M&A) reports to find out whether a deal has been accomplished. This dataset comprises quite a lot of M&A announcements sourced from financial news articles, press releases, and company filings. The first objective is to accurately discover the status of every deal—categorized as accomplished, pending, or terminated—based on the knowledge presented within the reports. This classification is crucial for investment analysts and financial institutions, as understanding the completion status of M&A deals can significantly influence investment strategies and market reactions. Models must reveal a sturdy understanding of the M&A landscape and the power to accurately classify deal statuses based on often complex and evolving narratives. - Metrics:
F1 Rating, Accuracy.
- Task Description:
-
MLESG (ESG Issue Identification)
- Task Description:
The MLESG task focuses on identifying Environmental, Social, and Governance (ESG) issues inside financial texts. This dataset is specifically designed to capture quite a lot of texts, including corporate reports, news articles, and regulatory filings, that debate ESG topics. The first objective of the duty is to guage a model’s ability to accurately classify and categorize ESG-related content, which is becoming increasingly necessary in today’s investment landscape. Models are tasked with detecting specific ESG issues, comparable to climate change impacts, social justice initiatives, or corporate governance practices. Models must reveal a deep understanding of the language utilized in these contexts, in addition to the power to discern subtle variations in meaning and intent. - Metrics:
F1 Rating, Accuracy.
- Task Description:
-
NER (Named Entity Recognition in Financial Texts)
- Task Description:
The NER task focuses on identifying and classifying named entities inside financial documents, comparable to firms, financial instruments, and individuals. This task utilizes a dataset that features a diverse range of monetary texts, encompassing regulatory filings, earnings reports, and news articles. The first objective is to accurately recognize entities relevant to the financial domain and categorize them appropriately, which is crucial for information extraction and evaluation. Effective named entity recognition enhances the automation of monetary evaluation processes, allowing stakeholders to quickly gather insights from large volumes of unstructured text. - Metrics:
Entity F1 Rating (EntityF1).
- Task Description:
-
FINER-ORD (Ordinal Classification in Financial NER)
- Task Description:
The FINER-ORD task focuses on extending standard Named Entity Recognition (NER) by requiring models to categorise entities not only by type but additionally by their ordinal relevance inside financial texts. This dataset comprises a spread of monetary documents that include reports, articles, and regulatory filings, where entities comparable to firms, financial instruments, and events are annotated with an extra layer of classification reflecting their importance or priority. The first objective is to guage a model’s ability to discern and categorize entities based on their significance within the context of the encompassing text. As an illustration, a model might discover a primary entity (e.g., a serious corporation) as having the next relevance in comparison with secondary entities (e.g., a minor competitor) mentioned in the identical document. This capability is crucial for prioritizing information and enhancing the efficiency of automated financial analyses, where distinguishing between various levels of importance can significantly impact decision-making processes. - Metrics:
Entity F1 Rating (EntityF1).
- Task Description:
-
FinRED (Financial Relation Extraction from Text)
- Task Description:
The FinRED task focuses on extracting relationships between financial entities mentioned in textual data. This task utilizes a dataset that features diverse financial documents, comparable to news articles, reports, and regulatory filings. The first objective is to discover and classify relationships comparable to ownership, acquisition, and partnership amongst various entities, comparable to firms, financial instruments, and stakeholders. Accurately extracting these relationships is crucial for constructing comprehensive knowledge graphs and facilitating in-depth financial evaluation. The challenge lies in accurately interpreting context, because the relationships often involve nuanced language and implicit connections that require a classy understanding of monetary terminology. - Metrics:
F1 Rating, Entity F1 Rating (EntityF1).
- Task Description:
-
SC (Causal Classification Task within the Financial Domain)
- Task Description:
The SC task focuses on evaluating a language model’s ability to categorise causal relationships inside financial texts. This involves identifying whether one event causes one other, which is crucial for understanding dynamics in financial markets. The dataset used for this task encompasses quite a lot of financial documents, including reports, articles, and regulatory filings, where causal language is commonly embedded. By examining phrases that express causality—comparable to “as a consequence of,” “leading to,” or “results in”—models must accurately determine the causal links between financial events, trends, or phenomena. This task is especially relevant for risk assessment, investment strategy formulation, and decision-making processes, as understanding causal relationships can significantly influence evaluations of market conditions and forecasts. - Metrics:
F1 Rating, Entity F1 Rating (EntityF1).
- Task Description:
-
CD (Causal Detection)
- Task Description:
The CD task focuses on detecting causal relationships inside a various range of monetary texts, including reports, news articles, and social media posts. This task evaluates a model’s ability to discover instances where one event influences or causes one other, which is crucial for understanding dynamics in financial markets. The dataset comprises annotated examples that explicitly highlight causal links, allowing models to learn from various contexts and expressions of causality. Detecting causality is crucial for risk assessment, because it helps analysts understand potential impacts of events on market behavior, investment strategies, and decision-making processes. Models must navigate nuances and subtleties in text to accurately discern causal connections. - Metrics:
F1 Rating, Entity F1 Rating (EntityF1).
- Task Description:
-
FinQA (Numerical Query Answering in Finance)
- Task Description:
The FinQA task evaluates a model’s ability to reply numerical questions based on financial documents, comparable to balance sheets, income statements, and financial reports. This dataset features a diverse set of questions that require not only comprehension of the text but additionally the power to extract and manipulate numerical data accurately. The first goal is to evaluate how well a model can interpret complex financial information and perform vital calculations to derive answers. The FinQA task is especially relevant for applications in financial evaluation, investment decision-making, and automatic reporting, where precise numerical responses are essential for stakeholders. - Metrics:
Exact Match Accuracy (EmAcc)
- Task Description:
-
TATQA (Table-Based Query Answering in Financial Documents)
- Task Description:
The TATQA task focuses on evaluating a model’s ability to reply questions that require interpreting and extracting information from tables in financial documents. This dataset is specifically designed to incorporate quite a lot of financial tables, comparable to balance sheets, income statements, and money flow statements, each containing structured data critical for financial evaluation. The first objective of this task is to evaluate how well models can navigate these tables to supply accurate and relevant answers to questions that always demand numerical reasoning or domain-specific knowledge. Models must reveal proficiency in not only locating the proper data but additionally understanding the relationships between different data points inside the context of monetary evaluation. - Metrics:
F1 Rating, Exact Match Accuracy (EmAcc).
- Task Description:
-
ConvFinQA (Multi-Turn Query Answering in Finance)
- Task Description:
The ConvFinQA task focuses on evaluating a model’s ability to handle multi-turn query answering within the financial domain. This task simulates real-world scenarios where financial analysts engage in dialogues, asking a series of related questions that construct upon previous answers. The dataset includes conversations that reflect common inquiries regarding financial data, market trends, and economic indicators, requiring the model to take care of context and coherence throughout the dialogue. The first objective is to evaluate the model’s capability to interpret and respond accurately to multi-turn queries, ensuring that it might probably provide relevant and precise information because the conversation progresses. This task is especially relevant in financial advisory settings, where analysts must extract insights from complex datasets while engaging with clients or stakeholders. - Metrics:
Exact Match Accuracy (EmAcc).
- Task Description:
-
FNXL (Numeric Labeling in Financial Texts)
- Task Description:
The FNXL task focuses on the identification and categorization of numeric values inside financial documents. This involves labeling numbers based on their type (e.g., revenue, profit, expense) and their relevance within the context of the text. The dataset used for this task features a diverse range of monetary reports, statements, and analyses, presenting various numeric expressions which are crucial for understanding financial performance. Accurate numeric labeling is crucial for automating financial evaluation and ensuring that critical data points are readily accessible for decision-making. Models must reveal a sturdy ability to parse context and semantics to accurately classify numeric information, thereby enhancing the efficiency of monetary data processing. - Metrics:
F1 Rating, Exact Match Accuracy (EmAcc).
- Task Description:
-
FSRL (Financial Statement Relation Linking)
- Task Description:
The FSRL task focuses on linking related information across different financial statements, comparable to matching revenue figures from income statements with corresponding money flow data. This task is crucial for comprehensive financial evaluation, enabling models to synthesize data from multiple sources to supply a coherent understanding of an organization’s financial health. The dataset used for this task includes quite a lot of financial statements from publicly traded firms, featuring intricate relationships between different financial metrics. Accurate linking of this information is crucial for financial analysts and investors who depend on holistic views of monetary performance. The duty requires models to navigate the complexities of monetary terminology and understand the relationships between various financial elements, ensuring they will effectively connect relevant data points. - Metrics:
F1 Rating, Exact Match Accuracy (EmAcc).
- Task Description:
-
EDTSUM (Extractive Document Summarization in Finance)
- Task Description:
The EDTSUM task focuses on summarizing lengthy financial documents by extracting essentially the most relevant sentences to create concise and coherent summaries. This task is crucial within the financial sector, where professionals often take care of extensive reports, research papers, and regulatory filings. The flexibility to distill critical information from large volumes of text is crucial for efficient decision-making and data dissemination. The EDTSUM dataset consists of varied financial documents, each paired with expert-generated summaries that highlight key insights and data points. Models are evaluated on their capability to discover and choose sentences that accurately reflect the predominant themes and arguments presented in the unique documents. - Metrics:
ROUGE, BERTScore, and BARTScore.
- Task Description:
-
ECTSUM (Extractive Content Summarization)
- Task Description:
The ECTSUM task focuses on extractive content summarization inside the financial domain, where the target is to generate concise summaries from extensive financial documents. This task leverages a dataset that features quite a lot of financial texts, comparable to reports, articles, and regulatory filings, each containing critical information relevant to stakeholders. The goal is to guage a model’s ability to discover and extract essentially the most salient sentences or phrases that encapsulate the important thing points of the unique text. The ECTSUM task challenges models to reveal their understanding of context, relevance, and coherence, ensuring that the extracted summaries accurately represent the predominant ideas while maintaining readability and clarity. - Metrics:
ROUGE, BERTScore, and BARTScore.
- Task Description:
-
BigData22 (Stock Movement Prediction)
- Task Description:
The BigData22 task focuses on predicting stock price movements based on financial news and reports. This dataset is designed to capture the intricate relationship between market sentiment and stock performance, utilizing a comprehensive collection of stories articles, social media posts, and market data. The first goal of this task is to guage a model’s ability to accurately forecast whether the value of a particular stock will increase or decrease inside an outlined timeframe. Models must effectively analyze textual data and discern patterns that correlate with market movements. - Metrics:
Accuracy, Matthews Correlation Coefficient (MCC).
- Task Description:
-
ACL18 (Financial News-Based Stock Prediction)
- Task Description:
The ACL18 task focuses on predicting stock movements based on financial news articles and headlines. Utilizing a dataset that features quite a lot of news pieces, this task goals to guage a model’s ability to research textual content and forecast whether stock prices will rise or fall within the near term. The dataset encompasses a spread of monetary news topics, from company announcements to economic indicators, reflecting the complex relationship between news sentiment and market reactions. Models must effectively interpret nuances in language and sentiment that may influence stock performance, ensuring that predictions align with actual market movements. - Metrics:
Accuracy, Matthews Correlation Coefficient (MCC).
- Task Description:
-
CIKM18 (Financial Market Prediction Using News)
- Task Description:
The CIKM18 task focuses on predicting broader market movements, comparable to stock indices, based on financial news articles. Utilizing a dataset that encompasses quite a lot of news stories related to market events, this task evaluates a model’s ability to synthesize information from multiple sources and make informed predictions about future market trends. The dataset includes articles covering significant financial events, economic indicators, and company news, reflecting the complex interplay between news sentiment and market behavior. The target of this task is to evaluate how well a model can analyze the content of monetary news and utilize that evaluation to forecast market movements. - Metrics:
Accuracy, Matthews Correlation Coefficient (MCC).
- Task Description:
-
German (Credit Scoring within the German Market)
- Task Description:
The German task focuses on evaluating a model’s ability to predict creditworthiness amongst loan applicants inside the German market. Utilizing a dataset that encompasses various financial indicators, demographic information, and historical credit data, this task goals to categorise applicants as either creditworthy or non-creditworthy. The dataset reflects the unique economic and regulatory conditions of Germany, providing a comprehensive view of the aspects influencing credit decisions on this specific market. Given the importance of accurate credit scoring for financial institutions, this task is crucial for minimizing risk and ensuring responsible lending practices. Models must effectively analyze multiple variables to make informed predictions, thereby facilitating higher decision-making in loan approvals and risk management. - Metrics:
F1 Rating, Matthews Correlation Coefficient (MCC).
- Task Description:
-
Australian (Credit Scoring within the Australian Market)
- Task Description:
The Australian task focuses on predicting creditworthiness amongst loan applicants inside the Australian financial context. This dataset features a comprehensive array of features derived from various sources, comparable to financial histories, income levels, and demographic information. The first objective of this task is to categorise applicants as either creditworthy or non-creditworthy, enabling financial institutions to make informed lending decisions. Given the unique economic conditions and regulatory environment in Australia, this task is especially relevant for understanding the particular aspects that influence credit scoring on this market. - Metrics:
F1 Rating, Matthews Correlation Coefficient (MCC).
- Task Description:
-
LendingClub (Peer-to-Peer Lending Risk Prediction)
- Task Description:
The LendingClub task focuses on predicting the chance of default for loans issued through the LendingClub platform, a serious peer-to-peer lending service. This task utilizes a dataset that features detailed details about loan applicants, comparable to credit scores, income levels, employment history, and other financial indicators. The first objective is to evaluate the likelihood of loan default, enabling lenders to make informed decisions regarding loan approvals and risk management. The models evaluated on this task must effectively analyze quite a lot of features, capturing complex relationships inside the data to supply reliable risk assessments. - Metrics:
F1 Rating, Matthews Correlation Coefficient (MCC).
- Task Description:
-
ccf (Credit Card Fraud Detection)
- Task Description:
The ccf task focuses on identifying fraudulent transactions inside a big dataset of bank card operations. This dataset encompasses various transaction features, including transaction amount, time, location, and merchant information, providing a comprehensive view of spending behaviors. The first objective of the duty is to categorise transactions as either legitimate or fraudulent, thereby enabling financial institutions to detect and forestall fraudulent activities effectively. Models must navigate the challenges posed by class imbalance, as fraudulent transactions typically represent a small fraction of the general dataset. - Metrics:
F1 Rating, Matthews Correlation Coefficient (MCC).
- Task Description:
-
ccfraud (Credit Card Transaction Fraud Detection)
- Task Description:
The ccfraud task focuses on identifying fraudulent transactions inside a dataset of bank card operations. This dataset comprises a lot of transaction records, each labeled as either legitimate or fraudulent. The first objective is to guage a model’s capability to accurately distinguish between normal transactions and people who exhibit suspicious behavior indicative of fraud. The ccfraud task presents unique challenges, including the necessity to handle imbalanced data, as fraudulent transactions typically represent a small fraction of the whole dataset. Models must reveal proficiency in detecting subtle patterns and anomalies that signify fraudulent activity while minimizing false positives to avoid inconveniencing legitimate customers. - Metrics:
F1 Rating, Matthews Correlation Coefficient (MCC).
- Task Description:
-
Polish (Credit Risk Prediction within the Polish Market)
- Task Description:
The Polish task focuses on predicting credit risk for loan applicants inside the Polish financial market. Utilizing a comprehensive dataset that features demographic and financial details about applicants, the duty goals to evaluate the likelihood of default on loans. This prediction is crucial for financial institutions in making informed lending decisions and managing risk effectively. Models have to be tailored to account for local aspects influencing creditworthiness, comparable to income levels, employment status, and credit history. - Metrics:
F1 Rating, Matthews Correlation Coefficient (MCC).
- Task Description:
-
Taiwan (Credit Risk Prediction within the Taiwanese Market)
- Task Description:
The Taiwan task focuses on predicting credit risk for loan applicants within the Taiwanese market. Utilizing a dataset that encompasses detailed financial and private details about borrowers, this task goals to evaluate the likelihood of default based on various aspects, including credit history, income, and demographic details. The model’s ability to research complex patterns inside the data and supply reliable predictions is crucial in a rapidly evolving financial landscape. Given the unique economic conditions and regulatory environment in Taiwan, this task also emphasizes the importance of local context in risk assessment, requiring models to effectively adapt to specific market characteristics and trends. - Metrics:
F1 Rating, Matthews Correlation Coefficient (MCC).
- Task Description:
-
Portoseguro (Claim Evaluation within the Brazilian Market)
- Task Description:
The Portoseguro task focuses on analyzing insurance claims inside the Brazilian market, specifically for auto insurance. This task leverages a dataset that features detailed details about various claims, comparable to the character of the incident, policyholder details, and claim outcomes. The first goal is to guage a model’s ability to predict the likelihood of a claim being approved or denied based on these aspects. By accurately classifying claims, models can assist insurance firms streamline their decision-making processes, enhance risk management strategies, and reduce fraudulent activities. Models must consider regional nuances and the particular criteria utilized in evaluating claims, ensuring that predictions align with local regulations and market practices. - Metrics:
F1 Rating, Matthews Correlation Coefficient (MCC).
- Task Description:
-
Travel Insurance (Travel Insurance Claim Prediction)
- Task Description:
The Travel Insurance task focuses on predicting the likelihood of a travel insurance claim being made based on various aspects and data points. This dataset includes historical data related to travel insurance policies, claims made, and associated variables comparable to the style of travel, duration, destination, and demographic information of the insured individuals. The first objective of this task is to guage a model’s ability to accurately assess the chance of a claim being filed, which is crucial for insurance firms in determining policy pricing and risk management strategies. By analyzing patterns and trends in the information, models can provide insights into which aspects contribute to the next likelihood of claims, enabling insurers to make informed decisions about underwriting and premium setting. - Metrics:
F1 Rating, Matthews Correlation Coefficient (MCC).
- Task Description:
-
MultiFin-ES (Multi-Class Financial Sentiment Evaluation in Spanish)
- Task Description:
The MultiFin-ES task focuses on analyzing - sentiment in Spanish-language financial texts, categorizing sentiments into multiple classes comparable to bullish, bearish, and neutral. This dataset features a diverse array of monetary documents, including news articles, reports, and social media posts, reflecting various points of the financial landscape. The first objective is to guage a model’s ability to accurately classify sentiments based on contextual cues, linguistic nuances, and cultural references prevalent in Spanish financial discourse. Models must reveal proficiency in processing the subtleties of the Spanish language, including idiomatic expressions and regional variations, to realize accurate classifications.
- Metrics:
F1 Rating.
- Task Description:
-
EFP (Financial Phrase Classification in Spanish)
- Task Description:
The EFP task focuses on the classification of monetary phrases in Spanish, utilizing a dataset specifically designed for this purpose. This dataset consists of a set of annotated phrases extracted from Spanish-language financial texts, including news articles, reports, and social media posts. The first objective is to categorise these phrases based on sentiment or intent, categorizing them into relevant classifications comparable to positive, negative, or neutral. Given the growing importance of the Spanish-speaking market in global finance, accurately interpreting and analyzing sentiment in Spanish financial communications is crucial for investors and analysts. - Metrics:
F1 Rating.
- Task Description:
-
EFPA (Financial Argument Classification in Spanish)
- Task Description:
The EFPA task focuses on classifying arguments inside Spanish financial documents, aiming to discover key components comparable to claims, evidence, and counterarguments. This dataset encompasses a spread of monetary texts, including reports, analyses, and regulatory documents, providing a wealthy resource for understanding argumentative structures within the financial domain. The first objective is to guage a model’s ability to accurately categorize different argument units, which is crucial for automating the evaluation of complex financial narratives. By effectively classifying arguments, stakeholders can gain insights into the reasoning behind financial decisions and the interplay of varied aspects influencing the market. This task presents unique challenges that require models to reveal a deep understanding of each linguistic and domain-specific contexts. - Metrics:
F1 Rating.
- Task Description:
-
FinanceES (Financial Sentiment Classification in Spanish)
- Task Description:
The FinanceES task focuses on classifying sentiment inside a various range of monetary documents written in Spanish. This dataset includes news articles, reports, and social media posts, reflecting various financial topics and events. The first objective is to guage a model’s ability to accurately discover sentiments as positive, negative, or neutral, thus providing insights into market perceptions in Spanish-speaking regions. Given the cultural and linguistic nuances inherent within the Spanish language, effective sentiment classification requires models to adeptly navigate idiomatic expressions, slang, and context-specific terminology. This task is especially relevant as financial sentiment evaluation expands globally, necessitating robust models that may perform effectively across different languages and cultural contexts. - Metrics:
F1 Rating.
- Task Description:
-
TSA-Spanish (Sentiment Evaluation in Spanish)
- Task Description:
The TSA-Spanish task focuses on evaluating a model’s ability to perform sentiment evaluation on tweets and short texts in Spanish related to financial markets. Utilizing a dataset comprised of diverse social media posts, this task goals to categorise sentiments as positive, negative, or neutral. The dynamic nature of social media provides a wealthy source of real-time sentiment data, reflecting public opinion on various financial topics, including stock performance, company announcements, and economic developments. This task presents unique challenges in accurately interpreting sentiment, as context, slang, and regional expressions can significantly influence meaning. Models must reveal a sturdy understanding of the subtleties of the Spanish language, including colloquialisms and ranging sentiment indicators commonly used across different Spanish-speaking communities. - Metrics:
F1 Rating.
- Task Description:
-
FinTrade (Stock Trading Dataset)
- Task Description:
The FinTrade task evaluates models on their ability to perform stock trading simulations using a specially developed dataset that comes with historical stock prices, financial news, and sentiment data over a period of 1 yr. This dataset is designed to reflect real-world trading scenarios, providing a comprehensive view of how various aspects influence stock performance. The first objective of this task is to evaluate the model’s capability to make informed trading decisions based on a mix of quantitative and qualitative data, comparable to market trends and sentiment evaluation. By simulating trading activities, models are tasked with generating actionable insights and methods that maximize profitability while managing risk. The varied nature of the information, which incorporates price movements, news events, and sentiment fluctuations, requires models to effectively integrate and analyze multiple data streams to optimize trading outcomes. - Metrics:
Sharpe Ratio (SR).
- Task Description:
The best way to Use the Open Financial LLM Leaderboard
Once you first visit the OFLL platform, you might be greeted by the predominant page, which provides an summary of the leaderboard, including an introduction to the platform’s purpose and a link to submit your model for evaluation.
At the highest of the predominant page, you will see different tabs:
- LLM Benchmark: The core page where you evaluate models.
- Submit here: A spot to submit your personal model for automatic evaluation.
- About: More details in regards to the benchmarks, evaluation process, and the datasets used.
Choosing Tasks to Display
To tailor the leaderboard to your specific needs, you may select the financial tasks you ought to concentrate on under the “Select columns to indicate” section. These tasks are divided into several categories, comparable to:
- Information Extraction (IE)
- Textual Evaluation (TA)
- Query Answering (QA)
- Text Generation (TG)
- Risk Management (RM)
- Forecasting (FO)
- Decision-Making (DM)
Simply check the box next to the tasks you are fascinated with. The chosen tasks will appear as columns within the evaluation table. In the event you want to remove all selections, click the “Uncheck All” button to reset the duty categories.
Choosing Models to Display
To further refine the models displayed within the leaderboard, you need to use the “Model types” and “Precision” filters on the right-hand side of the interface, and filter models based on their:
- Type: Pretrained, fine-tuned, instruction-tuned, or reinforcement-learning (RL)-tuned.
- Precision: float16, bfloat16, or float32.
- Model Size: Ranges from ~1.5 billion to 70+ billion parameters.
Viewing Leads to the Task Table
Once you’ve got chosen your tasks, the outcomes will populate within the task table (see image). This table provides detailed metrics for every model across the tasks you’ve chosen. The performance of every model is displayed under columns labeled Average IE, Average TA, Average QA, and so forth, corresponding to the tasks you chose.
Submitting a Model for Evaluation
If you may have a brand new model that you just’d like to guage on the leaderboard, the submission section means that you can upload your model file. You’ll need to supply:
- Model name
- Revision commit
- Model type
- Precision
- Weight type
After uploading your model, the leaderboard will robotically start evaluating it across the chosen tasks, providing real-time feedback on its performance.
Current Best Models and Surprising Results
Throughout the evaluation process on the Open FinLLM Leaderboard, several models have demonstrated exceptional capabilities across various financial tasks.
As of the most recent evaluation:
- Best model: GPT-4 and Llama 3.1 have consistently outperformed other models in lots of tasks, showing high accuracy and robustness in interpreting financial sentiment.
- Surprising Results: The Forecasting(FO) task, focused on stock movement predictions, showed that smaller models, comparable to Llama-3.1-7b, internlm-7b,often outperformed larger models, for instance Llama-3.1-70b, by way of accuracy and MCC. This implies that model size doesn’t necessarily correlate with higher performance in financial forecasting, especially in tasks where real-time market data and nuanced sentiment evaluation are critical. These results highlight the importance of evaluating models based on task-specific performance somewhat than relying solely on size or general-purpose benchmarks.
Acknowledgments
We would love to thank our sponsors, including The Linux Foundation, for his or her generous support in making the Open FinLLM Leaderboard possible. Their contributions have helped us construct a platform that serves the financial AI community and advances the evaluation of monetary language models.
We also invite the community to take part in this ongoing project by submitting models, datasets, or evaluation tasks. Your involvement is crucial in ensuring that the leaderboard stays a comprehensive and evolving tool for benchmarking financial LLMs. Together, we are able to drive innovation and help develop models higher suited to real-world financial applications.
