Overview: This text presents a deep dive into Capital Fund Management’s (CFM) use of open-source large language models (LLMs) and the Hugging Face (HF) ecosystem to optimize Named Entity Recognition (NER) for financial data. By leveraging LLM-assisted labeling with HF Inference Endpoints and refining data with Argilla, the team improved accuracy by as much as 6.4% and reduced operational costs, achieving solutions as much as 80x cheaper than large LLMs alone.
On this post, you’ll learn:
- Methods to use LLMs for efficient data labeling
- Steps for fine-tuning compact models with LLM insights
- Deployment of models on Hugging Face Inference Endpoints for scalable NER applications
This structured approach combines accuracy and cost-effectiveness, making it ideal for real-world financial applications.
| Model | F1-Rating (Zero-Shot) | F1-Rating (Advantageous-Tuned) | Inference Cost (per hour) | Cost Efficiency |
|---|---|---|---|---|
| GLiNER | 87.0% | 93.4% | $0.50 (GPU) / $0.10 (CPU) | As much as 80x cheaper |
| SpanMarker | 47.0% | 90.1% | $0.50 (GPU) / $0.10 (CPU) | As much as 80x cheaper |
| Llama 3.1-8b | 88.0% | N/A | $4.00 | Moderate |
| Llama 3.1-70b | 95.0% | N/A | $8.00 | High Cost |
Capital Fund Management (CFM) is an alternate investment management firm headquartered in Paris, also has teams in Recent York City and London currently overseeing assets totaling 15.5 billion dollars.
Employing a scientific approach to finance, CFM leverages quantitative and systematic methods to plot superior investment strategies.
CFM has been working with Hugging Face’s Expert Support to remain updated on the most recent advancements in machine learning and harness the ability of open-source technology for his or her big selection of monetary applications. One in all the collaboration’s primary goals has been to explore how CFM can profit from efficiently using open-source Large Language Models (LLMs) to reinforce their existing machine learning use cases. Quantitative hedge funds depend on massive amounts of knowledge to tell decisions about whether to purchase or sell specific financial products. As well as to straightforward data sources from financial markets (e.g., prices), hedge funds are increasingly extracting insights from alternative data, corresponding to textual information from news articles. One major challenge in incorporating news into fully automated trading strategies is accurately identifying the products or entities (e.g., firms, stocks, currencies) mentioned within the articles. While CFM’s data providers supply these tags, they may be incomplete and require further validation.
CFM explored several approaches to enhance financial entity recognition, including zero-shot NER using LLMs and smaller models, LLM-assisted data labeling with Hugging Face Inference Endpoints and Argilla, and fine-tuning smaller models on curated datasets. These approaches not only leverage the flexibility of huge models but in addition address the challenges of cost and scalability in real-world financial applications.
Amongst open-source models, the Llama 3.1 series by Meta stood out attributable to its strong performance across benchmarks, making it a top alternative for generating synthetic annotations. These LLMs were pivotal in creating high-quality labeled datasets, combining automation and human expertise to streamline the labeling process and enhance model performance in financial NER tasks.
Table of Content
- NER on the Financial News and Stock Price Integration Dataset
- LLM-Assisted data labeling with Llama
- Performance of zero-shot approaches for financial NER
- Improving performance of compact models with fine-tuning on LLM-assisted labeled dataset
- Weak Supervision vs. LLM-assisted labeling
NER on the Financial News and Stock Price Integration Dataset
Our deal with this use case was to extract company names from news headlines from the Financial News and Stock Price Integration Dataset (FNSPID). It consists of stories headlines and articles related to corresponding stock symbols coming from several sources corresponding to Bloomberg, Reuters, Benzinga and others. After analyzing the varied news sources, we found that news from Benzinga had no missing stock symbol values. This subset of the dataset comprises ~900k samples. Because of this, we decided to scale back our dataset to Benzinga headlines for a more consistent and reliable evaluation.
Dataset preview of FNSPID
{"example 1": "Black Diamond Stock Falling After Tweet"} -> Black Diamond
{"example 2": "Dun & Bradstreet Acquires Avention For $150M"} -> Dun & Bradstreet, Avention
{"example 3": "Fast Money Picks For April 27"} -> No company
Example samples and goal predictions for the duty
LLM-Assisted data labeling with Llama
To effectively compare different approaches, we first must consolidate a reliable dataset that can function a foundation for evaluating the candidate methods. This dataset will likely be used for each testing the performance of models in zero-shot settings and as a base for fine-tuning.
We used Llama-assisted data labeling to streamline and enhance the annotation process by having Llama 3.1 generate labels for dataset samples. These auto-generated labels were then manually reviewed using the open-source data annotation platform Argilla. This approach allowed us to hurry up the labeling process while ensuring the standard of the annotations.
Deploy Llama3.1-70b-Instruct with Hugging Face Inference Endpoints
To securely and quickly get access to a Llama3.1-70-Instruct deployment we opted for Hugging Face Inference Endpoints.
Hugging Face Inference Endpoints provide a simple and secure solution for deploying machine learning models in production environments. They allow developers and data scientists to construct AI applications without the necessity to manage infrastructure, simplifying deployment to only a number of clicks.
To access Inference Endpoints we logged in as a member of the CapitalFundManagement organization on the Hugging Face Hub, then accessed the service at https://ui.endpoints.huggingface.co . To begin a brand new deployment we create on Recent then select meta-llama/Llama-3.1-70B-Instruct
Endpoint creation on the Inference Endpoints UI
You may select on which cloud provider the hardware will likely be hosted, the region, and the style of instance. Inference Endpoints suggest an instance type based on the model size, which needs to be large enough to run the model. Here an instance with 4 Nvidia L40S is chosen. When LLM is chosen, an automatic container is chosen running Text Generation Inference for optimized inference.
When clicking on the “Create Endpoint” the deployment is created and the endpoint will likely be ready in a number of minutes. To get more details about Inference Endpoints setup, visit https://huggingface.co/docs/inference-endpoints.
Endpoint running on the Inference Endpoints UI
Once our endpoint is running, we’ll use the endpoint URL provided to send requests to it.
Prompting Llama for NER
Before sending requests, we wanted to design a prompt that might guide the model effectively to generate the specified output. After several rounds of testing, we structured the prompt into multiple sections to handle the duty accurately:
-
Role Definition: The model is positioned as a financial expert with strong English skills.
-
Task Instructions: The model is instructed to extract company names linked to stocks mentioned in headlines, while excluding stock indices often present in titles.
-
Expected Output: The model must return a dictionary with:
"result": Exact company names or stock symbols."normalized_result": Standardized company names corresponding to those in"result".
-
Few-shot Examples: A series of input-output examples to reveal the expected behavior and ensure consistency in performance across varied inputs. These examples help the model understand easy methods to distinguish relevant entities and format its output appropriately. Each example illustrates different headline structures to organize the model for a spread of real-world cases.
SYSTEM_PROMPT = “””
###Instructions:###
You're a financial expert with excellent English skills.
Extract only the corporate names from the next headlines which might be related to a stock discussed within the article linked to the headline.
Don't include stock indices corresponding to "CAC40" or "Dow Jones".
##Expected Output:##
Return a dictionary with:
A key "result" containing an inventory of company names or stock symbols. Make certain to return them exactly as they're written within the text even when the unique text has grammatical errors. If no firms or stocks are mentioned, return an empty list.
A key "normalized_result" containing the normalized company names corresponding to the entries within the "result" list, in the identical order. This list must have the identical size because the "result" list.
##Formatting:##
Don't return firms not mentioned within the text.
##Example Outputs##
Input: "There's A Recent Trading Tool That Allows Traders To Trade Cannabis With Leverage"
Output: {"result": [], "normalized_result": []}
Input: "We explain AAPL, TSLA, and MSFT report earnings"
Output: {"result": ["AAPL", "TSLA", "MSFT"], "normalized_result": ["Apple", "Tesla", "Microsoft"]}
Input: "10 Biggest Price Goal Changes For Friday"
Output: {"result": [], "normalized_result": []}
Input: "'M' is For Microsoft, and Meh"
Output: {"result": ["Microsoft"], "normalized_result": ["Microsoft"]}
Input: "Black Diamond: The Recent North Face? (BDE, VFC, JAH, AGPDY.PK)"
Output: {"result": ['Black Diamond', 'North Face', 'BDE', 'VFC','JAH','AGPDY.PK'], "normalized_result": ['Black Diamond','The North Face', 'BPER Banca', 'VF Corporation','Jarden Corporation','AGP Diagnostics']}
“””
Get predictions from the endpoint
Now that now we have our prompt and endpoint ready, the following step is to send requests using the titles from our dataset. To do that efficiently, we’ll use the AsyncInferenceClient from the huggingface_hub library. That is an asynchronous version of the InferenceClient, built on asyncio and aiohttp. It allows us to send multiple concurrent requests to the endpoint, making the processing of the dataset faster and more efficient.
from huggingface_hub import AsyncInferenceClient
client = AsyncInferenceClient(base_url="https://your-endpoint-url.huggingface.cloud")
To be sure that the model replies a structured output, we’ll use guidance with a particular Pydantic schema Corporations
from pydantic import BaseModel
from typing import List, Dict, Any
class Corporations(BaseModel):
"""
Pydantic model representing the expected LLM output.
Attributes:
result (List[str]): A listing of company names or results from the LLM.
normalized_result (List[str]): A listing of 'normalized' company names, i.e., processed/cleaned names.
"""
result: List[str]
normalized_result: List[str]
grammar: Dict[str, Any] = {
"type": "json_object",
"value": Corporations.schema()
}
We also set generation parameters:
max_tokens: int = 512
temperature: float = 0.1
Now we define our functions to send requests to the endpoint and parse output:
async def llm_engine(messages: List[Dict[str, str]]) -> str:
"""
Function to send a request to the LLM endpoint and get a response.
Args:
messages (List[Dict[str, str]]): A listing of messages to pass to the LLM.
Returns:
str: The content of the response message or 'failed' in case of an error.
"""
try:
response = await client.chat_completion(
messages=messages,
model="ENDPOINT",
temperature=temperature,
response_format=grammar,
max_tokens=max_tokens
)
answer: str = response.selections[0].message.content
return answer
except Exception as e:
print(f"Error in LLM engine: {e}")
return "failed"
def parse_llm_output(output_str: str) -> Dict[str, Any]:
"""
Parse the JSON-like output string from an LLM right into a dictionary.
Args:
output_str (str): The string output from an LLM, expected to be in JSON format.
Returns:
Dict[str, Any]: A dictionary parsed from the input JSON string with a 'valid' flag.
"""
try:
result_dict: Dict[str, Any] = json.loads(output_str)
result_dict["valid"] = True
return result_dict
except json.JSONDecodeError as e:
print(f"Error decoding JSON: {e}")
return {
"result": [output_str],
"normalized_result": [output_str],
"valid": False
}
We test the endpoint with a single example:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Some stocks i like buying are AAPL, GOOG, AMZN, META"}
]
response = await llm_engine(messages)
print(parse_llm_output(response))
{"normalized_result": ["Apple", "Alphabet", "Amazon", "Meta Platforms"], "result": ["AAPL", "GOOG", "AMZN", "META"]}
Now, we create a process_batch function to handle sending requests in manageable batches, stopping the API endpoint from becoming overwhelmed or hitting rate limits. This batching approach allows us to efficiently process multiple requests concurrently without saturating the server, reducing the chance of timeouts, rejected requests, or throttling. By controlling the flow of requests, we ensure stable performance, faster response times, and easier error handling while maximizing throughput.
import asyncio
async def process_batch(batch):
"""
Get the model output for a batch of samples.
This function processes a batch of samples by sending them to the LLM and
gathering the outcomes concurrently.
Args:
batch (List[Dict[str, str]]): A listing of dictionaries where each dictionary
comprises the information for a single sample, including
an "Article_title".
Returns:
List[str]: A listing of responses from the LLM for every sample within the batch.
"""
list_messages = []
for sample in batch:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": sample["Article_title"]}
]
list_messages.append(messages)
return await asyncio.gather(*[llm_engine(messages) for messages in list_messages])
We are going to run inference over the dataset:
from datasets import load_dataset, Dataset
dataset = load_dataset("Zihan1004/FNSPID", streaming=True)
iterable_dataset = iter(dataset["train"])
def get_sample_batch(iterable_dataset, batch_size):
batch = []
try:
for _ in range(batch_size):
batch.append(next(iterable_dataset))
except StopIteration:
pass
return batch
And we create the primary inference loop:
batch_size = 128
i= 0
len_extracted = 0
while True:
batch = get_sample_batch(iterable_dataset, batch_size)
predictions = await process_batch(batch)
parsed_predictions = [parse_llm_output(_pred) for _pred in predictions]
try :
parsed_dataset = [
{"Article_title": sample["Article_title"],
"predicted_companies": pred["result"],
"normalized_companies":pred.get("normalized_result", ""),
"valid": pred["valid"]} for sample, pred in zip(batch, parsed_predictions)
]
except Exception as e :
print(i,e)
proceed
with open(os.path.join(CHECKPOINT_DATA, f"parsed_dataset_{i}.json"), 'w') as json_file:
json.dump(parsed_dataset, json_file, indent=4)
len_extracted += len(parsed_dataset)
i+= 1
print(f"Extracted: {len_extracted} samples")
if len(batch) < batch_size:
break
When the inference is running we are able to monitor the traffic directly from the UI
Endpoint analytics
It took about 8 hours to process the entire dataset with 900k samples which costs ~$70.
Review predictions with Argilla
With the labeled data generated from our LLM, the following step is to curate a high-quality subset to make sure reliable evaluation of various methods, including zero-shot and fine-tuning approaches. This rigorously reviewed dataset may also function a foundational base for fine-tuning smaller models. Your complete labeled dataset by the LLM may be also used for fine-tuning in a weak supervision framework.
Dataset sampling and splitting
To create a manageable sample size for review, we used LLM labels and clustered company names using fuzzy matching with the rapidfuzz library. We applied the cdist function to compute Levenshtein distances between firms and clustered them with a threshold of 85. A representative company was chosen for every cluster, and Llama predictions were mapped accordingly. Finally, we sampled 5% of stories headlines from each company-related cluster and 5% from headlines with none firms leading to a dataset of 2714 samples.
Then using metadata, the sampled dataset is split into three parts:
-
Train: News from 2010 to 2016, used for training small ML models – 2405 samples
-
Validation: News from 2017 to 2018, for hyperparameters tuning – 204 samples
-
Test: News from 2019 to 2020, used to judge the model on unseen data. – 105 samples
Once created, we arrange an annotation tool, Argilla, to ease the annotation process.
Argilla is an open-source tool, integrated into the Hugging Face ecosystem that excels at gathering human feedback for a big selection of AI projects. Whether you are working on traditional NLP tasks like text classification and NER, fine-tuning large language models (LLMs) for retrieval-augmented generation (RAG) or preference tuning, or developing multimodal models like text-to-image systems, Argilla provides the tools needed to gather and refine feedback efficiently. This ensures your models constantly improve based on high-quality, human-validated data.
An Argilla interface may be arrange directly through Hugging Face Spaces and that is what we opted for. Take a look at the documentation to start out your individual interface and go to https://huggingface.co/new-space .
Create Argilla Space on the Hub
Argilla homepage on Spaces
Argilla datasets view
Once the interface is created, we are able to connect programmatically to it using the Argilla python SDK. To get able to annotate we followed the next steps:
- We hook up with our interface using credentials provided within the settings.
import argilla as rg
client = rg.Argilla(
api_url="https://capitalfundmanagement-argilla-ner.hf.space",
api_key="xxx",
headers={"Authorization": f"Bearer {HF_TOKEN}"},
confirm = False)
- We create the rules for the annotation, and generate the duty specific dataset. Here, we specify the SpanQuestion task. We then generate a train, a validation and test dataset objects with the defined settings.
import argilla as rg
labels = ["Company"]
settings = rg.Settings(
guidelines="Classify individual tokens based on the desired categories, ensuring that any overlapping or nested entities are accurately captured.",
fields=[rg.TextField(name="text", title="Text", use_markdown=False)],
questions=[rg.SpanQuestion(
name="span_label",
field="text",
labels=labels,
title="Classify the tokens according to the specified categories.",
allow_overlapping=False,
)],
)
train_dataset = rg.Dataset(name="train", settings=settings)
train_dataset.create()
valid_dataset = rg.Dataset(name="valid", settings=settings)
valid_dataset.create()
test_dataset = rg.Dataset(name="test", settings=settings)
test_dataset.create()
- We populate the dataset with news headlines from the several datasets
train_records = [rg.Record(fields={"text": title}) for title in train_df["Article_title"]]
valid_records = [rg.Record(fields={"text": title}) for title in valid_df["Article_title"]]
test_records = [rg.Record(fields={"text": title}) for title in test_df["Article_title"]]
train_records_list = [{"id": record.id, "text": record.fields["text"]} for record in train_records]
valid_records_list = [{"id": record.id, "text": record.fields["text"]} for record in valid_records]
test_records_list = [{"id": record.id, "text": record.fields["text"]} for record in test_records]
train_dataset.records.log(train_records)
valid_dataset.records.log(valid_records)
test_dataset.records.log(test_records)
- On this step, we incorporate predictions from various models. Specifically, we add Llama 3.1 predictions into Argilla, where each entity is represented as a dictionary containing a start index, end index, and a label (on this case, ‘Company’).”
train_data = [{"span_label": entity,"id": id,} for id, entity in zip(train_ids, train_entities_final)]
valid_data = [{"span_label": entity,"id": id,} for id, entity in zip(valid_ids, valid_entities_final)]
test_data = [{"span_label": entity,"id": id,} for id, entity in zip(test_ids, test_entities_final)]
train_dataset.records.log(records=train_data, batch_size = 1024)
valid_dataset.records.log(records=valid_data, batch_size = 1024)
test_dataset.records.log(records=test_data, batch_size = 1024)
Argilla annotation view
The annotation interface displays the text to be annotated, together with its status (either pending or submitted). Annotation guidelines are presented on the fitting side of the screen. On this case, now we have one label, ‘Company.‘ To annotate, we first select the label, then highlight the relevant text within the sentence. Once all entities are chosen, we click ‘submit’ to finalize the annotation.
Duration of annotation
Using pre-computed Llama labels significantly accelerates the annotation process, reducing the time per sample to only 5 to 10 seconds, in comparison with roughly 30 seconds for raw, unprocessed samples. This efficiency results in a considerable time saving, allowing us to finish the annotation of two,714 samples in around 8 hours. The efficiency gains are much more pronounced for more complex tasks than NER, where the time saved with pre-computed labels or generation becomes significantly greater.
Performance of zero-shot approaches for financial NER
With a high-quality, reviewed dataset in place, we are able to now experiment with different approaches for zero-shot NER. We tested 4 models:
Small Language Models:
- GLINER
- SpanMarker
Large Language Models (LLMs): - Llama-3.1-8b
- Llama-3.1-70b
- GLiNER
GLiNER is a compact, versatile NER model that leverages bidirectional transformers like BERT to discover a big selection of entity types overcoming the restrictions of traditional models which might be restricted to predefined entities. Unlike large autoregressive models, GLiNER treats NER as a task of matching entity types with spans in text, using parallel processing for efficiency. It offers a practical and resource-efficient alternative to LLMs, delivering strong performance in zero-shot scenarios without the high computational costs related to larger models.
GLiNER architecture from the original paper<
GLiNER offers three model variants:
- GLiNER-Small-v2.1 (50M parameters)
- GLiNER-Medium-v2.1 (90M parameters)
- GLiNER-Large-v2.1 (0.3B parameters)
In comparison with LLMs like Llama-3.1-70b, GLiNER models are more compact, cost-effective, and efficient for NER tasks, while LLMs generally offer broader flexibility but are much larger and resource-intensive. GLiNER medium may be tried on the Hugging Face Space: https://huggingface.co/spaces/tomaarsen/gliner_medium-v2.1 For our experiments, we focused our attention on a particular GLiNER variant EmergentMethods/gliner_medium_news-v2.1 which has been already fine-tuned on the EmergentMethods/AskNews-NER-v0 and goals at improving the accuracy across a big selection of topic and particularly for long-context news entity recognition extraction.To make use of GLiNER you may install the gliner package based out of the Hugging Face transformers library.
!pip install gliner
Then using GLiNER for NER is as easy as:
from gliner import GLiNER
model = GLiNER.from_pretrained("EmergentMethods/gliner_large_news-v2.1")
text = """
EMCOR Group Company Awarded Contract for Installation of Process Control Systems for Orange County Water District's Water Purification Facilities
"""
labels = ["company"]
entities = model.predict_entities(text, labels, threshold=.5)
for entity in entities:
print(entity["text"], "=>", entity["label"])
Output
“EMCOR Group => company"
For an inventory of samples the batch_text method may be used:
from gliner import GLiNER
model = GLiNER.from_pretrained("EmergentMethods/gliner_large_news-v2.1")
batch_text = [
"EMCOR Group Company Awarded Contract for Installation of Process Control Systems for Orange County Water District's Water Purification Facilities",
"Integra Eyes Orthopedics Company - Analyst Blog"
]
labels = ["company"]
batch_entities = model.batch_predict_entities(batch_text, labels, threshold=.5)
for entities in batch_entities:
for entity in entities:
print(entity["text"], "=>", entity["label"])
Output:
“EMCOR Group => company" #correct predictions
"Integra Eyes Orthopedics Company => company" #incorrect predictions, ground truth Integra
The zero-shot results, when it comes to F1-score on the annotated dataset of 2714 samples curated earlier is 87%.
The GLiNER model performs well in extracting company names from text but struggles with certain cases, corresponding to when firms are mentioned as stock symbols. It also misclassifies general references to stock industries, like “Healthcare stocks” or “Industrial stocks,” as company names. While effective in lots of cases, these errors highlight areas where further refinement is required to enhance accuracy in distinguishing between firms and broader industry terms.
- SpanMarker
SpanMarker is a framework for training powerful NER models using familiar encoders corresponding to BERT, RoBERTa and DeBERTa. Tightly implemented on top of the 🤗 Transformers library, SpanMarker can take good advantage of it. Because of this, SpanMarker will likely be intuitive to make use of for anyone conversant in Transformers. We selected this variant tomaarsen/span-marker-bert-base-orgs trained on trained on the FewNERD, CoNLL2003, and OntoNotes v5 dataset that may be used for NERn. This SpanMarker model uses bert-base-cased because the underlying encoder. It’s trained specifically to acknowledge organizations. It may be used for inference to predict anORG(organization) label as follows:
from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-orgs")
entities = model.predict("Enernoc Acquires Energy Response; Anticipates Deal Will Add to 2013 EPS")
for entity in entities:
print(entity["span"], "=>", entity["label"])
Output:
“EMCOR Group => ORG" #correct predictions
"Energy Response => ORG" #incorrect
The zero-shot results, when it comes to F1-score on the annotated dataset of 2714 samples curated earlier is 47%. SpanMarker model performs well in extracting accurate company names but tends to discover too many incorrect entities. That is attributable to its training on the broader “organization” category, which incorporates entities corresponding to non-profit organizations, government bodies, educational institutions, and sports teams. Because of this, it sometimes confuses these with company names, resulting in over-extraction and fewer precise ends in certain contexts.
- Llama3.1-8b and Llama3.1-70b
We tested 2 variants of the Llama3.1 model including the 70b that we used to curate ground truth examples. We used the prompt that’s presented above. On our annotated subset now we have the next results :
Performance Recap
On this experiment, we compared the performance of small models like GLiNER and SpanMarker against LLMs corresponding to Llama 3.1-8b and Llama 3.1-70b. Small models like GLiNER (87% F1) provide an excellent balance between accuracy and computational efficiency, making them ideal for resource-constrained scenarios. In contrast, LLMs, while more resource-intensive, deliver higher accuracy, with Llama 3.1-70b achieving a 95% F1-score. This highlights the trade-off between performance and efficiency when selecting between small models and LLMs for NER tasks. Let’s now see how the performance differs after we fine-tune compact models.
Improving the performance of compact models with fine-tuning on LLM-assisted labeled dataset
Advantageous-tuning
Using our train/validation/test subsets created earlier, we fine-tuned GLiNER and SpanMarker on an AWS instance with a single Nvidia A10 GPU. A fine-tuning example for GLiNER is accessible here and for SpanMarker here. Advantageous-tuning those models is as easy as running this with train_datasetand valid_datasetcreated as Hugging Face dataset.
import numpy as np
from gliner import GLiNER, Trainer
batch_size = 4
learning_rate=5e-6
num_epochs = 20
model=GLiNER.from_pretrained("EmergentMethods/gliner_medium_news-v2.1")
data_collator = DataCollator(model.config, data_processor=model.data_processor, prepare_labels=True)
log_dir = create_log_dir(model_path, model_name, learning_rate, batch_size, size, model_size, timestamp=False)
training_args = TrainingArguments(
output_dir=log_dir,
logging_dir = log_dir,
overwrite_output_dir = 'True',
learning_rate=learning_rate,
weight_decay=0.01,
others_lr=1e-5,
others_weight_decay=0.01,
lr_scheduler_type="linear",
warmup_ratio=0.1,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
focal_loss_alpha=0.75,
focal_loss_gamma=2,
num_train_epochs=num_epochs,
save_strategy="epoch",
save_total_limit=2,
metric_for_best_model="valid_f1",
use_cpu=False,
report_to="tensorboard",
logging_steps=100,
evaluation_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=valid_dataset,
tokenizer=model.data_processor.transformer_tokenizer,
data_collator=data_collator,
compute_metrics=custom_compute_metrics
)
trainer.train()
The training ran for 20 epochs, we save the checkpoint with the very best F1 rating on the validation set.
Performance comparison
On this updated comparison, we evaluated models based on their F1-scores after fine-tuning. The GLiNER-medium-news model improved from 87.0% to 93.4% after fine-tuning, showing significant gains in accuracy. Similarly, SpanMarker went from 47.0% to 90.1% with fine-tuning, making it much more competitive. Meanwhile, Llama 3.1-8b and Llama 3.1-70b performed well out of the box, scoring 80.0% and 92.7%, respectively, without fine-tuning. This comparison emphasizes that fine-tuning smaller models like GLiNER and SpanMarker can dramatically enhance performance, rivaling larger LLMs at a lower computational cost.
The llama 3.1-70b model costs at the least $8 per hour for inference, making it significantly dearer than compact models, which may run on a GPU instance costing around $0.50 per hour—16 times cheaper. Moreover, compact models may even be deployed on CPU instances, costing as little as ~$0.10 per hour, which is 80 times cheaper. This highlights the substantial cost advantage of smaller models, especially in resource-constrained environments, without sacrificing competitive performance when fine-tuned for specific tasks.
Weak Supervision vs. LLM-assisted labeling
On this experiment, we explored two key approaches to data labeling for NER: Weak Supervision and LLM-assisted labeling. While weak supervision enables scalable training on synthetic data, our findings suggest that it cannot achieve the identical level of accuracy as models trained on manually annotated data. For 1,000 samples, manual annotation took 3 hours with an F1 rating of 0.915, while Llama 3.1-70b inference only took 2 minutes but resulted in a rather lower F1 rating of 0.895. The trade-off between speed and accuracy is determined by the duty’s requirements.
Weak Supervision vs. LLM-assisted labeling
The graph compares the performance of GLiNER fine-tuned on human-annotated data versus synthetic data inferred by Llama-3.1 70b across various dataset sizes. The blue dots represent F1-scores of models trained on human-annotated data, while the red dots represent those inferred by Llama-3.1 70b. As dataset size increases, models fine-tuned on human annotations consistently outperform those using synthetic data, achieving higher F1-scores. It illustrates that while models fine-tuned on human annotations yield higher accuracy,_ LLM-assisted labeling using Llama-3.1 70b can still provide considerable value, especially when resources for manual annotation are limited. Although the F1-scores from LLM-inferred data are barely lower, they continue to be competitive across various dataset sizes. LLMs can rapidly generate large volumes of annotations, offering a practical solution for scaling dataset creation efficiently, making them useful in scenarios where time and value constraints are critical.
Conclusion
Our experiment demonstrated that while large models like Llama 3.1 provide superior performance out of the box, fine-tuning smaller models like GLiNER and SpanMarker with LLM-assisted labeling can significantly enhance accuracy, rivaling the LLMs at a fraction of the price. This approach highlights how investing in fine-tuning small models using LLM insights provides a cheap, scalable solution for financial NER tasks, making it ideal for real-world applications where each accuracy and resource efficiency are crucial.










