🇵🇭 FilBench – Can LLMs Understand and Generate Filipino?

As large language models (LLMs) change into increasingly integrated into our lives, it becomes crucial to evaluate whether or not they reflect the nuances and capabilities of specific language communities.
For instance, Filipinos are amongst probably the most energetic ChatGPT users globally, rating fourth in ChatGPT traffic (behind the USA, India, and Brazil [1] [2]), but despite this strong usage, we lack a transparent understanding of how LLMs perform for his or her languages, comparable to Tagalog and Cebuano.
Most of the present evidence is anecdotal, comparable to screenshots of ChatGPT responding in Filipino as proof that it’s fluent.
What we want as an alternative is a scientific evaluation of LLM capabilities in Philippine languages.

That’s why we developed FilBench: a comprehensive evaluation suite to evaluate the capabilities of LLMs for Tagalog, Filipino (the standardized type of Tagalog), and Cebuano, on fluency, linguistic and translation abilities, in addition to specific cultural knowledge.

We used it to judge 20+ state-of-the-art LLMs on FilBench, providing a comprehensive assessment of their performance in Philippine languages:

FilBench

The FilBench evaluation suite comprises 4 major categories–Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation–divided into 12 tasks.
For instance, the Classical NLP category includes tasks comparable to sentiment evaluation, whereas Generation tasks include different elements of translation.
With a view to be sure that these categories reflect the priorities and trends in NLP research and usage, we curate them based on a historical survey of NLP research on Philippine languages from 2006 to early 2024.
(Most of those categories exclusively contain non-translated content to make sure faithfulness to the natural use of Philippine languages.)

Cultural Knowledge: This category tests a language model’s ability to recall factual and culturally specific information. For Cultural Knowledge, we curated quite a lot of examples that test an LLM’s regional and factual knowledge (Global-MMLU), Filipino-centric values (KALAHI), and talent to disambiguate word sense (StingrayBench).
Classical NLP: This category encompasses quite a lot of information extraction and linguistic tasks, comparable to named entity recognition, sentiment evaluation, and text categorization, that specialized, trained models traditionally performed. On this category, we include instances from CebuaNER, TLUnified-NER, and Universal NER for named entity recognition, and subsets of SIB-200 and BalitaNLP for text categorization and sentiment evaluation.
Reading Comprehension: This category evaluates a language model’s ability to grasp and interpret Filipino text, specializing in tasks comparable to readability, comprehension, and natural language inference. For this category, we include instances from the Cebuano Readability Corpus, Belebele, and NewsPH NLI.
Generation: We dedicate a big portion of FilBench to testing an LLM’s capability to faithfully translate texts, either from English to Filipino or from Cebuano to English. We include a various set of test examples starting from documents (NTREX-128), realistic texts from volunteers (Tatoeba), and domain-specific text (TICO-19).

Each of those categories provides an aggregated metric.
To create a single representative rating, we compute the weighted average based on the variety of examples in each category, which we call the FilBench Rating.

To simplify usage and arrange, we built FilBench on top of Lighteval, an all-in-one framework for LLM evaluation.
For language-specific evaluation, we first defined translation pairs from English to Tagalog (or Cebuano) for common terms utilized in evaluation comparable to “yes” (oo), “no” (hindi), and “true” (totoo) amongst others.
Then, we used the provided templates to implement custom tasks for the capabilities we care about.

FilBench is now available as a set of community tasks within the official Lighteval repository!

What did we learn from FilBench?

By evaluating several LLMs on FilBench, we uncovered several insights into how they perform in Filipino.

Finding #1: Although region-specific LLMs still lag behind GPT-4, collecting data to coach these models continues to be a promising direction

Previously few years, now we have seen a rise in region-specific LLMs that focus on Southeast Asian languages (SEA-specific), comparable to SEA-LION and SeaLLM.
These are open-weight LLMs that you could freely download from HuggingFace.
We discover that SEA-specific LLMs are sometimes probably the most parameter-efficient for our languages, achieving the very best FilBench scores in comparison with other models of their size.
Nonetheless, the perfect SEA-specific model continues to be outperformed by closed-source LLMs like GPT-4o.

Constructing region-specific LLMs still is sensible, as we observe performance gains of 2-3% when repeatedly fine-tuning a base LLM with SEA-specific instruction-tuning data.
This implies that efforts to curate Filipino/SEA-specific training data for fine-tuning remain relevant, as they’ll lead to raised performance on FilBench.

Finding #2: Filipino translation continues to be a difficult task for LLMs

We also observe that across the 4 categories on FilBench, most models struggle with Generation capabilities.
Upon inspecting failure modes in Generation, we discover that these include cases where the model fails to follow translation instructions, generates overly verbose texts, or hallucinates one other language as an alternative of Tagalog or Cebuano.

Finding #3: Open LLMs Remain a Cost-Effective Alternative for Filipino Language Tasks

The Philippines tends to have limited web infrastructure and lower average incomes [3], necessitating accessible LLMs which can be cost- and compute-efficient.
Through FilBench, we were in a position to discover LLMs which can be on the Pareto frontier of efficiency.

Normally, we discover that open-weight LLMs, i.e., models that you could freely download from HuggingFace, are way cheaper than industrial models without sacrificing their performance.
Should you want an alternative choice to GPT-4o in your Filipino language tasks, then try Llama 4 Maverick!

We also make this information available within the HuggingFace space of the FilBench leaderboard.

Does your LLM work on Philippine Languages? Try it on FilBench!

We hope that FilBench provides deeper insights into LLM capabilities for Philippine languages and serves as a catalyst for advancing Filipino NLP research and development.
The FilBench evaluation suite is built on top of Hugging Face’s lighteval, allowing LLM developers to simply evaluate their models on our benchmark.
For more information, please visit the links below:

Acknowledgements

The authors would love to thank Cohere Labs for providing credits through the Cohere Research Grant to run the Aya model series, and Together AI for extra computational credits for running several open models.
We also acknowledge the Hugging Face team, particularly the OpenEvals team (Clémentine Fourrier and Nathan Habib) and Daniel van Strien, for his or her support in publishing this blog post.

Citation

Should you are evaluating on FilBench, please cite our work:

@article{filbench,
  title={Fil{B}ench: {C}an {LLM}s {U}nderstand and {G}enerate {F}ilipino?},
  writer={Miranda, Lester James V and Aco, Elyanah and Manuel, Conner and Cruz, Jan Christian Blaise and Imperial, Joseph Marvin},
  journal={arXiv preprint arXiv:2508.03523},
  yr={2025}
}

Source link

🇵🇭 FilBench – Can LLMs Understand and Generate Filipino?

FilBench

What did we learn from FilBench?

Finding #1: Although region-specific LLMs still lag behind GPT-4, collecting data to coach these models continues to be a promising direction

Finding #2: Filipino translation continues to be a difficult task for LLMs

Finding #3: Open LLMs Remain a Cost-Effective Alternative for Filipino Language Tasks

Does your LLM work on Philippine Languages? Try it on FilBench!

Acknowledgements

Citation

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks

a tool to work with datasets using open AI models!

Solve Linear Programs Using the GPU-Accelerated Barrier Method in NVIDIA cuOpt

How Good are LLMs at Text-Based Video Games?

I Cleaned a Messy CSV File Using Pandas . Here’s the Exact Process I Follow Every Time.

🇵🇭 FilBench – Can LLMs Understand and Generate Filipino?

FilBench

What did we learn from FilBench?

Finding #1: Although region-specific LLMs still lag behind GPT-4, collecting data to coach these models continues to be a promising direction

Finding #2: Filipino translation continues to be a difficult task for LLMs

Finding #3: Open LLMs Remain a Cost-Effective Alternative for Filipino Language Tasks

Does your LLM work on Philippine Languages? Try it on FilBench!

Acknowledgements

Citation

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.