Spaces:

filbench
/

filbench-leaderboard

Running

File size: 3,754 Bytes

from datetime import datetime

import pytz

NUM_FEWSHOT = 0

pacific_tz = pytz.timezone("Asia/Manila")
current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y")

# Leaderboard general information
TOP_TEXT = f"""
# FilBench: An Open LLM Leaderboard for Filipino

[Code](https://github.com/filbench/filbench-eval) | [Paper](https://arxiv.org/abs/2508.03523) | Total Models: {{}} | Last restart (PHT): {current_time}<br/>
📥: Indicates model submissions from the community. If you wish to submit your model evaluations, then please check our instructions on [GitHub](https://github.com/filbench/filbench-eval).
"""

# Leaderboard reproducibility
LLM_BENCHMARKS_TEXT = """
**FilBench** is a comprehensive evaluation benchmark for Filipino. We curate 12 sub-tasks across 4 major categories--Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation--and evaluate several models in order to understand their Filipino-centric capabilities.

## Overview

We average four core sections (weighted by the number of instances):

1. **Cultural Knowledge:** Includes instances for measuring cultural understanding capabilities of LLMs.
2. **Classical NLP:** Contains questions on standard NLP tasks such as text classification and named-entity recognition.
3. **Reading Comprehension:** Contains more focused natural language understanding (NLU) tasks and questions from readability benchmarks.
4. **Generation:** Contains instances for natural language generation (NLG), more focused on translation.

## Evaluation Runner

We use our own fork of [lighteval](https://github.com/filbench/lighteval) to perform evaluations.
We highly recommend using the vLLM backend for faster inference.
Sequentially, evaluating on FilBench can take 4.93 hours on 2 NVIDIA H100 GPUs.
However, the evaluation suite can be parallelized per benchmark, where the longest-running task can take approximately 1 hour and 28 minutes, and the shortest task takes only 5.86 minutes.

To evaluate your model on FilBench and for it to appear in the leaderboard, please follow the steps in our [Github repository](https://github.com/filbench/filbench-eval).

## Contact

This work was done by Lj V. Miranda ([@ljvmiranda921](https://github.com/ljvmiranda921)), Elyanah Aco ([@elyanah-aco](https://github.com/elyanah-aco)), Conner Manuel ([@connermanuel](https://github.com/connermanuel)), Blaise Cruz ([@jcblaisecruz02](https://github.com/jcblaisecruz02)), and Joseph Imperial ([@imperialite](https://github.com/imperialite)).
For any questions, please reach out to us via filbench-eval@googlegroups.com or through our [GitHub Issues](https://github.com/filbench/filbench-eval/issues).

## Acknowledgements

We would like to thank [Cohere Labs](https://cohere.com/research) for providing credits through the [Cohere Research Grant](https://cohere.com/research/grants) to run the Aya model series, and [Together AI](https://together.ai) for additional computational credits for running several open models.
We also acknowledge the Hugging Face team, particularly the OpenEvals team (Clémentine Fourrier [@clefourrier](https://github.com/clefourrier) and Nathan Habib [@NathanHB](https://github.com/NathanHB)) and Daniel van Strien [@davanstrien](https://github.com/davanstrien), for their support in publishing the FilBench blog post.
"""

# Citation information
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@article{miranda2025filbench,
  title={Fil{B}ench: {C}an {LLM}s {U}nderstand and {G}enerate {F}ilipino?},
  author={Miranda, Lester James V and Aco, Elyanah and Manuel, Conner and Cruz, Jan Christian Blaise and Imperial, Joseph Marvin},
  journal={arXiv preprint arXiv:2508.03523},
  year={2025}
}
"""