Spaces:
Running
Running
| from datetime import datetime | |
| import pytz | |
| NUM_FEWSHOT = 0 | |
| pacific_tz = pytz.timezone("Asia/Manila") | |
| current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y") | |
| # Leaderboard general information | |
| TOP_TEXT = f""" | |
| # FilBench: An Open LLM Leaderboard for Filipino | |
| [Code](https://github.com/filbench/filbench-eval) | [Paper](https://arxiv.org/abs/2508.03523) | Total Models: {{}} | Last restart (PHT): {current_time}<br/> | |
| 📥: Indicates model submissions from the community. If you wish to submit your model evaluations, then please check our instructions on [GitHub](https://github.com/filbench/filbench-eval). | |
| """ | |
| # Leaderboard reproducibility | |
| LLM_BENCHMARKS_TEXT = """ | |
| **FilBench** is a comprehensive evaluation benchmark for Filipino. We curate 12 sub-tasks across 4 major categories--Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation--and evaluate several models in order to understand their Filipino-centric capabilities. | |
| ## Overview | |
| We average four core sections (weighted by the number of instances): | |
| 1. **Cultural Knowledge:** Includes instances for measuring cultural understanding capabilities of LLMs. | |
| 2. **Classical NLP:** Contains questions on standard NLP tasks such as text classification and named-entity recognition. | |
| 3. **Reading Comprehension:** Contains more focused natural language understanding (NLU) tasks and questions from readability benchmarks. | |
| 4. **Generation:** Contains instances for natural language generation (NLG), more focused on translation. | |
| ## Evaluation Runner | |
| We use our own fork of [lighteval](https://github.com/filbench/lighteval) to perform evaluations. | |
| We highly recommend using the vLLM backend for faster inference. | |
| Sequentially, evaluating on FilBench can take 4.93 hours on 2 NVIDIA H100 GPUs. | |
| However, the evaluation suite can be parallelized per benchmark, where the longest-running task can take approximately 1 hour and 28 minutes, and the shortest task takes only 5.86 minutes. | |
| To evaluate your model on FilBench and for it to appear in the leaderboard, please follow the steps in our [Github repository](https://github.com/filbench/filbench-eval). | |
| ## Contact | |
| This work was done by Lj V. Miranda ([@ljvmiranda921](https://github.com/ljvmiranda921)), Elyanah Aco ([@elyanah-aco](https://github.com/elyanah-aco)), Conner Manuel ([@connermanuel](https://github.com/connermanuel)), Blaise Cruz ([@jcblaisecruz02](https://github.com/jcblaisecruz02)), and Joseph Imperial ([@imperialite](https://github.com/imperialite)). | |
| For any questions, please reach out to us via filbench-eval@googlegroups.com or through our [GitHub Issues](https://github.com/filbench/filbench-eval/issues). | |
| ## Acknowledgements | |
| We would like to thank [Cohere Labs](https://cohere.com/research) for providing credits through the [Cohere Research Grant](https://cohere.com/research/grants) to run the Aya model series, and [Together AI](https://together.ai) for additional computational credits for running several open models. | |
| We also acknowledge the Hugging Face team, particularly the OpenEvals team (Clémentine Fourrier [@clefourrier](https://github.com/clefourrier) and Nathan Habib [@NathanHB](https://github.com/NathanHB)) and Daniel van Strien [@davanstrien](https://github.com/davanstrien), for their support in publishing the FilBench blog post. | |
| """ | |
| # Citation information | |
| CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
| CITATION_BUTTON_TEXT = r""" | |
| @article{miranda2025filbench, | |
| title={Fil{B}ench: {C}an {LLM}s {U}nderstand and {G}enerate {F}ilipino?}, | |
| author={Miranda, Lester James V and Aco, Elyanah and Manuel, Conner and Cruz, Jan Christian Blaise and Imperial, Joseph Marvin}, | |
| journal={arXiv preprint arXiv:2508.03523}, | |
| year={2025} | |
| } | |
| """ | |