Spaces:

filbench
/

filbench-leaderboard

Running

App Files Files Community

filbench-leaderboard / src /about.py

ljvmiranda921

Update paper link

1a5efd4 4 months ago

raw

history blame contribute delete

3.75 kB

	from datetime import datetime

	import pytz

	NUM_FEWSHOT = 0

	pacific_tz = pytz.timezone("Asia/Manila")
	current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y")

	# Leaderboard general information
	TOP_TEXT = f"""
	# FilBench: An Open LLM Leaderboard for Filipino

	[Code](https://github.com/filbench/filbench-eval) \| [Paper](https://arxiv.org/abs/2508.03523) \| Total Models: {{}} \| Last restart (PHT): {current_time}<br/>
	📥: Indicates model submissions from the community. If you wish to submit your model evaluations, then please check our instructions on [GitHub](https://github.com/filbench/filbench-eval).
	"""

	# Leaderboard reproducibility
	LLM_BENCHMARKS_TEXT = """
	FilBench is a comprehensive evaluation benchmark for Filipino. We curate 12 sub-tasks across 4 major categories--Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation--and evaluate several models in order to understand their Filipino-centric capabilities.

	## Overview

	We average four core sections (weighted by the number of instances):

	1. Cultural Knowledge: Includes instances for measuring cultural understanding capabilities of LLMs.
	2. Classical NLP: Contains questions on standard NLP tasks such as text classification and named-entity recognition.
	3. Reading Comprehension: Contains more focused natural language understanding (NLU) tasks and questions from readability benchmarks.
	4. Generation: Contains instances for natural language generation (NLG), more focused on translation.

	## Evaluation Runner

	We use our own fork of [lighteval](https://github.com/filbench/lighteval) to perform evaluations.
	We highly recommend using the vLLM backend for faster inference.
	Sequentially, evaluating on FilBench can take 4.93 hours on 2 NVIDIA H100 GPUs.
	However, the evaluation suite can be parallelized per benchmark, where the longest-running task can take approximately 1 hour and 28 minutes, and the shortest task takes only 5.86 minutes.

	To evaluate your model on FilBench and for it to appear in the leaderboard, please follow the steps in our [Github repository](https://github.com/filbench/filbench-eval).

	## Contact

	This work was done by Lj V. Miranda ([@ljvmiranda921](https://github.com/ljvmiranda921)), Elyanah Aco ([@elyanah-aco](https://github.com/elyanah-aco)), Conner Manuel ([@connermanuel](https://github.com/connermanuel)), Blaise Cruz ([@jcblaisecruz02](https://github.com/jcblaisecruz02)), and Joseph Imperial ([@imperialite](https://github.com/imperialite)).
	For any questions, please reach out to us via filbench-eval@googlegroups.com or through our [GitHub Issues](https://github.com/filbench/filbench-eval/issues).

	## Acknowledgements

	We would like to thank [Cohere Labs](https://cohere.com/research) for providing credits through the [Cohere Research Grant](https://cohere.com/research/grants) to run the Aya model series, and [Together AI](https://together.ai) for additional computational credits for running several open models.
	We also acknowledge the Hugging Face team, particularly the OpenEvals team (Clémentine Fourrier [@clefourrier](https://github.com/clefourrier) and Nathan Habib [@NathanHB](https://github.com/NathanHB)) and Daniel van Strien [@davanstrien](https://github.com/davanstrien), for their support in publishing the FilBench blog post.
	"""

	# Citation information
	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	CITATION_BUTTON_TEXT = r"""
	@article{miranda2025filbench,
	title={Fil{B}ench: {C}an {LLM}s {U}nderstand and {G}enerate {F}ilipino?},
	author={Miranda, Lester James V and Aco, Elyanah and Manuel, Conner and Cruz, Jan Christian Blaise and Imperial, Joseph Marvin},
	journal={arXiv preprint arXiv:2508.03523},
	year={2025}
	}
	"""