Aurora-M/MDEL

community

https://aurora-lm.github.io/posts/about-us/

AI & ML interests

Formerly, MDEL, we have renamed ourselves after the model we deployed, Aurora-M. Visit us here: https://huggingface.co/aurora-m

Recent Activity

liangyuch authored a paper 11 days ago

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

liangyuch submitted a paper 11 days ago

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Taishi-N324 authored a paper 17 days ago

On the Optimal Reasoning Length for RL-Trained Language Models

View all activity

ajibawa-2023

posted an update 1 day ago

Post

1631

Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.

1 reply

ajibawa-2023

posted an update 6 days ago

Post

2505

PHP-Code-Large

Dataset: ajibawa-2023/PHP-Code-Large

PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.

By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.

PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.

ajibawa-2023

posted an update 11 days ago

Post

3239

JavaScript-Code-Large
ajibawa-2023/JavaScript-Code-Large

JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.

By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.

JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .

liangyuch

authored a paper 11 days ago

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Paper • 2602.12279 • Published 17 days ago • 19

liangyuch

submitted a paper to Daily Papers 11 days ago

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Paper • 2602.12279 • Published 17 days ago • 19

ajibawa-2023

posted an update 12 days ago

Post

3123

Java-Code-Large ( ajibawa-2023/Java-Code-Large)

Java-Code-Large is a large-scale corpus of publicly available Java source code comprising more than 15 million java codes. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis.

By providing a high-volume, language-specific corpus, Java-Code-Large enables systematic experimentation in Java-focused model training, domain adaptation, and downstream code understanding tasks.

Taishi-N324

authored a paper 17 days ago

On the Optimal Reasoning Length for RL-Trained Language Models

Paper • 2602.09591 • Published 19 days ago • 5

Ziyang

authored a paper about 2 months ago

DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

Paper • 2601.03559 • Published Jan 7 • 14

terryyz

authored a paper 3 months ago

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Paper • 2511.18538 • Published Nov 23, 2025 • 299

shannons

authored 3 papers 3 months ago

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

Paper • 2406.07835 • Published Jun 10, 2024 • 2

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Paper • 2510.09541 • Published Oct 10, 2025 • 17

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Paper • 2511.19399 • Published Nov 24, 2025 • 62

adamm-hf

posted an update 4 months ago

Post

1118

The #1 trending AI/ML dataset today 🏆

Massive scale, diversity and end-to-end potential from nvidia !
nvidia/PhysicalAI-Autonomous-Vehicles

adamm-hf

posted an update 4 months ago

Post

744

The new King 👑has arrived!

Moonshot AI now the top model on Hugging Face 🔥
moonshotai/Kimi-K2-Thinking

adamm-hf

posted an update 4 months ago

Post

2824

💸🤑You don’t need 100 GPUs to train something amazing!

Our Smol Training Playbook teaches you a better path to world-class LLMs, for free!

Check out the #1 trending space on 🤗 :
HuggingFaceTB/smol-training-playbook

vumichien

authored a paper 5 months ago

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Paper • 2509.25531 • Published Sep 29, 2025 • 9

terryyz

authored a paper 5 months ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published Oct 9, 2025 • 39

vumichien

authored a paper 5 months ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published Oct 9, 2025 • 39

georgosgeorgos

authored 2 papers 5 months ago

From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design

Paper • 2311.12668 • Published Nov 21, 2023

Accelerating Material Design with the Generative Toolkit for Scientific Discovery

Paper • 2207.03928 • Published Jul 8, 2022

AI & ML interests

Recent Activity

Team members 86

Multi-Domain-Expert-Learning's activity