π¦ Falcon-H1 Chat | π€ Hugging Face | π Paper | π° Blog | π Documentation | π₯οΈ Hugging Face Demo | π¬ Discord
- 10/02/2025 Falcon-H1 series is now integrated into SGLang!
- 09/22/2025 Falcon-H1 series is now integrated into MLX!
- 07/31/2025 The Technical Report of Falcon-H1 is now released!
- 07/09/2025 Falcon-H1 series is now integrated into llama.cpp!
- 06/30/2025 Falcon-H1 series is now integrated into several fine-tuning frameworks (axolotl, llama-factory, unsloth)!
- 05/21/2025 Falcon-H1 series is finally out!
We are excited to introduce Falcon-H1, the latest evolution in the Falcon family of large language models. Built upon an advanced hybrid architectureβwhere each block integrates both State Space Models (SSMs) and Attention Mechanisms, these models span a wide range of scales, from 500 million to 34 billion parameters, making them suitable for both lightweight inference on edge devices and large-scale deployments in data centers.
Falcon-H1 was initially trained with support for 18 core languages, with scalability to 100+ languages, achieving state-of-the-art multilingual and reasoning performances in instruction following, maths, coding, and multilingual tasks.
Built by the Technology Innovation Institute (TII) in Abu Dhabi, Falcon-H1 is the latest step in pushing the frontier of hybrid transformer design.
Attention and Mamba2 heads are combined in parallel within our hybrid mixer block. Importantly, the amount of attention and mamba heads can be adjusted independently, allowing for an optimal attention ans SSM ratio. This hybrid design enables faster inference, lower memory usage, and strong geenralization aceoss tasks.
Models available at multiple scales or variants: 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, and 34B parameters, supporting diverse usages and deployment scenarios, from edge devices to large-scale systems.
Native training in 18 languages, including Arabic (ar), Czech (cs), German (de), English (en), Spanish (es), French (fr), Hindi (hi), Italian (it), Japanese (ja), Korean (ko), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Swedish (sv), Urdu (ur), and Chinese (zh) β with scalability to 100+ languages, thanks to our multilingual tokenizer trained on diverse language datasets, with strong zero-shot translation and instruction-following abilities.
Falcon-H1-0.5B delivers performance on par with typical 7B models from 2024, while Falcon-H1-1.5B-Deep rivals many of the current leading 7Bβ10B models. Each Falcon-H1 model is designed to match or exceed the performance of models at least twice its size, making them ideal for low-resource and edge deployments without compromising on capability.
Falcon-H1 models support up to 256K context length, enabling applications in long-document processing, multi-turn dialogue, and long-range reasoning, with exceptional long context performance and greater computational and memory efficiency, Falcon-H1 provides a great balance between performance and resource cost.
Falcon-H1 employs a redesigned training approach that maximizes the value of high-quality but limited data. Additionally, the training process scales smoothly across model sizes through a customized Maximal Update Parametrization (
Falcon-H1 is compatible with most major training, inference and deployment frameworks, such as Llama-Factory, Unsloth, vLLM, SGLang, Hugging Face Transformers, llama.cpp and MLX β with more coming soon.
A detailed dynamic evaluation report is provided in our blogpost and technical report.
- π We show that Falcon-H1 models achieve state-of-the-art performance in most benchmarks (reasoning, maths, coding, in-context learning, and more), outperforming leading open models double their sizes.
- π Falcon-H1-34B achieves up to a 4x improvement in input throughput and an 8x speedup in output throughput, compared to similar size Transformer models (Qwen2.5-32B).
We provide the following documentation and resources to begin working with Falcon-H1:
- π¬ Quick Deploy: Try Falcon-H1 instantly using our hosted Chat Interface or the Live Demo from Hugging Face
- π οΈ Inference Toolkits: Compatible out-of-the-box with vLLM, SGLang, Transformers, and llama.cpp. π Deployment Instructions. Other runtimes are in progress.
- βοΈ Fine-tuning: Compatible with most frameworks based on Hugging Face Transformers library, out-of-the-box with OUMI, Llama-Factory, etc. π Fine-Tuning Guidelines. More framework support coming soon!
- π» Local Setup: Full GGUF and HF formats available. Run it efficiently on both GPU and CPU.
- π¬ Research: Learn more about our novel hybrid design in the Falcon-H1 technical report.
π‘ Tip: For optimal performance, always use
torch.bfloat16instead oftorch.float16; The recommendedmodel temperatureis0.1- higher than that, model's performance may largely drop.
Make sure to install the latest version of transformers or vllm, eventually install these packages from source:
pip install git+https://github.com/huggingface/transformers.gitRefer to the official vLLM documentation for more details on building vLLM from source.
Transformers is a library of pretrained natural language processing for inference and training. Refer to the snippet below to run H1 models using π€ transformers:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the model
model_id = "tiiuae/Falcon-H1-1.5B-Base"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Perform text generation belowvLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. To run Falcon-H1 models with the recommended precision and sampling defaults:
# pip install vllm
vllm serve tiiuae/Falcon-H1-1.5B-Instruct \
--tensor-parallel-size 2 \
--data-parallel-size 1 \
--dtype bfloat16 \
--port 8000Then send requests with the target temperature:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tiiuae/Falcon-H1-1.5B-Instruct",
"temperature": 0.1,
"messages": [
{"role": "system", "content": "You are Falcon-H1."},
{"role": "user", "content": "Give me a fun fact about falcons."}
]
}'SGLang provides a high-performance serving runtime with native Falcon-H1 kernels. Follow the steps below to spin up a Falcon-H1 endpoint:
# 1. Install the runtime (requires an NVIDIA GPU with FlashInfer-compatible CUDA drivers)
pip install uv
uv pip install "sglang[all]>=0.5.3"
# 2. Launch Falcon-H1 with SGLang (replace <your_token> with a Hugging Face token)
HF_TOKEN=<your_token> python -m sglang.launch_server \
--model-path tiiuae/Falcon-H1-7B-Instruct \
--dtype bfloat16 \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-codeYou can now call the OpenAI-compatible endpoint exposed on http://localhost:30000:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tiiuae/Falcon-H1-7B-Instruct",
"temperature": 0.1,
"messages": [
{"role": "system", "content": "You are Falcon-H1."},
{"role": "user", "content": "Give me a fun fact about falcons."}
]
}'Falcon-H1 is now natively supported into llama.cpp !
All official GGUF files can be found on our official Hugging Face collection.
- CMake β₯ 3.16
- A C++17-compatible compiler (e.g.,
gcc,clang) - make or ninja build tool
- (Optional) Docker, for OpenWebUI integration
# Clone the Falcon-H1 llama.cpp fork
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Create a build directory and compile
mkdir build && cd build
cmake .. # Configure the project
make -j$(nproc) # Build the binariesTip: For GPU acceleration, refer to the llama.cpp GPU guide.
Fetch the desired Falcon-H1 checkpoint from Hugging Faceβs collection:
# Example: download the 1B Instruct model
wget https://huggingface.co/tiiuae/Falcon-H1-1.5B-Instruct-GGUF/resolve/main/Falcon-H1-1.5B-Instruct-Q5_K.gguf \
-P models/All available GGUF files: https://huggingface.co/collections/tiiuae/falcon-h1-6819f2795bc406da60fab8df
Start the HTTP server for inference:
./build/bin/llama-server \
-m models/Falcon-H1-1.5B-Instruct-Q5_0.gguf \
-c 4096 \
-ngl 512 \
--temp 0.1 \
--host 0.0.0.0 \
--port 11434Use the popular OpenWebUI frontend to chat in your browser:
docker run -d \
--name openwebui-test \
-e OPENAI_API_BASE_URL="http://host.docker.internal:11434/v1" \
-p 8888:8888 \
ghcr.io/open-webui/open-webui:main- Open your browser at http://localhost:8888
- Select Falcon-H1-1.5B-Instruct-Q5_0 from the model list
- Start chatting!
For advanced tuning and custom flags, see the full llama.cpp documentation: https://github.com/ggerganov/llama.cpp
Demo Hardware: MacBook M4 Max Chip Model:Falcon-H1-1.5B-Q6_K
Falcon-H1-1B-Q6_K.mp4
Got feedback or want to build with Falcon-H1?
Join the conversation on Discord, follow us on Hugging Face, visit our official website, or check out our roadmap and open issues on GitHub.
Feel free to cite our work if you find it useful for your projects:
@article{falconh1,
title={Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance},
author={Jingwei Zuo and Maksim Velikanov and Ilyas Chahed and Younes Belkada and Dhia Eddine Rhayem and Guillaume Kunsch and Hakim Hacid and Hamza Yous and Brahim Farhat and Ibrahim Khadraoui and Mugariya Farooq and Giulia Campesan and Ruxandra Cojocaru and Yasser Djilali and Shi Hu and Iheb Chaabane and Puneesh Khanna and Mohamed El Amine Seddik and Ngoc Dung Huynh and Phuc Le Khac and Leen AlQadi and Billel Mokeddem and Mohamed Chami and Abdalgader Abubaker and Mikhail Lubinets and Kacper Piskorski and Slim Frikha},
journal = {arXiv preprint arXiv:2507.22448},
year={2025}
}


