Z-Image-Turbo on AXERA AX650N

This project provides a complete implementation for deploying the Z-Image-Turbo diffusion model on AXERA AX650N NPU hardware. Z-Image-Turbo is a high-performance text-to-image generation model that leverages advanced diffusion techniques to produce high-quality images with fast inference speed.

Table of Contents

Overview

The Z-Image-Turbo model consists of three main components:

  1. Text Encoder: Converts text prompts into embeddings
  2. Transformer: Core diffusion model that processes latent representations
  3. VAE (Variational Autoencoder): Encodes/decodes between pixel space and latent space

Deployment Strategy

The deployment architecture is optimized for AXERA AX650N with the following design decisions:

  • Text Encoder: Currently runs on PyTorch for simplicity and faster development iteration. This component uses the Qwen3 model and can be converted to axmodel format for pure NPU inference in future releases to achieve end-to-end NPU acceleration.
  • Transformer: Fully converted to axmodel format and runs on NPU through model partitioning and subgraph optimization, achieving optimal performance on the target hardware.
  • VAE: Both encoder and decoder are converted to axmodel format for complete NPU acceleration, enabling fast image encoding and decoding operations.

Requirements

This project requires the following Python environment and dependencies:

Python 3.9.20
torch 2.7.0
torchvision 0.22.0
transformers 4.53.1
diffusers 0.32.1

Additional Dependencies:

  • ONNX Runtime (for ONNX model inference and validation)
  • onnxslim (for ONNX model optimization)
  • numpy (for numerical operations and calibration data handling)
  • Pulsar2 toolchain (for AXERA AX650N model compilation)

Hardware Requirements:

  • AXERA AX650N development board for deployment
  • x86/ARM Linux system for model conversion and compilation

Project Structure

Z-Image-Turbo/
โ”œโ”€โ”€ original_onnx/              # Exported ONNX models (original format)
โ”‚   โ”œโ”€โ”€ vae_decoder_simp_slim.onnx
โ”‚   โ”œโ”€โ”€ vae_encoder_simp_slim.onnx
โ”‚   โ””โ”€โ”€ z_image_transformer_body_only_simp_slim.onnx
โ”œโ”€โ”€ text_encoder_axmodel/       # Text encoder models in axmodel format
โ”‚   โ”œโ”€โ”€ model.embed_tokens.weight.npy
โ”‚   โ”œโ”€โ”€ qwen3_p128_l0_together.axmodel
โ”‚   โ”œโ”€โ”€ qwen3_p128_l1_together.axmodel
โ”‚   โ””โ”€โ”€ ... (36 layer models for Qwen3)
โ”œโ”€โ”€ transformer_axmodel/        # Transformer subgraph models in axmodel format
โ”‚   โ”œโ”€โ”€ auto_00_model_layers_29_Add_4_output_0_to_sample_auto.axmodel
โ”‚   โ”œโ”€โ”€ cfg_00_timestep_to_model_t_embedder_mlp_mlp_2_Gemm_output_0_config.axmodel
โ”‚   โ””โ”€โ”€ ... (compiled subgraph models)
โ”œโ”€โ”€ transformer_onnx/           # Transformer models in ONNX format
โ”œโ”€โ”€ vae_model/                  # VAE models (both ONNX and axmodel formats)
โ”œโ”€โ”€ VideoX-Fun/                 # Main conversion and inference code
โ””โ”€โ”€ README.md                   # This documentation

Model Components

1. Transformer Module

The transformer module is the core component responsible for the diffusion process. It iteratively processes latent representations to generate high-quality images from noise. Due to the model's complexity and size, we employ a subgraph partitioning strategy to optimize deployment on the AX650N NPU.

Step 1: Export to ONNX Format

First, export the transformer model to ONNX format (without ControlNet support):

python scripts/z_image/export_transformer_body_onnx.py \
        --output onnx-models-512x512/z_image_transformer_body_only_512x512.onnx \
        --height 512 --width 512 --sequence-length 128 \
        --latent-downsample-factor 8 \
        --dtype fp32 \
        --skip-slim

Parameters:

  • --output: Output path for the ONNX model
  • --height, --width: Target image dimensions (512x512)
  • --sequence-length: Maximum sequence length for text embeddings (128 tokens)
  • --latent-downsample-factor: VAE downsample factor (8x)
  • --dtype: Data type (fp32 for highest accuracy)
  • --skip-slim: Skip ONNX simplification (optional)

Note: If you don't use --skip-slim, the model will be automatically simplified and the output will be named: z_image_transformer_body_only_512x512_simp_slim.onnx

Step 2: Collect Calibration Data

Collect calibration dataset from the original model for quantization. This step generates representative input data that will be used during the quantization process:

python ./examples/z_image_fun/collect_onnx_inputs.py \
    --model_name models/Diffusion_Transformer/Z-Image-Turbo/ \
    --output_dir transformer_body_only_512x512_simp_slim/calibration \
    --height 512 --width 512 \
    --max_sequence_length 128

This command generates calibration data by running the model with various prompts and diffusion steps, capturing the actual input distributions that the model will encounter during inference.

Step 3: Split ONNX Model into Subgraphs

Split the monolithic ONNX model into multiple subgraphs for better memory management and compilation optimization:

python ./scripts/split_onnx_by_subconfig.py \
    --model ./onnx-models-512x512/z_image_transformer_body_only_512x512_simp_slim.onnx \
    --config ./pulsar2_configs/transformers_subgraph_512x512.json \
    --output-dir ./transformers_body_only_512_512_split_onnx \
    --verify \
    --input-data ./transformer_body_only_512x512_simp_slim/calibration/transformer_inputs_prompt000_step00.npy \
    --providers CPUExecutionProvider

The subgraph configuration file (transformers_subgraph_512x512.json) defines the splitting strategy, determining how the model is partitioned into smaller, manageable pieces that fit within the NPU's constraints.

Step 4: Collect Subgraph Calibration Data

After splitting, collect calibration data for each individual subgraph:

python examples/z_image_fun/collect_subgraph_inputs.py \
  --onnx ./onnx-models-512x512/z_image_transformer_body_only_512x512_simp_slim.onnx \
  --subgraph-config ./pulsar2_configs/transformers_subgraph_512x512.json \
  --output-dir ./transformer_body_only_512x512_simp_slim/subgraph-calib \
  --tar-list-file ./transformer_body_only_512x512_simp_slim/subgraph-calib/paths.txt \
  --skip-existing

For collecting additional calibration data with different resolutions (for instance: 1728x992):

python examples/z_image_fun/collect_subgraph_inputs.py \
    --onnx ./onnx-models-1728x992/z_image_transformer_body_only_1728x992_simp_slim.onnx \
    --subgraph-config ./pulsar2_configs/transformers_subgraph_1728x992.json \
    --output-dir ./transformer_body_only_1728x992_simp_slim/subgraph-calib \
    --tar-list-file ./transformer_body_only_1728x992_simp_slim/subgraph-calib/paths.txt  \
    --sample-size 1728 992 \
    --max-seq-len 256

Step 5: Generate Compilation Configuration Files

Automatically generate individual compilation configuration files for each subgraph:

python ./scripts/generate_subgraph_configs.py \
    --tar-list-file ./transformer_body_only_512x512_simp_slim/subgraph-calib/paths.txt \
    --output-config-dir pulsar2_configs/subgraphs_512x512

This step creates tailored configuration files for each subgraph, specifying quantization settings, calibration data paths, and compilation options.

Important: After generating the sub-ONNX files, you need to apply ONNX simplification (onnxslim) to each subgraph for optimal performance.

Step 6: Compile All Subgraphs

Compile all subgraphs using the Pulsar2 toolchain:

./compile_all_subgraphs.sh \
    --onnx-dir ./transformers_body_only_512_512_split_onnx \
    --config-dir pulsar2_configs/subgraphs_512x512 \
    --output-base-dir ./compiled_transformers_body_only_512x512/out_all \
    --final-output-dir ./compiled_transformers_body_only_512x512/out_final

Output Directories:

  • out_all: Contains compilation logs and intermediate files for all subgraphs
  • out_final: Contains only the successfully compiled axmodel files, ready for deployment

The compilation process converts each ONNX subgraph into an optimized axmodel format that can run efficiently on the AX650N NPU.

2. VAE Decoder Module

The Variational Autoencoder (VAE) is responsible for converting between the latent space representation and pixel space. The decoder takes the denoised latent representation from the transformer and generates the final RGB image.

Step 1: Export VAE to ONNX Format

Export both the VAE encoder and decoder to ONNX format:

python scripts/z_image_fun/export_vae_onnx.py \
        --model-root models/Diffusion_Transformer/Z-Image-Turbo/ \
        --height 512 --width 512 \
        --encoder-output onnx-models-512x512/vae_encoder.onnx \
        --decoder-output onnx-models-512x512/vae_decoder.onnx \
        --dtype fp32 \
        --save-calib-inputs \
        --calib-dir onnx-calibration-512x512 \
        --skip-ort-check

Parameters:

  • --model-root: Path to the Z-Image-Turbo model
  • --encoder-output, --decoder-output: Output paths for the encoder and decoder ONNX models
  • --save-calib-inputs: Save calibration inputs for quantization
  • --calib-dir: Directory to store calibration data
  • --skip-ort-check: Skip ONNX Runtime validation (useful when ORT has compatibility issues)

Step 2: Create Compilation Configuration

Create a configuration file for the VAE decoder compilation. Example configuration file: pulsar2_configs/vae_decoder.json

This configuration should specify:

  • Input/output tensor names and shapes
  • Quantization strategy (e.g., int8, mixed precision)
  • Calibration data paths
  • Hardware target (AX650)

Step 3: Compile VAE Decoder

Compile the ONNX model to axmodel format using Pulsar2:

pulsar2 build \
    --output_dir ./compiled_output_vae_decoder \
    --config pulsar2_configs/vae_decoder.json \
    --npu_mode NPU3 \
    --input onnx-models/vae_decoder_simp_slim.onnx \
    --target_hardware AX650

Parameters:

  • --output_dir: Output directory for compiled models
  • --config: Path to the compilation configuration file
  • --npu_mode: NPU mode (NPU3 for maximum performance on AX650N)
  • --target_hardware: Target hardware platform (AX650)

The compiled VAE decoder will be saved in the output directory and can be deployed to the AX650N board.

Complete Inference Pipeline

After compiling all components, you can run the complete text-to-image inference pipeline on the AXERA AX650N development board.

Running on the Development Board

  1. Transfer all compiled axmodel files to the development board
  2. Ensure all dependencies are installed
  3. Run the inference script:
python3 examples/z_image_fun/launcher_axmodel.py \
    --transformer-config pulsar2_configs/transformers_subgraph.json \
    --transformer-subgraph-dir ../transformer_axmodel \
    --vae-axmodel ../vae_model/vae_decoder.axmodel

Parameters:

  • --transformer-config: Configuration file that defines the subgraph structure
  • --transformer-subgraph-dir: Directory containing all compiled transformer subgraph axmodels
  • --vae-axmodel: Path to the compiled VAE decoder axmodel

The launcher script will:

  1. Load the text encoder (PyTorch)
  2. Process input prompts into embeddings
  3. Run the transformer subgraphs sequentially on NPU
  4. Decode the latent representation using VAE decoder on NPU
  5. Output the final generated image

Example Output

Here's an example of the inference process running on the AX650N development board:

root@ax650 Z-Image-Turbo/VideoX-Fun $ python3 examples/z_image_fun/launcher_axmodel.py \
    --transformer-config pulsar2_configs/transformers_subgraph.json \
    --transformer-subgraph-dir ../transformer_axmodel \
    --vae-axmodel ../vae_model/vae_decoder.axmodel

[INFO] Available providers:  ['AxEngineExecutionProvider']
/root/yongqiang/push_hugging_face/Z-Image-Turbo/VideoX-Fun/videox_fun/dist/wan_xfuser.py:22: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  @amp.autocast(enabled=False)
...
/root/yongqiang/push_hugging_face/Z-Image-Turbo/VideoX-Fun/videox_fun/models/wan_audio_injector.py:114: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  @amp.autocast(enabled=False)
/root/yongqiang/push_hugging_face/Z-Image-Turbo/VideoX-Fun/videox_fun/models/wan_transformer3d_s2v.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  @amp.autocast(enabled=False)
2026-01-15 15:55:55.577 | INFO     | __main__:main:425 - ไฝฟ็”จ็š„ prompt: sunrise over alpine mountains, low clouds in valleys, god rays, ultra-detailed landscape
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [00:01<00:00,  2.26it/s]
The module name  (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
^@^@^@[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC50
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.12.0s
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 5c5e711b-dirty
AX Denoising:   0%|                                                                             | 0/9 [00:00<?, ?it/s][INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 5c5e711b-dirty
2026-01-15 15:58:44.111 | INFO     | __main__:_get_session:301 - ๅŠ ่ฝฝๅญๅ›พ session: cfg_00 from cfg_00_timestep_to_model_t_embedder_mlp_mlp_2_Gemm_output_0_config.axmodel
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 5c5e711b-dirty
2026-01-15 15:58:48.882 | INFO     | __main__:_get_session:301 - ๅŠ ่ฝฝๅญๅ›พ session: cfg_01 from cfg_01_prompt_embeds_to_model_Slice_1_output_0_config.axmodel
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 5c5e711b-dirty
...
2026-01-15 16:00:08.612 | INFO     | __main__:_get_session:301 - ๅŠ ่ฝฝๅญๅ›พ session: cfg_30 from cfg_30_model_layers_26_Add_4_output_0_to_model_layers_27_Add_4_output_0_config.axmodel
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1 5c5e711b
2026-01-15 16:00:11.179 | INFO     | __main__:_get_session:301 - ๅŠ ่ฝฝๅญๅ›พ session: cfg_31 from cfg_31_model_layers_27_Add_4_output_0_to_model_layers_28_Add_4_output_0_config.axmodel
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1 5c5e711b
2026-01-15 16:00:13.868 | INFO     | __main__:_get_session:301 - ๅŠ ่ฝฝๅญๅ›พ session: cfg_32 from cfg_32_model_layers_28_Add_4_output_0_to_model_layers_29_Add_4_output_0_config.axmodel
AX Denoising:  22%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                     | 2/9 [01:36<04:45, 40.84s/it]AX Denoising: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 9/9 [02:20<00:00, 15.60s/it]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1 5c5e711b
2026-01-15 16:01:06.972 | INFO     | __main__:main:537 - AXModel ๆŽจ็†ๅฎŒๆˆ๏ผŒ็ป“ๆžœไฟๅญ˜ๅˆฐ /root/yongqiang/push_hugging_face/Z-Image-Turbo/VideoX-Fun/samples/z-image-t2i-axmodel/z_image_axmodel_2.png

The inference process demonstrates the complete pipeline working on the hardware, including:

  • Model loading and initialization (~3 minutes for all 33 subgraphs)
  • Denoising iterations (9 steps, ~2 minutes 20 seconds total)
  • Final image generation and saving

Known Limitations

Quantization Accuracy: Unfortunately, due to quantization precision limitations, the axmodel inference results show some differences compared to the original ONNX model outputs. This is a trade-off between inference speed and numerical precision when deploying on NPU hardware. Future work may include:

  • Fine-tuning quantization parameters to improve accuracy
  • Exploring mixed-precision quantization strategies
  • Implementing calibration with more diverse datasets

Advanced Usage

Frontend-Only Export for Graph Analysis

For debugging and graph analysis, you can export only the frontend graph without compilation:

ENABLE_COMPILER=0 DUMP_FRONTEND_GRAPH=1 \
pulsar2 build \
    --output_dir ./compiled_output_trans_body_only_frontend \
    --config pulsar2_configs/config_controlnet.json \
    --npu_mode NPU3 \
    --input ../original_onnx/z_image_transformer_body_only_simp_slim.onnx \
    --target_hardware AX650

This is useful for:

  • Analyzing the graph structure before compilation
  • Debugging subgraph partitioning strategies
  • Verifying model transformations

Compile from Quantized ONNX

If you already have a quantized ONNX model, you can compile it directly:

pulsar2 build \
    --input compiled_output_trans_body_only_use_calibration/quant/quant_axmodel.onnx \
    --model_type QuantAxModel \
    --output_dir compiled_subgraph_from_quant_onnx \
    --output_name transformers.axmodel \
    --config pulsar2_configs/transformers_subgraph.json \
    --target_hardware AX650 \
    --npu_mode NPU3

Technical Support

If you encounter any issues or have questions about the implementation:

  • GitHub Issues: Create an issue for bug reports and feature requests
  • QQ Group: 139953715 (Chinese community support)

License

This project is licensed under the BSD-3-Clause License. See the LICENSE file for details.


Note: This implementation is optimized for AXERA AX650N hardware. Performance and compatibility may vary on other platforms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support