CoreML
Collection
Models for Apple devices. See https://github.com/FluidInference/FluidAudio for usage details
•
8 items
•
Updated
•
5
Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.
| Variant | File | Latency | Use Case |
|---|---|---|---|
| Default | Sortformer.mlmodelc |
~1.04s | Low latency streaming |
| NVIDIA Low | SortformerNvidiaLow.mlmodelc |
~1.04s | Low latency streaming |
| NVIDIA High | SortformerNvidiaHigh.mlmodelc |
~30.4s | Best quality, offline |
| Parameter | Default | NVIDIA Low | NVIDIA High |
|---|---|---|---|
| chunk_len | 6 | 6 | 340 |
| chunk_right_context | 7 | 7 | 40 |
| chunk_left_context | 1 | 1 | 1 |
| fifo_len | 40 | 188 | 40 |
| spkcache_len | 188 | 188 | 188 |
General:
| Input | Shape | Description |
|---|---|---|
| chunk | [1, 8*(C+L+R), 128] |
Mel spectrogram features |
| chunk_lengths | [1] |
Actual chunk length |
| spkcache | [1, S, 512] |
Speaker cache embeddings |
| spkcache_lengths | [1] |
Actual cache length |
| fifo | [1, F, 512] |
FIFO queue embeddings |
| fifo_lengths | [1] |
Actual FIFO length |
| Output | Shape | Description |
|---|---|---|
| speaker_preds | [C+L+R+S+F, 4] |
Speaker probabilities (4 speakers) |
| chunk_pre_encoder_embs | [C+L+R, 512] |
Embeddings for state update |
| chunk_pre_encoder_lengths | [1] |
Actual embedding count |
| nest_encoder_embs | [C+L+R+S+F, 192] |
Embeddings for speaker discrimination |
| nest_encoder_lengths | [1] |
Actual speaker embedding count |
Note: C = chunk_len, L = chunk_left_context, R = chunk_right_context, S = spkcache_len, F = fifo_len.
Configuration-Specific Shapes:
| Input | Default | NVIDIA Low | NVIDIA High |
|---|---|---|---|
| chunk | [1, 112, 128] |
[1, 112, 128] |
[1, 3048, 128] |
| chunk_lengths | [1] |
[1] |
[1] |
| spkcache | [1, 188, 512] |
[1, 188, 512] |
[1, 188, 512] |
| spkcache_lengths | [1] |
[1] |
[1] |
| fifo | [1, 40, 512] |
[1, 188, 512] |
[1, 40, 512] |
| fifo_lengths | [1] |
[1] |
[1] |
| Output | Default | NVIDIA Low | NVIDIA High |
|---|---|---|---|
| speaker_preds | [1, 242, 128] |
[1, 390, 128] |
[1, 609, 128] |
| chunk_pre_encoder_embs | [1, 14, 512] |
[1, 14, 512] |
[1, 381, 512] |
| chunk_pre_encoder_lengths | [1] |
[1] |
[1] |
| nest_encoder_embs | [1, 242, 192] |
[1, 390, 192] |
[1, 609, 192] |
| nest_encoder_lengths | [1] |
[1] |
[1] |
| Metric | Default | NVIDIA High |
|---|---|---|
| Latency | ~1.12s | ~30.4s |
| RTFx (M4 Max) | ~5.7x | ~125.3x |
import FluidAudio
// Initialize with default config (auto-downloads from HuggingFace)
let diarizer = SortformerDiarizer(config: .default)
let models = try await SortformerModels.loadFromHuggingFace(config: .default)
diarizer.initialize(models: models)
// Streaming processing
for audioChunk in audioStream {
if let result = try diarizer.processSamples(audioChunk) {
for frame in 0..<result.frameCount {
for speaker in 0..<4 {
let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
}
}
}
}
// Or batch processing
let timeline = try diarizer.processComplete(audioSamples)
for (speakerIndex, segments) in timeline.segments.enumerated() {
for segment in segments {
print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
}
}
Performance
https://github.com/FluidInference/FluidAudio/blob/main/Documentation/Benchmarks.md
Files
Models
Scripts
Source
Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1
Credits & Acknowledgements
This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.
Their work was instrumental in:
This project was built upon the foundational work of the NVIDIA NeMo team.