NIO NIM4-ASR

NOMI Intelligence Model 4.0-ASR (NIM4-ASR)

Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

NIM4-ASR is a 2.3B-parameter production-oriented LLM-based ASR framework designed for low-latency streaming speech recognition, robustness against hallucinations, and million-scale hotword customization.

Yuan Xie*, Jiaqi Song*, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, Jie Wu

Advanced Intelligence System Group, NIO

Demo

Streaming Inference

This demo illustrates streaming inference with real-time punctuation insertion, followed by a full-context second-pass LLM decoding update once all streaming chunks have been received.

Sports commentary. The model can leverage semantic context to correct the streaming partial result from "那时" to "纳什".
Automotive safety terminology. We provide streaming punctuation insertion to improve the readability of real-time transcriptions.

Streaming Inference with Online Hotword Customization

This demo illustrates streaming inference with online hotword retrieval and biasing: hotwords are retrieved on-the-fly and injected into the decoding context in real time. Our hotword customization module does not require hotwords to be manually specified — they can instead be retrieved automatically from a large-scale database via phoneme-based RAG.

Location-aware hotwords. Real-time hotword biasing corrects "彗新" to "惠新".
Celebrity-aware hotwords.

Comparison with Leading Open-Source LLM-ASR Models

Chinese

Speech Clip Ground Truth
reference transcription
Our NIM4-ASR
(streaming)
Fun-ASR
Nano-2512
Qwen3-ASR
1.7B
FireRedASR2S
LLM
Step-Audio2
Mini
Qwen3-Omni
30B-A3B Instruct
vitw_00426 (voice_in_the_wild_bench) 这个快捷键不顺手换一个组合 CER 0.00这个快捷键不顺手换一个组合 CER 23.08快捷键不换一个组合 CER 15.38这个快捷键不换一个组合 CER 15.38这个快捷键不换一个组合 CER 7.69这个快捷键不手换一个组合 CER 7.69这个快捷键不手换一个组合
A1109CO0001 (aishell_2021c) 减弱音量到八十九 CER 0.00减弱音量到八十九 CER 50.00紧罗一样到八十九 CER 50.00锦罗一样到八十九 CER 50.00简罗一样到八十九 CER 37.50了一样到八十九 CER 50.00捡一辆到八十九
A1123CO0001 (aishell_2021c) 音量第七格 CER 0.00音量第七格 CER 60.00音量贴贴狗 CER 100.00听听歌 CER 100.00晶亮贴贴狗 CER >100你们二个天天搞 CER 100.00你有踢踢过
9uELWl5vOJM_0093 (SPEECHIO_ASR_ZH00020) 而纳什也在零六年拿到自己第二个MVP奖杯并且是连庄MVP CER 0.00而纳什也在零六年拿到自己第二个MVP奖杯并且是连庄MVP CER 33.33那时也在零六年拿到自己第二个MVP奖杯并且是连MVP CER 29.63那时也在零六年拿到自己第二个MVP奖杯并且是连庄MVP CER 29.63那时也在零六年拿到自己第二个MVP奖杯并且是连庄MVP CER 33.33那时也在零六年拿到自己第二个MVP奖杯并且是连MVP CER 29.63那时也在零六年拿到自己第二个MVP奖杯并且是连庄MVP
ZFWalpHS0wg_0035 (SPEECHIO_ASR_ZH00021) 主动刹车车道偏离预警车道保持系统包括它还有DSC车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错 CER 0.00主动刹车车道偏离预警车道保持系统包括它还有DSC车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错 CER 3.70主动刹车车道偏离预警车道保持系统包括它还有DSA车身稳定控制系统这些全部都是全系标配的所以安心配置表现还不错 CER 5.56主动刹车车道偏离预警车道保持系统包括它还有ESC车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错 CER 5.56主动刹车车道偏离预警车道保持系统包括它还有ESC车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错 CER 5.56主动刹车车道偏离预警车道保持系统包括它还有DSI车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错 CER 1.85主动刹车车道偏离预警车道保持系统包括它还有ESC车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错

English

Speech Clip Ground Truth
reference transcription
Our NIM4-ASR
(streaming)
Fun-ASR
Nano-2512
Qwen3-ASR
1.7B
FireRedASR2S
LLM
Step-Audio2
Mini
Qwen3-Omni
30B-A3B Instruct
ZH-CN_U0061_S0_153 (cs_dialogue) yeah I watched the movie for twice twice and my my best friends have watched from nine nine the nine times oh my god and Im so shock right now it is incredibly hes yeah hes the big hes the big fan of taylor swift WER 8.89yeah I watched the movie for twice twice and my my best friends have watched it for nine nine the nine times oh my god and Im so shocked right now it is incredibly hes yeah hes the biggest hes the big fan of taylor swift WER 22.22yeah I watched the movie for twice twice and my my best friends have watched it for nine nine the nine times oh my god __ Im so shocked right now it it it incredibly hes a yeah hes the biggest hes the big fan of takers with WER 17.78yeah I watched the movie for twice twice and my my best friends have watched it for nine nine __ nine times oh my god __ Im so shocked right now it it is incredibly hes a yeah hes the biggest hes the big fan of taylor swift WER 15.56yeah I watched the movie for twice twice and my my best friends have watched it from nine nine the nine times oh my god is Im so shocked right now it it it is incredibly hes a yeah hes the biggest hes the big fan of taylor swift WER 17.78yeah I watched the movie for twice twice and my my best friends have watched it for nine nine the nine times oh my god __ Im so shocked right now it it its incredibly hes a yeah hes the biggest hes the big fan of taylor swift WER 44.44yes I watched the movie __ twice and my __ best friend has watched it nine times oh my god __ Im so shocked right now __ incredibly hes __ the biggest fan of taylor swift
5442-32873-00140000009853 (librispeech_test_other) luke took care of mister larkins dogs and groomed mister wylders horse and cleaned up his dog cart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dog cart in the post office premises and so evaded the livery charges of the brandon arms WER 1.69luke took care of mister larkins dogs and groomed mister wilders horse and cleaned up his dog cart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dog cart in the post office premises and so evaded the livery charges of the brandon arms WER 15.25luke took care of mister larkins dogs and groomed mister wylders horse and cleaned up his dogcart for mock wing clothes about money and finding that the thing was to be done more cheaply that way put up his horse and dogcart in the postoffice premises and so evaded the livery charges of the brandon arms WER 5.08luke took care of mister larkins dogs and groomed mister wilders horse and cleaned up his dog cart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dog cart in the post office premises and so evaded the livery charges of the brand and arms WER 8.47luke took care of mister larkins dogs and groomed mister wilders horse and cleaned up his dogcart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dogcart in the post office premises and so evaded the livery charges of the brandon arms WER 8.47luc took care of mr larkinss dogs and groomed mr wilders horse and cleaned up his dog cart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dog cart in the post office premises and so evaded the livery charges of the brandon arms WER 10.17luke took care of mister larkinss dogs and groomed mister wilders horse and cleaned up his dogcart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dogcart in the post office premises and so evaded the livery charges of the brandon arms
vitw_01998_random_synthetic_en_recording_4841 (voice_in_the_wild_bench) I need to review the budget again before approving the purchase request WER 0.00I need to review the budget again before approving the purchase request WER 33.33I need to review the budget again before approving them for the budget meeting WER 25.00I need to review the budget again for approving the purchases WER 25.00I need to review the budget again before approving it WER 8.33I need to review the budget again before approving the purchase order WER 41.67I need to review the budget again for upcoming delta purchases

Mandarin-English Code-switch

Speech Clip Ground Truth
reference transcription
Our NIM4-ASR
(streaming)
Fun-ASR
Nano-2512
Qwen3-ASR
1.7B
FireRedASR2S
LLM
Step-Audio2
Mini
Qwen3-Omni
30B-A3B Instruct
validation-00000-of-00001_00570 (ascend) 平时就是我们做pre的时候他们尽量就是 CER 0.00平时就是我们做pre的时候他们尽量就是 CER 5.88平时就是我们做的时候他们尽量就是 CER 5.88平时就是我们做的时候他们尽量就是 CER 5.88平时就是我们做play的时候他们尽量就是 CER 5.88平时就是我们做play的时候他们尽量就是 CER 5.88平时就是我们做play的时候他们尽量就是
ZH-CN_U0018_S0_32 (cs_dialogue) 好的of course CER 0.00好的of course CER 50.00had that of course CER 50.00hada of course CER 50.00how that of course CER >100哈那那那那那那那那那那那那那那那那那那那...... CER 0.00好的of course

Hallucination-Prone Cases

Speech Clip Ground Truth
reference transcription
Our NIM4-ASR
(streaming)
Fun-ASR
Nano-2512
Qwen3-ASR
1.7B
FireRedASR2S
LLM
Step-Audio2
Mini
Qwen3-Omni
30B-A3B Instruct
whisper_S0189_M-0189-2_014060-014592 (aishell_6c) 恋人间的默契像是无声的语言相视一笑便懂 CER 5.26恋人间的默契像是无声的语言相一笑便懂 CER 26.32恋人间的默契像是无声的语言像是一小片 CER 15.79人间的默契像是无声的语言像是一笑便懂 CER >100阿弥陀佛阿弥陀佛阿弥陀佛阿弥陀佛阿弥陀佛阿弥陀佛阿弥陀佛...... CER 26.32恋人间的默契像是无声的语言像是一小片毒 CER >100在美丽的湖边划一只小船享受着自然的亲密接触
whisper_S0189_M-0189-2_099060-099700 (aishell_6c) 在工作与生活之间找到平衡是一门深奥的艺术 CER 0.00在工作与生活之间找到平衡是一门深奥的艺术 CER 5.00在工作与生活之间找到平衡是一门深的艺术 CER 5.00在工作与生活之间找到平衡是一门深的艺术 CER >100搿么侬讲到搿个叫啥个呃吃个物事对伐我觉着也蛮重要个为啥道理呢...... CER 5.00在工作与生活之间找到平衡是一门深的艺术 CER >100突然的灵感如同泉涌创作的激情也随之而来热血沸腾

Method

Architecture and Training

NIM4-ASR architecture diagram

Modular ASR Architecture

The architecture comprises a streaming Conformer encoder, a two-layer speech adaptor, a phoneme-level CTC head with a RAG module, and a Qwen3-1.7B LLM decoder.

NIM4-ASR training pipeline diagram

Role-Aligned Training

Training consists of CR-CTC pretraining, alignment, CKA-triggered IA-SFT, late-stage joint SFT, context SFT, and ASR-specific GRPO-style reinforcement learning.

Our training design, from pretraining through SFT, is guided by an entropy-allocation analysis of the Encoder–Adaptor–LLM architecture. Readers interested in this topic can refer to our work: "Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs" Paper

Streaming

Real-Time Inference

NIM4-ASR uses a decoupled deployment architecture: the encoder runs on Triton; the adaptor and LLM run in a vLLM-based engine; and the phoneme head with RAG module run on CPU. Speech embeddings are appended incrementally via streaming chunked prefill, enabling real-time partial transcription followed by a stable second-pass final decoding after VAD detects the end of speech.

Customization

Phoneme-Level RAG

Hotwords are encoded as phoneme-token sequences in a trie structure built upon an Aho–Corasick automaton for efficient retrieval. Exact phoneme matching and longest-match filtering support databases containing millions of hotwords while maintaining sub-millisecond retrieval latency and high precision..

Results

To reduce evaluation variance caused by surface-form differences, including numeric formatting and filler-word usage, we normalize all transcriptions with WeTextProcessing, a WFST-based toolkit. Although normalization may lower absolute error rates, applying the same pipeline to every system enables a fairer comparison of recognition performance. All baselines are reproduced according to their official guidelines.

TL;DR: We apply the same WeTextProcessing-based normalization to all systems, reducing formatting-related noise and enabling fairer comparisons, although the resulting error rates may be lower than those reported under standard evaluation protocols.

Public Benchmarks (Metric: CER/WER)

Benchmark Fun-ASR
Nano-2512
GLM-ASR
Nano-2512
Qwen3-ASR
1.7B
FireRedASR2S
LLM
Step-Audio2
Mini
Qwen3-Omni
30B-A3B Instruct
NIM4-ASR
(offline)
NIM4-ASR
(streaming)
Model Size0.8B1.5B2.0B8B+8B+30B-A3B2.3B2.3B
Mandarin
AISHELL-1 dev | test1.59 | 1.812.40 | 2.411.40 | 1.510.60 | 0.640.76 | 0.810.86 | 0.920.43 | 0.570.43 | 0.60
AISHELL-2-ios dev | test2.62 | 2.733.21 | 3.452.41 | 2.602.07 | 2.082.24 | 2.292.11 | 2.312.28 | 2.432.33 | 2.49
AISHELL-2021-Eval A | C | D4.75 | 4.29 | 2.337.25 | 9.48 | 3.404.22 | 3.51 | 1.8213.40 | 3.92 | 4.684.54 | 3.69 | 2.345.19 | 3.34 | 1.663.12 | 1.51 | 1.813.28 | 1.63 | 2.22
WeNetSpeech meeting | net4.68 | 5.226.87 | 5.724.00 | 4.133.36 | 3.524.23 | 4.633.92 | 3.854.91 | 4.725.71 | 5.00
SpeechIO2.783.172.552.203.412.332.612.84
Chinese Dialects
WeNetSpeech-Chuan easy | hard13.21 | 23.7620.95 | 33.6111.18 | 20.3510.36 | 20.0713.99 | 25.3514.13 | 25.1610.51 | 20.5811.22 | 20.37
WeNetSpeech-Yue short | long7.31 | 10.0216.78 | 13.975.79 | 8.005.05 | 10.457.78 | 8.446.97 | 8.605.12 | 8.585.39 | 9.62
KeSpeech7.189.594.983.053.986.004.405.08
English
LibriSpeech-dev clean | other1.63 | 4.061.82 | 3.931.54 | 3.141.27 | 2.631.06 | 2.481.08 | 2.101.13 | 2.451.18 | 2.86
LibriSpeech-test clean | other1.63 | 4.351.96 | 4.291.56 | 3.491.29 | 2.971.22 | 2.611.15 | 2.381.19 | 2.531.29 | 2.92
VoxPopuli dev | test7.86 | 7.708.78 | 8.527.58 | 7.429.38 | 9.248.86 | 8.376.86 | 6.756.18 | 6.086.26 | 6.22
MLS-English6.805.324.934.714.374.044.775.04
Mandarin–English Code-Switching
CS-Dialogue5.376.155.444.639.468.514.704.91
ASCEND11.9112.2910.8710.2213.5018.6811.4611.85
Lyrics
M4Singer5.2518.455.72N/A9.688.406.396.94

Internal Benchmarks (Metric: CER/WER)

Benchmark Fun-ASR
Nano-2512
GLM-ASR
Nano-2512
Qwen3-ASR
1.7B
FireRedASR2S
LLM
Step-Audio2
Mini
Qwen3-Omni
30B-A3B Instruct
NIM4-ASR
(offline)
NIM4-ASR
(streaming)
Model Size0.8B1.5B2.0B8B+8B+30B-A3B2.3B2.3B
Point of Interest (POI)
City A7.0714.689.148.549.419.673.863.85
City B8.5015.7510.5910.4311.6711.734.864.94
City C7.6017.5510.0110.1711.3512.183.773.81
City D7.4217.919.779.5111.5510.864.104.17
Media
Music12.6024.2512.6712.1314.9415.895.755.78
Video8.2720.359.699.3812.3015.332.993.03
Radio13.6919.8210.5111.8414.2117.911.211.17
Device Control
Vehicle control4.748.785.314.524.974.181.881.78
Conversational
Vehicle-domain chat easy | hard3.75 | 5.925.63 | 10.123.31 | 5.962.93 | 5.612.35 | 7.635.98 | 6.602.70 | 4.882.76 | 4.83
Multi-domain chat1.651.891.331.271.495.341.551.75

Hallucination Rate (%)

Category Fun-ASR
Nano-2512
GLM-ASR
Nano-2512
Qwen3-ASR
1.7B
FireRedASR2S
LLM
Step-Audio2
Mini
Qwen3-Omni
30B-A3B Instruct
NIM4-ASR
(offline w/o RL)
NIM4-ASR
(offline w/ RL)
Mandarin (Avg.)0.018%0.030%0.018%0.165%0.020%0.013%0.003%0.002%
Dialect (Avg.)0.217%0.201%0.120%0.298%0.194%0.370%0.122%0.117%
English (Avg.)0.014%0.014%0.014%0.014%0.014%0.007%0.007%0.007%
Code-switch (Avg.)0.397%0.315%0.345%0.335%1.255%1.778%0.261%0.261%
Lyrics (Avg.)0.153%0.580%0.249%1.775%0.390%0.129%0.215%0.081%

Citation

Rethinking Entropy Allocation in LLM-based ASR
@article{xie2026rethinking,
  title={Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs},
  author={Xie, Yuan and Song, Jiaqi and Qiu, Guang and Wang, Xianliang and Lei, Ming and Gao, Jie and Wu, Jie},
  journal={arXiv preprint arXiv:2604.08003},
  year={2026}
}
NIM4-ASR
@article{xie2026nim4,
  title={NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR},
  author={Xie, Yuan and Song, Jiaqi and Qiu, Guang and Wang, Xianliang and Qiao, Kai and Yuan, Junfeng and Liu, Shengqing and Zhang, Yi and Chen, Bowen and Lei, Ming and others},
  journal={arXiv preprint arXiv:2604.18105},
  year={2026}
}