NIM4-ASR

Demo

Streaming Inference

This demo illustrates streaming inference with real-time punctuation insertion, followed by a full-context second-pass LLM decoding update once all streaming chunks have been received.

Sports commentary. The model can leverage semantic context to correct the streaming partial result from "那时" to "纳什".

Automotive safety terminology. We provide streaming punctuation insertion to improve the readability of real-time transcriptions.

Streaming Inference with Online Hotword Customization

This demo illustrates streaming inference with online hotword retrieval and biasing: hotwords are retrieved on-the-fly and injected into the decoding context in real time. Our hotword customization module does not require hotwords to be manually specified — they can instead be retrieved automatically from a large-scale database via phoneme-based RAG.

Location-aware hotwords. Real-time hotword biasing corrects "彗新" to "惠新".

Celebrity-aware hotwords.

Comparison with Leading Open-Source LLM-ASR Models

Chinese

Speech Clip	Ground Truth reference transcription	Our NIM4-ASR (streaming)	Fun-ASR Nano-2512	Qwen3-ASR 1.7B	FireRedASR2S LLM	Step-Audio2 Mini	Qwen3-Omni 30B-A3B Instruct
vitw_00426 (voice_in_the_wild_bench)	这个快捷键不顺手换一个组合	CER 0.00这个快捷键不顺手换一个组合	CER 23.08这种快捷键不熟换一个组合	CER 15.38这个快捷键不熟换一个组合	CER 15.38这个快捷键不熟换一个组合	CER 7.69这个快捷键不是手换一个组合	CER 7.69这个快捷键不是手换一个组合
A1109CO0001 (aishell_2021c)	减弱音量到八十九	CER 0.00减弱音量到八十九	CER 50.00紧罗一样到八十九	CER 50.00锦罗一样到八十九	CER 50.00简罗一样到八十九	CER 37.50减了一样到八十九	CER 50.00捡一辆到八十九
A1123CO0001 (aishell_2021c)	音量第七格	CER 0.00音量第七格	CER 60.00音量贴贴狗	CER 100.00听听歌	CER 100.00晶亮贴贴狗	CER >100你们二个天天搞	CER 100.00你有踢踢过
9uELWl5vOJM_0093 (SPEECHIO_ASR_ZH00020)	而纳什也在零六年拿到自己第二个MVP奖杯并且是连庄MVP	CER 0.00而纳什也在零六年拿到自己第二个MVP奖杯并且是连庄MVP	CER 33.33而那时也在零六年拿到自己第二个MVP奖杯并且是连装MVP	CER 29.63而那时也在零六年拿到自己第二个MVP奖杯并且是连庄MVP	CER 29.63而那时也在零六年拿到自己第二个MVP奖杯并且是连庄MVP	CER 33.33而那时也在零六年拿到自己第二个MVP奖杯并且是连装MVP	CER 29.63而那时也在零六年拿到自己第二个MVP奖杯并且是连庄MVP
ZFWalpHS0wg_0035 (SPEECHIO_ASR_ZH00021)	主动刹车车道偏离预警车道保持系统包括它还有DSC车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错	CER 0.00主动刹车车道偏离预警车道保持系统包括它还有DSC车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错	CER 3.70主动刹车车道偏离预警车道保持系统包括它还有DSA车身稳定控制系统这些全部都是全系标配的所以安心配置表现还不错	CER 5.56主动刹车车道偏离预警车道保持系统包括它还有ESC车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错	CER 5.56主动刹车车道偏离预警车道保持系统包括它还有ESC车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错	CER 5.56主动刹车车道偏离预警车道保持系统包括它还有DSI车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错	CER 1.85主动刹车车道偏离预警车道保持系统包括它还有ESC车身稳定控制系统这些全部都是全系标配的所以安全配置表现还不错

English

Speech Clip	Ground Truth reference transcription	Our NIM4-ASR (streaming)	Fun-ASR Nano-2512	Qwen3-ASR 1.7B	FireRedASR2S LLM	Step-Audio2 Mini	Qwen3-Omni 30B-A3B Instruct
ZH-CN_U0061_S0_153 (cs_dialogue)	yeah I watched the movie for twice twice and my my best friends have watched from nine nine the nine times oh my god and Im so shock right now it is incredibly hes yeah hes the big hes the big fan of taylor swift	WER 8.89yeah I watched the movie for twice twice and my my best friends have watched it for nine nine the nine times oh my god and Im so shocked right now it is incredibly hes yeah hes the biggest hes the big fan of taylor swift	WER 22.22yeah I watched the movie for twice twice and my my best friends have watched it for nine nine the nine times oh my god __ Im so shocked right now it it it incredibly hes a yeah hes the biggest hes the big fan of takers with	WER 17.78yeah I watched the movie for twice twice and my my best friends have watched it for nine nine __ nine times oh my god __ Im so shocked right now it it is incredibly hes a yeah hes the biggest hes the big fan of taylor swift	WER 15.56yeah I watched the movie for twice twice and my my best friends have watched it from nine nine the nine times oh my god is Im so shocked right now it it it is incredibly hes a yeah hes the biggest hes the big fan of taylor swift	WER 17.78yeah I watched the movie for twice twice and my my best friends have watched it for nine nine the nine times oh my god __ Im so shocked right now it it its incredibly hes a yeah hes the biggest hes the big fan of taylor swift	WER 44.44yes I watched the movie __ twice and my __ best friend has watched it nine times oh my god __ Im so shocked right now __ incredibly hes __ the biggest fan of taylor swift
5442-32873-00140000009853 (librispeech_test_other)	luke took care of mister larkins dogs and groomed mister wylders horse and cleaned up his dog cart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dog cart in the post office premises and so evaded the livery charges of the brandon arms	WER 1.69luke took care of mister larkins dogs and groomed mister wilders horse and cleaned up his dog cart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dog cart in the post office premises and so evaded the livery charges of the brandon arms	WER 15.25luke took care of mister larkins dogs and groomed mister wylders horse and cleaned up his dogcart for mock wing clothes about money and finding that the thing was to be done more cheaply that way put up his horse and dogcart in the postoffice premises and so evaded the livery charges of the brandon arms	WER 5.08luke took care of mister larkins dogs and groomed mister wilders horse and cleaned up his dog cart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dog cart in the post office premises and so evaded the livery charges of the brand and arms	WER 8.47luke took care of mister larkins dogs and groomed mister wilders horse and cleaned up his dogcart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dogcart in the post office premises and so evaded the livery charges of the brandon arms	WER 8.47luc took care of mr larkinss dogs and groomed mr wilders horse and cleaned up his dog cart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dog cart in the post office premises and so evaded the livery charges of the brandon arms	WER 10.17luke took care of mister larkinss dogs and groomed mister wilders horse and cleaned up his dogcart for mark being close about money and finding that the thing was to be done more cheaply that way put up his horse and dogcart in the post office premises and so evaded the livery charges of the brandon arms
vitw_01998_random_synthetic_en_recording_4841 (voice_in_the_wild_bench)	I need to review the budget again before approving the purchase request	WER 0.00I need to review the budget again before approving the purchase request	WER 33.33I need to review the budget again before approving them for the budget meeting	WER 25.00I need to review the budget again for approving the purchases	WER 25.00I need to review the budget again before approving it	WER 8.33I need to review the budget again before approving the purchase order	WER 41.67I need to review the budget again for upcoming delta purchases

Mandarin-English Code-switch

Speech Clip	Ground Truth reference transcription	Our NIM4-ASR (streaming)	Fun-ASR Nano-2512	Qwen3-ASR 1.7B	FireRedASR2S LLM	Step-Audio2 Mini	Qwen3-Omni 30B-A3B Instruct
validation-00000-of-00001_00570 (ascend)	平时就是我们做pre的时候他们尽量就是	CER 0.00平时就是我们做pre的时候他们尽量就是	CER 5.88平时就是我们做推的时候他们尽量就是	CER 5.88平时就是我们做推的时候他们尽量就是	CER 5.88平时就是我们做play的时候他们尽量就是	CER 5.88平时就是我们做play的时候他们尽量就是	CER 5.88平时就是我们做play的时候他们尽量就是
ZH-CN_U0018_S0_32 (cs_dialogue)	好的of course	CER 0.00好的of course	CER 50.00had that of course	CER 50.00hada of course	CER 50.00how that of course	CER >100哈那那那那那那那那那那那那那那那那那那那......	CER 0.00好的of course

Hallucination-Prone Cases

Speech Clip	Ground Truth reference transcription	Our NIM4-ASR (streaming)	Fun-ASR Nano-2512	Qwen3-ASR 1.7B	FireRedASR2S LLM	Step-Audio2 Mini	Qwen3-Omni 30B-A3B Instruct
whisper_S0189_M-0189-2_014060-014592 (aishell_6c)	恋人间的默契像是无声的语言相视一笑便懂	CER 5.26恋人间的默契像是无声的语言相识一笑便懂	CER 26.32恋人间的默契像是无声的语言像是一小片	CER 15.79猎人间的默契像是无声的语言像是一笑便懂	CER >100阿弥陀佛阿弥陀佛阿弥陀佛阿弥陀佛阿弥陀佛阿弥陀佛阿弥陀佛......	CER 26.32恋人间的默契像是无声的语言像是一小片毒	CER >100在美丽的湖边划一只小船享受着自然的亲密接触
whisper_S0189_M-0189-2_099060-099700 (aishell_6c)	在工作与生活之间找到平衡是一门深奥的艺术	CER 0.00在工作与生活之间找到平衡是一门深奥的艺术	CER 5.00在工作与生活之间找到平衡是一门深厚的艺术	CER 5.00在工作与生活之间找到平衡是一门深厚的艺术	CER >100搿么侬讲到搿个叫啥个呃吃个物事对伐我觉着也蛮重要个为啥道理呢......	CER 5.00在工作与生活之间找到平衡是一门深厚的艺术	CER >100突然的灵感如同泉涌创作的激情也随之而来热血沸腾

Method

Architecture and Training

Modular ASR Architecture

The architecture comprises a streaming Conformer encoder, a two-layer speech adaptor, a phoneme-level CTC head with a RAG module, and a Qwen3-1.7B LLM decoder.

Role-Aligned Training

Training consists of CR-CTC pretraining, alignment, CKA-triggered IA-SFT, late-stage joint SFT, context SFT, and ASR-specific GRPO-style reinforcement learning.

Our training design, from pretraining through SFT, is guided by an entropy-allocation analysis of the Encoder–Adaptor–LLM architecture. Readers interested in this topic can refer to our work: "Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs" Paper

Streaming

Real-Time Inference

NIM4-ASR uses a decoupled deployment architecture: the encoder runs on Triton; the adaptor and LLM run in a vLLM-based engine; and the phoneme head with RAG module run on CPU. Speech embeddings are appended incrementally via streaming chunked prefill, enabling real-time partial transcription followed by a stable second-pass final decoding after VAD detects the end of speech.

Customization

Phoneme-Level RAG

Hotwords are encoded as phoneme-token sequences in a trie structure built upon an Aho–Corasick automaton for efficient retrieval. Exact phoneme matching and longest-match filtering support databases containing millions of hotwords while maintaining sub-millisecond retrieval latency and high precision..

Results

To reduce evaluation variance caused by surface-form differences, including numeric formatting and filler-word usage, we normalize all transcriptions with WeTextProcessing, a WFST-based toolkit. Although normalization may lower absolute error rates, applying the same pipeline to every system enables a fairer comparison of recognition performance. All baselines are reproduced according to their official guidelines.

TL;DR: We apply the same WeTextProcessing-based normalization to all systems, reducing formatting-related noise and enabling fairer comparisons, although the resulting error rates may be lower than those reported under standard evaluation protocols.

Public Benchmarks (Metric: CER/WER)

Benchmark	Fun-ASR Nano-2512	GLM-ASR Nano-2512	Qwen3-ASR 1.7B	FireRedASR2S LLM	Step-Audio2 Mini	Qwen3-Omni 30B-A3B Instruct	NIM4-ASR (offline)	NIM4-ASR (streaming)
Model Size	0.8B	1.5B	2.0B	8B+	8B+	30B-A3B	2.3B	2.3B
Mandarin
AISHELL-1 dev \| test	1.59 \| 1.81	2.40 \| 2.41	1.40 \| 1.51	0.60 \| 0.64	0.76 \| 0.81	0.86 \| 0.92	0.43 \| 0.57	0.43 \| 0.60
AISHELL-2-ios dev \| test	2.62 \| 2.73	3.21 \| 3.45	2.41 \| 2.60	2.07 \| 2.08	2.24 \| 2.29	2.11 \| 2.31	2.28 \| 2.43	2.33 \| 2.49
AISHELL-2021-Eval A \| C \| D	4.75 \| 4.29 \| 2.33	7.25 \| 9.48 \| 3.40	4.22 \| 3.51 \| 1.82	13.40 \| 3.92 \| 4.68	4.54 \| 3.69 \| 2.34	5.19 \| 3.34 \| 1.66	3.12 \| 1.51 \| 1.81	3.28 \| 1.63 \| 2.22
WeNetSpeech meeting \| net	4.68 \| 5.22	6.87 \| 5.72	4.00 \| 4.13	3.36 \| 3.52	4.23 \| 4.63	3.92 \| 3.85	4.91 \| 4.72	5.71 \| 5.00
SpeechIO	2.78	3.17	2.55	2.20	3.41	2.33	2.61	2.84
Chinese Dialects
WeNetSpeech-Chuan easy \| hard	13.21 \| 23.76	20.95 \| 33.61	11.18 \| 20.35	10.36 \| 20.07	13.99 \| 25.35	14.13 \| 25.16	10.51 \| 20.58	11.22 \| 20.37
WeNetSpeech-Yue short \| long	7.31 \| 10.02	16.78 \| 13.97	5.79 \| 8.00	5.05 \| 10.45	7.78 \| 8.44	6.97 \| 8.60	5.12 \| 8.58	5.39 \| 9.62
KeSpeech	7.18	9.59	4.98	3.05	3.98	6.00	4.40	5.08
English
LibriSpeech-dev clean \| other	1.63 \| 4.06	1.82 \| 3.93	1.54 \| 3.14	1.27 \| 2.63	1.06 \| 2.48	1.08 \| 2.10	1.13 \| 2.45	1.18 \| 2.86
LibriSpeech-test clean \| other	1.63 \| 4.35	1.96 \| 4.29	1.56 \| 3.49	1.29 \| 2.97	1.22 \| 2.61	1.15 \| 2.38	1.19 \| 2.53	1.29 \| 2.92
VoxPopuli dev \| test	7.86 \| 7.70	8.78 \| 8.52	7.58 \| 7.42	9.38 \| 9.24	8.86 \| 8.37	6.86 \| 6.75	6.18 \| 6.08	6.26 \| 6.22
MLS-English	6.80	5.32	4.93	4.71	4.37	4.04	4.77	5.04
Mandarin–English Code-Switching
CS-Dialogue	5.37	6.15	5.44	4.63	9.46	8.51	4.70	4.91
ASCEND	11.91	12.29	10.87	10.22	13.50	18.68	11.46	11.85
Lyrics
M4Singer	5.25	18.45	5.72	N/A	9.68	8.40	6.39	6.94

Internal Benchmarks (Metric: CER/WER)

Benchmark	Fun-ASR Nano-2512	GLM-ASR Nano-2512	Qwen3-ASR 1.7B	FireRedASR2S LLM	Step-Audio2 Mini	Qwen3-Omni 30B-A3B Instruct	NIM4-ASR (offline)	NIM4-ASR (streaming)
Model Size	0.8B	1.5B	2.0B	8B+	8B+	30B-A3B	2.3B	2.3B
Point of Interest (POI)
City A	7.07	14.68	9.14	8.54	9.41	9.67	3.86	3.85
City B	8.50	15.75	10.59	10.43	11.67	11.73	4.86	4.94
City C	7.60	17.55	10.01	10.17	11.35	12.18	3.77	3.81
City D	7.42	17.91	9.77	9.51	11.55	10.86	4.10	4.17
Media
Music	12.60	24.25	12.67	12.13	14.94	15.89	5.75	5.78
Video	8.27	20.35	9.69	9.38	12.30	15.33	2.99	3.03
Radio	13.69	19.82	10.51	11.84	14.21	17.91	1.21	1.17
Device Control
Vehicle control	4.74	8.78	5.31	4.52	4.97	4.18	1.88	1.78
Conversational
Vehicle-domain chat easy \| hard	3.75 \| 5.92	5.63 \| 10.12	3.31 \| 5.96	2.93 \| 5.61	2.35 \| 7.63	5.98 \| 6.60	2.70 \| 4.88	2.76 \| 4.83
Multi-domain chat	1.65	1.89	1.33	1.27	1.49	5.34	1.55	1.75

Hallucination Rate (%)

Category	Fun-ASR Nano-2512	GLM-ASR Nano-2512	Qwen3-ASR 1.7B	FireRedASR2S LLM	Step-Audio2 Mini	Qwen3-Omni 30B-A3B Instruct	NIM4-ASR (offline w/o RL)	NIM4-ASR (offline w/ RL)
Mandarin (Avg.)	0.018%	0.030%	0.018%	0.165%	0.020%	0.013%	0.003%	0.002%
Dialect (Avg.)	0.217%	0.201%	0.120%	0.298%	0.194%	0.370%	0.122%	0.117%
English (Avg.)	0.014%	0.014%	0.014%	0.014%	0.014%	0.007%	0.007%	0.007%
Code-switch (Avg.)	0.397%	0.315%	0.345%	0.335%	1.255%	1.778%	0.261%	0.261%
Lyrics (Avg.)	0.153%	0.580%	0.249%	1.775%	0.390%	0.129%	0.215%	0.081%

Citation

Rethinking Entropy Allocation in LLM-based ASR

@article{xie2026rethinking,
  title={Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs},
  author={Xie, Yuan and Song, Jiaqi and Qiu, Guang and Wang, Xianliang and Lei, Ming and Gao, Jie and Wu, Jie},
  journal={arXiv preprint arXiv:2604.08003},
  year={2026}
}

NIM4-ASR

@article{xie2026nim4,
  title={NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR},
  author={Xie, Yuan and Song, Jiaqi and Qiu, Guang and Wang, Xianliang and Qiao, Kai and Yuan, Junfeng and Liu, Shengqing and Zhang, Yi and Chen, Bowen and Lei, Ming and others},
  journal={arXiv preprint arXiv:2604.18105},
  year={2026}
}

Benchmark	Fun-ASR Nano-2512	GLM-ASR Nano-2512	Qwen3-ASR 1.7B	FireRedASR2S LLM	Step-Audio2 Mini	Qwen3-Omni 30B-A3B Instruct	NIM4-ASR (offline)	NIM4-ASR (streaming)
Model Size	0.8B	1.5B	2.0B	8B+	8B+	30B-A3B	2.3B	2.3B
Mandarin
AISHELL-1 dev \| test	1.59 \| 1.81	2.40 \| 2.41	1.40 \| 1.51	0.60 \| 0.64	0.76 \| 0.81	0.86 \| 0.92	0.43 \| 0.57	0.43 \| 0.60
AISHELL-2-ios dev \| test	2.62 \| 2.73	3.21 \| 3.45	2.41 \| 2.60	2.07 \| 2.08	2.24 \| 2.29	2.11 \| 2.31	2.28 \| 2.43	2.33 \| 2.49
AISHELL-2021-Eval A \| C \| D	4.75 \| 4.29 \| 2.33	7.25 \| 9.48 \| 3.40	4.22 \| 3.51 \| 1.82	13.40 \| 3.92 \| 4.68	4.54 \| 3.69 \| 2.34	5.19 \| 3.34 \| 1.66	3.12 \| 1.51 \| 1.81	3.28 \| 1.63 \| 2.22
WeNetSpeech meeting \| net	4.68 \| 5.22	6.87 \| 5.72	4.00 \| 4.13	3.36 \| 3.52	4.23 \| 4.63	3.92 \| 3.85	4.91 \| 4.72	5.71 \| 5.00
SpeechIO	2.78	3.17	2.55	2.20	3.41	2.33	2.61	2.84
Chinese Dialects
WeNetSpeech-Chuan easy \| hard	13.21 \| 23.76	20.95 \| 33.61	11.18 \| 20.35	10.36 \| 20.07	13.99 \| 25.35	14.13 \| 25.16	10.51 \| 20.58	11.22 \| 20.37
WeNetSpeech-Yue short \| long	7.31 \| 10.02	16.78 \| 13.97	5.79 \| 8.00	5.05 \| 10.45	7.78 \| 8.44	6.97 \| 8.60	5.12 \| 8.58	5.39 \| 9.62
KeSpeech	7.18	9.59	4.98	3.05	3.98	6.00	4.40	5.08
English
LibriSpeech-dev clean \| other	1.63 \| 4.06	1.82 \| 3.93	1.54 \| 3.14	1.27 \| 2.63	1.06 \| 2.48	1.08 \| 2.10	1.13 \| 2.45	1.18 \| 2.86
LibriSpeech-test clean \| other	1.63 \| 4.35	1.96 \| 4.29	1.56 \| 3.49	1.29 \| 2.97	1.22 \| 2.61	1.15 \| 2.38	1.19 \| 2.53	1.29 \| 2.92
VoxPopuli dev \| test	7.86 \| 7.70	8.78 \| 8.52	7.58 \| 7.42	9.38 \| 9.24	8.86 \| 8.37	6.86 \| 6.75	6.18 \| 6.08	6.26 \| 6.22
MLS-English	6.80	5.32	4.93	4.71	4.37	4.04	4.77	5.04
Mandarin–English Code-Switching
CS-Dialogue	5.37	6.15	5.44	4.63	9.46	8.51	4.70	4.91
ASCEND	11.91	12.29	10.87	10.22	13.50	18.68	11.46	11.85
Lyrics
M4Singer	5.25	18.45	5.72	N/A	9.68	8.40	6.39	6.94