[Feature Request/Bug] Support custom local paths for VAD and SPK models in serve_vllm.py

Notice: In order to resolve issues efficiently, please follow the template and include reproducible details.
（注意：为了更加高效率解决您遇到的问题，请按照模板提问，补充可复现细节。）

## 🐛 Bug

I am working with serve_vllm.py (/opt/voice_ai/funasr/funasr/examples/industrial_data_pretraining/fun_asr_nano) and realized it currently lacks flexible support for custom local paths for VAD and SPK models. I attempted to modify the script to allow passing custom paths via arguments (--vad-model and --spk-model).

## To Reproduce

Steps to reproduce the behavior. Always include the exact command you ran.

1. Modify serve_vllm.py (cp to serve_vllm_test.py) to accept --vad-model and --spk-model as arguments.

2. Pass a local absolute path to these arguments.

3. Observe that AutoModel fails to initialize the model because the local path is not in the list of Registered model keys.

## Code sample


Update the def load_engine function
```python
def load_engine(args):
    global _engine, _vad_model, _spk_model, _args
    _args = args
    if _engine is None:
        logger.info(f"Loading vLLM engine: {args.model}")
        _engine = FunASRNanoVLLM.from_pretrained(
            model=args.model, hub=args.hub, device=args.device, dtype=args.dtype,
            max_model_len=args.max_model_len,
            gpu_memory_utilization=args.gpu_memory_utilization,
        )
        logger.info("Loading VAD: fsmn-vad")
        #_vad_model = AutoModel(model="fsmn-vad", device=args.device, disable_update=True)
        logger.info(f"Loading VAD: {args.vad_model}")
        _vad_model = AutoModel(model=args.vad_model, device=args.device, model_revision="v1.0.0", disable_update=True)


        #logger.info("Loading SPK: eres2netv2")
        #_spk_model = AutoModel(model="iic/speech_eres2netv2_sv_zh-cn_16k-common", device=args.device, disable_update=True)


        logger.info(f"Loading SPK via standard local path: {args.spk_model}")
        _spk_model = AutoModel(model=args.spk_model, device=args.device, disable_update=True)
        logger.info("All models ready!")
```
Then add two args in "main"
```python
    parser.add_argument("--vad-model", type=str, default="fsmn-vad", help="Local path or ID for VAD model")
    parser.add_argument("--spk-model", type=str, default="iic/speech_eres2netv2_sv_zh-cn_16k-common", help="Local path or ID for SPK model")
```
## Expected behavior


It should load the models and FastAPI can be up.

## Error logs



```text
((venv) ) [user@llm_server fun_asr_nano]# CUDA_VISIBLE_DEVICES=0 python serve_vllm_test.py     --port 8899     --model /opt/voice_ai/funasr/models/Fun-ASR-Nano-2512     --vad-model /opt/voice_ai/funasr/models/speech_fsmn_vad_zh-cn-16k-common-pytorch     --spk-model /opt/voice_ai/funasr/models/speech_eres2netv2_sv_zh-cn_16k-common     --gpu-memory-utilization 0.5     --max-model-len 8192
2026-06-04 18:13:43,242 [INFO] Loading vLLM engine: /opt/voice_ai/funasr/models/Fun-ASR-Nano-2512
2026-06-04 18:13:43,242 [INFO] Model directory: /opt/voice_ai/funasr/models/Fun-ASR-Nano-2512
2026-06-04 18:13:47,163 [INFO] vLLM model already prepared at /opt/voice_ai/funasr/models/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm
2026-06-04 18:13:48,817 [INFO] Loading audio component weights from /opt/voice_ai/funasr/models/Fun-ASR-Nano-2512/model.pt
2026-06-04 18:14:04,106 [INFO]   Loaded audio_encoder: 914 params
2026-06-04 18:14:04,110 [INFO]   Loaded audio_adaptor: 36 params
2026-06-04 18:14:04,970 [INFO] Initializing vLLM with model: /opt/voice_ai/funasr/models/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm
2026-06-04 18:14:04,970 [INFO]   tensor_parallel_size=1
2026-06-04 18:14:04,971 [INFO]   gpu_memory_utilization=0.5
INFO 06-04 18:14:04 [utils.py:278] non-default args: {'enable_prompt_embeds': True, 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 8192, 'gpu_memory_utilization': 0.5, 'disable_log_stats': True, 'model': '/opt/voice_ai/funasr/models/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm'}
INFO 06-04 18:14:05 [model.py:617] Resolved architecture: Qwen3ForCausalLM
INFO 06-04 18:14:05 [model.py:1752] Using max model len 8192
INFO 06-04 18:14:05 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-04 18:14:05 [vllm.py:977] Asynchronous scheduling is enabled.
INFO 06-04 18:14:05 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
WARNING 06-04 18:14:06 [vllm.py:509] Model runner v2 does not yet support prompt embeds; using the v1 model runner instead.
WARNING 06-04 18:14:08 [system_utils.py:157] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
(EngineCore pid=4510) INFO 06-04 18:14:17 [core.py:112] Initializing a V1 LLM engine (v0.22.0) with config: model='/opt/voice_ai/funasr/models/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm', speculative_config=None, tokenizer='/opt/voice_ai/funasr/models/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/opt/voice_ai/funasr/models/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::qwen_gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
(EngineCore pid=4510) WARNING 06-04 18:14:18 [vllm.py:509] Model runner v2 does not yet support prompt embeds; using the v1 model runner instead.
(EngineCore pid=4510) INFO 06-04 18:14:18 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.34.1.177:60811 backend=nccl
(EngineCore pid=4510) INFO 06-04 18:14:20 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=4510) INFO 06-04 18:14:21 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=4510) INFO 06-04 18:14:21 [gpu_model_runner.py:5037] Starting to load model /opt/voice_ai/funasr/models/Fun-ASR-Nano-2512/Qwen3-0.6B-vllm...
(EngineCore pid=4510) INFO 06-04 18:14:22 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=4510) INFO 06-04 18:14:22 [flash_attn.py:636] Using FlashAttention version 2
(EngineCore pid=4510) INFO 06-04 18:14:22 [weight_utils.py:922] Filesystem type for checkpoints: XFS. Checkpoint size: 1.40 GiB. Available RAM: 11.03 GiB.
(EngineCore pid=4510) INFO 06-04 18:14:22 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (XFS) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:08<00:00,  8.09s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:08<00:00,  8.09s/it]
(EngineCore pid=4510) 
(EngineCore pid=4510) INFO 06-04 18:14:30 [default_loader.py:397] Loading weights took 8.10 seconds
(EngineCore pid=4510) INFO 06-04 18:14:31 [gpu_model_runner.py:5132] Model loading took 1.12 GiB memory and 8.635157 seconds
(EngineCore pid=4510) INFO 06-04 18:14:36 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/413c4a3fa0/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=4510) INFO 06-04 18:14:36 [backends.py:1148] Dynamo bytecode transform time: 4.52 s
(EngineCore pid=4510) INFO 06-04 18:14:37 [backends.py:292] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 1.108 s
(EngineCore pid=4510) INFO 06-04 18:14:37 [decorators.py:311] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/8866e5cc844ad5e8d1d1558501ad8e98ac7885947c4f1815a1af58918f36e9dd/rank_0_0/model
(EngineCore pid=4510) INFO 06-04 18:14:37 [monitor.py:53] torch.compile took 5.89 s in total
(EngineCore pid=4510) INFO 06-04 18:14:37 [monitor.py:81] Initial profiling/warmup run took 0.19 s
(EngineCore pid=4510) INFO 06-04 18:14:38 [gpu_model_runner.py:6279] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=4510) INFO 06-04 18:14:39 [gpu_model_runner.py:6365] Estimated CUDA graph memory: 0.47 GiB total
(EngineCore pid=4510) INFO 06-04 18:14:39 [gpu_worker.py:466] Available KV cache memory: 8.76 GiB
(EngineCore pid=4510) INFO 06-04 18:14:39 [gpu_worker.py:481] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.5000 is equivalent to --gpu-memory-utilization=0.4785 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.5215. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=4510) INFO 06-04 18:14:39 [kv_cache_utils.py:1733] GPU KV cache size: 82,032 tokens
(EngineCore pid=4510) INFO 06-04 18:14:39 [kv_cache_utils.py:1734] Maximum concurrency for 8,192 tokens per request: 10.01x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████| 51/51 [00:01<00:00, 26.99it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 28.22it/s]
(EngineCore pid=4510) INFO 06-04 18:14:43 [gpu_model_runner.py:6456] Graph capturing finished in 4 secs, took 0.41 GiB
(EngineCore pid=4510) INFO 06-04 18:14:43 [gpu_worker.py:619] CUDA graph pool memory: 0.41 GiB (actual), 0.47 GiB (estimated), difference: 0.07 GiB (16.8%).
(EngineCore pid=4510) INFO 06-04 18:14:43 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=4510) INFO 06-04 18:14:43 [core.py:302] init engine (profile, create kv cache, warmup model) took 12.43 s (compilation: 5.89 s)
(EngineCore pid=4510) INFO 06-04 18:14:44 [vllm.py:977] Asynchronous scheduling is enabled.
(EngineCore pid=4510) INFO 06-04 18:14:44 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
2026-06-04 18:14:45,949 [INFO] Loaded embedding layer: torch.Size([151936, 1024])
2026-06-04 18:14:46,092 [INFO] Loading VAD: fsmn-vad
2026-06-04 18:14:46,092 [INFO] Loading VAD: /opt/voice_ai/funasr/models/speech_fsmn_vad_zh-cn-16k-common-pytorch
funasr version: 1.3.9.
2026-06-04 18:14:46,093 [INFO] download models from model hub: ms
2026-06-04 18:14:46,104 [WARNING] trust_remote_code: False
2026-06-04 18:14:46,114 [INFO] Loading pretrained params from /opt/voice_ai/funasr/models/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt
2026-06-04 18:14:46,115 [INFO] ckpt: /opt/voice_ai/funasr/models/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt
2026-06-04 18:14:46,130 [INFO] scope_map: ['module.', 'None']
2026-06-04 18:14:46,130 [INFO] excludes: None
2026-06-04 18:14:46,131 [INFO] Loading ckpt: /opt/voice_ai/funasr/models/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt, status: <All keys matched successfully>
2026-06-04 18:14:46,133 [INFO] Loading SPK via standard local path: /opt/voice_ai/funasr/models/speech_eres2netv2_sv_zh-cn_16k-common
funasr version: 1.3.9.
2026-06-04 18:14:46,133 [INFO] download models from model hub: ms
2026-06-04 18:14:46,135 [WARNING] trust_remote_code: False
Traceback (most recent call last):
  File "/opt/voice_ai/funasr/funasr/examples/industrial_data_pretraining/fun_asr_nano/serve_vllm_test.py", line 421, in <module>
    load_engine(_args)
  File "/opt/voice_ai/funasr/funasr/examples/industrial_data_pretraining/fun_asr_nano/serve_vllm_test.py", line 91, in load_engine
    _spk_model = AutoModel(model=args.spk_model, device=args.device, disable_update=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/voice_ai/funasr/funasr/funasr/auto/auto_model.py", line 220, in __init__
    model, kwargs = self.build_model(**kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/voice_ai/funasr/funasr/funasr/auto/auto_model.py", line 391, in build_model
    raise RuntimeError(
RuntimeError: model '/opt/voice_ai/funasr/models/speech_eres2netv2_sv_zh-cn_16k-common' is not registered.
Registered model keys (48): BAT, BiCifParaformer, Branchformer, CAMPPlus, CTC, CTTransformer, CTTransformerStreaming, Conformer, ContextualParaformer, EBranchformer, EParaformer, ERes2NetV2, Emotion2vec, FsmnKWS, FsmnKWSConvert, FsmnKWSMT, FsmnKWSMTConvert, FsmnVADStreaming, FunASRNano, GLMASR, LCBNet, LLMASR, LLMASR2, LLMASR3, LLMASR4, LLMASRNAR, LLMASRNARPrompt, MonotonicAligner, OpenAIWhisperLIDModel, OpenAIWhisperModel, Paraformer, ParaformerStreaming, Paraformer_v2_community, Qwen/Qwen3-ASR-0.6B, Qwen/Qwen3-ASR-1.7B, Qwen3ASR, SANM, SCAMA, SanmKWS, SanmKWSStreaming, SeacoParaformer, SenseVoiceSmall, Transducer, Transformer, UniASR, ZhipuAI/GLM-ASR-Nano-2512, iic/speech_eres2netv2_sv_zh-cn_16k-common, zai-org/GLM-ASR-Nano-2512
Some modules may have failed to import during auto-registration. Set FUNASR_IMPORT_DEBUG=1 to print failures during import, or FUNASR_STRICT_IMPORT=1 to fail fast.
Recorded import failures:
  - funasr.bin.train: ImportError: cannot import name 'AutoModel' from partially initialized module 'funasr' (most likely due to a circular import) (/opt/voice_ai/funasr/funasr/funasr/__init__.py)
  - funasr.bin.train_ds: ImportError: cannot import name 'AutoModel' from partially initialized module 'funasr' (most likely due to a circular import) (/opt/voice_ai/funasr/funasr/funasr/__init__.py)
  - funasr.frontends.default: ModuleNotFoundError: No module named 'pytorch_wpe'
  - funasr.frontends.fused: ModuleNotFoundError: No module named 'pytorch_wpe'
  - funasr.frontends.s3prl: ModuleNotFoundError: No module named 'pytorch_wpe'
  - funasr.frontends.utils.dnn_wpe: ModuleNotFoundError: No module named 'pytorch_wpe'
  - funasr.frontends.utils.frontend: ModuleNotFoundError: No module named 'pytorch_wpe'
  - funasr.models.fun_asr_nano.tools.whisper_mix_normalize: ModuleNotFoundError: No module named 'cn_tn'
  - funasr.models.language_model.rnn.decoders: ModuleNotFoundError: No module named 'funasr.models.transformer.utils.scorers'
  - funasr.models.language_model.seq_rnn_lm: ModuleNotFoundError: No module named 'funasr.train'
  - funasr.models.language_model.transformer_lm: ModuleNotFoundError: No module named 'funasr.models.encoder'
  - funasr.models.mfcca.e2e_asr_mfcca: ImportError: cannot import name 'ErrorCalculator' from 'funasr.metrics' (/opt/voice_ai/funasr/funasr/funasr/metrics/__init__.py)
  - funasr.models.mfcca.mfcca_encoder: ModuleNotFoundError: No module named 'funasr.models.encoder'
  - funasr.models.mossformer.e2e_ss: ModuleNotFoundError: No module named 'funasr.models.base_model'
  - funasr.models.qwen_audio.model: ModuleNotFoundError: No module named 'whisper'
  - funasr.models.sa_asr.beam_search_sa_asr: ImportError: cannot import name 'end_detect' from 'funasr.metrics' (/opt/voice_ai/funasr/funasr/funasr/metrics/__init__.py)
  - funasr.models.sa_asr.e2e_sa_asr: ModuleNotFoundError: No module named 'funasr.layers'
  - funasr.models.sense_voice.whisper_lib.triton_ops: SyntaxError: invalid syntax (triton_ops.py, line 57)
  - funasr.models.sond.e2e_diar_sond: ModuleNotFoundError: No module named 'funasr.models.decoder'
  - funasr.models.sond.encoder.conv_encoder: ModuleNotFoundError: No module named 'funasr.models.encoder'
  - funasr.models.sond.encoder.fsmn_encoder: ModuleNotFoundError: No module named 'funasr.models.encoder'
  - funasr.models.sond.encoder.resnet34_encoder: ModuleNotFoundError: No module named 'funasr.models.encoder'
  - funasr.models.sond.encoder.self_attention_encoder: ImportError: cannot import name 'CTC' from 'funasr.models.ctc' (/opt/voice_ai/funasr/funasr/funasr/models/ctc/__init__.py)
  - funasr.models.sond.sv_decoder: ModuleNotFoundError: No module named 'funasr.models.decoder'
  - funasr.models.whisper.model: ModuleNotFoundError: No module named 'whisper'
  - funasr.models.whisper_lid.decoder: ModuleNotFoundError: No module named 'whisper'
  - funasr.models.whisper_lid.encoder: ModuleNotFoundError: No module named 'whisper'
  - funasr.models.whisper_lid.eres2net.simple_avg: ModuleNotFoundError: No module named 'funasr.models.encoder'
  - funasr.models.xvector.e2e_sv: ModuleNotFoundError: No module named 'funasr.layers'
  - funasr.utils.speaker_utils: ModuleNotFoundError: No module named 'funasr.utils.modelscope_file'
(EngineCore pid=4510) INFO 06-04 18:14:46 [core.py:1266] Shutdown initiated (timeout=0)
(EngineCore pid=4510) INFO 06-04 18:14:46 [core.py:1289] Shutdown complete
((venv) ) [user@llm_server fun_asr_nano]# 
```
## Findings & Proposed Solution

Observed behavior: local path loading works for VAD models but fails for SPK models, suggesting inconsistent local-path handling inside AutoModel depending on model metadata or registration flow.

It appears that AutoModel currently relies on a strict registry-key lookup when initializing certain models. For local filesystem paths, it may be beneficial to support one of the following fallback mechanisms when the provided value is not found in the registry:

* **Metadata Inference:** Attempt to infer the model architecture directly from local configuration files (e.g., `configuration.json`) instead of relying solely on a pre-registered key.
* **Explicit Type Override:** Allow users to pass an optional `model_type` argument when using `AutoModel` for local paths.
* **Class-based Fallback:** If the registry lookup fails, attempt a fallback to the model-specific `from_pretrained` method (e.g., `ERes2NetV2.from_pretrained`) based on the inferred type.

This would improve consistency between different model categories and make offline/local deployments easier to support.

## Environment

- OS: Amazon Linux 2023
- Python version: 3.12.13
- FunASR version: 1.3.9
- ModelScope version: 1.37.1
- PyTorch / torchaudio version: 2.11.0+cu130
- Install method (`pip`, source, Docker): venv/pip
- Device (`cuda`, `cpu`, `mps`): cuda
- GPU model: A10
- CUDA/cuDNN version: 13.0
- Docker image tag, if used: N/A

## Audio details

If the audio cannot be shared, please describe:

- Duration:
- Sample rate:
- Format:
- Language/dialect:
- Speaker count:
- Background noise/music:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request/Bug] Support custom local paths for VAD and SPK models in serve_vllm.py #2964

🐛 Bug

To Reproduce

Code sample

Expected behavior

Error logs

Findings & Proposed Solution

Environment

Audio details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature Request/Bug] Support custom local paths for VAD and SPK models in serve_vllm.py #2964

Description

🐛 Bug

To Reproduce

Code sample

Expected behavior

Error logs

Findings & Proposed Solution

Environment

Audio details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions