利用llama-vulkan版本测试腾讯混元Hy-MT2多语言翻译模型
先到hf-mirror网站下载GGUF格式模型https://hf-mirror.com/tencent/Hy-MT2-1.8B-GGUF/tree/main, modelscope网站还未提供此格式 https://modelscope.cn/models/Tencent-Hunyuan/Hy-MT2-1.8B下载如下文件C:\dcurl -LO https://hf-mirror.com/tencent/Hy-MT2-1.8B-GGUF/resolve/main/Hy-MT2-1.8B-Q4_K_M.gguf -C - % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1365 0 1365 0 0 1408 0 0 100 1.05G 100 1.05G 0 0 7.64M 0 02:21 02:21 8.39M再到llama.cpp的github存储库下载最新版本llama预编译可执行文件选择vulkan版本与cpu版本的区别就是多了一个56MB的ggml-vulkan.dll它会自动检测显卡类型。C:\dcurl -LO https://github.com/ggml-org/llama.cpp/releases/download/b9279/llama-b9279-bin-win-cpu-x64.zip % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 00:03 0 22 15.18M 22 3.35M 0 0 33555 0 07:54 01:44 06:10 30056^C C:\dcurl -LO https://github.com/ggml-org/llama.cpp/releases/download/b9279/llama-b9279-bin-win-vulkan-x64.zip -C - % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 0 100 31.17M 100 31.17M 0 0 35429 0 15:22 15:22 38437为了看懂基准测试输出摘录这里的参数含义参数Q4_0 是什么Q4_0 是一种 4-bit 量化格式。它的意义不是“模型更强”而是“模型更小、更省显存、更容易塞进更多设备里”。这些榜单大多统一用 Llama 2 7B, Q4_0核心目的是减少变量让不同 GPU 的成绩更容易横向比较。pp512 是什么pp512 一般可以理解为 prompt processing 512 tokens也就是处理 512 个输入 token 时的吞吐。pp prompt processing512 输入长度是 512 tokent/s tokens per second它更像“吃提示词的速度”通常能并行得更充分所以数字往往很高。tg128 是什么tg128 一般可以理解为 text generation 128 tokens也就是连续生成 128 个 token 时的速度。tg text generation128 连续生成 128 tokent/s tokens per second它更接近我们平时感受到的“模型回答快不快”。因为生成阶段是逐 token 递推所以通常明显低于 pp512。基准测试C:\d\llama260522llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf -ngl 0 load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat load_backend: loaded Vulkan backend from C:\d\llama260522\ggml-vulkan.dll load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 0 | pp512 | 592.38 ± 13.29 | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 0 | tg128 | 45.02 ± 0.42 | build: 47c0eda9d (9279)可见它检测出了我的集成显卡AMD Radeon 780M Graphics。把ggml-vulkan.dll文件改名重新执行这次后台就是CPUpp512减少了近一半tg128保持不变。C:\d\llama260522llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf -ngl 0 load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | CPU | 8 | pp512 | 339.36 ± 10.26 | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | CPU | 8 | tg128 | 45.39 ± 0.11 | build: 47c0eda9d (9279)参阅文档https://juejin.cn/post/7382216166486540339了解到-ngl N, --n-gpu-layers N当使用GPU支持编译时此选项允许将一些层卸载到GPU进行计算。通常会提高性能。现在这个参数为0再恢复文件去掉-ngl 0参数C:\d\llama260522llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat load_backend: loaded Vulkan backend from C:\d\llama260522\ggml-vulkan.dll load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 99 | pp512 | 844.50 ± 9.69 | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 99 | tg128 | 59.84 ± 0.31 | build: 47c0eda9d (9279)这次pp512和tg128都比Vulkan -ngl 0提升了30%。运行一个 completion 示例C:\d\llama260522llama-completion --model ..\Hy-MT2-1.8B-Q4_K_M.gguf -p Translate the following segment into Chinese, without additional explanationHello --jinja -ngl 0 -n 64 -st 0.00.078.290 I llama_completion: llama backend init 0.00.078.296 I llama_completion: load the model and apply lora adapter, if any 0.00.078.303 I common_init_result: fitting params to device memory ... 0.00.078.304 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on) 0.00.408.458 W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect 0.09.586.475 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) 0.20.135.187 I llama_completion: llama threadpool init, n_threads 8 0.20.136.500 I llama_completion: chat template is available, enabling conversation mode (disable it with -no-cnv) 0.20.136.506 W *** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead? 0.20.148.699 I llama_completion: chat template example: hy_begin▁of▁sentenceYou are a helpful assistanthy_place▁holder▁no▁3hy_UserHellohy_AssistantHi therehy_place▁holder▁no▁2hy_UserHow are you?hy_Assistant 0.20.148.709 I 0.20.149.456 I system_info: n_threads 8 (n_threads_batch 8) / 16 | CPU : SSE3 1 | SSSE3 1 | AVX 1 | AVX2 1 | F16C 1 | FMA 1 | BMI2 1 | AVX512 1 | AVX512_VBMI 1 | AVX512_VNNI 1 | AVX512_BF16 1 | LLAMAFILE 1 | OPENMP 1 | REPACK 1 | 0.20.149.458 I 0.20.161.695 I sampler seed: 3367966364 0.20.161.906 I sampler params: repeat_last_n 64, repeat_penalty 1.000, frequency_penalty 0.000, presence_penalty 0.000 dry_multiplier 0.000, dry_base 1.750, dry_allowed_length 2, dry_penalty_last_n -1 top_k 20, top_p 0.800, min_p 0.050, xtc_probability 0.000, xtc_threshold 0.100, typical_p 1.000, top_n_sigma -1.000, temp 0.700 mirostat 0, mirostat_lr 0.100, mirostat_ent 5.000, adaptive_target -1.000, adaptive_decay 0.900 0.20.162.115 I sampler chain: logits - ?penalties - ?dry - ?top-n-sigma - top-k - ?typical - top-p - min-p - ?xtc - temp-ext - dist 0.20.162.118 I generate: n_ctx 262144, n_batch 2048, n_predict 64, n_keep 0 0.20.162.118 I Translate the following segment into Chinese, without additional explanationHello你好 [end of text] 0.21.819.501 I common_perf_print: sampling time 0.64 ms 0.21.819.505 I common_perf_print: samplers time 0.09 ms / 17 tokens 0.21.819.506 I common_perf_print: load time 19767.05 ms 0.21.819.511 I common_perf_print: prompt eval time 1611.80 ms / 15 tokens ( 107.45 ms per token, 9.31 tokens per second) 0.21.819.513 I common_perf_print: eval time 36.48 ms / 1 runs ( 36.48 ms per token, 27.41 tokens per second) 0.21.819.514 I common_perf_print: total time 1684.68 ms / 16 tokens 0.21.819.515 I common_perf_print: unaccounted time 35.77 ms / 2.1 % (total - sampling - prompt eval - eval) / (total) 0.21.819.516 I common_perf_print: graphs reused 0 C:\d\llama260522用CLI测试, 不知为何翻译了一句就退出。用读入文件的方法也一样翻译了一句就退出。C:\d\llama260522llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf Loading model... / C:\d\llama260522llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf --jinja -ngl 0 -n 64 -st Loading model... build : b9279-47c0eda9d model : Hy-MT2-1.8B-Q4_K_M.gguf modalities : text 请将以下文本准确翻译为英文。 Please translate the text accurately into English. [ Prompt: 9.9 t/s | Generation: 50.7 t/s ] Exiting... C:\d\llama260522llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf --jinja -ngl 0 -n 64 -st Loading model... build : b9279-47c0eda9d model : Hy-MT2-1.8B-Q4_K_M.gguf modalities : text /read ..\eng.txt Loaded text from ..\eng.txt 译成中文 --- 文件..\eng.txt --- 简要总结Lance 是一种开放性的 Lakehouse 格式专为 AI 工作负载设计。LanceDB 与 DuckDB Labs 合作让您能够直接在 DuckDB SQL 中执行快速向量和混合搜索而无需中断您的分析工作流 [ Prompt: 49.9 t/s | Generation: 43.1 t/s ] Exiting...