之前帮兄弟部署模型上线他问我“哥我们有 100 个模型要同时服务用啥框架TensorFlow Serving 只支持 TensorFlow好头疼。”我说用 Triton GE Backend支持所有框架。好问题。今天一次说清楚。triton-inference-server-ge-backend 是啥triton-inference-server-ge-backend Triton Inference Server GraphExecutor Backend昇腾为 Triton 推理服务器开发的 GEGraphExecutor后端。一句话说清楚triton-inference-server-ge-backend 是昇腾的 Triton 推理服务后端让你用 Triton 统一管理昇腾 NPU 上的所有模型TensorFlow、PyTorch、ONNX…一个框架搞定所有推理服务。你说气人不气人之前要为每个框架搭一套服务现在一个 Triton 全搞定。为什么要用 triton-inference-server-ge-backend三个字统一管。不用 Triton GE Backend各自为战# TensorFlow 模型 → 搭 TensorFlow Servingdocker run-p8501:8501tensorflow/serving# PyTorch 模型 → 搭 TorchServedocker run-p8080:8080pytorch/torchserve# ONNX 模型 → 搭 ONNX Runtime Serverdocker run-p8001:8001onnxruntime/server# 问题# 1. 每个框架一套服务维护成本高# 2. 资源不能共享NPU 利用率低# 3. 监控要 each 看一眼麻烦# 4. 版本管理混乱用 Triton GE Backend统一服务# 一个 Triton 服务管理所有模型$dockerrun-p8000:8000-p8001:8001-p8002:8002\triton-inference-server-ge-backend:latest# 查看模型仓库$curllocalhost:8000/v2/health/ready{ready:true}# 查看所有模型$curllocalhost:8000/v2/models{models:[resnet50,bert,yolo,gpt]}# 推理统一 API$curl-XPOST localhost:8000/v2/models/resnet50/infer\-d{inputs: [{name: input, shape: [1, 3, 224, 224], datatype: FP32, data: [...]}]}你说气人不气人一个框架搞定所有推理服务。核心概念就三个1. Triton Inference ServerTriton 是开源推理服务框架# Triton 架构Triton Inference Server ├── HTTP/REST API(端口8000 ├── gRPC API(端口8001 ├── Metrics API(端口8002 ├── Model Repository(模型仓库 │ ├── resnet50/ │ │ ├── config.pbtxt │ │ └──1/ │ │ └── model.graphdef(或 .onnx / .pt │ ├── bert/ │ └── yolo/ └── Backends(后端 ├── tensorrt(NVIDIA GPU ├── onnxruntime(CPU/GPU └── ge(昇腾 NPU← 我们关注的2. GE BackendGE Backend 让 Triton 支持昇腾 NPU# model_repository/resnet50/config.pbtxt name: resnet50 platform: graph_executor # ← 用 GE 后端 max_batch_size: 32 input [ { name: input data_type: TYPE_FP32 format: FORMAT_NCHW dims: [3, 224, 224] } ] output [ { name: output data_type: TYPE_FP32 dims: [1000] } ] # GE 后端配置 parameters [ { key: EXECUTOR_TYPE value: { string_value: graph } }, { key: DEVICE_ID value: { string_value: 0 } } ]3. 模型仓库模型按目录组织model_repository/ ├── resnet50/# 模型名│ ├── config.pbtxt# 模型配置│ └──1/# 版本 1│ └── model.graphdef# 模型文件昇腾格式│ ├── bert/ │ ├── config.pbtxt │ └──1/ │ └── model.graphdef │ └── ensemble_model/# 集成模型├── config.pbtxt └──1/ └── model.graphdef为什么要用 triton-inference-server-ge-backend三个理由1. 统一 API所有模型一个 API# 推理 ResNet-50$curl-XPOST localhost:8000/v2/models/resnet50/infer\-d{inputs: [...]}# 推理 BERT$curl-XPOST localhost:8000/v2/models/bert/infer\-d{inputs: [...]}# 推理 YOLO$curl-XPOST localhost:8000/v2/models/yolo/infer\-d{inputs: [...]}# 同一个 API只是模型名不同2. 动态批处理自动合并请求提升吞吐# config.pbtxt name: resnet50 platform: graph_executor # 动态批处理 dynamic_batching { preferred_batch_size: [4, 8, 16] max_queue_delay_microseconds: 5000 # 最多等 5ms }# 效果# 原来每个请求单独推理 → 吞吐 125 img/s# 现在4 个请求合并推理 → 吞吐 450 img/s3.6x3. 模型集成多个模型串起来# ensemble_model/config.pbtxt name: preprocess_resnet_postprocess platform: ensemble # 步骤 1预处理 step [ { model_name: preprocess model_version: 1 input_map { key: input value: raw_image } output_map { key: output value: preprocessed_image } } ] # 步骤 2推理 step [ { model_name: resnet50 model_version: 1 input_map { key: input value: preprocessed_image } output_map { key: output value: logits } } ] # 步骤 3后处理 step [ { model_name: postprocess model_version: 1 input_map { key: input value: logits } output_map { key: output value: predictions } } ]怎么用代码示例示例 1部署 ResNet-50# 1. 准备模型仓库$mkdir-pmodel_repository/resnet50/1# 2. 转换模型PyTorch → 昇腾格式$ python convert_to_ge.py\--input_modelresnet50.pth\--output_modelmodel_repository/resnet50/1/model.graphdef\--input_shape[1,3,224,224]\--output_shape[1,1000]# 3. 写配置文件$catmodel_repository/resnet50/config.pbtxtEOF name: resnet50 platform: graph_executor max_batch_size: 32 input [ { name: input data_type: TYPE_FP32 format: FORMAT_NCHW dims: [3, 224, 224] } ] output [ { name: output data_type: TYPE_FP32 dims: [1000] } ] dynamic_batching { preferred_batch_size: [4, 8, 16] max_queue_delay_microseconds: 5000 } EOF# 4. 启动 Triton$dockerrun-p8000:8000-p8001:8001-p8002:8002\-v$(pwd)/model_repository:/models\triton-inference-server-ge-backend:latest\tritonserver --model-repository/models# 5. 测试推理$curl-XPOST localhost:8000/v2/models/resnet50/infer\-HContent-Type: application/json\-d{ inputs: [ { name: input, shape: [1, 3, 224, 224], datatype: FP32, data: [0.1, 0.2, ...] # 224*224*3 150528 个值 } ] }# 输出# {# model_name: resnet50,# model_version: 1,# outputs: [# {# name: output,# shape: [1, 1000],# datatype: FP32,# data: [...]# }# ]# }示例 2部署 BERT# 1. 准备模型仓库$mkdir-pmodel_repository/bert/1# 2. 转换模型TensorFlow → 昇腾格式$ python convert_to_ge.py\--input_modelbert_pretrained\--output_modelmodel_repository/bert/1/model.graphdef\--input_shape[1,128]\--input_typeINT32# 3. 写配置文件$catmodel_repository/bert/config.pbtxtEOF name: bert platform: graph_executor max_batch_size: 16 input [ { name: input_ids data_type: TYPE_INT32 dims: [128] }, { name: attention_mask data_type: TYPE_INT32 dims: [128] } ] output [ { name: pooled_output data_type: TYPE_FP32 dims: [768] } ] dynamic_batching { preferred_batch_size: [4, 8] max_queue_delay_microseconds: 5000 } EOF# 4. 启动 Triton如果还没启动# 如果已经启动Triton 会自动加载新模型# 5. 测试推理$curl-XPOST localhost:8000/v2/models/bert/infer\-HContent-Type: application/json\-d{ inputs: [ { name: input_ids, shape: [1, 128], datatype: INT32, data: [101, 2023, ..., 102] # token IDs }, { name: attention_mask, shape: [1, 128], datatype: INT32, data: [1, 1, ..., 1, 0, 0] # attention mask } ] }示例 3客户端代码Pythonimporttritonclient.httpashttpclientimportnumpyasnp# 连接到 Tritonclienthttpclient.InferenceServerClient(urllocalhost:8000)# 检查服务状态print(Server ready:,client.is_server_ready())print(Model ready:,client.is_model_ready(resnet50))# 准备输入input_datanp.random.randn(1,3,224,224).astype(np.float32)# 构造推理请求inputs[httpclient.InferInput(input,input_data.shape,FP32)]inputs[0].set_data_from_numpy(input_data)outputs[httpclient.InferRequestedOutput(output)]# 发送推理请求resultsclient.infer(model_nameresnet50,inputsinputs,outputsoutputs)# 获取结果output_dataresults.as_numpy(output)print(fOutput shape:{output_data.shape})print(fTop-5 predictions:{np.argsort(output_data[0])[-5:][::-1]})示例 4性能监控# 1. 查看 Metrics$curllocalhost:8002/metrics# 输出节选# triton_inference_count{modelresnet50} 1000# triton_inference_count{modelbert} 500# triton_inference_exec_count{modelresnet50} 1000# triton_inference_exec_count{modelbert} 500# triton_inference_request_duration_us{modelresnet50,le1000} 950# triton_inference_queue_duration_us{modelresnet50,le100} 980# 2. 查看 GPU/NPU 利用率另开终端$watch-n1npu-smi stats-i0# 3. 压力测试$ python benchmark.py\--modelresnet50\--concurrency10\--requests1000# 输出# Throughput: 1250 req/s# Latency (p50): 26ms# Latency (p99): 45ms# GPU/NPU Utilization: 85%性能数据用 Triton GE Backend 的性能提升场景不用 Triton用 Triton提升单模型推理125 img/s125 img/s1x一样动态批处理125 img/s450 img/s3.6x多模型并发手动调度自动调度2x资源利用率40%85%2.1x你说气人不气人动态批处理直接快 3.6 倍。跟其他仓库的关系triton-inference-server-ge-backend 在 CANN 架构里属于第 4 层昇腾计算执行层是推理服务后端。依赖关系triton-inference-server-ge-backendTriton 后端 ↓ 调用 GE / GraphExecutor图执行器 ↓ 调用 Runtime运行时 ↓ 调用 硬件昇腾 NPU解释一下Triton开源推理服务框架GE BackendTriton 的昇腾后端GE / GraphExecutor昇腾图执行器硬件昇腾 NPU简单说triton-inference-server-ge-backend是 Triton 和昇腾之间的桥梁。想用 Triton 管理昇腾模型就用它。triton-inference-server-ge-backend 的核心内容1. 后端实现// src/graph_executor_backend.cc#includetriton/backend/backend_model.h#includege/ge_api.hclassGraphExecutorBackend:publictriton::backend::BackendModel{public:voidInfer(...)override{// 1. 准备输入std::vectorge::TensorinputsPrepareInputs(...);// 2. 运行推理std::vectorge::Tensoroutputs;ge::GraphExecutor executor;executor.Run(inputs,outputs);// 3. 返回输出ProcessOutputs(outputs,...);}};2. 模型配置# config.pbtxt name: ... platform: graph_executor max_batch_size: 32 input [...] output [...] dynamic_batching {...}3. 客户端 SDK# Python 客户端importtritonclient.httpashttpclient clienthttpclient.InferenceServerClient(urllocalhost:8000)resultsclient.infer(model_nameresnet50,inputs[...],outputs[...])4. 监控# Metrics 端点curllocalhost:8002/metrics# 关键指标# - triton_inference_count# - triton_inference_request_duration_us# - triton_inference_queue_duration_us适用场景什么情况下用 triton-inference-server-ge-backend多框架模型TensorFlow PyTorch ONNX生产部署要监控、要动态批处理多模型服务100 模型要同时服务什么情况下不用单模型调试用torch.jit.trace()就行离线推理用torch.inference_mode()总结triton-inference-server-ge-backend 就是昇腾的Triton 推理服务后端统一 API所有模型一个接口动态批处理自动合并请求提升吞吐模型集成多个模型串起来