需求:针对视频形式的数据输入,对每一帧图像,有多个神经网络模型需要进行推理并获得预测结果。如何让整个推理过程更加高效,尝试了几种不同的方案。
硬件:单显卡主机。
由于存在多个模型需要推理,但模型之间没有相互依赖关系,因此很容易想到通过并行的方式来提高运行效率。
对比了如下几种方案的结果,包括:
配置了 4 个体量相近的模型。
为了屏蔽读取和解码的时间消耗对最终结果的影响,提前读取视频并准备输入。
统计每个单独模型执行推理的累积时间,以及整体的运行时间。
import asyncio from time import time def main(): frames = load_video() weights = load_weights() print('串行:') one_by_one(weights, frames) print('多线程:') multit_thread(weights, frames) print('多进程:') multi_process(weights, frames) print('协程:') asyncio.run(coroutine(weights, frames))
读取到当前帧数据后,所有模型依次运行。
def one_by_one(weights, frames): sessions = [init_session(weight) for weight in weights] costs = [[] for _ in range(len(weights))] since_infer = time() for frame in frames: for session in sessions: since = time() _ = session.run('output', {'input': frame}) cost = time() - since costs[idx].append(cost) print([sum(cost) for cost in costs]) print("infer:", time() - since_infer) return
为每一个模型分配一个线程。
from threading import Thread def multit_thread(weights, frames): sessions = [init_session(weight) for weight in weights] threads = [] since_infer = time() for session in sessions: thread = Thread(target=run_session_thread, args=(session, frames)) thread.start() threads.append(thread) for thread in threads: thread.join() print("infer:", time() - since_infer) return def run_session_thread(session, frames): costs = [] for frame in frames: since = time() _ = session.run('output', {'input': frame}) costs.append(time() - since) print(sum(costs)) return
为每一个模型分配一个进程。
由于 session 不能在进程间传递,因此需要在每个进程的内部单独初始化。如果数据较多,这部分初始化的时间消耗基本可以忽略不急。
from multiprocessing import Manager, Process def multi_process(weights, frames): inputs = Manager().list(frames) processes = [] since_infer = time() for weight in weights: process = Process(target=run_session_process, args=(weight, inputs)) process.start() processes.append(process) for process in processes: process.join() print("infer:", time() - since_infer) return def run_session_process(weight, frames): session = init_session(weight) costs = [] for frame in frames: since = time() _ = session.run('output', {'input': frame}) costs.append(time() - since) print(sum(costs)) return
为每一个模型分配一个协程。
async def coroutine(weights, frames): sessions = [init_session(weight) for weight in weights] since_infer = time() tasks = [ asyncio.create_task(run_session_coroutine(session, frames)) for session in sessions ] for task in tasks: await task print("infer:", time() - since_all) return async def run_session_coroutine(session, frames): costs = [] for frame in frames: since = time() _ = session.run('output', {'input': frame}) costs.append(time() - since) print(sum(costs)) return
import cv2 import numpy as np import onnxruntime as ort def init_session(weight): provider = "CUDAExecutionProvider" session = ort.InferenceSession(weight, providers=[provider]) return session def load_video(): # 为了减少读视频的时间,复制相同的图片组成batch vcap = cv2.VideoCapture('path_to_video') count = 1000 batch_size = 4 frames = [] for _ in range(count): _, frame = vcap.read() frame = cv2.resize(frame, (256, 256)).transpose((2, 0, 1)) frame = np.stack([frame] * batch_size, axis=0) frames.append(frame.astype(np.float32)) return frames def load_weights(): return ['path_to_weights_0', 'path_to_weights_1', 'path_to_weights_2', 'path_to_weights_3',]
以batch_size=4
共运行 1000 帧数据,推理结果如下:
方案 | 串行 | 线程 | 进程 | 协程 |
---|---|---|---|---|
单模型累积时间/s | 7.9/5.3/5.2/5.2 | 13.5/13.5/15.6/15.7 | 13.5/13.8/13.7/13.6 | 6.5/5.2/5.3/5.3 |
总时间/s | 23.7 | 15.8 | 30.1 | 22.5 |
显存占用/MB | 1280 | 1416 | 3375 | 1280 |
平均GPU-Util | 约 60% | 约 85% | 约 70% | 约 55% |
为什么多线程相比串行可以提高运行效率?
session.run()
函数运行时,既有 CPU 执行的部分,又有 GPU 执行的部分;为什么多进程反而降低了运行效率?
为什么看起来协程与串行的效果一样?
协程方案在执行过程中,从表现上来看:
print
出来的,间隔大致等于每个模型的累积时间(而线程和进程方案中,几乎是同时输出 4 个模型的累积时间,说明是同时运行结束);可能的原因:
使用协程的必要性:
关于协程的使用,也是现学,有可能因为使用方法不当而得出以上的结论。如有错误,欢迎指正。