Run Llama7b 100+ TPS with A10.

#machinelearning #tutorial #ai #bash

Hi Folk.

I've been working on setting up and managing TensorRT-LLM and Triton backend scripts to build the Llama2-7b model in FP16, int8, and int4 formats.

I ran a benchmark with int4 and achieved an inference speed of approximately 100 tokens per second.

Github Llama7b-TensorRT-LLM

Top comments (1)

Lionel♾️☁️ • Nov 30 '23

Hello @mattick27 great work. I love the hardwork you put in this. Above all thanks for the link, it helps to be able to understand what you are referring to.

Host a static website on AWS: A detailed step-by-step guide

Hermann ESSOH - Jan 24

🚀 React Patterns: Essential Tips and Tricks for Developers

Pawani Madushika - Jan 24

AI Models Get Human-Like Memory with New Test-Time Regression Framework

Mike Young - Jan 24

Understanding LLM Concepts: Orchestrators, Evaluators, Validators, and Guardrails

RD - Jan 23

DEV Community