NVIDIALLM全栈式方案使用和优化最佳实践
周国峰(Chandler)NVIDIA技术研发经理
GTC2024ChinaAIDay,Mar.19,2024
1
Agenda
?NVIDIAFull-StackSolutionforLLM
?BestPracticesofNVIDIAMegatron-CoreforLLMTraining
?BestPracticesofNVIDIATensorRT-LLMforLLMInference
BestPracticesofNVIDIATritonInferenceSeverforLLM
?
Deployment
?ConclusionandProspect
2
Agenda
?NVIDIAFull-StackSolutionforLLM
?BestPracticesofNVIDIAMegatron-CoreforLLMTraining
?BestPracticesofNVIDIATensorRT-LLMforLLMInference
BestPracticesofNVIDIATritonInferenceSeverforLLM
?
Deployment
?ConclusionandProspect
3
NVIDIAFull-StackSolutionforLLM
FromTraining,InferencetoDeployment
NVIDIAMegatron-Core(M-core)forLLMTraining
?Anopen-sourcelibraryforGPUoptimizedtechniquesforLLMtraining.ForcustomerstobuildcustomLLMframework.
NVIDIATensorRT-LLMforLLMInference
?Anopen-sourcelibrarythatacceleratesandoptimizesinferenceperformanceofthelatestlargelanguagemodels(LLMs)
NVIDIATritonInferenceSeverforLLMdeployment
?Anopen-sourcelibrarythatstandardizesAImodeldeploymentandexecutionacrosseveryworkload
TensorRT-LLM+TritonInferenceServerfordeployment
?ThesuggestedwaytodeployLLM-basedservicesonNVIDIAAIplatform
?SOTAperformanceandrichfunctionalities
?TensorRT-LLMbackend.TheTritonbackendforTensorRT-LLM,includingin-flightbatching,pagedKVcacheandmore.