Description
vLLM is a high-throughput LLM evaluator which runs on HuggingFace models, performing various kinds of model sharding across GPUs using Ray backend.
In its basic form, vLLM is a great speedup over AccelerateEvaluator, which is quite slow.
Basic requirements:
- Should be compatible with RayEvaluator (and GenerativeLM if needed).
- Should support only single-node models; scaling up models should require larger nodes (design choice for better execution speed).
- Should integrate with all HF transformers LLMs.