This pull request adds StreamingLLM support for llamacpp and llamacpp_HF models, aiming to improve performance and reliability. The changes allow indefinite chatting with the model without re-evaluating the prompt.
This PR implements the StreamingLLM technique for model loaders, focusing on handling context length and optimizing chat generation speed.