InternLM has introduced its newest development in open large language models, InternLM2.5-7B-Chat, which is available in GGUF format. This latest model is compatible with the open-source framework, llama.cpp which is used for LLM inference and can be utilized both locally and in the cloud on different hardware platforms. The GGUF format provides half-precision and low-bit quantized versions including q5_0, q5_k_m, q6_k, and q8_0.
The upgraded InternLM2.5 offers a 7 billion parameter base model along with a chat model designed for practical applications. This model stands out with its advanced reasoning capabilities, especially in mathematical reasoning, outperforming competitors like Llama3 and Gemma2-9B. Another distinctive feature is its massive 1M context window, which enables near-perfect performance in tasks that require understanding longer contexts, as assessed by LongBench.
InternLM2.5-7B-Chat is especially effective at extracting information from large documents making it highly useful for tasks involving extensive documents. This feature is amplified when combined with LMDeploy, a toolkit developed for compressing, deploying, and serving LLMs. A variant, InternLM2.5-7B-Chat-1M, has been designed specifically for 1M-long context inference. However, to operate effectively, it requires substantial computational resources, such as the 4xA100-80G GPUs.
InternLM2.5-7B-Chat’s performance was evaluated using the OpenCompass tool across various areas including disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. The model exhibits superior performance in benchmarks like MMLU, CMMLU, BBH, MATH, GSM8K, and GPQA compared to other models. In the MMLU benchmark, for instance, it managed to score 72.8, surpassing models like Llama-3-8B-Instruct and Gemma2-9B-IT.
The model’s ability to gather information effectively from over 100 web pages substantiates its capabilities in handling tool use. The launch of Lagent, which is expected to enhance the model’s skills in instruction following, tool selection, and reflection, is awaited.
For users who want to set up the model, there is an exhaustive installation guide, instructions to download the model, and examples for model inference and service deployment. Users can execute batched offline inference with the quantized model using lmdeploy, which supports INT4 weight-only quantization and deployment (W4A16). This setup provides up to 2.4x faster inference than FP16 on compatible NVIDIA GPUs, such as the 20, 30, and 40 series and A10, A16, A30, and A100.
The InternLM2.5 model architecture retains crucial features of the previous version while integrating new technological advancements. Powered by a massive synthetic data corpus and iterative training process, these improvements lead to a significantly enhanced reasoning performance, witnessing a 20% increase from InternLM2. The model maintains its capacity to handle 1M context windows with almost complete accuracy, rendering it a top-performing model for long-context tasks.
To conclude, the suggestively advanced reasoning abilities, effective long-context handling, efficient tool use, and the availability of the InternLM2.5 and its variants make InternLM2.5-7B-Chat a valuable asset for numerous applications in both research and practice.