Documentation
Local Engines
llama.cpp

llama.cpp (Cortex)

Overview

Jan has Cortex (opens in a new tab) - a default C++ inference server built on top of llama.cpp (opens in a new tab). This server provides an OpenAI-compatible API, queues, scaling, and additional features on top of the wide capabilities of llama.cpp.

This guide shows you how to initialize the llama.cpp to download and install the required dependencies to start chatting with a model using the llama.cpp engine.

Prerequisites

  • Mac Intel:
    • Make sure you're using an Intel-based Mac. For a complete list of supported Intel CPUs, please see here (opens in a new tab).
    • For Mac Intel, it is recommended to utilize smaller models.
  • Mac Sillicon:
    • Make sure you're using a Mac Silicon. For a complete list of supported Apple Silicon CPUs, please see here (opens in a new tab).
    • Using an adequate model size based on your hardware is recommended for Mac Silicon.

This can use Apple GPU with Metal by default for acceleration. Apple ANE is not supported yet.

  • Windows:
    • Ensure that you have Windows with x86_64 architecture.
  • Linux:
    • Ensure that you have Linux with x86_64 architecture.

GPU Acceleration Options

Enable the GPU acceleration option within the Jan application by following the Installation Setup guide.

Step-by-step Guide

Step 1: Open the model.json

  1. Open Jan Data Folder

Jan Data Folder


  1. Select models folder > Click model folder that you want to modify > click model.json
  2. Once open, model.json file looks like below, use model "TinyLlama Chat 1.1B Q4" as an example:

{
"sources": [
{
"filename": "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
"url": "https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
}
],
"id": "tinyllama-1.1b",
"object": "model",
"name": "TinyLlama Chat 1.1B Q4",
"version": "1.0",
"description": "TinyLlama is a tiny model with only 1.1B. It's a good model for less powerful computers.",
"format": "gguf",
"settings": {
"ctx_len": 4096,
"prompt_template": "<|system|>\n{system_message}<|user|>\n{prompt}<|assistant|>",
"llama_model_path": "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
},
"parameters": {
"temperature": 0.7,
"top_p": 0.95,
"stream": true,
"max_tokens": 2048,
"stop": [],
"frequency_penalty": 0,
"presence_penalty": 0
},
"metadata": {
"author": "TinyLlama",
"tags": [
"Tiny",
"Foundation Model"
],
"size": 669000000
},
"engine": "nitro"
}

Step 2: Modify the model.json

  1. Modify the model's engine settings under the settings array. You can modify the settings with the following parameters:
ParameterTypeDescription
ctx_lenIntegerProvides ample context for model operations like GPT-3.5. The default value is 2048 (Maximum: 4096, Minimum: 1).
prompt_templateStringDefines the template used to format prompts
model_pathStringSpecifies the path to the model .GGUF file.
nglIntegerDetermines GPU layer usage. The default value is 100.
cpu_threadsIntegerDetermines CPU inference threads, limited by hardware and OS. (Maximum determined by system)
cont_batchingIntegerControls continuous batching, enhancing throughput for LLM inference.
embeddingIntegerEnables embedding utilization for tasks like document-enhanced chat in RAG-based applications.
  1. Save the model.json file.

If you use a different model, you must set it up again. As this only affects the selected model.

Step 3: Start the Model

  1. Restart the Jan application to apply your settings.
  2. Navigate to the Threads.
  3. Chat with your model.

If you have questions, please join our Discord community (opens in a new tab) for support, updates, and discussions.