Code Example
You can download the sample code from this repository: llm-laptop/llama-cpp-api at main ยท urbanspr1nter/llm-laptop
Introduction
In this post, I'd like to introduce a basic pattern in Python to repeatedly retrieve tokens generated from llama.cpp
. This feature is very common and gives the same effect you see in popular LLM-powered chat applications like ChatGPT and Claude. After asking a question to the LLM, you'll see a stream of words start to appear in the LLM's response as if it was speaking to you in real time like this:
We can also do this ourselves using llama.cpp
using streaming HTTP responses, and not just through the web, but also within the terminal!
Setup
You'll need to setup llama.cpp
. You can follow the build guide to do just that: https://www.roger.lol/blog/llamacpp-build-guide-for-cpu-inferencing
Make sure to be working within a virtual environment. You can always create one using python -m venv .venv
to create a virtual environment in .venv
at the current working directory.
You'll only need a single dependency, requests
to get this code up and running. Everything else is just straight vanilla Python. :)
pip install requests
Inference Using the llama.cpp API
Performing inference on llama.cpp
is simple. It's just a POST request to the endpoint /v1/chat/completions
. Assuming you have your server already up and running, the llama.cpp
server is compatible with OpenAI's REST API specification: API Reference - OpenAI API
We can start by building simple payload:
import requests
base_url = "http://localhost:8000/v1"
model = "gemma3-4b-it"
payload = {
"model": model,
"messages": [
{ "role": "user", "content": "Hello, there!" }
],
"stream": False
}
chat_response = requests.post(f"{base_url}/chat/completions", json=payload)
response_json = chat_response.json()
print(response_json)
This sends a simple message, Hello, there! to the LLM within the messages
array. Setting stream
to False
means that we'll get the entire message from the LLM in one large payload. As soon as we get that response, we can print out the response dictionary which turns out to be:
{'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': 'Hello to you too! ๐ \n\nHow can I help you today? Do you want to:\n\n* Chat about something?\n* Ask me a question?\n* Play a game?\n* Get some information?'}}], 'created': 1744481277, 'model': 'gemma3-4b-it', 'system_fingerprint': 'b4991-0bb29193', 'object': 'chat.completion', 'usage': {'completion_tokens': 50, 'prompt_tokens': 13, 'total_tokens': 63}, 'id': 'chatcmpl-w3n9tEOxkcmSwLGf93GEV4LL7O9esaCa', 'timings': {'prompt_n': 1, 'prompt_ms': 3332.753, 'prompt_per_token_ms': 3332.753, 'prompt_per_second': 0.3000522390948264, 'predicted_n': 50, 'predicted_ms': 164095.729, 'predicted_per_token_ms': 3281.9145799999997, 'predicted_per_second': 0.30470019119144776}}
Let's convert it to a JSON string, and format that better to understand what we have in which we can use.
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "Hello to you too! \\ud83d\\ude0a \\n\\nHow can I help you today? Do you want to:\\n\\n* Chat about something?\\n* Ask me a question?\\n* Play a game?\\n* Get some information?"
}
}
],
"created": 1744481277,
"model": "gemma3-4b-it",
"system_fingerprint": "b4991-0bb29193",
"object": "chat.completion",
"usage": {
"completion_tokens": 50,
"prompt_tokens": 13,
"total_tokens": 63
},
"id": "chatcmpl-w3n9tEOxkcmSwLGf93GEV4LL7O9esaCa",
"timings": {
"prompt_n": 1,
"prompt_ms": 3332.753,
"prompt_per_token_ms": 3332.753,
"prompt_per_second": 0.3000522390948264,
"predicted_n": 50,
"predicted_ms": 164095.729,
"predicted_per_token_ms": 3281.9145799999997,
"predicted_per_second": 0.30470019119144776
}
}
There is a lot of stuff, and mostly statistics. What we want to get is the content of the response. That's going to be in choices[0].message.content
. That content is Hello to you too! ๐ \n\nHow can I help you today? Do you want to:\n\n* Chat about something?\n* Ask me a question?\n* Play a game?\n* Get some information?
. Notice that is raw text with the \n
character. Once transformed to JSON, we characters such as those representing emojis are properly escaped as unicode.
Streaming Responses
How do we make it so that we can stream responses? We can use Python generators to display tokens to the user as they are generated. Now instead of getting one large JSON response, we can continuously stream the response back to the client from the server. The JSON payload will be smaller, and a stop payload will be sent to signal that there are no more responses generated.
When we receive these chunks, we can parse the content and deliver the content back to the requestor. The requests
library allows us to perform the POST request as streaming by setting the stream
argument to True
.
Here is how we do it now.
We'll now declare a nice Python function to encapsulate the complex logic. Here is a start.
base_url = "http://localhost:8000/v1"
model = "gemma3-4b-it"
def chat(message):
payload = {
"model": model,
"messages": [
{ "role": "user", "content": message }
],
"stream": True,
"temperature": 1.0
}
...
The function is flexible in that we can now send any message to our llama.cpp instance to inference.
Next, we tweak the way we make the POST request to llama.cpp with request.post
by setting stream=True
. What we will be getting back are multiple chunks of data within an iterator. We can continuously read that data, and yield
back to the caller as a generator. We can do it like this:
def chat(message):
payload = {
"model": model,
"messages": [
{ "role": "user", "content": message }
],
"stream": True,
"temperature": 1.0
}
with requests.post(f"{base_url}/chat/completions", json=payload, stream=True) as chat_response:
for chunk in chat_response.iter_content(chunk_size=None):
if chunk:
raw_json = chunk.decode('utf-8')
if raw_json == "[DONE]":
break
if raw_json.startswith("data: "):
raw_json = raw_json[6:]
# process chunk
...
...
chunk
itself is represented as raw bytes. So we will need to convert the entire byte array to a raw UTF-8 string using decode
. Once we do that, we can check to see if the chunk
is the string [DATA]
, because that is how llama.cpp tells the client to stop processing chunks. Notice that if the chunk is indeed [DATA]
, then we can just break out of the loop.
Another point to note is that the chunks are delivered as text that looks like this:
data: { "key1": "value1", ..., "keyN": "valueN" }
Notice the data:
... this is not valid JSON and should be stripped away, so that is why we check to see if the raw_json
starts with this specific substring. We will remove the leading data :
if it exists to conver to a proper JSON.
We can just call json.loads
to convert the string to a Python dict
object. Then we drill into the payload to get the content. The chunks have a specific schema which you can easily just print out after calling json.loads
. Most of this new code is just boilerplate error checking.
Here is what the shape of a chunk can look like (shortened to just what we need):
{
"choices": [
{
"finish_reason": false,
"delta": {
"role": "",
"content: ""
}
}
]
}
What's most important to notice is the yield content
line. This is the part where we can signal back to the function caller that a new token was generated. The caller can then just use this content to build its own payload, send it back to the original requestor, or whatever else may be useful.
A caller will know when to stop when the generator function no longer receives new chunks from the stream. This is signaled by te raw text [DONE]
. Our logic handles that and breaks out of the loop.
Here is the full implementation:
def chat(message):
payload = {
"model": model,
"messages": [
{ "role": "user", "content": message }
],
"stream": True,
"temperature": 1.0
}
with requests.post(f"{base_url}/chat/completions", json=payload, stream=True) as chat_response:
for chunk in chat_response.iter_content(chunk_size=None):
if chunk:
raw_json = chunk.decode('utf-8')
if raw_json == "[DONE]":
break
if raw_json.startswith("data: "):
raw_json = raw_json[6:]
response_json = json.loads(raw_json)
choices = response_json.get("choices", None)
if choices is None or len(choices) == 0:
raise Exception("Invalid response from the server.")
finish_reason = choices[0].get("finish_reason", None)
if finish_reason == "stop":
break
delta = choices[0].get("delta", None)
if delta is None:
raise Exception("Message did not exist in the response.")
content = delta.get("content", None)
if content is None:
raise Exception("Content did not exist in the response.")
yield content
This now can easily be used. Here's a pattern to continuously iterate over the stream, for reference:
for response in chat("hello, how are you?":
print(response, end="", flush=True)