Can you say that again?

LLM on a Laptop - llama.cpp API Request Example

Roger Ngo

on Sat Apr 12 2025

Code Example

You can download the sample code from this repository: llm-laptop/llama-cpp-api at main · urbanspr1nter/llm-laptop

Introduction

In this post, I'd like to introduce a basic pattern in Python to repeatedly retrieve tokens generated from llama.cpp. This feature is very common and gives the same effect you see in popular LLM-powered chat applications like ChatGPT and Claude. After asking a question to the LLM, you'll see a stream of words start to appear in the LLM's response as if it was speaking to you in real time like this:

We can also do this ourselves using llama.cpp using streaming HTTP responses, and not just through the web, but also within the terminal!

Setup

You'll need to setup llama.cpp. You can follow the build guide to do just that: https://www.roger.lol/blog/llamacpp-build-guide-for-cpu-inferencing

Make sure to be working within a virtual environment. You can always create one using python -m venv .venv to create a virtual environment in .venv at the current working directory.

You'll only need a single dependency, requests to get this code up and running. Everything else is just straight vanilla Python. :)

pip install requests

Inference Using the llama.cpp API

Performing inference on llama.cpp is simple. It's just a POST request to the endpoint /v1/chat/completions. Assuming you have your server already up and running, the llama.cpp server is compatible with OpenAI's REST API specification: API Reference - OpenAI API

We can start by building simple payload:

import requests

base_url = "http://localhost:8000/v1"
model = "gemma3-4b-it"

payload = {
	"model": model,
	"messages": [
		{ "role": "user", "content": "Hello, there!" }
	],
	"stream": False
}

chat_response = requests.post(f"{base_url}/chat/completions", json=payload)

response_json = chat_response.json()

print(response_json)

This sends a simple message, Hello, there! to the LLM within the messages array. Setting stream to False means that we'll get the entire message from the LLM in one large payload. As soon as we get that response, we can print out the response dictionary which turns out to be:

{'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': 'Hello to you too! 😊 \n\nHow can I help you today? Do you want to:\n\n*   Chat about something?\n*   Ask me a question?\n*   Play a game?\n*   Get some information?'}}], 'created': 1744481277, 'model': 'gemma3-4b-it', 'system_fingerprint': 'b4991-0bb29193', 'object': 'chat.completion', 'usage': {'completion_tokens': 50, 'prompt_tokens': 13, 'total_tokens': 63}, 'id': 'chatcmpl-w3n9tEOxkcmSwLGf93GEV4LL7O9esaCa', 'timings': {'prompt_n': 1, 'prompt_ms': 3332.753, 'prompt_per_token_ms': 3332.753, 'prompt_per_second': 0.3000522390948264, 'predicted_n': 50, 'predicted_ms': 164095.729, 'predicted_per_token_ms': 3281.9145799999997, 'predicted_per_second': 0.30470019119144776}}

Let's convert it to a JSON string, and format that better to understand what we have in which we can use.

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello to you too! \\ud83d\\ude0a \\n\\nHow can I help you today? Do you want to:\\n\\n*   Chat about something?\\n*   Ask me a question?\\n*   Play a game?\\n*   Get some information?"
      }
    }
  ],
  "created": 1744481277,
  "model": "gemma3-4b-it",
  "system_fingerprint": "b4991-0bb29193",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 50,
    "prompt_tokens": 13,
    "total_tokens": 63
  },
  "id": "chatcmpl-w3n9tEOxkcmSwLGf93GEV4LL7O9esaCa",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 3332.753,
    "prompt_per_token_ms": 3332.753,
    "prompt_per_second": 0.3000522390948264,
    "predicted_n": 50,
    "predicted_ms": 164095.729,
    "predicted_per_token_ms": 3281.9145799999997,
    "predicted_per_second": 0.30470019119144776
  }
}

There is a lot of stuff, and mostly statistics. What we want to get is the content of the response. That's going to be in choices[0].message.content. That content is Hello to you too! 😊 \n\nHow can I help you today? Do you want to:\n\n* Chat about something?\n* Ask me a question?\n* Play a game?\n* Get some information?. Notice that is raw text with the \n character. Once transformed to JSON, we characters such as those representing emojis are properly escaped as unicode.

Streaming Responses

How do we make it so that we can stream responses? We can use Python generators to display tokens to the user as they are generated. Now instead of getting one large JSON response, we can continuously stream the response back to the client from the server. The JSON payload will be smaller, and a stop payload will be sent to signal that there are no more responses generated.

When we receive these chunks, we can parse the content and deliver the content back to the requestor. The requests library allows us to perform the POST request as streaming by setting the stream argument to True.

Here is how we do it now.

We'll now declare a nice Python function to encapsulate the complex logic. Here is a start.

base_url = "http://localhost:8000/v1"
model = "gemma3-4b-it"

def chat(message):
	payload = {
		"model": model,
		"messages": [
			{ "role": "user", "content": message }
		],
		"stream": True,
		"temperature": 1.0		
	}
	
	...

The function is flexible in that we can now send any message to our llama.cpp instance to inference.

Next, we tweak the way we make the POST request to llama.cpp with request.post by setting stream=True. What we will be getting back are multiple chunks of data within an iterator. We can continuously read that data, and yield back to the caller as a generator. We can do it like this:

def chat(message):
	payload = {
		"model": model,
		"messages": [
			{ "role": "user", "content": message }
		],
		"stream": True,
		"temperature": 1.0		
	}
	
	with requests.post(f"{base_url}/chat/completions", json=payload, stream=True) as chat_response:
		for chunk in chat_response.iter_content(chunk_size=None):
			if chunk:
				raw_json = chunk.decode('utf-8')
				if raw_json == "[DONE]":
					break
					
				if raw_json.startswith("data: "):
					raw_json = raw_json[6:]
						
				# process chunk
				
				...
				
			...

chunk itself is represented as raw bytes. So we will need to convert the entire byte array to a raw UTF-8 string using decode. Once we do that, we can check to see if the chunk is the string [DATA] , because that is how llama.cpp tells the client to stop processing chunks. Notice that if the chunk is indeed [DATA], then we can just break out of the loop.

Another point to note is that the chunks are delivered as text that looks like this:

data: { "key1": "value1", ..., "keyN": "valueN" }

Notice the data: ... this is not valid JSON and should be stripped away, so that is why we check to see if the raw_json starts with this specific substring. We will remove the leading data : if it exists to conver to a proper JSON.

We can just call json.loads to convert the string to a Python dict object. Then we drill into the payload to get the content. The chunks have a specific schema which you can easily just print out after calling json.loads. Most of this new code is just boilerplate error checking.

Here is what the shape of a chunk can look like (shortened to just what we need):

{
	"choices": [
		{
			"finish_reason": false,
			"delta": {
				"role": "",
				"content: ""
			}
		}
	]
}

What's most important to notice is the yield content line. This is the part where we can signal back to the function caller that a new token was generated. The caller can then just use this content to build its own payload, send it back to the original requestor, or whatever else may be useful.

A caller will know when to stop when the generator function no longer receives new chunks from the stream. This is signaled by te raw text [DONE]. Our logic handles that and breaks out of the loop.

Here is the full implementation:

def chat(message):
	payload = {
		"model": model,
		"messages": [
			{ "role": "user", "content": message }
		],
		"stream": True,
		"temperature": 1.0		
	}
	
	with requests.post(f"{base_url}/chat/completions", json=payload, stream=True) as chat_response:
		for chunk in chat_response.iter_content(chunk_size=None):
			if chunk:
				raw_json = chunk.decode('utf-8')
				if raw_json == "[DONE]":
					break
					
				if raw_json.startswith("data: "):
					raw_json = raw_json[6:]
						
				response_json = json.loads(raw_json)
				
				choices = response_json.get("choices", None)
        		if choices is None or len(choices) == 0:
          			raise Exception("Invalid response from the server.")

		
        		finish_reason = choices[0].get("finish_reason", None)
        		if finish_reason == "stop":
          			break

        		delta = choices[0].get("delta", None)
        		if delta is None:
          			raise Exception("Message did not exist in the response.")

		
        		content = delta.get("content", None)
        		if content is None:
          			raise Exception("Content did not exist in the response.")
				
				
				yield content

This now can easily be used. Here's a pattern to continuously iterate over the stream, for reference:

for response in chat("hello, how are you?":
	print(response, end="", flush=True)