Skip to content

Conversation

@qnixsynapse
Copy link
Collaborator

@qnixsynapse qnixsynapse commented Dec 16, 2025

Summary

This PR proposes adding an optional, explicitly gated HTTP endpoint (POST /exit) to llama-server that allows a client application to request a graceful server shutdown when traditional OS-level signals (e.g. SIGTERM, SIGINT) are unavailable or unreliable. (for eg. Windows)

The endpoint is disabled by default and can only be enabled via an explicit command-line flag or environment variable.

Motivation

Many llama-server deployments are no longer run as simple foreground processes where POSIX signals are always available.

In these environments, the client cannot reliably send SIGTERM or SIGINT

As a result, client applications currently resort to hard process termination

An application-level shutdown API mechanism provides a portable, explicit, and graceful alternative for clients to request the server to clean up and shutdown.

Proposed Solution

Introduce a POST-only endpoint:

POST /exit

Behavior

  • When enabled, the endpoint:
    • Validates an explicit confirmation token in the request body
    • Returns a success response immediately
    • Initiates a graceful shutdown after the response is sent
  • When disabled (default):
    • Requests to /exit return an error indicating the endpoint is not supported

Example Request

POST /exit
Content-Type: application/json

{
  "confirm": "shutdown"
}

Example Response:

{
  "message": "Server shutdown initiated",
  "status": "terminating"
}

Configuration & Safety Guarantees

Disabled by Default

The endpoint is off by default.
It must be explicitly enabled using either:

  • A CLI flag:
--endpoint-exit

Or environment variable:

LLAMA_ARG_ENDPOINT_EXIT=1

I will update the docs after getting inputs and feedback's before merging.
Please note: This is definitely not indented for public servers!!

Current issues

  • shutdown_cb will invoke termination of all HTTP threads causing deadlock in some cases
  • better API design to disable the endpoint completely in normal mode instead of returning an error that it is unavailable(?)

- Introduce --endpoint-exit flag and LLAMA_ARG_ENDPOINT_EXIT env var
- Add endpoint_exit to common_params (disabled by default)
- Implement POST /exit with explicit confirmation token to prevent misuse
- Support graceful shutdown via injected on_shutdown callback
- Handle both router and non-router server shutdown paths
@ngxson
Copy link
Collaborator

ngxson commented Dec 16, 2025

Any reasons why you cannot use the /models/unload endpoint of router mode?

The router mode is designed to run as daemon. I recommend doing that instead of an /exit endpoint

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traditional OS-level signals (e.g. SIGTERM, SIGINT) are unavailable or unreliable. (for eg. Windows)

first of, if you cannot send OS-level signal to an application, there is something to do with the way you spawn and manage the process, but not the application itself.

for the same reason, Windows has windows services and Linux usually has something like systemd to spawn and manage daemon processes. such process never has to expose a shutdown mechanism to the user space, instead, user send a request to the manager (systemctl for example), and the manager shutdown the process.

if there is a problem with this mechanism on windows, we should fix it, but not to circumvent by introducing yet another mechanism (e.g. a /exit endpoint)

therefore, I against this proposal as it seems like an anti-pattern / misuse in term of system design

const json body = json::parse(req.body);
const std::string confirm = json_value(body, "confirm", std::string());

if (confirm != "shutdown") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a good way to design an API either. I don't get why a confirmation can prevent someone accidentally exit the server if the request is sent by a program and not a human

we are designing Application Programming Interface, not Human Interface here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why a confirmation can prevent someone accidentally exit the server if the request is sent by a program and not a human

This endpoint is disabled by default and need a flag or an env variable so there isn't any cause of accidents I think. But I agree on a better API design. This is just a working PoC RFC PR.

std::this_thread::sleep_for(std::chrono::milliseconds(100));
SRV_INF("%s: executing on_shutdown callback...\n", __func__);
try {
shutdown_cb();
Copy link
Collaborator

@ngxson ngxson Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will definitely cause deadlocks in some cases. a HTTP thread should never kill itself (shutdown_cb will invoke termination of all HTTP threads)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use ctx_server.queue_tasks.terminate() in hope the control will return back to main() where shutdown and cleanups will be called but that only works in non router mode I think since it doesn't shutdown the http server thread. Please cmiiw.

@qnixsynapse
Copy link
Collaborator Author

qnixsynapse commented Dec 16, 2025

Any reasons why you cannot use the /models/unload endpoint of router mode?

The router mode is designed to run as daemon. I recommend doing that instead of an /exit endpoint

That only unloads the model, not terminate the server process, especially in router mode.

@ngxson
Copy link
Collaborator

ngxson commented Dec 16, 2025

In router mode, unload model == terminate child process holding the model

@qnixsynapse
Copy link
Collaborator Author

In router mode, unload model == terminate child process holding the model

This PR === concept of gracefully terminates the main router process by a client since the client manages it(both launching and termination).

@qnixsynapse qnixsynapse marked this pull request as draft December 16, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants