Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,8 @@
/tests/ @ggerganov
/tests/test-chat-.* @pwilkin
/tools/batched-bench/ @ggerganov
/tools/main/ @ggerganov
/tools/cli/ @ngxson
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov should you be added as codeowner of CLI, or it's fine as-is here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll merge this PR as-is and decide this later if needed

/tools/completion/ @ggerganov
/tools/mtmd/ @ngxson
/tools/perplexity/ @ggerganov
/tools/quantize/ @ggerganov
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ The Hugging Face platform provides a variety of online tools for converting, qua

To learn more about model quantization, [read this documentation](tools/quantize/README.md)

## [`llama-cli`](tools/main)
## [`llama-cli`](tools/cli)

#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.

Expand Down Expand Up @@ -525,7 +525,8 @@ To learn more about model quantization, [read this documentation](tools/quantize
## Other documentation
- [main (cli)](tools/main/README.md)
- [cli](tools/cli/README.md)
- [completion](tools/completion/README.md)
- [server](tools/server/README.md)
- [GBNF grammars](grammars/README.md)
Expand Down
3 changes: 2 additions & 1 deletion docs/development/HOWTO-add-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ Adding a model requires few steps:
After following these steps, you can open PR.

Also, it is important to check that the examples and main ggml backends (CUDA, METAL, CPU) are working with the new architecture, especially:
- [main](/tools/main/)
- [cli](/tools/cli/)
- [completion](/tools/completion/)
- [imatrix](/tools/imatrix/)
- [quantize](/tools/quantize/)
- [server](/tools/server/)
Expand Down
6 changes: 3 additions & 3 deletions grammars/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# GBNF Guide

GBNF (GGML BNF) is a format for defining [formal grammars](https://en.wikipedia.org/wiki/Formal_grammar) to constrain model outputs in `llama.cpp`. For example, you can use it to force the model to generate valid JSON, or speak only in emojis. GBNF grammars are supported in various ways in `tools/main` and `tools/server`.
GBNF (GGML BNF) is a format for defining [formal grammars](https://en.wikipedia.org/wiki/Formal_grammar) to constrain model outputs in `llama.cpp`. For example, you can use it to force the model to generate valid JSON, or speak only in emojis. GBNF grammars are supported in various ways in `tools/cli`, `tools/completion` and `tools/server`.

## Background

Expand Down Expand Up @@ -135,7 +135,7 @@ While semantically correct, the syntax `x? x? x?.... x?` (with N repetitions) ma
You can use GBNF grammars:

- In [llama-server](../tools/server)'s completion endpoints, passed as the `grammar` body field
- In [llama-cli](../tools/main), passed as the `--grammar` & `--grammar-file` flags
- In [llama-cli](../tools/cli) and [llama-completion](../tools/completion), passed as the `--grammar` & `--grammar-file` flags
- With [test-gbnf-validator](../tests/test-gbnf-validator.cpp), to test them against strings.

## JSON Schemas → GBNF
Expand All @@ -145,7 +145,7 @@ You can use GBNF grammars:
- In [llama-server](../tools/server):
- For any completion endpoints, passed as the `json_schema` body field
- For the `/chat/completions` endpoint, passed inside the `response_format` body field (e.g. `{"type", "json_object", "schema": {"items": {}}}` or `{ type: "json_schema", json_schema: {"schema": ...} }`)
- In [llama-cli](../tools/main), passed as the `--json` / `-j` flag
- In [llama-cli](../tools/cli) and [llama-completion](../tools/completion), passed as the `--json` / `-j` flag
- To convert to a grammar ahead of time:
- in CLI, with [examples/json_schema_to_grammar.py](../examples/json_schema_to_grammar.py)
- in JavaScript with [json-schema-to-grammar.mjs](../tools/server/public_legacy/json-schema-to-grammar.mjs) (this is used by the [server](../tools/server)'s Web UI)
Expand Down
1 change: 1 addition & 0 deletions tools/cli/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
TODO
32 changes: 16 additions & 16 deletions tools/completion/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# llama.cpp/tools/main
# llama.cpp/tools/completion

This example program allows you to use various LLaMA language models easily and efficiently. It is specifically designed to work with the [llama.cpp](https://github.com/ggml-org/llama.cpp) project, which provides a plain C/C++ implementation with optional 4-bit quantization support for faster, lower memory inference, and is optimized for desktop CPUs. This program can be used to perform various inference tasks with LLaMA models, including generating text based on user-provided prompts and chat-like interactions with reverse prompts.

Expand Down Expand Up @@ -27,64 +27,64 @@ Once downloaded, place your model in the models folder in llama.cpp.
##### Input prompt (One-and-done)

```bash
./llama-cli -m models/gemma-1.1-7b-it.Q4_K_M.gguf -no-cnv --prompt "Once upon a time"
./llama-completion -m models/gemma-1.1-7b-it.Q4_K_M.gguf -no-cnv --prompt "Once upon a time"
```
##### Conversation mode (Allow for continuous interaction with the model)

```bash
./llama-cli -m models/gemma-1.1-7b-it.Q4_K_M.gguf --chat-template gemma
./llama-completion -m models/gemma-1.1-7b-it.Q4_K_M.gguf --chat-template gemma
```

##### Conversation mode using built-in jinja chat template

```bash
./llama-cli -m models/gemma-1.1-7b-it.Q4_K_M.gguf --jinja
./llama-completion -m models/gemma-1.1-7b-it.Q4_K_M.gguf --jinja
```

##### One-and-done query using jinja with custom system prompt and a starting prompt

```bash
./llama-cli -m models/gemma-1.1-7b-it.Q4_K_M.gguf --jinja --single-turn -sys "You are a helpful assistant" -p "Hello"
./llama-completion -m models/gemma-1.1-7b-it.Q4_K_M.gguf --jinja --single-turn -sys "You are a helpful assistant" -p "Hello"
```

##### Infinite text from a starting prompt (you can use `Ctrl-C` to stop it):
```bash
./llama-cli -m models/gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1
./llama-completion -m models/gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1
```

### Windows:

##### Input prompt (One-and-done)
```powershell
./llama-cli.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf -no-cnv --prompt "Once upon a time"
./llama-completion.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf -no-cnv --prompt "Once upon a time"
```
##### Conversation mode (Allow for continuous interaction with the model)

```powershell
./llama-cli.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --chat-template gemma
./llama-completion.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --chat-template gemma
```

##### Conversation mode using built-in jinja chat template

```powershell
./llama-cli.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --jinja
./llama-completion.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --jinja
```

##### One-and-done query using jinja with custom system prompt and a starting prompt

```powershell
./llama-cli.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --jinja --single-turn -sys "You are a helpful assistant" -p "Hello"
./llama-completion.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --jinja --single-turn -sys "You are a helpful assistant" -p "Hello"
```

#### Infinite text from a starting prompt (you can use `Ctrl-C` to stop it):

```powershell
llama-cli.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1
llama-completion.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1
```

## Common Options

In this section, we cover the most commonly used options for running the `llama-cli` program with the LLaMA models:
In this section, we cover the most commonly used options for running the `llama-completion` program with the LLaMA models:

- `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/gemma-1.1-7b-it.Q4_K_M.gguf`; inferred from `--model-url` if set).
- `-mu MODEL_URL --model-url MODEL_URL`: Specify a remote http url to download the file (e.g [https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf?download=true](https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf?download=true)).
Expand All @@ -97,7 +97,7 @@ In this section, we cover the most commonly used options for running the `llama-

## Input Prompts

The `llama-cli` program provides several ways to interact with the LLaMA models using input prompts:
The `llama-completion` program provides several ways to interact with the LLaMA models using input prompts:

- `--prompt PROMPT`: Provide a prompt directly as a command-line option.
- `--file FNAME`: Provide a file containing a prompt or multiple prompts.
Expand All @@ -107,7 +107,7 @@ The `llama-cli` program provides several ways to interact with the LLaMA models

## Interaction

The `llama-cli` program offers a seamless way to interact with LLaMA models, allowing users to engage in real-time conversations or provide instructions for specific tasks. The interactive mode can be triggered using various options, including `--interactive` and `--interactive-first`.
The `llama-completion` program offers a seamless way to interact with LLaMA models, allowing users to engage in real-time conversations or provide instructions for specific tasks. The interactive mode can be triggered using various options, including `--interactive` and `--interactive-first`.

In interactive mode, users can participate in text generation by injecting their input during the process. Users can press `Ctrl+C` at any time to interject and type their input, followed by pressing `Return` to submit it to the LLaMA model. To submit additional lines without finalizing input, users can end the current line with a backslash (`\`) and continue typing.

Expand Down Expand Up @@ -136,15 +136,15 @@ To overcome this limitation, you can use the `--in-prefix` flag to add a space o
The `--in-prefix` flag is used to add a prefix to your input, primarily, this is used to insert a space after the reverse prompt. Here's an example of how to use the `--in-prefix` flag in conjunction with the `--reverse-prompt` flag:

```sh
./llama-cli -r "User:" --in-prefix " "
./llama-completion -r "User:" --in-prefix " "
```

### In-Suffix

The `--in-suffix` flag is used to add a suffix after your input. This is useful for adding an "Assistant:" prompt after the user's input. It's added after the new-line character (`\n`) that's automatically added to the end of the user's input. Here's an example of how to use the `--in-suffix` flag in conjunction with the `--reverse-prompt` flag:

```sh
./llama-cli -r "User:" --in-prefix " " --in-suffix "Assistant:"
./llama-completion -r "User:" --in-prefix " " --in-suffix "Assistant:"
```
When --in-prefix or --in-suffix options are enabled the chat template ( --chat-template ) is disabled

Expand Down
2 changes: 1 addition & 1 deletion tools/llama-bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ Each test is repeated the number of times given by `-r`, and the results are ave

Using the `-d <n>` option, each test can be run at a specified context depth, prefilling the KV cache with `<n>` tokens.

For a description of the other options, see the [main example](../main/README.md).
For a description of the other options, see the [completion example](../completion/README.md).

> [!NOTE]
> The measurements with `llama-bench` do not include the times for tokenization and for sampling.
Expand Down