Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 124 additions & 61 deletions docs/toolhive/concepts/observability.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,56 +31,75 @@ Here's what a trace looks like when a client calls a tool in the GitHub MCP
server (some fields omitted for brevity):

```text
Span: mcp.tools/call (150ms)
├── service.name: toolhive-mcp-proxy
Span: tools/call create_issue (150ms)
├── service.name: thv-github
├── service.version: v0.1.9
├── http.duration_ms: 150.3
├── http.host: localhost:14972
├── http.method: POST
├── http.request_content_length: 256
├── http.response_content_length: 1024
├── http.status_code: 202
├── http.url: /messages?session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
├── http.user_agent: claude-code/1.0.53
├── mcp.method: tools/call
├── mcp.request.id: 5
├── http.request.method: POST
├── http.request.body.size: 256
├── http.response.status_code: 202
├── http.response.body.size: 1024
├── url.full: /messages?session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
├── url.path: /messages
├── url.scheme: https
├── server.address: localhost:14972
├── user_agent.original: claude-code/1.0.53
├── mcp.method.name: tools/call
├── mcp.server.name: github
├── mcp.tool.arguments: owner=stacklok, repo=toolhive, pullNumber=1131
├── mcp.tool.name: create_issue
└── mcp.transport: stdio
├── mcp.session.id: abc123
├── rpc.system.name: jsonrpc
├── jsonrpc.protocol.version: 2.0
├── jsonrpc.request.id: 5
├── gen_ai.tool.name: create_issue
├── gen_ai.operation.name: execute_tool
├── gen_ai.tool.call.arguments: owner=stacklok, repo=toolhive, pullNumber=1131
├── network.transport: tcp
└── network.protocol.name: http
```

### MCP-specific traces

ToolHive automatically captures traces for all MCP operations, including:

- **Tool calls** (`mcp.tools/call`) - When AI assistants use tools
- **Resource access** (`mcp.resources/read`) - When servers read files or data
- **Prompt operations** (`mcp.prompts/get`) - When servers retrieve prompts
- **Connection events** (`mcp.initialize`) - When clients connect to servers
- **Tool calls** (`tools/call`) - When AI assistants use tools
- **Resource access** (`resources/read`) - When servers read files or data
- **Prompt operations** (`prompts/get`) - When servers retrieve prompts
- **Connection events** (`initialize`) - When clients connect to servers

### Trace attributes

Each trace includes detailed context across three layers:
Each trace includes detailed context across several layers:

#### Service information

```text
service.name: toolhive-mcp-proxy
service.name: thv-github
service.version: v0.1.9
host.name: my-machine
```

#### HTTP layer information
#### HTTP layer

```text
http.duration_ms: 150.3
http.host: localhost:14972
http.method: POST
http.request_content_length: 256
http.response_content_length: 1024
http.status_code: 202
http.url: /messages?session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
http.user_agent: claude-code/1.0.53
http.request.method: POST
http.request.body.size: 256
http.response.status_code: 202
http.response.body.size: 1024
url.full: /messages?session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
url.path: /messages
url.scheme: https
url.query: session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
server.address: localhost:14972
user_agent.original: claude-code/1.0.53
```

#### Network layer

```text
network.transport: tcp
network.protocol.name: http
network.protocol.version: 1.1
client.address: 127.0.0.1
client.port: 52431
```

#### MCP protocol details
Expand All @@ -89,31 +108,41 @@ Details about the MCP operation being performed (some fields are specific to
each operation):

```text
mcp.client.name: claude-code
mcp.method: tools/call
mcp.request.id: 123
mcp.method.name: tools/call
mcp.server.name: github
mcp.tool.arguments: owner=stacklok, repo=toolhive, path=pkg/telemetry/middleware.go, start_index=130, max_length=1000
mcp.tool.name: get_file_contents
mcp.transport: stdio
rpc.service: mcp
rpc.system: jsonrpc
mcp.session.id: abc123
mcp.protocol.version: 2025-03-26
mcp.is_batch: false
rpc.system.name: jsonrpc
jsonrpc.protocol.version: 2.0
jsonrpc.request.id: 123
```

#### Method-specific attributes

- **`mcp.tools/call`** traces include:
- `mcp.tool.name` - The name of the tool being called
- `mcp.tool.arguments` - Sanitized tool arguments (sensitive values redacted)
- **`tools/call`** traces include:
- `gen_ai.tool.name` - The name of the tool being called
- `gen_ai.operation.name` - Set to `execute_tool`
- `gen_ai.tool.call.arguments` - Sanitized tool arguments (sensitive values
redacted)

- **`mcp.resources/read`** traces include:
- **`resources/read`** traces include:
- `mcp.resource.uri` - The URI of the resource being accessed

- **`mcp.prompts/get`** traces include:
- `mcp.prompt.name` - The name of the prompt being retrieved
- **`prompts/get`** traces include:
- `gen_ai.prompt.name` - The name of the prompt being retrieved

- **`initialize`** traces include:
- `mcp.protocol.version` - The MCP protocol version negotiated

- **`mcp.initialize`** traces include:
- `mcp.client.name` - The name of the connecting client
:::note[Legacy attribute names]

By default, ToolHive emits both the new OpenTelemetry semantic convention
attribute names shown above and legacy attribute names (e.g., `http.method`,
`mcp.method`, `mcp.tool.name`) for backward compatibility with existing
dashboards. You can control this with the `--otel-use-legacy-attributes` flag.

:::

## Metrics collection

Expand Down Expand Up @@ -171,6 +200,40 @@ toolhive_mcp_tool_calls_total{server="github",status="success",tool="list_pull_r
toolhive_mcp_tool_calls_total{server="github",status="success",tool="search_issues"} 2
```

### MCP semantic convention metrics

In addition to the ToolHive-prefixed metrics above, ToolHive emits metrics that
follow the
[OpenTelemetry MCP semantic conventions](https://github.com/open-telemetry/semantic-conventions):

| Metric | Type | Description |
| ------------------------------- | --------- | ---------------------------------------- |
| `mcp.server.operation.duration` | Histogram | Duration of MCP server operations |
| `mcp.client.operation.duration` | Histogram | Duration of MCP client operations (vMCP) |

These metrics use the same labels as the ToolHive-prefixed metrics and are
compatible with dashboards built for the OpenTelemetry MCP semantic conventions.

### vMCP metrics

When using Virtual MCP Server (vMCP), additional metrics are available for
monitoring backend operations, workflow executions, and optimizer performance.
For details, see the
[vMCP telemetry guide](../guides-vmcp/telemetry-and-metrics.mdx).

## Trace context propagation

ToolHive supports two methods of trace context propagation:

- **HTTP headers**: Standard W3C Trace Context (`traceparent` and `tracestate`
headers) and W3C Baggage propagation
- **MCP `_meta` field**: Trace context embedded in MCP request parameters via
the `params._meta` field, following the MCP specification

When both are present, the MCP `_meta` trace context takes priority. This
enables proper trace correlation across MCP server boundaries, even when MCP
clients inject trace context into the request payload rather than HTTP headers.

## Export options

ToolHive supports multiple export formats to integrate with your existing
Expand Down Expand Up @@ -217,13 +280,13 @@ ToolHive automatically protects sensitive information in traces:
For example, a tool call with sensitive arguments:

```text
mcp.tool.arguments: password=secret123, api_key=abc456, title=Bug report
gen_ai.tool.call.arguments: password=secret123, api_key=abc456, title=Bug report
```

Is sanitized in the trace as:
ToolHive sanitizes this in the trace as:

```text
mcp.tool.arguments: password=[REDACTED], api_key=[REDACTED], title=Bug report
gen_ai.tool.call.arguments: password=[REDACTED], api_key=[REDACTED], title=Bug report
```

## Monitoring examples
Expand Down Expand Up @@ -254,15 +317,15 @@ When a client calls the `create_issue` tool:
**Generated trace**:

```text
Span: mcp.tools/call
├── mcp.method: tools/call
├── mcp.request.id: req_456
├── mcp.tool.name: create_issue
├── mcp.tool.arguments: title=Bug report, body=Found an issue with the API
Span: tools/call create_issue
├── mcp.method.name: tools/call
├── jsonrpc.request.id: req_456
├── gen_ai.tool.name: create_issue
├── gen_ai.tool.call.arguments: title=Bug report, body=Found an issue with...
├── mcp.server.name: github
├── mcp.transport: sse
├── http.method: POST
├── http.status_code: 200
├── network.transport: tcp
├── http.request.method: POST
├── http.response.status_code: 200
└── duration: 850ms
```

Expand All @@ -281,10 +344,10 @@ Failed requests generate error traces and metrics:
**Error trace**:

```text
Span: mcp.tools/call
├── mcp.method: tools/call
├── mcp.tool.name: invalid_tool
├── http.status_code: 400
Span: tools/call invalid_tool
├── mcp.method.name: tools/call
├── gen_ai.tool.name: invalid_tool
├── http.response.status_code: 400
├── span.status: ERROR
├── span.status_message: Tool not found
└── duration: 12ms
Expand Down
32 changes: 17 additions & 15 deletions docs/toolhive/guides-cli/telemetry-and-metrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@ scheme or path (e.g., use `api.honeycomb.io` or `api.honeycomb.io:443`, not

:::

By default, the service name is set to `toolhive-mcp-proxy`, and the sampling
rate is `0.1` (10%). You can customize these settings with additional
[configuration options](#configuration-options).
By default, the service name is set to `thv-<server-name>` (e.g., `thv-fetch`),
and the sampling rate is `0.1` (10%). You can customize these settings with
additional [configuration options](#configuration-options).

:::tip[Recommendation]

Expand Down Expand Up @@ -218,21 +218,23 @@ thv run [--otel-endpoint <URL>] [--otel-service-name <NAME>] \
[--otel-sampling-rate <RATE>] [--otel-headers <KEY=VALUE>] \
[--otel-custom-attributes <KEY=VALUE>] [--otel-env-vars <VAR1,VAR2>] \
[--otel-insecure] [--otel-enable-prometheus-metrics-path] \
[--otel-use-legacy-attributes=<true|false>] \
<SERVER>
```

| Flag | Description | Default |
| --------------------------------------- | ------------------------------------------------------------- | -------------------- |
| `--otel-endpoint` | OTLP endpoint (e.g., `api.honeycomb.io`) | None |
| `--otel-metrics-enabled` | Enable OTLP metrics export (when OTLP endpoint is configured) | `true` |
| `--otel-tracing-enabled` | Enable distributed tracing (when OTLP endpoint is configured) | `true` |
| `--otel-service-name` | Service name for telemetry | `toolhive-mcp-proxy` |
| `--otel-sampling-rate` | Trace sampling rate (0.0-1.0) | `0.1` (10%) |
| `--otel-headers` | Authentication headers in `key=value` format | None |
| `--otel-custom-attributes` | Custom resource attributes in `key=value` format | None |
| `--otel-env-vars` | List of environment variables to include in telemetry spans | None |
| `--otel-insecure` | Connect using HTTP instead of HTTPS | `false` |
| `--otel-enable-prometheus-metrics-path` | Enable `/metrics` endpoint | `false` |
| Flag | Description | Default |
| --------------------------------------- | ------------------------------------------------------------------- | -------------- |
| `--otel-endpoint` | OTLP endpoint (e.g., `api.honeycomb.io`) | None |
| `--otel-metrics-enabled` | Enable OTLP metrics export (when OTLP endpoint is configured) | `true` |
| `--otel-tracing-enabled` | Enable distributed tracing (when OTLP endpoint is configured) | `true` |
| `--otel-service-name` | Service name for telemetry | `thv-<SERVER>` |
| `--otel-sampling-rate` | Trace sampling rate (0.0-1.0) | `0.1` (10%) |
| `--otel-headers` | Authentication headers in `key=value` format | None |
| `--otel-custom-attributes` | Custom resource attributes in `key=value` format | None |
| `--otel-env-vars` | List of environment variables to include in telemetry spans | None |
| `--otel-insecure` | Connect using HTTP instead of HTTPS | `false` |
| `--otel-enable-prometheus-metrics-path` | Enable `/metrics` endpoint | `false` |
| `--otel-use-legacy-attributes` | Emit legacy attribute names alongside new OTel semantic conventions | `true` |

### Global configuration

Expand Down
55 changes: 54 additions & 1 deletion docs/toolhive/guides-vmcp/telemetry-and-metrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ spec:
serviceName: 'my-vmcp'
insecure: true
tracingEnabled: true
samplingRate: '0.1'
samplingRate: '0.05'
metricsEnabled: true
enablePrometheusMetricsPath: true
# highlight-end
Expand Down Expand Up @@ -114,6 +114,59 @@ vMCP supports two methods for collecting metrics:
- **Pull via Prometheus**: Set `enablePrometheusMetricsPath: true` to expose a
`/metrics` endpoint on the vMCP service port (4483) for Prometheus to scrape

### Backend metrics

These metrics track requests to individual MCP server backends:

| Metric | Type | Description |
| ----------------------------------------- | --------- | -------------------------------------------------------- |
| `toolhive_vmcp_backends_discovered` | Gauge | Number of backends discovered |
| `toolhive_vmcp_backend_requests` | Counter | Total requests per backend |
| `toolhive_vmcp_backend_errors` | Counter | Total errors per backend |
| `toolhive_vmcp_backend_requests_duration` | Histogram | Duration of backend requests |
| `mcp.client.operation.duration` | Histogram | MCP client operation duration (OTel semantic convention) |

### Workflow metrics

These metrics track workflow execution across backends:

| Metric | Type | Description |
| ----------------------------------- | --------- | ------------------------------- |
| `toolhive_vmcp_workflow_executions` | Counter | Total workflow executions |
| `toolhive_vmcp_workflow_errors` | Counter | Total workflow execution errors |
| `toolhive_vmcp_workflow_duration` | Histogram | Duration of workflow executions |

### Optimizer metrics

When the vMCP optimizer is enabled, these metrics track tool-finding and
tool-calling performance:

| Metric | Type | Description |
| ----------------------------------------------- | --------- | --------------------------------------- |
| `toolhive_vmcp_optimizer_find_tool_requests` | Counter | Total FindTool calls |
| `toolhive_vmcp_optimizer_find_tool_errors` | Counter | Total FindTool errors |
| `toolhive_vmcp_optimizer_find_tool_duration` | Histogram | Duration of FindTool calls |
| `toolhive_vmcp_optimizer_find_tool_results` | Histogram | Number of tools returned per call |
| `toolhive_vmcp_optimizer_token_savings_percent` | Histogram | Token savings percentage per call |
| `toolhive_vmcp_optimizer_call_tool_requests` | Counter | Total CallTool calls |
| `toolhive_vmcp_optimizer_call_tool_errors` | Counter | Total CallTool errors |
| `toolhive_vmcp_optimizer_call_tool_not_found` | Counter | CallTool calls where tool was not found |
| `toolhive_vmcp_optimizer_call_tool_duration` | Histogram | Duration of CallTool calls |

## Distributed tracing

vMCP creates client-side spans for backend operations with the following span
names:

- `tools/call <tool_name>` - Tool calls to backends
- `resources/read` - Resource reads from backends
- `prompts/get <prompt_name>` - Prompt retrieval from backends
- `list_capabilities` - Backend capability discovery

Each span includes attributes for the target backend (`target.workload_id`,
`target.workload_name`, `target.base_url`) and the relevant MCP attributes
(`mcp.method.name`, `gen_ai.tool.name`, `mcp.resource.uri`).

## Related information

- [Observability concepts](../concepts/observability.mdx) - Overview of
Expand Down