Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
23f05c6
feat: ReconcileUtils for strongly consistent updates (#3106)
csviri Jan 15, 2026
26480b3
feat: observability with otel and default grafana dashboard
csviri Feb 4, 2026
ddf8965
wip
csviri Feb 4, 2026
c5db800
wip
csviri Feb 4, 2026
7ad0250
wip
csviri Feb 4, 2026
63cab18
wip
csviri Feb 8, 2026
978c204
wip
csviri Feb 8, 2026
f03cac8
wip
csviri Feb 8, 2026
c38f20b
wip
csviri Feb 9, 2026
372eaa7
wip
csviri Feb 9, 2026
73bfdeb
wip
csviri Feb 9, 2026
c04b0f1
wip
csviri Feb 9, 2026
ab66e99
wip
csviri Feb 9, 2026
2e3fb34
wip
csviri Feb 9, 2026
b31e851
wip
csviri Feb 9, 2026
47d1d9e
wip
csviri Feb 9, 2026
ad4c94b
wip
csviri Feb 9, 2026
4f39cd9
improve: micrometer metrics improvements
csviri Feb 9, 2026
c660ccc
wip
csviri Feb 10, 2026
a13b76a
wip
csviri Feb 10, 2026
d8df8b5
wip
csviri Feb 10, 2026
f8c401e
wip
csviri Feb 10, 2026
5c46c76
wip
csviri Feb 10, 2026
43390e7
wip
csviri Feb 10, 2026
0a55712
wip
csviri Feb 11, 2026
d7a3ccb
wip
csviri Feb 11, 2026
f92fb99
wip
csviri Feb 11, 2026
852c160
wip
csviri Feb 11, 2026
76b8c45
e2e test skeleton
csviri Feb 11, 2026
1db492b
wip
csviri Feb 12, 2026
51e37f4
wip
csviri Feb 17, 2026
8136796
wip
csviri Feb 21, 2026
725b851
wip
csviri Feb 27, 2026
0c0dbd9
wip
csviri Feb 27, 2026
040a009
wip
csviri Feb 27, 2026
5d65997
wip
csviri Feb 27, 2026
dcc196a
wip
csviri Feb 27, 2026
6b54b0d
wip
csviri Feb 27, 2026
02a1ea9
wip
csviri Feb 28, 2026
2127002
wip
csviri Mar 1, 2026
9eb94da
documentation update
csviri Mar 1, 2026
fcfd3bd
wip
csviri Mar 1, 2026
d2e2840
logging
csviri Mar 1, 2026
9a7d90f
Update sample-operators/metrics-processing/src/main/java/io/javaopera…
csviri Mar 1, 2026
8ab36d8
Update sample-operators/metrics-processing/pom.xml
csviri Mar 1, 2026
86cde7c
Update sample-operators/metrics-processing/pom.xml
csviri Mar 1, 2026
86b4e91
Update operator-framework-core/src/main/java/io/javaoperatorsdk/opera…
csviri Mar 1, 2026
c64f9be
Update observability/install-observability.sh
csviri Mar 1, 2026
d166f06
wip
csviri Mar 1, 2026
8376c0b
Update sample-operators/metrics-processing/src/main/resources/io/java…
csviri Mar 1, 2026
cb93e91
wip
csviri Mar 1, 2026
8458b1e
wip
csviri Mar 1, 2026
1fe5c39
wip
csviri Mar 1, 2026
7704651
wip
csviri Mar 1, 2026
7a90db4
wip
csviri Mar 1, 2026
d60193a
wip
csviri Mar 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/e2e-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ jobs:
- "sample-operators/tomcat-operator"
- "sample-operators/webpage"
- "sample-operators/leader-election"
- "sample-operators/metrics-processing"
runs-on: ubuntu-latest
steps:
- name: Checkout
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ on:
paths-ignore:
- 'docs/**'
- 'adr/**'
- 'observability/**'
workflow_dispatch:
jobs:
check_format_and_unit_tests:
Expand Down
125 changes: 101 additions & 24 deletions docs/content/en/docs/documentation/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,30 +77,108 @@ Metrics metrics; // initialize your metrics implementation
Operator operator = new Operator(client, o -> o.withMetrics(metrics));
```

### Micrometer implementation
### MicrometerMetricsV2 (Recommended, since 5.3.0)

The micrometer implementation is typically created using one of the provided factory methods which, depending on which
is used, will return either a ready to use instance or a builder allowing users to customize how the implementation
behaves, in particular when it comes to the granularity of collected metrics. It is, for example, possible to collect
metrics on a per-resource basis via tags that are associated with meters. This is the default, historical behavior but
this will change in a future version of JOSDK because this dramatically increases the cardinality of metrics, which
could lead to performance issues.
[`MicrometerMetricsV2`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/micrometer-support/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetricsV2.java) is the recommended micrometer-based implementation. It is designed with low cardinality in mind:
all meters are scoped to the controller, not to individual resources. This avoids unbounded cardinality growth as
resources come and go.

To create a `MicrometerMetrics` implementation that behaves how it has historically behaved, you can just create an
instance via:
The simplest way to create an instance:

```java
MeterRegistry registry; // initialize your registry implementation
Metrics metrics = MicrometerMetrics.newMicrometerMetricsBuilder(registry).build();
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry).build();
```

Optionally, include a `namespace` tag on per-reconciliation counters (disabled by default to avoid unexpected
cardinality increases in existing deployments):

```java
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry)
.withNamespaceAsTag()
.build();
```

You can also supply a custom timer configuration for `reconciliations.execution.duration`:

```java
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry)
.withExecutionTimerConfig(builder -> builder.publishPercentiles(0.5, 0.95, 0.99))
.build();
```

The class provides factory methods which either return a fully pre-configured instance or a builder object that will
allow you to configure more easily how the instance will behave. You can, for example, configure whether the
implementation should collect metrics on a per-resource basis, whether associated meters should be removed when a
resource is deleted and how the clean-up is performed. See the relevant classes documentation for more details.
#### MicrometerMetricsV2 metrics

All meters use `controller.name` as their primary tag. Counters optionally carry a `namespace` tag when
`withNamespaceAsTag()` is enabled.

| Meter name (Micrometer) | Type | Tags | Description |
|------------------------------------------|---------|-----------------------------------|----------------------------------------------------------------------|
| `reconciliations.executions` | gauge | `controller.name` | Number of reconciler executions currently in progress |
| `reconciliations.active` | gauge | `controller.name` | Number of resources currently queued for reconciliation |
| `custom_resources` | gauge | `controller.name` | Number of custom resources tracked by the controller |
| `reconciliations.execution.duration` | timer | `controller.name` | Reconciliation execution duration with explicit SLO bucket histogram |
| `reconciliations.started.total` | counter | `controller.name`, `namespace`* | Number of reconciliations started (including retries) |
| `reconciliations.success.total` | counter | `controller.name`, `namespace`* | Number of successfully finished reconciliations |
| `reconciliations.failure.total` | counter | `controller.name`, `namespace`* | Number of failed reconciliations |
| `reconciliations.retries.total` | counter | `controller.name`, `namespace`* | Number of reconciliation retries |
| `events.received` | counter | `controller.name`, `event`, `action`, `namespace`* | Number of Kubernetes events received by the controller |
| `events.delete` | counter | `controller.name`, `namespace`* | Number of resource deletion events processed |

\* `namespace` tag is only included when `withNamespaceAsTag()` is enabled.

The execution timer uses explicit SLO boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure
compatibility with `histogram_quantile()` queries in Prometheus. This is important when using the OTLP registry, where
`publishPercentileHistogram()` would otherwise produce Base2 Exponential Histograms that are incompatible with classic
`_bucket` queries.

> **Note on Prometheus metric names**: The exact Prometheus metric name suffix depends on the `MeterRegistry` in use.
> For `PrometheusMeterRegistry` the timer is exposed as `reconciliations_execution_duration_seconds_*`. For
> `OtlpMeterRegistry` (metrics exported via OpenTelemetry Collector), it is exposed as
> `reconciliations_execution_duration_milliseconds_*`.

#### Grafana Dashboard

A ready-to-use Grafana dashboard is available at
[`observability/josdk-operator-metrics-dashboard.json`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/observability/josdk-operator-metrics-dashboard.json).
It visualizes all of the metrics listed above, including reconciliation throughput, error rates, queue depth, active
executions, resource counts, and execution duration histograms and heatmaps.

The dashboard is designed to work with metrics exported via OpenTelemetry Collector to Prometheus, as set up by the
observability sample (see below).

#### Exploring metrics end-to-end

The
[`metrics-processing` sample operator](https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/metrics-processing)
includes a full end-to-end test,
[`MetricsHandlingE2E`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/metrics-processing/src/test/java/io/javaoperatorsdk/operator/sample/metrics/MetricsHandlingE2E.java),
that:

1. Installs a local observability stack (Prometheus, Grafana, OpenTelemetry Collector) via
`observability/install-observability.sh`. That imports also the Grafana dashboards.
2. Runs two reconcilers that produce both successful and failing reconciliations over a sustained period
3. Verifies that the expected metrics appear in Prometheus

This is a good starting point for experimenting with the metrics and the Grafana dashboard in a real cluster without
having to deploy your own operator.

### MicrometerMetrics (Deprecated)

> **Deprecated**: `MicrometerMetrics` (V1) is deprecated as of JOSDK 5.3.0. Use `MicrometerMetricsV2` instead.
> V1 attaches resource-specific metadata (name, namespace, etc.) as tags to every meter, which causes unbounded
> cardinality growth and can lead to performance issues in your metrics backend.

The legacy `MicrometerMetrics` implementation is still available. To create an instance that behaves as it historically
has:

```java
MeterRegistry registry; // initialize your registry implementation
Metrics metrics = MicrometerMetrics.newMicrometerMetricsBuilder(registry).build();
```

For example, the following will create a `MicrometerMetrics` instance configured to collect metrics on a per-resource
basis, deleting the associated meters after 5 seconds when a resource is deleted, using up to 2 threads to do so.
To collect metrics on a per-resource basis, deleting the associated meters after 5 seconds when a resource is deleted,
using up to 2 threads:

```java
MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
Expand All @@ -109,9 +187,9 @@ MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
.build();
```

### Operator SDK metrics
#### Operator SDK metrics (V1)

The micrometer implementation records the following metrics:
The V1 micrometer implementation records the following metrics:

| Meter name | Type | Tag names | Description |
|-------------------------------------------------------------|----------------|-------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
Expand All @@ -130,12 +208,11 @@ The micrometer implementation records the following metrics:
| operator.sdk.controllers.execution.cleanup.success | counter | controller, type | Number of successful cleanups per controller |
| operator.sdk.controllers.execution.cleanup.failure | counter | controller, exception | Number of failed cleanups per controller |

As you can see all the recorded metrics start with the `operator.sdk` prefix. `<resource metadata>`, in the table above,
refers to resource-specific metadata and depends on the considered metric and how the implementation is configured and
could be summed up as follows: `group?, version, kind, [name, namespace?], scope` where the tags in square
brackets (`[]`) won't be present when per-resource collection is disabled and tags followed by a question mark are
omitted if the associated value is empty. Of note, when in the context of controllers' execution metrics, these tag
names are prefixed with `resource.`. This prefix might be removed in a future version for greater consistency.
All V1 metrics start with the `operator.sdk` prefix. `<resource metadata>` refers to resource-specific metadata and
depends on the considered metric and how the implementation is configured: `group?, version, kind, [name, namespace?],
scope` where tags in square brackets (`[]`) won't be present when per-resource collection is disabled and tags followed
by a question mark are omitted if the value is empty. In the context of controllers' execution metrics, these tag names
are prefixed with `resource.`.

### Aggregated Metrics

Expand Down
26 changes: 24 additions & 2 deletions docs/content/en/docs/migration/v5-3-migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: Migrating from v5.2 to v5.3
---


## Renamed JUnit Module
## Rename of JUnit module

If you use JUnit extension in your test just rename it from:

Expand All @@ -26,4 +26,26 @@ to
<version>5.3.0<version>
<scope>test</scope>
</dependency>
```
```

## Metrics interface changes

The [Metrics](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/monitoring/Metrics.java)
interface changed in non backwards compatible way, in order to make the API cleaner:

The following table shows the relevant method renames:

| v5.2 method | v5.3 method |
|------------------------------------|------------------------------|
| `reconcileCustomResource` | `reconciliationSubmitted` |
| `reconciliationExecutionStarted` | `reconciliationStarted` |
| `reconciliationExecutionFinished` | `reconciliationSucceeded` |
| `failedReconciliation` | `reconciliationFailed` |
| `finishedReconciliation` | `reconciliationFinished` |
| `cleanupDoneFor` | `cleanupDone` |
| `receivedEvent` | `eventReceived` |


Other changes:
- `reconciliationFinished(..)` method is extended with `RetryInfo`
- `monitorSizeOf(..)` method is removed.
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@

import static io.javaoperatorsdk.operator.api.reconciler.Constants.CONTROLLER_NAME;

@Deprecated
public class MicrometerMetrics implements Metrics {

private static final String PREFIX = "operator.sdk.";
Expand Down Expand Up @@ -182,7 +183,7 @@ public <T> T timeControllerExecution(ControllerExecution<T> execution) {
}

@Override
public void receivedEvent(Event event, Map<String, Object> metadata) {
public void eventReceived(Event event, Map<String, Object> metadata) {
if (event instanceof ResourceEvent) {
incrementCounter(
event.getRelatedCustomResourceID(),
Expand All @@ -201,14 +202,14 @@ public void receivedEvent(Event event, Map<String, Object> metadata) {
}

@Override
public void cleanupDoneFor(ResourceID resourceID, Map<String, Object> metadata) {
public void cleanupDone(ResourceID resourceID, Map<String, Object> metadata) {
incrementCounter(resourceID, EVENTS_DELETE, metadata);

cleaner.removeMetersFor(resourceID);
}

@Override
public void reconcileCustomResource(
public void reconciliationSubmitted(
HasMetadata resource, RetryInfo retryInfoNullable, Map<String, Object> metadata) {
Optional<RetryInfo> retryInfo = Optional.ofNullable(retryInfoNullable);
incrementCounter(
Expand All @@ -228,19 +229,20 @@ public void reconcileCustomResource(
}

@Override
public void finishedReconciliation(HasMetadata resource, Map<String, Object> metadata) {
public void reconciliationSucceeded(HasMetadata resource, Map<String, Object> metadata) {
incrementCounter(ResourceID.fromResource(resource), RECONCILIATIONS_SUCCESS, metadata);
}

@Override
public void reconciliationExecutionStarted(HasMetadata resource, Map<String, Object> metadata) {
public void reconciliationStarted(HasMetadata resource, Map<String, Object> metadata) {
var reconcilerExecutions =
gauges.get(RECONCILIATIONS_EXECUTIONS + metadata.get(CONTROLLER_NAME));
reconcilerExecutions.incrementAndGet();
}

@Override
public void reconciliationExecutionFinished(HasMetadata resource, Map<String, Object> metadata) {
public void reconciliationFinished(
HasMetadata resource, RetryInfo retryInfo, Map<String, Object> metadata) {
var reconcilerExecutions =
gauges.get(RECONCILIATIONS_EXECUTIONS + metadata.get(CONTROLLER_NAME));
reconcilerExecutions.decrementAndGet();
Expand All @@ -251,8 +253,8 @@ public void reconciliationExecutionFinished(HasMetadata resource, Map<String, Ob
}

@Override
public void failedReconciliation(
HasMetadata resource, Exception exception, Map<String, Object> metadata) {
public void reconciliationFailed(
HasMetadata resource, RetryInfo retry, Exception exception, Map<String, Object> metadata) {
var cause = exception.getCause();
if (cause == null) {
cause = exception;
Expand All @@ -266,11 +268,6 @@ public void failedReconciliation(
Tag.of(EXCEPTION, cause.getClass().getSimpleName()));
}

@Override
public <T extends Map<?, ?>> T monitorSizeOf(T map, String name) {
return registry.gaugeMapSize(PREFIX + name + SIZE_SUFFIX, Collections.emptyList(), map);
}

private void addMetadataTags(
ResourceID resourceID, Map<String, Object> metadata, List<Tag> tags, boolean prefixed) {
if (collectPerResourceMetrics) {
Expand Down
Loading
Loading