Skip to content

[FLINK-27773][Web Dashboard] Top N Metrics Dashboard#27774

Open
featzhang wants to merge 5 commits intoapache:masterfrom
featzhang:feature/FLINK-top-n-metrics-dashboard
Open

[FLINK-27773][Web Dashboard] Top N Metrics Dashboard#27774
featzhang wants to merge 5 commits intoapache:masterfrom
featzhang:feature/FLINK-top-n-metrics-dashboard

Conversation

@featzhang
Copy link
Member

Purpose

This PR fixes fundamental architectural issues in the Top N Metrics Dashboard implementation that were identified during CI analysis. The previous implementation had critical design flaws that prevented it from working correctly.

Changes

  1. Fixed REST Handler inheritance - Now properly extends AbstractRestHandler instead of using incorrect base class
  2. Fixed MessageHeaders implementation - Now implements RuntimeMessageHeaders with correct method signatures (getRequestClass, getResponseClass, getResponseStatusCode, getHttpMethod)
  3. Fixed MetricStore access - Using public APIs:
    • metricStore.getRepresentativeAttempts() to get job tasks
    • taskMetricStore.getAllSubtaskMetricStores() to get subtasks
    • Instead of attempting to access private members (JobMetricStore, TaskMetricStore.subtasks)
  4. Fixed HTTP method references - Using HttpMethodWrapper instead of non-existent HttpMethod
  5. Added proper logging - Added Logger instance for better error tracking
  6. Added handler registration - Registered TopNMetricsHandler in WebMonitorEndpoint
  7. Moved response body to correct package - Moved from legacy.messages to proper job.metrics package

Implementation Details

The implementation now follows Flink's standard REST API architecture pattern:

  • Extends AbstractRestHandler<RestfulGateway, EmptyRequestBody, TopNMetricsResponseBody, TopNMetricsMessageParameters>
  • Implements proper request handling with MetricFetcher integration
  • Uses public MetricStore APIs to safely access metrics data
  • Returns Top N metrics for:
    • CPU consumers (Top 5)
    • Backpressured operators (Top 5)
    • GC-intensive tasks (Top 5)

Verifying this change

  • Code compiles successfully (excluding unrelated upstream compilation issues)
  • Follows Flink REST API architecture patterns
  • Uses proper public APIs for MetricStore access
  • Code formatted with Spotless
  • Integration tests (to be added)

Documentation

This adds a new REST endpoint: GET /jobs/:jobid/metrics/top-n that returns Top N metrics for a job.

Notes

The previous PR #27771 was closed due to fundamental architectural issues. This implementation addresses all identified issues and follows Flink's standard patterns.

This commit introduces a Top N Metrics Dashboard to the Flink Web UI,
providing visibility into resource-intensive components:

- Top N CPU Consumers: Identify tasks with highest CPU usage
- Top N Backpressure Operators: Highlight operators experiencing backpressure
- Top N GC Intensive Tasks: Show tasks with highest GC overhead

The implementation includes:
- REST API endpoint: /jobs/:jobid/metrics/top-n
- Response body with three metric categories
- Angular components for displaying metrics
- Demo page showcasing the feature

This feature helps operators quickly identify performance bottlenecks
and optimize job execution.
… architecture

This commit fixes fundamental architectural issues in the Top N Metrics Dashboard implementation:

1. Fixed REST Handler inheritance - Now properly extends AbstractRestHandler instead of using incorrect base class
2. Fixed MessageHeaders implementation - Now implements RuntimeMessageHeaders with correct method signatures
3. Fixed MetricStore access - Using public APIs (getRepresentativeAttempts, getAllSubtaskMetricStores) instead of attempting to access private members
4. Fixed HTTP method references - Using HttpMethodWrapper instead of non-existent HttpMethod
5. Added proper logging - Added Logger instance for better error tracking
6. Added handler registration - Registered TopNMetricsHandler in WebMonitorEndpoint
7. Moved response body to correct package - Moved from legacy.messages to proper job.metrics package

The implementation now follows Flink's standard REST API architecture pattern and properly interacts with the MetricStore system through public APIs.
@featzhang featzhang changed the title [FLINK-27773][Web Dashboard] Fix Top N Metrics Dashboard implementation architecture [FLINK-27773][Web Dashboard] Top N Metrics Dashboard implementation architecture Mar 16, 2026
@featzhang featzhang changed the title [FLINK-27773][Web Dashboard] Top N Metrics Dashboard implementation architecture [FLINK-27773][Web Dashboard] Top N Metrics Dashboard Mar 16, 2026
@flinkbot
Copy link
Collaborator

flinkbot commented Mar 16, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants