[0.49-topru] topru manager keyspace cluster by zeminzhou · Pull Request #82 · tidbcloud/vector-extensions

zeminzhou · 2026-03-22T14:22:39Z

No description provided.

pingcap-cla-assistant · 2026-03-22T14:22:52Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ zhoucai-pingcap
❌ zeminzhou
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

This reverts commit 6dbf0fa.

src/common/keyspace_cluster.rs

src/common/topology/fetch/tidb_manager.rs

src/common/keyspace_cluster.rs

src/common/topology/fetch/tidb_manager.rs

src/common/topology/fetch/mod.rs

zeminzhou · 2026-03-23T06:47:49Z

跟一下这轮 review comments，刚刚已经在 167a2de 里处理并 push 到这个 PR：

keyspace_cluster 的 cache 无上限问题
已改成有上限的 LRU cache，默认容量 10000，避免 serverless 活跃 cluster 持续变化时无限增长占内存。
manager 选出来的 active TiDB 为空时没有日志
已补日志。现在在 manager 返回结果经过筛选后为空时，会输出一条带 manager_server_address / tidb_namespace / shard_config 的信息，方便排查。
resolve_keyspace 失败/未命中时没有写缓存，是否会第一次失败第二次成功
这里保持“不缓存失败/未命中”是有意为之。这样 PD 元数据延迟出现、临时网络抖动、或者第一次查不到第二次查到时，后续轮询还能恢复；如果把失败结果也缓存住，反而会放大瞬时异常。我在代码里补了注释说明这一点。
manager response body 可能过大，担心内存打爆
已加 body 大小限制，当前限制为 8 MiB。超过上限会直接报错，不再无限制读完整个 body。
只配置 manager_server_address，没配置 tidb_namespace 时是什么行为
我把这块改成 fail-fast 了：现在如果配了 manager_server_address 但 tidb_namespace 为空/全空白，会直接返回 ConfigurationError，避免之前静默拿不到 TiDB 拓扑的行为。

本地已跑过定向测试：

cargo test keyspace_cluster --lib
cargo test tidb_manager --lib
cargo test normalize_tidb_namespace --lib

如果你更倾向于第 5 点走“fallback 回 etcd”而不是 fail-fast，我也可以再改。

This reverts commit 9e8807f.

# Conflicts: # src/sinks/topsql_data_deltalake/processor.rs

zeminzhou

claude：整体代码质量不错，测试覆盖充分，review comment 的处理也很到位（LRU cache、body 限制、fail-fast 配置校验等）。

另外有一个关于 TopSQLDeltaLakeSink::new 中 unsafe 块的建议：通过 raw pointer 从 Arc 中取出内部值比较脆弱，如果以后有人改了代码多加了一个 clone 就可能导致 UB。建议重构为：先创建 channel 和 shared state（Arc<Mutex<HashMap<...>>>），spawn task 时直接传这些 Arc，然后正常构造 Self 返回，完全不需要 unsafe。

src/sinks/topsql_data_deltalake/processor.rs

src/common/keyspace_cluster.rs

src/common/topology/fetch/tidb_manager.rs

src/sources/topsql_v2/mod.rs

zeminzhou · 2026-03-24T09:42:57Z

跟进 review 3997482947，这轮 comment 我确认都需要通过代码修复处理，没有仅靠文字回复即可关闭的项；相关修改已在 54c1fda 中 push 到当前 PR 分支。

处理内容：

read_manager_shard_config_from_env 不再在每次 get_up_tidbs 时读取，改为在 LegacyTopologyFetcher 初始化时读取一次并缓存。
hash_keyspace_name 补了注释，说明为什么选 FNV-1a，以及它作为跨服务 keyspace sharding 隐式契约的要求。
topsql_v2 里 manager_server_address / tidb_namespace 的 PLACEHOLDER doc comment 已补成实际说明。
TopSQLDeltaLakeSink::new 里的 unsafe 已移除：现在先正常构造 sink，再 clone 一份给后台 task，不再通过 raw pointer 从 Arc 里取内部值。

本地定向验证：

cargo test -q --package vector-extensions --lib manager_shard_config -- --nocapture
cargo test -q --package vector-extensions --lib test_send_when_batch_size_reached -- --nocapture
cargo test -q --package vector-extensions --lib normalize_tidb_namespace -- --nocapture
cargo test -q --package vector-extensions --lib hash_keyspace_name_is_stable -- --nocapture

zeminzhou · 2026-03-24T09:50:43Z

codex: 关于 review body 里提到的 TopSQLDeltaLakeSink::new 中 unsafe 建议，这部分已经在 54c1fda 里改掉了：现在先正常构造 Self，再 clone 一份给后台 task 使用，不再通过 raw pointer 从 Arc 里取内部值，因此后续即使有人再改初始化路径也不会踩到 UB 风险。

zeminzhou

claude：LGTM 🎉

上一轮的 review comments 都已修复：

✅ process_events_loop 加了 max retry (5次) + exponential backoff (5s→60s cap) + 超限丢弃日志
✅ unsafe 块移除，改用 #[derive(Clone)] + sink.clone()（data 和 meta processor 都改了）
✅ is_not_found_body 收紧为 "keyspace not found"，支持纯文本和 JSON 格式，加了测试
✅ normalize_namespaces 抽成公共函数 normalize_namespace_list
✅ shard config 初始化时读一次环境变量，缓存到 struct
✅ doc comment PLACEHOLDER 已补上
✅ FNV-1a hash 加了注释说明选择原因和跨服务契约

- Replace recursive JSON extraction with direct serde deserialization for ActiveTiDBAddress, removing ~100 lines of guessing logic - Remove redundant normalize_namespaces wrapper method - Change PdKeyspaceResolver::new to take Option<&TlsConfig> instead of owned Option<TlsConfig> to avoid unnecessary clones - Add debug_assert on replace_keyspace_route_segments to catch org_id/cluster_id containing path separators - Add TODO for resolve_keyspace concurrent cache miss dedup - Add comment explaining org=/cluster= path segment convention

Empty array from manager is a legitimate response (no active TiDB instances). Previously it was treated as an error while filter-to-empty after keyspace sharding was treated as Ok — inconsistent behavior. Now both cases flow through get_up_tidbs which logs and returns Ok, making the behavior uniform. Also removes the now-unused InvalidManagerResponse error variant.

… on retry process_events wrote dedup keys to LRU before flush_buffer persisted data to Delta Lake. If flush failed (e.g. PD keyspace lookup error), the retry path would find all keys already in LRU and silently skip them, causing permanent meta data loss. Fix: stage dedup keys in a pending_dedup_keys buffer during process_events (read-only LRU check). Commit to LRU only after flush_buffer succeeds. On failure, process_events_loop clears pending state so the cloned retry snapshot is re-processed cleanly.

# Conflicts: # src/sinks/topsql_data_deltalake/mod.rs # src/sinks/topsql_data_deltalake/processor.rs # src/sinks/topsql_meta_deltalake/mod.rs # src/sinks/topsql_meta_deltalake/processor.rs

…eyspace-cluster # Conflicts: # src/sources/topsql_v2/upstream/tidb/parser.rs

…ster module Deduplicate identical retry delay function and constants from both topsql_data_deltalake and topsql_meta_deltalake processors into common/keyspace_cluster.rs. Move the associated test as well.

zhoucai-pingcap and others added 7 commits March 9, 2026 18:41

fix topru proto

7ce3e3e

change type to component

387215d

fix native jeprof issue

6aaf6af

remove arm vector first

8667b91

native jeprof support heap_v2

ddbe3a1

topsql: add manager discovery and topru keyspace routing

71b0006

topsql: infer topru keyspace from user

6dbf0fa

ti-chi-bot bot added the size/XXL label Mar 22, 2026

zeminzhou added 2 commits March 22, 2026 22:30

Revert "topsql: infer topru keyspace from user"

b7cdb9c

This reverts commit 6dbf0fa.

feat: shard manager tidb by keyspace name

f56120c

zhoucai-pingcap reviewed Mar 23, 2026

View reviewed changes

fix: address manager topology review feedback

167a2de

zeminzhou added 3 commits March 23, 2026 21:29

feat: add tikv topsql collection switch

b8f997f

feat: update topru storage path layout

9e8807f

Revert "feat: update topru storage path layout"

6c2ebe5

This reverts commit 9e8807f.

Base automatically changed from 0.49-topru to 0.49 March 24, 2026 02:45

zeminzhou added 2 commits March 24, 2026 11:01

topsql: restore component-based data path layout

f6c4670

Merge remote-tracking branch 'origin/0.49' into HEAD

835aad0

# Conflicts: # src/sinks/topsql_data_deltalake/processor.rs

zeminzhou commented Mar 24, 2026

View reviewed changes

topru: address remaining review feedback

54c1fda

topru: finish remaining review fixes

17a2eb7

zeminzhou commented Mar 24, 2026

View reviewed changes

zeminzhou added 4 commits March 25, 2026 13:25

Add keyspace routing for TopSQL meta events

c212665

Validate keyspace routing path templates

665ee2e

zeminzhou added 5 commits March 26, 2026 11:39

fix: commit meta dedup keys only after successful writes

4d3e974

Merge remote-tracking branch 'tmp/0.49' into HEAD

1ab14e7

# Conflicts: # src/sinks/topsql_data_deltalake/mod.rs # src/sinks/topsql_data_deltalake/processor.rs # src/sinks/topsql_meta_deltalake/mod.rs # src/sinks/topsql_meta_deltalake/processor.rs

Merge remote-tracking branch 'origin/0.49' into codex/topru-manager-k…

a4e4233

…eyspace-cluster # Conflicts: # src/sources/topsql_v2/upstream/tidb/parser.rs

refactor: extract route_resolution_retry_delay to common keyspace_clu…

70c71e9

…ster module Deduplicate identical retry delay function and constants from both topsql_data_deltalake and topsql_meta_deltalake processors into common/keyspace_cluster.rs. Move the associated test as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.49-topru] topru manager keyspace cluster#82

[0.49-topru] topru manager keyspace cluster#82
zeminzhou wants to merge 26 commits into0.49from
codex/topru-manager-keyspace-cluster

zeminzhou commented Mar 22, 2026

Uh oh!

pingcap-cla-assistant bot commented Mar 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeminzhou commented Mar 23, 2026

Uh oh!

zeminzhou left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeminzhou commented Mar 24, 2026

Uh oh!

zeminzhou commented Mar 24, 2026 •

edited

Loading

Uh oh!

zeminzhou left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zeminzhou commented Mar 22, 2026

Uh oh!

pingcap-cla-assistant bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeminzhou commented Mar 23, 2026

Uh oh!

zeminzhou left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeminzhou commented Mar 24, 2026

Uh oh!

zeminzhou commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zeminzhou left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pingcap-cla-assistant bot commented Mar 22, 2026 •

edited

Loading

zeminzhou left a comment •

edited

Loading

zeminzhou commented Mar 24, 2026 •

edited

Loading

zeminzhou left a comment •

edited

Loading