Skip to content

ATS Configuration Reload with observability/tracing - Token model#12892

Open
brbzull0 wants to merge 8 commits intoapache:11-Devfrom
brbzull0:detached_config_reload
Open

ATS Configuration Reload with observability/tracing - Token model#12892
brbzull0 wants to merge 8 commits intoapache:11-Devfrom
brbzull0:detached_config_reload

Conversation

@brbzull0
Copy link
Contributor

@brbzull0 brbzull0 commented Feb 18, 2026

New Configuration Reload Framework

Contents:


TL;DR

NOTE: Backward compatible: The existing traffic_ctl config reload command works exactly as before —
same syntax, same behavior from the operator's perspective. Internally, it now fires the new reload
logic, which means every reload is automatically tracked, timed, and queryable.

How Reload Works — The Token Model

When you run traffic_ctl config reload, the command sends a JSONRPC request to ATS and
returns immediately — it does not block until every config handler finishes. Instead, ATS:

  1. Assigns a token to the reload operation. The token is either auto-generated
    (e.g., rldtk-1739808000000, a timestamp-based ID) or user-supplied via -t
    (e.g., -t deploy-v2.1).
  2. Schedules the actual reload on background threads (ET_TASK). Each registered config
    handler runs, reports its status (in_progresssuccess or fail), and the results are
    aggregated into a task tree.
  3. Returns the token to the caller so you can track the reload afterwards.

The token is the handle for everything that follows:

What you want Command
Monitor progress live traffic_ctl config reload -t <token> -m
Query final status traffic_ctl config status -t <token>
Get detailed logs traffic_ctl config status -t <token> -l

If you don't supply -t, ATS generates a token automatically and prints it so you can copy-paste
it into follow-up commands. If you supply -t deploy-v2.1, that exact string becomes the token —
useful for CI pipelines, deploy scripts, or any workflow where you want a meaningful, predictable
identifier.

In short: the token is a reload's unique ID. You get it when the reload starts, and you use it
to ask "what happened?" at any point afterwards.

New traffic_ctl commands:

# Basic reload — works exactly as before, but now returns a token for tracking
$ traffic_ctl config reload
✔ Reload scheduled [rldtk-1739808000000]

  Monitor : traffic_ctl config reload -t rldtk-1739808000000 -m
  Details : traffic_ctl config reload -t rldtk-1739808000000 -s -l
# Monitor mode with a custom token
$ traffic_ctl config reload -t deploy-v2.1 -m
✔ Reload scheduled [deploy-v2.1]
✔ [deploy-v2.1] ████████████████████ 11/11  success  (245ms)

Note: All subtasks — both file-based and record-triggered — are registered before the
first status poll. After rereadConfig() processes files, RecFlushConfigUpdateCbs() immediately
fires all pending record callbacks, and each on_record_change() calls reserve_subtask() to
pre-register a CREATED subtask on the main task. This means the total task count is stable from
the start (e.g., 3/11 in_progress11/11 success) rather than growing over time.

# Full status report
$ traffic_ctl config status -t deploy-v2.1
✔ Reload [success] — deploy-v2.1
  Started : 2025 Feb 17 12:00:00.123
  Finished: 2025 Feb 17 12:00:00.368
  Duration: 245ms

  ✔ 11 success  ◌ 0 in-progress  ✗ 0 failed  (11 total)

  Tasks:
   ✔ logging.yaml ··························· 120ms
   ✔ ip_allow.yaml ·························· 18ms
   ✔ remap.config ··························· 42ms
   ✔ ssl_client_coordinator ················· 35ms
   ├─ ✔ sni.yaml ··························· 20ms
   └─ ✔ ssl_multicert.config ··············· 15ms
   ...

Failed reload — monitor mode:

$ traffic_ctl config reload -t hotfix-ssl-cert -m
✔ Reload scheduled [hotfix-ssl-cert]
✗ [hotfix-ssl-cert] ██████████████░░░░░░ 9/11  fail  (310ms)

  Details : traffic_ctl config status -t hotfix-ssl-cert

Failed reload — status report:

$ traffic_ctl config status -t hotfix-ssl-cert
✗ Reload [fail] — hotfix-ssl-cert
  Started : 2025 Feb 17 14:30:10.500
  Finished: 2025 Feb 17 14:30:10.810
  Duration: 310ms

  ✔ 9 success  ◌ 0 in-progress  ✗ 2 failed  (11 total)

  Tasks:
   ✔ ip_allow.yaml ·························· 18ms
   ✔ remap.config ··························· 42ms
   ✗ logging.yaml ·························· 120ms  ✗ FAIL
   ✗ ssl_client_coordinator ················· 85ms  ✗ FAIL
   ├─ ✔ sni.yaml ··························· 20ms
   └─ ✗ ssl_multicert.config ··············· 65ms  ✗ FAIL
   ...

Inline YAML reload (runtime only, not persisted to disk):

Note: Inline YAML reload is currently disabled — no config handler supports ConfigSource::FileAndRpc
yet. The infrastructure is in place and will be enabled as handlers are migrated. See TODO below.

$ traffic_ctl config reload -d @ip_allow_new.yaml -t update-ip-rules -m
✔ Reload scheduled [update-ip-rules]
✔ [update-ip-rules] ████████████████████ 1/1  success  (18ms)

Note: Inline configuration is NOT persisted to disk.
      Server restart will revert to file-based configuration.

The -d flag accepts @filename to read from a file, or @- to read from stdin. The YAML file
uses registry keys as top-level keys — the key string passed as the first argument to
register_config() or register_record_config(). The content under each key is the actual YAML
that the config file normally contains — it is passed as-is to the handler via ctx.supplied_yaml().
A single file can target multiple handlers:

# reload_rules.yaml — multiple configs in one file
# Each top-level key is a registry key (as declared in register_config()).
# The value is the config content (inner data, not wrapped in the file's top-level key).
ip_allow:
  - apply: in
    ip_addrs: 0.0.0.0/0
    action: allow
    methods: ALL
sni:
  - fqdn: "*.example.com"
    verify_client: NONE
# From file — reloads both ip_allow and sni handlers
$ traffic_ctl config reload -d @reload_rules.yaml -t update-rules -m

# From stdin — pipe YAML directly into ATS
$ cat reload_rules.yaml | traffic_ctl config reload -d @- -m

New traffic_ctl Commands

Command Description
traffic_ctl config reload Trigger a file-based reload. Shows token and next-step hints.
traffic_ctl config reload -m Trigger and monitor with a live progress bar.
traffic_ctl config reload -s -l Trigger and immediately show detailed report with logs.
traffic_ctl config reload -t <token> Reload with a custom token.
traffic_ctl config reload -d @file.yaml Inline reload from file (runtime only, not persisted).
traffic_ctl config reload -d @- Inline reload from stdin.
traffic_ctl config reload --force Force a new reload even if one is in progress.
traffic_ctl config reload -m -r 0.5 Monitor with a 500ms polling interval (default: 0.5s). Accepts fractional values. 0 means no wait.
traffic_ctl config reload -m -w 2 Monitor with a 2s initial wait before first poll (default: 2s).
traffic_ctl config reload -m -T 30s Monitor with a 30s client-side timeout. Exits with EX_TEMPFAIL (75) if not done. Requires -m.
traffic_ctl config status Show the last reload status.
traffic_ctl config status -t <token> Show status of a specific reload.
traffic_ctl config status -c all Show full reload history (rolling window, oldest evicted at 100).

All commands support --format json to output the raw JSONRPC response instead of human-readable
text. This is useful for automation, CI pipelines, or any tool that consumes structured output
directly:

traffic_ctl config status -t reload1 --format json
{
  "tasks": [
    {
      "config_token": "reload1",
      "status": "success",
      "description": "Main reload task - 2026 Feb 18 20:03:10",
      "filename": "",
      "meta": {
        "created_time_ms": "1771444990585",
        "last_updated_time_ms": "1771444991015",
        "main_task": "true"
      },
      "log": [],
      "sub_tasks": [
        {
          "config_token": "reload1",
          "status": "success",
          "description": "ip_allow",
          "filename": "/opt/ats/etc/trafficserver/ip_allow.yaml",
          "meta": {
            "created_time_ms": "1771444991013",
            "last_updated_time_ms": "1771444991015",
            "main_task": "false"
          },
          "log": [],
          "logs": [
            "Finished loading"
          ],
          "sub_tasks": []
        },
        {
          "config_token": "reload1",
          "status": "success",
          "description": "ssl_ticket_key",
          "filename": "",
          "meta": {
            "created_time_ms": "1771444991015",
            "last_updated_time_ms": "1771444991015",
            "main_task": "false"
          },
          "log": [],
          "logs": [
            "SSL ticket key reloaded"
          ],
          "sub_tasks": []
        }
      ]
    }
  ]
}

New JSONRPC APIs

Method Description
admin_config_reload Unified reload — file-based (default) or inline when configs param is present. Params: token, force, configs.
get_reload_config_status Query reload status by token or get the last N reloads via count.

Inline reload RPC example:

jsonrpc: "2.0"
method: "admin_config_reload"
params:
  token: "update-ip-and-sni"
  configs:
    ip_allow:
      - apply: in
        ip_addrs: 0.0.0.0/0
        action: allow
        methods: ALL
    sni:
      - fqdn: "*.example.com"
        verify_client: NONE

Background: Issues with the Previous Reload Mechanism

The previous configuration reload relied on a loose collection of independent record callbacks
(RecRegisterConfigUpdateCb) wired through FileManager and AddConfigFilesHere.cc. Each config
module registered its file independently, and reloads were fire-and-forget:

  • No visibility — There was no way to know whether a reload succeeded or failed, which handlers
    ran, or how long each one took.
  • No coordination — Each handler ran independently with no shared context. There was no concept of
    a "reload session" grouping all config updates triggered by a single request.
  • No inline content — Configuration could only be reloaded from files on disk. There was no way
    to push YAML content at runtime through the RPC or CLI.
  • Scattered registration — File registrations were split between AddConfigFilesHere.cc (for
    FileManager) and individual modules (for record callbacks), making it hard to reason about which
    files were tracked and which records triggered reloads.
  • No token tracking — There was no identifier for a reload operation, so you couldn't query the
    status of a specific reload or distinguish between overlapping reloads.

What the New Design Solves

  1. Full reload traceability — Every reload gets a token. Each config handler reports its status
    (in_progress, success, fail) through a ConfigContext. Results are aggregated into a task
    tree with per-handler timings and logs.
  2. Centralized registrationConfigRegistry is the single source of truth for all config files,
    their filename records, trigger records, and reload handlers.
  3. Inline YAML injection — Handlers that opt in (ConfigSource::FileAndRpc) can receive YAML
    content directly through the RPC, without writing to disk. This is runtime-only — the content
    lives in memory and is lost on restart.
  4. Coordinated reload sessionsReloadCoordinator manages the lifecycle of each reload:
    token generation, concurrency control (--force to override), timeout detection, and history.
  5. CLI observabilitytraffic_ctl config reload -m shows a live progress bar.
    traffic_ctl config status provides a full post-mortem with task tree, durations, and failure
    details.

Basic Design

┌─────────────┐   JSONRPC    ┌────────────────┐
│ traffic_ctl  │────────────►│  RPC Handler   │
│ config reload│             │  reload_config  │
└─────────────┘             └───────┬────────┘
                                    │
          ┌─────────────────────────┼─────────────────────────┐
          │                         ▼                         │
          │              ┌──────────────────┐                 │
          │              │ ReloadCoordinator │                 │
          │              │  - prepare_reload │                 │
          │              │  - token tracking │                 │
          │              │  - history        │                 │
          │              └────────┬─────────┘                 │
          │                       │                           │
          │           ┌───────────┴──────────┐                │
          │           ▼                      ▼                │
          │   ┌──────────────┐      ┌──────────────────┐      │
          │   │ File-based   │      │ Inline mode      │      │
          │   │ FileManager  │      │ set_passed_config │      │
          │   │ rereadConfig │      │ schedule_reload   │      │
          │   └──────┬───────┘      └────────┬─────────┘      │
          │          │                       │                │
          │          └───────────┬───────────┘                │
          │                      ▼                            │
          │            ┌──────────────────┐                   │
          │            │  ConfigRegistry  │                   │
          │            │  execute_reload  │                   │
          │            └────────┬─────────┘                   │
          │                     ▼                             │
          │           ┌──────────────────┐                    │
          │           │  ConfigContext   │                    │
          │           │  - in_progress() │                    │
          │           │  - log()         │                    │
          │           │  - complete()    │                    │
          │           │  - fail()        │                    │
          │           │  - supplied_yaml()│                   │
          │           └────────┬─────────┘                    │
          │                    ▼                              │
          │           ┌──────────────────┐                    │
          │           │  Handler         │                    │
          │           │  (IpAllow, SNI,  │                    │
          │           │   remap, etc.)   │                    │
          │           └──────────────────┘                    │
          └───────────────────────────────────────────────────┘

Key components:

Component Role
ConfigRegistry Singleton registry mapping config keys to handlers, filenames, trigger records. Self-registers with FileManager.
ReloadCoordinator Manages reload sessions: token generation, concurrency, timeout detection, history.
ConfigReloadTask Tracks a single reload operation as a tree of sub-tasks with status, timings, and logs.
ConfigContext Lightweight context passed to handlers. Provides in_progress(), complete(), fail(), log(), supplied_yaml(), and add_dependent_ctx(). Safe no-op at startup (no active reload task).
ConfigReloadProgress Periodic checker that detects stuck tasks and marks them as TIMEOUT.

Thread model:

All reload work runs on ET_TASK threads — never on the RPC thread or event-loop threads:

  1. RPC thread — receives the JSONRPC request (admin_config_reload), creates the reload token
    and task via ReloadCoordinator::prepare_reload(), then schedules the actual work on ET_TASK
    and returns immediately. The RPC response (with the token) is sent back to traffic_ctl before
    any handler runs.
  2. ET_TASK — file-based reloadReloadWorkContinuation fires on ET_TASK. It calls
    FileManager::rereadConfig(), which walks every registered file and invokes
    ConfigRegistry::execute_reload() for each changed config. Each handler runs synchronously
    on this thread.
  3. ET_TASK — inline (RPC) reloadScheduledReloadContinuation fires on ET_TASK. It calls
    ConfigRegistry::execute_reload() directly for the targeted config key(s). Same synchronous
    execution.
  4. Deferred handlers — Some handlers (e.g., LogConfig) schedule work on other threads and
    return before completion. The ConfigContext they hold remains valid (weak pointer + ref-counted
    YAML node), and they must call ctx.complete() or ctx.fail() from whatever thread finishes
    the work. If they don't, the timeout checker marks the task as TIMEOUT.
  5. Timeout checkerConfigReloadProgress is a per-reload continuation on ET_TASK that polls
    periodically and marks stuck tasks as TIMEOUT.

Handlers block ET_TASK while they run. A slow handler delays all subsequent handlers in the same
reload cycle. This is the same behavior as the previous mechanism — the difference is that now it's
visible through the task tree and timings.

Stuck reload checker:
ConfigReloadProgress is a periodic continuation scheduled on ET_TASK. It monitors active reload
tasks and marks any that exceed the configured timeout as TIMEOUT. This acts as a safety net for
handlers that fail to call ctx.complete() or ctx.fail() — for example, if a handler crashes,
deadlocks, or its deferred thread never executes. The checker reads proxy.config.admin.reload.timeout
dynamically at each interval, so the timeout can be adjusted at runtime without a restart. This is
a simple record read (RecGetRecordString), not an expensive operation. Setting the
timeout to "0" disables it (tasks will run indefinitely until completion).
The checker is not a global poller — a new instance is created per-reload and self-terminates once
the task reaches a terminal state. No idle polling when no reload is in progress.

How Handlers Work

Before — scattered registration (ip_allow example):

Registration was split across multiple files with no centralized tracking:

// 1. AddConfigFilesHere.cc — register with FileManager for mtime detection
registerFile("proxy.config.cache.ip_allow.filename", ts::filename::IP_ALLOW, NOT_REQUIRED);
registerFile("proxy.config.cache.ip_categories.filename", ts::filename::IP_CATEGORIES, NOT_REQUIRED);

// 2. IPAllow.cc — attach record callback (fire-and-forget, no status tracking)
ConfigUpdateHandler<IpAllow> *ipAllowUpdate = new ConfigUpdateHandler<IpAllow>("ip_allow");
ipAllowUpdate->attach("proxy.config.cache.ip_allow.filename");

// 3. IpAllow::reconfigure() — no context, no status, no tracing
void IpAllow::reconfigure() {
    // ... load config from disk, no way to report success/failure ...
}

Now — each module self-registers with full tracing:

Each module registers itself directly with ConfigRegistry. No more separate AddConfigFilesHere.cc
entry — the registry handles FileManager registration, record callbacks, and status tracking
automatically:

// IPAllow.cc — one call replaces all three steps above
config::ConfigRegistry::Get_Instance().register_config(
    "ip_allow",                                           // registry key
    ts::filename::IP_ALLOW,                               // default filename
    "proxy.config.cache.ip_allow.filename",               // record holding filename
    [](ConfigContext ctx) { IpAllow::reconfigure(ctx); }, // handler with context
    config::ConfigSource::FileOnly,                       // content source
    {"proxy.config.cache.ip_allow.filename"});            // trigger records

// Auxiliary file — attach ip_categories as a dependency (changes trigger ip_allow reload)
config::ConfigRegistry::Get_Instance().add_file_dependency(
    "ip_allow",                                  // config key to attach to
    "proxy.config.cache.ip_categories.filename", // record holding the filename
    ts::filename::IP_CATEGORIES,                 // default filename
    false);                                      // not required

Additional triggers can be attached from any module at any time:

// From another module — attach an extra record trigger
config::ConfigRegistry::Get_Instance().attach("ip_allow", "proxy.config.some.extra.record");

Composite configs can declare file dependencies and dependency keys. For example, SSLClientCoordinator
owns sni.yaml and ssl_multicert.config as children:

// Main registration (no file of its own — it's a pure coordinator)
// From SSLClientCoordinator::startup()
config::ConfigRegistry::Get_Instance().register_record_config(
    "ssl_client_coordinator",                                          // registry key
    [](ConfigContext ctx) { SSLClientCoordinator::reconfigure(ctx); }, // reload handler
    {"proxy.config.ssl.client.cert.path",                              // trigger records
     "proxy.config.ssl.client.cert.filename", "proxy.config.ssl.client.private_key.path",
     "proxy.config.ssl.client.private_key.filename", "proxy.config.ssl.keylog_file",
     "proxy.config.ssl.server.cert.path", "proxy.config.ssl.server.private_key.path",
     "proxy.config.ssl.server.cert_chain.filename",
     "proxy.config.ssl.server.session_ticket.enable"});

// Track sni.yaml — FileManager watches for mtime changes, record wired to trigger reload
config::ConfigRegistry::Get_Instance().add_file_and_node_dependency(
    "ssl_client_coordinator", "sni",
    "proxy.config.ssl.servername.filename", ts::filename::SNI, false);

// Track ssl_multicert.config — same pattern
config::ConfigRegistry::Get_Instance().add_file_and_node_dependency(
    "ssl_client_coordinator", "ssl_multicert",
    "proxy.config.ssl.server.multicert.filename", ts::filename::SSL_MULTICERT, false);

Handler interaction with ConfigContext:

Each config module implements a C++ reload handler — the callback passed to register_config().
The handler reports progress through the ConfigContext:

void IpAllow::reconfigure(ConfigContext ctx) {
    ctx.in_progress();

    // ... load config from disk ...

    ctx.complete("Loaded successfully");
    // or on error:
    // ctx.fail(errata, "Failed to load");
}

When a reload fires, the handler receives a ConfigContext:

  • File sourcectx.supplied_yaml() is undefined; the handler reads from its registered file on disk.
  • RPC sourcectx.supplied_yaml() contains the YAML node passed via --data / RPC.
    The content is runtime-only and is never written to disk.

Handlers report progress:

ctx.in_progress("Parsing ip_allow.yaml");
ctx.log("Loaded 42 rules");
ctx.complete("Finished loading");
// or on error:
ctx.fail(errata, "Failed to load ip_allow.yaml");

Supplied YAML — inline content via -d / RPC:

Note: The infrastructure for RPC-supplied YAML is fully implemented, but no handler currently
opts into ConfigSource::FileAndRpc. File-based handlers use ConfigSource::FileOnly, and
record-only handlers use ConfigSource::RecordOnly (implicitly via register_record_config()).

When a handler opts into ConfigSource::FileAndRpc, it can receive YAML content directly instead
of reading from disk. The handler checks ctx.supplied_yaml() to determine the source:

void IpAllow::reconfigure(ConfigContext ctx) {
    ctx.in_progress();

    YAML::Node root;
    if (auto yaml = ctx.supplied_yaml()) {
        // Inline mode: YAML supplied via -d flag or JSONRPC.
        // Not persisted to disk — runtime only.
        root = yaml;
    } else {
        // File mode: read from the registered config file on disk.
        root = YAML::LoadFile(config_filename);
    }

    // ... parse and apply config ...

    ctx.complete("Loaded successfully");
}

For composite configs (e.g., SSLClientCoordinator), handlers create child contexts to track
each sub-config independently. From SSLClientCoordinator::reconfigure():

SSLConfig::reconfigure(reconf_ctx.add_dependent_ctx("SSLConfig"));
SNIConfig::reconfigure(reconf_ctx.add_dependent_ctx("SNIConfig"));
SSLCertificateConfig::reconfigure(reconf_ctx.add_dependent_ctx("SSLCertificateConfig"));
reconf_ctx.complete("SSL configs reloaded");

The parent task automatically aggregates status from its children. In traffic_ctl config status,
this renders as a tree:

   ✔ ssl_client_coordinator ················· 35ms
   ├─ ✔ SSLConfig ·························· 10ms
   ├─ ✔ SNIConfig ·························· 12ms
   └─ ✔ SSLCertificateConfig ·············· 13ms

Design Challenges

1. Handlers must reach a terminal state — or the task hangs

The entire tracing model relies on handlers calling ctx.complete() or ctx.fail() before
returning. If a handler returns without reaching a terminal state, the task stays IN_PROGRESS
indefinitely until the timeout checker marks it as TIMEOUT.

After execute_reload() calls the handler, it checks ctx.is_terminal() and emits a warning
if the handler left the task in a non-terminal state:

entry_copy.handler(ctx);
if (!ctx.is_terminal()) {
    Warning("Config '%s' handler returned without reaching a terminal state. "
            "If the handler deferred work to another thread, ensure ctx.complete() or ctx.fail() "
            "is called when processing finishes; otherwise the task will remain in progress "
            "until the timeout checker marks it as TIMEOUT.",
            entry_copy.key.c_str());
}

The safety net: ConfigReloadProgress runs periodically on ET_TASK and marks stuck tasks as
TIMEOUT after the configured duration (proxy.config.admin.reload.timeout, default: 1h).

2. Parent status aggregation from sub-tasks

Parent tasks do not track their own status directly — they derive it from their children.
When a child calls complete() or fail(), it notifies its parent, which re-evaluates:

  • Any child failed or timed out → parent is FAIL
  • Any child still in progress → parent stays IN_PROGRESS
  • All children succeeded → parent is SUCCESS

This aggregation is recursive: a sub-task can have its own children (e.g.,
ssl_client_coordinatorsni + ssl_multicert), and status bubbles up through the tree.

One subtle issue: if a handler creates child contexts but forgets to call complete() or
fail()
on one of them, that child stays CREATED and the parent never reaches SUCCESS.
It is the handler developer's responsibility to ensure every ConfigContext (and its children)
reaches a terminal state (complete() or fail()). The timeout checker is the ultimate safety
net for cases where this is not properly handled.

3. Startup vs. reload — same handler, different context

Handlers are called both at startup (initial config load) and during runtime reloads. At startup,
there is no active ReloadCoordinator task, so all ConfigContext operations (in_progress(),
complete(), fail(), log()) are safe no-ops — they check _task.lock() and return
immediately if the weak pointer is expired or empty.

This avoids having two separate code paths for startup vs. reload. The handler logic is identical
in both cases:

void IpAllow::reconfigure(ConfigContext ctx) {
    ctx.in_progress();  // no-op at startup, tracks progress during reload
    // ... load config ...
    ctx.complete();     // no-op at startup, marks task as SUCCESS during reload
}

4. Duplicate handler execution for multi-record triggers (fixed)

Known issue: ssl_client_coordinator registers multiple trigger records and file dependencies,
each wiring an independent on_record_change callback. When several of these fire during the same
reload, the handler executes more than once, producing duplicate entries in the reload status output.
This behavior exists on master as well, but was not visible because the old reload path had no status tracking. The new framework's task tree makes the duplicate executions observable.
This is a pre-existing issue present on master — see
#11724.
This issue is resolved in this PR.

Note: This fix exists only in this PR — it relies on the ConfigReloadTask subtask tree
and config_key tracking introduced by the new framework. The old reload path on master has
no equivalent structure to deduplicate against; each RecRegisterConfigUpdateCb callback fires
independently with no shared state. A fix on master would require adding an atomic<bool>
CAS flag per config key, but that approach risks lost updates when handlers reschedule work
by modifying their own trigger records during reconfiguration.

5. Plugin support

Plugins are not supported by ConfigRegistry in this PR. The legacy reload notification
mechanism (TSMgmtUpdateRegister) still works — plugins registered through it will continue
to be invoked via FileManager::invokeConfigPluginCallbacks() during every reload cycle.
A dedicated plugin API to let plugins register their own config handlers and participate in
the reload framework will be addressed in a separate PR.

6. Subtask reservation — ensuring all handlers are tracked from the start

During a file-based reload, handlers are discovered in two phases:

  • File-based handlers (e.g., ip_allow, remap) run synchronously inside
    FileManager::rereadConfig(). Their subtasks are created and completed immediately on the
    ET_TASK thread that drives the reload.
  • Record-triggered handlers (e.g., cache_control, ssl_ticket_key,
    ssl_client_coordinator) are activated by record callbacks. When rereadConfig() detects
    changed files, it calls RecSetSyncRequired() to flag the associated records as dirty.

To avoid a gap between file-based completions and record-triggered registrations,
RecFlushConfigUpdateCbs() is called immediately after rereadConfig(). This synchronously
fires all pending record callbacks instead of waiting for the next config_update_cont tick
(~3s). Each on_record_change() callback calls reserve_subtask() to pre-register a CREATED
subtask on the main task, then schedules the actual handler on ET_TASK.

The result is that all subtasks — both file-based and record-triggered — are registered before
the reload executor returns. The task count is stable from the first status poll:

  1. rereadConfig() finishes — file-based subtasks are created and completed.
  2. RecFlushConfigUpdateCbs() fires — on_record_change() callbacks run synchronously,
    each calling reserve_subtask() to register a CREATED subtask.
  3. Main task sees CREATED children alongside completed children — stays IN_PROGRESS.
  4. Record-triggered continuations run on ET_TASK — activate their reserved subtasks via
    create_config_context(), execute the handler, and complete the subtask.
  5. When the last handler completes — main task reaches SUCCESS.

As a safety net, add_sub_task() also calls aggregate_status() when the parent has already
reached SUCCESS, reverting it to IN_PROGRESS. This handles any edge case where a subtask is
registered after all known work has completed (e.g., from a plugin or deferred callback).


Configs Migrated to ConfigRegistry

Config Key File
IP Allow ip_allow ip_allow.yaml
IP Categories (dependency of ip_allow) ip_categories.yaml
Cache Control cache_control cache.config
Cache Hosting cache_hosting hosting.config
Parent Selection parent_proxy parent.config
Split DNS split_dns splitdns.config
Remap remap remap.config ¹
Logging logging logging.yaml
SSL/TLS (coordinator) ssl_client_coordinator
SNI (dependency of ssl_client_coordinator) sni.yaml
SSL Multicert (dependency of ssl_client_coordinator) ssl_multicert.config
SSL Ticket Key ssl_ticket_key (record-only, no file)
Records records records.yaml
Storage (static, no handler) storage.config
Socks (static, no handler) socks.config
Volume (static, no handler) volume.config
Plugin (static, no handler) plugin.config
JSONRPC (static, no handler) jsonrpc.yaml

¹ Remap migration will be refactored after #12813 (remap.yaml)
and #12669 (virtual hosts) land.

New Configuration Records

records:
  admin:
    reload:
      # Maximum time a reload task can run before being marked as TIMEOUT.
      # Supports duration strings: "30s", "5min", "1h". Set to "0" to disable.
      # Default: 1h. Updateable at runtime (RECU_DYNAMIC).
      timeout: 1h

      # How often the progress checker polls for stuck tasks (minimum: 1s).
      # Supports duration strings: "1s", "5s", "30s".
      # Default: 2s. Updateable at runtime (RECU_DYNAMIC).
      check_interval: 2s

TODO

  • Clean up - Run some clean up on this code.
  • Documentation — User-facing docs added: traffic_ctl.en.rst (commands & options), jsonrpc-api.en.rst (admin_config_reload / get_reload_config_status), and config-reload-framework.en.rst (developer guide).
  • Remove legacy reload infrastructureConfigUpdateHandler/ConfigUpdateContinuation removed from ConfigProcessor.h. Remaining registerFile() calls in AddConfigFilesHere.cc will be retired via inventory-only ConfigRegistry entries (see below).
  • Consolidate AddConfigFilesHere.cc into ConfigRegistry — Static files (storage.config, volume.config, plugin.config, etc.) registered as inventory-only entries. records.yaml registered with reload handler. AddConfigFilesHere.cc removed.
  • Additional tests — Expand autest coverage.

Future Work (separate PRs)

  • Improve error reporting — All config loaders are migrated to ConfigRegistry. Remaining work: fully log detailed errors via ctx.log(), ctx.fail(), etc.
  • Enable inline YAML for more handlers — Currently file-based handlers use ConfigSource::FileOnly and record-only handlers use ConfigSource::RecordOnly. Migrate file-based handlers to FileAndRpc so they can read YAML directly from the RPC (via ctx.supplied_yaml()).
  • Autest reload extension — Implement an autest extension that checks reload success/failure via the JSONRPC status API (traffic_ctl config status -t <token>) instead of grepping log files.
  • Trace record-triggered reloads — Record-based reloads (via trigger_records / RecRegisterConfigUpdateCb) are not currently tracked. Create a main task with a synthetic token so they appear in traffic_ctl config status.
  • Expose ConfigRegistry to plugins — Add a plugin API so plugins can register their own config handlers and participate in the reload framework.

Dependencies and Related Issues

Cherry-pick #12934 for an Argparser fix.

Fixes #12324 — Improving traffic_ctl config reload.

This PR will likely land after:

There should be no major conflicts with those PRs. Conversation and coordination needs to be done before merging.

@brbzull0 brbzull0 self-assigned this Feb 18, 2026
@brbzull0 brbzull0 added Configuration JSONRPC JSONRPC 2.0 related work. traffic_ctl traffic_ctl related work. labels Feb 18, 2026
@brbzull0 brbzull0 changed the title WIP - Traceable Configuration Reload Traceable Configuration Reload Feb 18, 2026
@brbzull0 brbzull0 force-pushed the detached_config_reload branch from a48a5dc to 8f9af3c Compare February 18, 2026 19:49
@brbzull0 brbzull0 changed the title Traceable Configuration Reload ATS Configuration Reload with observability/tracing Feb 20, 2026
@brbzull0 brbzull0 changed the title ATS Configuration Reload with observability/tracing ATS Configuration Reload with observability/tracing - Token model Feb 23, 2026
@brbzull0 brbzull0 force-pushed the detached_config_reload branch from 7d7ac21 to 1ffaeda Compare February 23, 2026 17:11
@brbzull0 brbzull0 marked this pull request as ready for review February 23, 2026 22:04
@cmcfarlen cmcfarlen self-requested a review February 23, 2026 22:48
@cmcfarlen cmcfarlen modified the milestone: 11.0.0 Feb 23, 2026
@brbzull0 brbzull0 force-pushed the detached_config_reload branch from 2e4c04f to c84487c Compare February 24, 2026 09:39
@brbzull0
Copy link
Contributor Author

[approve ci autest 1]

@brbzull0 brbzull0 force-pushed the detached_config_reload branch from c84487c to 55ee35b Compare February 24, 2026 11:39
@brbzull0 brbzull0 changed the base branch from master to 11-Dev February 24, 2026 11:39
brbzull0 added 6 commits March 2, 2026 14:15
validate_dependencies() incorrectly triggers on options that have
default values but were not explicitly passed by the user.

The root cause is that append_option_data() populates default values
into the Arguments map before validate_dependencies() runs. When
validate_dependencies() calls ret.get(key) for an option with a
default, the lookup finds the entry and sets _is_called = true,
making the option appear "used" even though the user never specified
it on the command line.

Fix by extracting the default-value loop into apply_option_defaults()
and calling it after validate_dependencies() in parse().
Pre-register subtasks via reserve_subtask() when on_record_change()
fires, preventing the main task from reaching SUCCESS before all
record-triggered handlers have registered. Call RecFlushConfigUpdateCbs()
after rereadConfig() to process callbacks synchronously instead of
waiting for the 3s timer.
@brbzull0 brbzull0 force-pushed the detached_config_reload branch from ad90671 to d0fe21c Compare March 2, 2026 14:23
brbzull0 added 2 commits March 2, 2026 16:59
The test waited for 'sni.yaml finished loading' to appear 3 times
in the diags log before proceeding after a config reload. The 3rd
occurrence relied on duplicate handler execution from multiple
trigger records firing independently — a pre-existing bug now fixed
by the ConfigRegistry deduplication logic.

With dedup, ssl_client_coordinator fires exactly once per reload
cycle, so the expected count is 2 (startup + one reload).
@zwoop zwoop requested a review from serrislew March 9, 2026 19:43
Copy link
Contributor

@cmcfarlen cmcfarlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out and it seems to work! Pretty sweet.

Does this really need to wait on those two PRs mentioned in the description?

@bryancall bryancall requested a review from Copilot March 9, 2026 22:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive configuration reload framework for Apache Traffic Server that replaces the previous fire-and-forget mechanism with a token-based, observable reload system. The new system provides full traceability of reload operations, centralized config registration through ConfigRegistry, and CLI/RPC-based monitoring and status querying.

Changes:

  • Introduces ConfigRegistry, ReloadCoordinator, ConfigReloadTask, and ConfigContext as core framework components that centralize config file registration, track reload sessions with tokens, and provide status reporting through task trees
  • Migrates all existing config handlers (ip_allow, sni, ssl, remap, parent, cache, logging, splitdns, etc.) from the old ConfigUpdateHandler/ConfigUpdateContinuation pattern to the new ConfigRegistry pattern with ConfigContext callbacks
  • Adds new traffic_ctl commands (config reload with --monitor, --token, --data, --show-details options; config status with --token, --count) and JSONRPC APIs (admin_config_reload, get_reload_config_status) along with an ArgParser fix for default value dependency validation

Reviewed changes

Copilot reviewed 100 out of 101 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
include/mgmt/config/ConfigContext.h New context class passed to reload handlers for status tracking and inline YAML support
include/mgmt/config/ReloadCoordinator.h Singleton managing reload sessions, tokens, concurrency, and history
include/mgmt/config/ConfigReloadErrors.h Error code enum shared between server and client
include/mgmt/config/ConfigReloadExecutor.h Header for async reload scheduling on ET_TASK
include/mgmt/config/FileManager.h Replaced ConfigUpdateCbTable* with std::function<void()> for plugin callbacks
include/records/RecCore.h Added RecFlushConfigUpdateCbs() and updated RecConfigWarnIfUnregistered signature
include/records/YAMLConfigReloadTaskEncoder.h YAML encoder for task info in JSONRPC responses
include/iocore/eventsystem/ConfigProcessor.h Removed legacy ConfigUpdateHandler/ConfigUpdateContinuation templates
include/proxy/*.h, include/iocore/net/*.h, include/iocore/dns/*.h Updated reconfigure() signatures to accept ConfigContext
include/shared/rpc/yaml_codecs.h Added default parameter to try_extract helper
include/tscore/ArgParser.h Added apply_option_defaults() method
src/mgmt/config/ConfigContext.cc ConfigContext implementation (status tracking, dependent contexts, YAML support)
src/mgmt/config/ReloadCoordinator.cc Reload lifecycle management, token generation, history, deduplication
src/mgmt/config/ConfigReloadExecutor.cc Schedules reload work on ET_TASK with FileManager integration
src/mgmt/config/ConfigRegistry.cc New centralized config registry (not shown in diffs but referenced)
src/mgmt/config/FileManager.cc Delegated records.yaml reload to ConfigRegistry, simplified plugin callbacks
src/mgmt/config/AddConfigFilesHere.cc Removed — replaced by ConfigRegistry registrations in individual modules
src/mgmt/config/CMakeLists.txt Updated build to include new source files
src/mgmt/rpc/handlers/config/Configuration.cc New JSONRPC handlers for reload and status with token/force/inline support
src/traffic_server/traffic_server.cc Replaced initializeRegistry() with register_config_files() using ConfigRegistry
src/traffic_server/RpcAdminPubHandlers.cc Registered new get_reload_config_status RPC handler
src/traffic_ctl/traffic_ctl.cc Added reload/status subcommand options with monitor, token, data, force flags
src/traffic_ctl/jsonrpc/CtrlRPCRequests.h New request/response models for reload and status
src/traffic_ctl/jsonrpc/ctrl_yaml_codecs.h YAML codecs for reload request/response serialization
src/traffic_ctl/CtrlPrinters.cc Reload report/progress bar rendering with tree display
src/traffic_ctl/CtrlPrinters.h Added printer methods and as<>() template
src/traffic_ctl/CtrlCommands.h Added reload helper method declarations
src/traffic_ctl/TrafficCtlStatus.h Added CTRL_EX_TEMPFAIL exit code
src/proxy/IPAllow.cc Migrated to ConfigRegistry with add_file_dependency for ip_categories
src/proxy/CacheControl.cc Migrated to ConfigRegistry
src/proxy/ParentSelection.cc Migrated to ConfigRegistry
src/proxy/ReverseProxy.cc Migrated to ConfigRegistry, removed UR_UpdateContinuation
src/proxy/logging/LogConfig.cc Migrated to ConfigRegistry with deferred reload context
src/proxy/http/PreWarmConfig.cc Migrated to ConfigRegistry via register_record_config
src/iocore/net/SSLClientCoordinator.cc Migrated to ConfigRegistry with add_file_and_node_dependency
src/iocore/net/SSLConfig.cc Migrated ticket key to ConfigRegistry, updated reconfigure signatures
src/iocore/net/SSLSNIConfig.cc Updated reconfigure to accept ConfigContext
src/iocore/net/QUICMultiCertConfigLoader.cc Updated reconfigure to accept ConfigContext
src/iocore/net/quic/QUICConfig.cc Updated reconfigure to accept ConfigContext
src/iocore/dns/SplitDNS.cc Migrated to ConfigRegistry
src/iocore/cache/Cache.cc Migrated hosting.config to ConfigRegistry (late registration)
src/iocore/cache/CacheHosting.cc Removed old config_callback
src/iocore/cache/P_CacheHosting.h Removed CacheHostTableConfig continuation
src/records/RecCore.cc Updated RecConfigWarnIfUnregistered to log via ConfigContext
src/records/P_RecCore.cc Added RecFlushConfigUpdateCbs()
src/records/RecordsConfig.cc Added reload timeout/check_interval records
src/records/CMakeLists.txt Added reload infrastructure and test sources
src/tscore/ArgParser.cc Extracted apply_option_defaults(), called after dependency validation
src/tscore/unit_tests/test_ArgParser.cc Tests for case-sensitive short options and default value validation
src/records/unit_tests/test_ConfigReloadTask.cc Unit tests for state transitions and timeout behavior
src/records/unit_tests/test_ConfigRegistry.cc Unit tests for registry resolution and dependency management
Various CMakeLists.txt files Added configmanager dependency to tests and libraries
tests/gold_tests/traffic_ctl/traffic_ctl_config_reload.test.py New test for traffic_ctl config reload commands
tests/gold_tests/traffic_ctl/traffic_ctl_test_utils.py Added ConfigReload/ConfigStatus helper classes
tests/gold_tests/jsonrpc/config_reload_tracking.test.py JSONRPC-level reload tracking tests
tests/gold_tests/jsonrpc/config_reload_reserve_subtask.test.py Tests for subtask reservation during reload
tests/gold_tests/jsonrpc/config_reload_full_smoke.test.py Full smoke test for all config handlers
tests/gold_tests/jsonrpc/config_reload_dedup.test.py Deduplication test for multi-trigger configs
tests/gold_tests/ip_allow/ip_allow_reload_triggered.test.py IP allow reload via file dependency test
tests/gold_tests/parent_config/parent_config_reload.test.py Parent config reload test
tests/gold_tests/cache/cache_config_reload.test.py Cache/hosting config reload test
tests/gold_tests/dns/splitdns_reload.test.py SplitDNS reload test

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +172 to 176
url_rewrite_CB(const char * /* name ATS_UNUSED */, RecDataT /* data_type ATS_UNUSED */, RecData data,
void * /* cookie ATS_UNUSED */)
{
int my_token = static_cast<int>((long)cookie);

switch (my_token) {
case REVERSE_CHANGED:
rewrite_table->SetReverseFlag(data.rec_int);
break;

case TSNAME_CHANGED:
case FILE_CHANGED:
case HTTP_DEFAULT_REDIRECT_CHANGED:
eventProcessor.schedule_imm(new UR_UpdateContinuation(reconfig_mutex), ET_TASK);
break;

case URL_REMAP_MODE_CHANGED:
// You need to restart TS.
break;

default:
ink_assert(0);
break;
}

rewrite_table->SetReverseFlag(data.rec_int);
return 0;
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The url_rewrite_CB function now only handles REVERSE_CHANGED (calling SetReverseFlag), since TSNAME_CHANGED, FILE_CHANGED, and HTTP_DEFAULT_REDIRECT_CHANGED are now handled by ConfigRegistry. However, the function still takes cookie parameter (previously used to determine which case to handle) but now ignores it. The callback registration at line 91 still passes (void *)REVERSE_CHANGED as cookie. This works, but the function signature with unused cookie and the remaining RecRegisterConfigUpdateCb for proxy.config.reverse_proxy.enabled should be documented or simplified to clarify that this is now only for reverse proxy enable/disable toggling.

Copilot uses AI. Check for mistakes.
p.Ready = When.FileContains(ts.Disk.diags_log.Name, "parent.config finished loading", 3)
p.Timeout = 20
tr.Processes.Default.StartBefore(p)
## TODO: we should have an extension like When.ReloadCompleted(token, success) to validate this inetasd of parsing
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "inetasd" should be "instead".

Copilot uses AI. Check for mistakes.
void setup_log_objects();

static int reconfigure(const char *name, RecDataT data_type, RecData data, void *cookie);
// static int reconfigure(const char *name, RecDataT data_type, RecData data, void *cookie);
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented-out old declaration // static int reconfigure(const char *name, RecDataT data_type, RecData data, void *cookie); should be removed. The new declaration is directly below it.

Copilot uses AI. Check for mistakes.
Comment on lines +120 to +122
tr = Test.AddTestRun("remap_config reload, test")
tr.Processes.Default.Env = tm.Env
tr.Processes.Default.Command = 'sleep 2; traffic_ctl rpc invoke get_reload_config_status -f json | jq'
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remap_reload.test.py appends a test run that invokes get_reload_config_status via traffic_ctl rpc invoke and pipes to jq, but doesn't set a return code expectation or add any validation. This appears to be a debug/exploratory command left in the test. It should either validate something meaningful or be removed before merging.

Copilot uses AI. Check for mistakes.
if (auto p = _task.lock()) {
auto child = p->add_child(description);
// child task will get the full content of the parent task
// TODO: eventyually we can have a "key" passed so child module
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "eventyually" should be "eventually".

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +135
swoc::bwprint(text, "{} failed to load", new_table->ip_categories_config_file);
Error("%s\n%s", text.c_str(), swoc::bwprint(text, "{}", errata).c_str());
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IpAllow::reconfigure reuses the same text string variable for both the bwprint formatting call and the Error/Fatal macro. On line 135, swoc::bwprint(text, ...) writes into text, then the Error macro passes text.c_str() as the first %s argument, and swoc::bwprint(text, "{}", errata).c_str() as the second — but this second call overwrites text in-place, so text.c_str() in the first argument now points to the errata content, not the original error message. The same issue exists on line 143/145. Each bwprint call should use a separate buffer for the errata string.

Copilot uses AI. Check for mistakes.
// The server will not let you start two reload at the same time. This option will force a new reload
// even if there is one in progress. Use with caution as this may have unexpected results.
// This is mostly for debugging and testing purposes. note: Should we keep it here?
.add_option("--force", "-F", "Force reload even if there are unsaved changes")
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --force option's description says "Force reload even if there are unsaved changes" but according to the PR description and the RPC handler, --force forces a new reload even if one is in progress. The help text is misleading — "unsaved changes" implies something different. The description should be something like "Force a new reload even if one is in progress".

Copilot uses AI. Check for mistakes.

/// Generate a warning if any configuration name/value is not registered.
void RecConfigWarnIfUnregistered();
// void RecConfigWarnIfUnregistered();
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented-out old declaration // void RecConfigWarnIfUnregistered(); should be removed. It serves no purpose and could be confusing alongside the new declaration immediately below it.

Copilot uses AI. Check for mistakes.
Comment on lines +214 to +216
if(NOT APPLE)
target_link_options(test_NextHopStrategyFactory PRIVATE -Wl,--allow-multiple-definition)
endif()
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using --allow-multiple-definition linker flag is a workaround that suppresses legitimate duplicate symbol warnings. This can mask real issues where the same symbol is defined in multiple compilation units. Consider using proper library factoring or weak symbols instead. If this is truly needed as a short-term fix, at minimum add a comment explaining which symbols conflict and why.

Copilot uses AI. Check for mistakes.
//
// The server will not let you start two reload at the same time. This option will force a new reload
// even if there is one in progress. Use with caution as this may have unexpected results.
// This is mostly for debugging and testing purposes. note: Should we keep it here?
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "note: Should we keep it here?" — this internal discussion question should be resolved before merging. Either remove the option or remove the comment.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

11.0.x Configuration JSONRPC JSONRPC 2.0 related work. traffic_ctl traffic_ctl related work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improving traffic_ctl config reload

3 participants