Skip to content

fix laplace_marginal optimization#3297

Open
avehtari wants to merge 2 commits intodevelopfrom
fix-laplace_marginal
Open

fix laplace_marginal optimization#3297
avehtari wants to merge 2 commits intodevelopfrom
fix-laplace_marginal

Conversation

@avehtari
Copy link
Member

@avehtari avehtari commented Mar 19, 2026

Summary

laplace_marginal was failing with normal-normal model, in which case Newton should get to the mode in one step.
This fix was made with Claude assistance and the sufficiently detailed prompt was based on previous detailed analysis of behavior of Stan model included.

All code changes are by Claude. Below I have also included summary of fixes by Claude. I have looked at the changes and summaries and they make sense. All existing Stan math tests for laplace_marginal pass and my Stan model example works well now, too. The code should still be carefully checked.

Checklist

  • Copyright holder: I think the changes are so minor that there is no claim on copyright. Just in case, I'd be holder for my copyright.

By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

Previously problematic model and data

R code and data

library(loo)
library(brms)
library(cmdstanr)

# Data
data("sleepstudy", package = "lme4")
conditions <- make_conditions(sleepstudy, "Subject", incl_vars = FALSE)
sleepstudy <- sleepstudy |>
  filter(Days >= 2) |>
  mutate(Days = Days - 2)
standata4 <- standata(
  Reaction ~ 1 + Days + (1 + Days | Subject), 
  data = sleepstudy,
  family = gaussian()
)

# Model and posterior
mod4_la <- cmdstan_model("sleep4.stan", force_recompile=TRUE)
fit_4_la <- mod4_la$sample(data = standata4, chains = 4, refresh = 0, seed  = 1)

# The following should be the same when Laplace works:

# LOGO-CV with Laplace integrated importance sampling
(logo_4_la <- fit_4_la$loo(variables = "log_lik_g"))

# LOGO-CV with analytically integrated importance sampling
(logo_4 <- fit_4_la$loo(variables = "log_lik_ga"))

Stan code

// generated with brms 2.23.1
functions {
 /* compute correlated group-level effects
  * Args:
  *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns:
  *   matrix of scaled group-level effects
  */
  matrix scale_r_cor(matrix z, vector SD, matrix L) {
    // r is stored in another dimension order than z
    return transpose(diag_pre_multiply(SD, L) * z);
  }

  // Log-likelihood function for a single group
  // theta = [r_1, r_2] (group-level intercept and slope)
  real normal_group_ll(vector theta,
                          data int n_g, data vector Y_g,
                          data vector Z_1_1_g, data vector Z_1_2_g,
                          data vector mu_fixed_g, data real sigma) {
    real r_1 = theta[1];  // group-level intercept
    real r_2 = theta[2];  // group-level slope
    vector[n_g] mu = mu_fixed_g[1:n_g] + r_1 * Z_1_1_g[1:n_g] + r_2 * Z_1_2_g[1:n_g];
    return normal_lpdf(Y_g[1:n_g] | mu, sigma);
  }

  // Covariance function for bivariate group-level effects
  matrix cov_group(vector sd_1, matrix L_1) {
    return multiply_lower_tri_self_transpose(diag_pre_multiply(sd_1, L_1));
  }
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int<lower=1> Kc;  // number of population-level effects after centering
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
  // Prepare group-specific data arrays (fixed at data time)
  array[N_1] int n_g;  // number of observations per group
  array[N_1, N] real Y_g_arr;  // Y values per group (padded)
  array[N_1, N] real Z_1_1_g_arr;  // Z_1_1 values per group (padded)
  array[N_1, N] real Z_1_2_g_arr;  // Z_1_2 values per group (padded)
  array[N_1, N] int obs_idx_g;  // observation indices per group

  // initialize counts
  for (g in 1:N_1) {
    n_g[g] = 0;
  }
  // populate group-specific arrays
  for (n in 1:N) {
    int g = J_1[n];
    n_g[g] += 1;
    Y_g_arr[g, n_g[g]] = Y[n];
    Z_1_1_g_arr[g, n_g[g]] = Z_1_1[n];
    Z_1_2_g_arr[g, n_g[g]] = Z_1_2[n];
    obs_idx_g[g, n_g[g]] = n;
  }
  // control parameters for Laplace approximation
  real tolerance = 1e-6;
  int max_num_steps = 100, solver = 1, max_steps_line_search = 100;
}
parameters {
  vector[Kc] b;  // regression coefficients
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // dispersion parameter
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
}
transformed parameters {
  matrix[N_1, M_1] r_1;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_1;
  vector[N_1] r_1_2;
  // prior contributions to the log posterior
  real lprior = 0;
  // compute actual group-level effects
  r_1 = scale_r_cor(z_1, sd_1, L_1);
  r_1_1 = r_1[, 1];
  r_1_2 = r_1[, 2];
  lprior += normal_lpdf(b | 0, 20);
  lprior += normal_lpdf(Intercept | 250, 100);
  lprior += exponential_lpdf(sigma | 0.04);
  lprior += exponential_lpdf(sd_1[1] | 0.04);
  lprior += exponential_lpdf(sd_1[2] | 1.0/15);
  lprior += lkj_corr_cholesky_lpdf(L_1 | 1);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    mu += Intercept + Xc * b;
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    }
    target += normal_lpdf(Y | mu, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
  vector[N] log_lik;
  vector[N] mu = rep_vector(0.0, N);
  mu += Intercept + Xc * b;
  for (n in 1:N) {
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    log_lik[n] = normal_lpdf(Y[n] | mu[n], sigma);
  }

  // group-level log-likelihoods using Laplace approximation
  vector[N_1] log_lik_g;
  {
    // fixed part of linear predictor for all observations
    vector[N] mu_fixed = Intercept + Xc * b;

    for (g in 1:N_1) {
      // Prepare group-specific vectors
      vector[N] Z_1_1_g_vec = to_vector(Z_1_1_g_arr[g]);
      vector[N] Z_1_2_g_vec = to_vector(Z_1_2_g_arr[g]);
      vector[N] Y_g_vec = to_vector(Y_g_arr[g]);
      vector[N] mu_fixed_g;
      for (i in 1:n_g[g]) {
        mu_fixed_g[i] = mu_fixed[obs_idx_g[g, i]];
      }

      // Use sampled group effects as starting point for Newton solver
      vector[2] theta_init = [r_1_1[g], r_1_2[g]]';
      // vector[2] theta_init = [0.0, 0.0]';

      log_lik_g[g] = laplace_marginal_tol(
        normal_group_ll,
        (n_g[g], Y_g_vec, Z_1_1_g_vec, Z_1_2_g_vec, mu_fixed_g, sigma),
        2,
        cov_group,
        (sd_1, L_1),
        (theta_init, tolerance, max_num_steps, solver, max_steps_line_search, 1));
    }
  }
  // group-level log-likelihoods using analytic integration
  vector[N_1] log_lik_ga;
  {
    // Group-level covariance matrix (M_1 x M_1)
    matrix[M_1, M_1] Sigma_group = multiply_lower_tri_self_transpose(diag_pre_multiply(sd_1, L_1));
    // Fixed part of linear predictor for all observations
    vector[N] mu_fixed = Intercept + Xc * b;
    for (g in 1:N_1) {
      int ng = n_g[g];

      // Build group-specific design matrix and vectors
      matrix[ng, M_1] Z_g;
      vector[ng] Y_g;
      vector[ng] mu_g;
      for (i in 1:ng) {
        Z_g[i, 1] = Z_1_1_g_arr[g, i];
        Z_g[i, 2] = Z_1_2_g_arr[g, i];
        Y_g[i] = Y_g_arr[g, i];
        mu_g[i] = mu_fixed[obs_idx_g[g, i]];
      }

      // Marginal covariance: sigma^2 * I_{ng} + Z_g * Sigma_group * Z_g'
      matrix[ng, ng] Cov_g = add_diag(Z_g * Sigma_group * Z_g', square(sigma));

      log_lik_ga[g] = multi_normal_lpdf(Y_g | mu_g, Cov_g);
    }
  }
}

Claude's summary of the fixes

Bug Report: laplace_marginal_tol() Erratically Fails to Find the Mode

Symptom

When laplace_marginal_tol() is called with a user-supplied non-zero
theta_init, the Newton optimizer sometimes:

  1. Stays near the initial values without actually optimizing,
  2. Converges to [0, 0], or
  3. Finds the correct mode (unpredictably).

This happens across all three solvers (1, 2, 3) and is not related to
the Laplace equations themselves — when the mode is found, the marginal
density matches the analytic result.

Root Cause

The invariant theta = Sigma * a is violated at initialization.

The Newton solver operates in the a-parameterization where
theta = Sigma * a and maximizes the objective:

obj(a, theta) = -0.5 * a' * theta + log_likelihood(theta)

In wolfe_line_search.hpp, the WolfeData constructor (line ~449)
unconditionally sets a = 0 regardless of the user-supplied theta_init:

WolfeData(obj_fun, n, theta0, theta_grad_f)
    : theta_(theta0),                       // theta = user-supplied init
      theta_grad_(theta_grad_f(theta_)),
      a_(Eigen::VectorXd::Zero(n)),          // a = 0  ← BUG
      eval_(1.0, obj_fun(a_, theta_), 0.0)

When theta_init != 0 and a = 0, the initial objective evaluates to:

obj_init = -0.5 * 0' * theta_init + LL(theta_init) = LL(theta_init)

The prior penalty term -0.5 * theta' * Sigma^{-1} * theta (which
equals -0.5 * a' * theta when a = Sigma^{-1} * theta) is entirely
missing. The initial objective is artificially inflated by
0.5 * theta_init' * Sigma^{-1} * theta_init.

Why This Causes Three Failure Modes

Mode A — "Stays near initial values": When theta_init is near the
true mode, the first Newton step proposes a consistent (a_new, theta_new)
with theta_new ≈ theta_init. But obj_new includes the prior penalty
while obj_init does not, so obj_new < obj_init. The Wolfe line search
rejects the step (accept_ = false), and search_failed = true fires
immediately. The optimizer returns after 0–1 iterations.

Mode B — "Converges to [0, 0]": When the rejected step causes
build_result to return state.prev() with a = [0, 0], downstream
code that recomputes theta = Sigma * a gets theta = [0, 0].

Mode C — "Sometimes finds the mode": When theta_init ≈ 0, the
inflation 0.5 * theta_init' * Sigma^{-1} * theta_init is negligible, so
the inconsistency doesn't prevent the solver from making progress.

The severity depends on the magnitude of theta_init relative to the
prior, which varies across MCMC draws and groups — explaining the erratic
pattern.

Secondary Issue: Corrupted First-Iteration Gradient

The first-iteration search direction gradient is computed as:

prev_g = -covariance * state.prev().a() + covariance * state.prev().theta_grad();

With a = 0, the -covariance * a term vanishes, so the gradient
direction is purely likelihood-driven with no prior pull. This can cause
the Newton step to overshoot or go in a suboptimal direction even if the
line search doesn't reject outright.

Fix

Approach

Compute the consistent a_init = Sigma^{-1} * theta_init before
constructing NewtonState, and pass it through the constructor chain
so that both a and theta are initialized consistently.

Changed Files

stan/math/mix/functor/wolfe_line_search.hpp

Added a new WolfeInfo constructor overload that accepts both a_init
and theta_init, forwarding to the existing WolfeData constructor
that takes an explicit a argument:

WolfeInfo(ObjFun&& obj_fun, const Eigen::VectorXd& a_init, Theta0&& theta0,
          ThetaGradF&& theta_grad_f, int /*tag*/)
    : curr_(std::forward<ObjFun>(obj_fun), a_init,
            std::forward<Theta0>(theta0),
            std::forward<ThetaGradF>(theta_grad_f)),
      prev_(curr_),
      scratch_(a_init.size()) { ... }

A trailing int /*tag*/ parameter disambiguates from the existing
(obj_fun, n, theta0, theta_grad_f) overload.

stan/math/mix/functor/laplace_marginal_density_estimator.hpp

  1. Added a new NewtonState constructor overload that accepts
    (theta_size, obj_fun, theta_grad_f, a_init, theta_init) and
    forwards to the new WolfeInfo constructor.

  2. In laplace_marginal_density_est, when InitTheta = true, compute:

    Eigen::VectorXd a_init = covariance.llt().solve(theta_init);

    and construct NewtonState with both a_init and theta_init.
    When InitTheta = false, the existing zero-initialization path is
    unchanged.

What the Fix Does NOT Change

  • The InitTheta = false path (zero initialization) is identical.
  • All solver policies (1, 2, 3) are unaffected — the fix is upstream
    in the shared initialization code.
  • The update_fun lambda already correctly enforces
    theta = Sigma * a for all subsequent proposals.
  • The Wolfe line search, Barzilai-Borwein step size, and convergence
    criteria are all unchanged.

Verification

All Laplace test suites compile and pass:

Test suite Tests Result
laplace_marginal_lpdf_moto_test (non-zero theta_init) 48 run, 24 skipped All passed
laplace_types_test 55 All passed
wolfe_line_search_test 18 All passed
generate_laplace_options_test 3 All passed
aki_ex_test 2 All passed
laplace_marginal_bernoulli_logit_lpmf_test compiles OK
laplace_marginal_poisson_log_lpmf_test compiles OK
laplace_marginal_neg_binomial_log_lpmf_test compiles OK
laplace_marginal_neg_binomial_log_summary_lpmf_test compiles OK

Other Observations (Not Fixed Here)

During the code examination, several additional issues were noted that
are not related to the primary bug but may warrant separate attention:

  1. llt_with_jitter accumulates jitter. Each retry adds increasing
    diagonal jitter (1e-10 to 1e-5) without removing the previous
    jitter. The resulting B matrix differs from the theoretical B,
    but the log-determinant is computed from the jittered matrix without
    correction.

  2. std::once_flag masks repeated fallbacks. The solver fallback
    warning uses static std::once_flag, so it fires at most once per
    process lifetime. Repeated fallbacks across calls are silently
    suppressed.

  3. LU log-determinant ignores permutation sign.
    LUSolver::compute_log_determinant() takes log() of the LU
    diagonal entries. If any entry is negative (possible when B is not
    positive definite — the reason solver 3 was needed), this produces
    NaN.

  4. Absolute-only convergence tolerance.
    |curr.obj - prev.obj| < tolerance has no relative component. For
    objectives with large magnitude the threshold may trigger too easily;
    for near-zero objectives it may be too loose.

  5. block_matrix_sqrt uses .isFinite().any() instead of .all().
    The check !local_block.array().isFinite().any() only catches blocks
    where every element is non-finite; a mix of finite and NaN values
    passes silently.

Fixes for laplace_marginal_tol() Newton Solver Convergence

Context

After applying the initial a_init fix (documented in
laplace_marginal_tol_bug.md), many log_lik_g values from
laplace_marginal_tol() still disagreed with the analytic closed-form
marginal likelihood. The a_init fix corrected the initialization
invariant (theta = Sigma * a) but three additional bugs in the Newton
solver prevented reliable convergence to the true mode.

All fixes are in
stan/math/mix/functor/laplace_marginal_density_estimator.hpp and
stan/math/mix/functor/wolfe_line_search.hpp.


Fix 1: Consistent a_init initialization (prior fix, included for completeness)

Files: laplace_marginal_density_estimator.hpp, wolfe_line_search.hpp

When InitTheta = true, WolfeData unconditionally set a = 0
regardless of the user-supplied theta_init, violating the invariant
theta = Sigma * a. This inflated the initial objective by omitting
the prior penalty -0.5 * a' * theta, causing the Wolfe line search to
reject the first Newton step.

Fix: Compute a_init = Sigma^{-1} * theta_init via
covariance.llt().solve(theta_init) and pass it through a new
constructor overload chain (NewtonStateWolfeInfoWolfeData)
so both a and theta are consistent from the start.

Fix 2: Wolfe zoom-phase case [4] placement

File: wolfe_line_search.hpp

In the zoom phase of the Wolfe line search, case [4] (high = mid, for
Armijo-OK but f(mid) <= f(low)) was placed outside the Armijo-true
branch, causing it to execute unconditionally after cases [2] and [3].
This meant every zoom iteration that reached mid.obj() <= low.obj()
would collapse the bracket incorrectly.

Fix: Nest case [4] inside the else of the mid.obj() > low.obj()
check, so it only fires when Armijo holds but f(mid) <= f(low):

      } else {
        // [4]
        high = mid;
      }

Fix 3: Cumulative jitter in llt_with_jitter

File: laplace_marginal_density_estimator.hpp

Each retry in the jitter loop added jitter_try to the diagonal
without removing the previous jitter. After the full retry sequence,
the total perturbation was 1e-10 + 1e-9 + ... + 1e-5 ≈ 1.111e-5
rather than the intended 1e-5. Since B is passed by reference and
the jittered matrix is kept, the Cholesky factorization (and hence the
log-determinant used in the final marginal density) was computed from an
over-perturbed matrix.

Fix: Track prev_jitter and add only the incremental difference
each iteration, so the total diagonal perturbation is exactly
jitter_try:

double prev_jitter = 0.0;
double jitter_try = min_jitter;
for (; jitter_try < max_jitter; jitter_try *= 10) {
    B.diagonal().array() += (jitter_try - prev_jitter);
    prev_jitter = jitter_try;
    llt_B.compute(B);
    if (llt_B.info() == Eigen::Success) break;
}

Fix 4: search_failed used stale curr.obj(), terminating on every Wolfe rejection

File: laplace_marginal_density_estimator.hpp

The original convergence check was:

bool search_failed = (!state.wolfe_status.accept_
                      && state.curr().obj() <= state.prev().obj());

When the Wolfe line search rejects a step (accept_ = false), it
leaves curr_ unchanged — so curr.obj() retains the stale value
from the previous update_next_step swap, which is typically equal to
prev.obj(). The <= prev.obj() comparison is therefore almost
always true, making search_failed equivalent to !accept_. This
caused the Newton loop to terminate immediately on any Wolfe
rejection, even when the full Newton step would have improved the
objective.

Fix: Before the Wolfe line search runs, save the full Newton step's
objective (full_newton_obj = scratch.eval_.obj()). After a Wolfe
rejection, check whether the full Newton step improved the objective.
If so, re-evaluate scratch at alpha = 1, accept it as the current
iterate, and let the Newton loop continue:

const double full_newton_obj = scratch.eval_.obj();
// ... Wolfe line search runs, potentially overwriting scratch ...

bool search_failed;
if (!state.wolfe_status.accept_) {
    if (full_newton_obj > state.prev().obj()) {
        // Re-evaluate at the full Newton step and accept it
        scratch.eval_.alpha() = 1.0;
        update_fun(scratch, state.curr(), state.prev(),
                   scratch.eval_, state.wolfe_info.p_);
        state.curr().update(scratch);
        state.wolfe_status.accept_ = true;
        search_failed = false;
    } else {
        search_failed = true;
    }
} else {
    search_failed = false;
}

This is necessary because the Wolfe line search overwrites scratch
with intermediate trial points; the saved scalar full_newton_obj is
the only reliable record of whether the full Newton step was an
improvement.

Fix 5: Absolute-only convergence tolerance

File: laplace_marginal_density_estimator.hpp

The Newton loop convergence check used only an absolute tolerance:

bool objective_converged
    = std::abs(curr.obj() - prev.obj()) < options.tolerance;

For the Laplace objective (which can range from −50 to −200 or beyond),
an absolute threshold of 1e-6 triggers trivially. The optimizer could
declare convergence after a single iteration if the objective happened
to change by less than 1e-6 — even though the relative change was
still significant (e.g., a change of 5e-7 on an objective of −50 is
1e-8 relative, but on an objective of −0.001 would be 5e-4
relative).

Fix: Require both an absolute and a relative condition:

double obj_change = std::abs(state.curr().obj() - state.prev().obj());
bool objective_converged
    = obj_change < options.tolerance
      && obj_change < options.tolerance * std::abs(state.prev().obj());

This prevents premature convergence when the objective magnitude is
large, while preserving the absolute floor for near-zero objectives.


How the Fixes Interact

# Bug Effect on Laplace density Which draws affected?
1 a = 0 with theta ≠ 0 Inflated initial objective → first step rejected → 0–1 iterations Draws where theta_init is far from 0
2 Zoom case [4] misplaced Bracket collapses incorrectly → suboptimal step size All draws using zoom phase
3 Cumulative jitter log|B| computed from over-perturbed matrix Draws where Cholesky needs jitter
4 Stale curr.obj() in search_failed Newton loop exits after any Wolfe rejection Draws where Wolfe line search struggles
5 Absolute-only tolerance Premature convergence on large-magnitude objectives All draws (probabilistic)

Bugs 1 and 4 had the largest impact: together they meant the Newton
solver would often run only 0–2 iterations for groups where theta_init
was far from zero. Bug 1 prevented the first step from being accepted,
and bug 4 terminated the loop as soon as the line search failed to find
a point satisfying the Wolfe conditions — even when the objective was
clearly improving.


Verification

After all five fixes, sleep4.stan (Laplace approximation) and
sleep4g.stan (analytic marginal) produce matching log_lik_g values
across all 18 groups and 4000 posterior draws.

@avehtari avehtari requested a review from SteveBronder March 19, 2026 15:57
@avehtari
Copy link
Member Author

I had problems with other data and models, too. I just tested that this removed problems also from two cases using Bernoulli and beta-binomial data models.

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to make a few changes, but this is good!

Comment on lines +987 to +1011
// When the Wolfe line search rejects, don't immediately terminate.
// Instead, let the Newton loop try at least one more iteration.
// The original code compared the stale curr.obj() (which equalled
// prev.obj() after the swap in update_next_step) and would always
// terminate on ANY Wolfe rejection — even on the very first Newton
// step. Now we only declare search_failed if the full Newton step
// itself didn't improve the objective.
bool search_failed;
if (!state.wolfe_status.accept_) {
if (full_newton_obj > state.prev().obj()) {
// The full Newton step (evaluated before Wolfe ran) improved
// the objective. Re-evaluate scratch at the full Newton step
// so we can accept it as the current iterate.
scratch.eval_.alpha() = 1.0;
update_fun(scratch, state.curr(), state.prev(), scratch.eval_,
state.wolfe_info.p_);
state.curr().update(scratch);
state.wolfe_status.accept_ = true;
search_failed = false;
} else {
search_failed = true;
}
} else {
search_failed = false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what this is trying to do, though I think it caught the main issue that curr can be stale, which it called out but then did not do anything about. I was being memory greedy / too clever here and reusing the memory in curr in a few places I should not have.

Though I don't disagree with what it is doing here. If we are in a loop with a small'ish step size and a problem with a weird geometry it can be worth yolo'ing a large step size to try to jump out of it. The code actually used to have something like this in it, but at the time I dismissed it. We also test alpha = 1 a few lines above and should be reusing those results here instead of rerunning update_fun. update_fun checks if the objective or gradient is NaN and reduces the step size from 1 until we get a finite objective and gradient.

I'm going to modify this, but will keep the backup newton step check.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_regr/gp_regr.stan 0.1 0.09 1.05 4.8% faster
gp_regr/gen_gp_data.stan 0.03 0.02 1.07 6.64% faster
arK/arK.stan 1.81 1.71 1.06 5.96% faster
eight_schools/eight_schools.stan 0.06 0.05 1.09 7.89% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.99 8.34 1.08 7.22% faster
pkpd/one_comp_mm_elim_abs.stan 19.91 18.66 1.07 6.26% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.25 0.24 1.07 6.19% faster
sir/sir.stan 70.58 66.42 1.06 5.9% faster
gp_pois_regr/gp_pois_regr.stan 2.89 2.67 1.08 7.42% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.74 2.54 1.08 7.33% faster
irt_2pl/irt_2pl.stan 4.16 3.89 1.07 6.58% faster
arma/arma.stan 0.3 0.27 1.09 8.02% faster
garch/garch.stan 0.43 0.4 1.06 5.97% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.01 0.01 1.12 10.42% faster
performance.compilation 231.39 224.34 1.03 3.05% faster
Mean result: 1.071462349031666

Jenkins Console Log
Blue Ocean
Commit hash: e0a54f2c08661b080c37acc67934833d4d8465c6


Machine information No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping: 4
CPU MHz: 2400.000
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 1.3 MiB
L1i cache: 1.3 MiB
L2 cache: 40 MiB
L3 cache: 55 MiB
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Vmscape: Mitigation; IBPB before exit to userspace
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities

G++:
g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Clang:
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants