tskit-dev · apragsdale · Mar 5, 2026 · Mar 6, 2026 · Mar 6, 2026 · Mar 6, 2026
diff --git a/docs/stats.md b/docs/stats.md
@@ -684,14 +684,310 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1.
   and {math}`v_j` is the covariance of the trait with the j-th covariate.
 
 
+(sec_stats_multi_site)=
+
 ## Multi site statistics
 
-:::{todo}
-Document statistics that use information about correlation between sites, such as
-LdCalculator (and perhaps reference {ref}`sec_identity`). Note that if we have a general
-framework which has the same calling conventions as the single site stats,
-we can rework the sections above.
-:::
+(sec_stats_two_locus)=
+
+### Two-locus statistics
+
+The {meth}`~TreeSequence.ld_matrix` method provides an interface to
+a collection of two-locus statistics with predefined summary functions (see
+{ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch`
+{ref}`modes <sec_stats_mode>`. The LD matrix method differs from other
+statistics methods in that it provides a unified API with an argument to
+specify different two-locus summaries of the data. It otherwise behaves
+similarly to most other functions with respect to `sample_sets` and `indexes`.
+
+Two-locus statistics can be computed using two modes, either `site` or
+`branch`, and these should be interpreted in the same way as these modes in the
+single-site statistics. Within this framework, statistics may be either
+polarised or unpolarised. For statistics that are polarised, we compute
+statistic values for pairs of derived alleles. Unpolarised statistics compute
+statistics over all possible alleles, derived and ancestral, and the result in
+averaged over statistics computed for all pairs of alleles (see weighting
+schemes below). The option for polarisation is not exposed to the user.
-statistic values for pairs of derived alleles. Unpolarised statistics compute
-statistics over all possible alleles, derived and ancestral, and the result in
-averaged over statistics computed for all pairs of alleles (see weighting
-schemes below). The option for polarisation is not exposed to the user.
+statistic values for pairs of derived alleles. Unpolarised statistics compute
+statistics over all pairs of alleles, derived and ancestral. In either case,
+the result is averaged over these values, using a weighting
+scheme descrbed below. The option for polarisation is not exposed to the user.
-statistic values for pairs of derived alleles. Unpolarised statistics compute
-statistics over all possible alleles, derived and ancestral, and the result in
-averaged over statistics computed for all pairs of alleles (see weighting
-schemes below). The option for polarisation is not exposed to the user.
+statistic values for pairs of derived alleles. Unpolarised statistics compute
+statistics over all pairs of alleles, derived and ancestral. In either case,
+the result is averaged over these values, using a weighting
+scheme descrbed below. The option for polarisation is not exposed to the user.
+Instead, we implement statistics that are polarised where appropriate.
+
+(sec_stats_two_locus_site)=
+
+#### Site
+
+The `site` mode computes two-locus statistics summarized over alleles between
+all pairs of specified sites. The default behavior, leaving `sites`
+unspecified, will compute a matrix for all pairs of sites, with
+one row and column for each site in the tree sequence (i.e., an n x n
+matrix where n is the number of sites in the tree sequence). We can also
+restrict the output to a subset of sites, either by specifying a single vector
+for both rows and columns or a pair of vectors for the row sites and column
+sites separately.
+
+The following computes a matrix of the {math}`r^2` measure of linkage
+disequilibrium (LD) computed pairwise between the first 4 sites in the tree
+sequence among all samples. In our computations, row sites are used as the row
+("left-hand") loci and column sites are used as the column ("right-hand")
+locus, and with a single list of sites specified, we obtain a symmetric square
+matrix. 
+
+```{code-cell} ipython3
+ld = ts.ld_matrix(sites=[[0, 1, 2, 3]])
+print(ld)
+```
+
+The following demonstrates how we can specify the row and column sites
+independently of each other. We're specifying 3 columns and 2 rows, which
+computes a subset of the matrix shown above.
+
+```{code-cell} ipython3
+ld = ts.ld_matrix(sites=[[0, 1], [1, 2, 3]])
+print(ld)
+```
+
+Because we allow for two-locus statistics to be computed for multi-allelic
+data, we need to be able to combine statistical results from each pair of
+alleles into one summary for a pair of sites. We use two implementations for
-alleles into one summary for a pair of sites. We use two implementations for
+alleles into one summary for a pair of sites. This does not affect biallelic
+data (and so this section can be skipped on first reading).
+We use two implementations for
-alleles into one summary for a pair of sites. We use two implementations for
+alleles into one summary for a pair of sites. This does not affect biallelic
+data (and so this section can be skipped on first reading).
+We use two implementations for
+combining results from multiple alleles: `hap_weighted` and `total_weighted`.
+These are statistic-specific and not chosen by the user.
+
+Briefly, consider a pair of sites with {math}`n` alleles at the first locus and
+{math}`m` alleles at the second. Write {math}`f_{i,j}` as the statistic
-{math}`m` alleles at the second. Write {math}`f_{i,j}` as the statistic
+{math}`m` alleles at the second. (Whether this includes the ancestral allele
+depends on whether the statistic is polarised.) Write {math}`f_{i,j}` as the statistic
-{math}`m` alleles at the second. Write {math}`f_{i,j}` as the statistic
+{math}`m` alleles at the second. (Whether this includes the ancestral allele
+depends on whether the statistic is polarised.) Write {math}`f_{i,j}` as the statistic
+computed for focal alleles {math}`A_i` and {math}`B_j`, with haplotype weights
+{math}`(A_i B_j, A_i b_j, a_i B_j)`, where {math}`a_i` and {math}`b_j` are the
+collection of alleles that are not the focal alleles {math}`A_i` or
+{math}`B_j`, respectively. Then the weighting schemes are defined as:
-computed for focal alleles {math}`A_i` and {math}`B_j`, with haplotype weights
-{math}`(A_i B_j, A_i b_j, a_i B_j)`, where {math}`a_i` and {math}`b_j` are the
-collection of alleles that are not the focal alleles {math}`A_i` or
-{math}`B_j`, respectively. Then the weighting schemes are defined as:
+computed for focal alleles {math}`A_i` and {math}`B_j`.
+Then the weighting schemes are defined as:
-computed for focal alleles {math}`A_i` and {math}`B_j`, with haplotype weights
-{math}`(A_i B_j, A_i b_j, a_i B_j)`, where {math}`a_i` and {math}`b_j` are the
-collection of alleles that are not the focal alleles {math}`A_i` or
-{math}`B_j`, respectively. Then the weighting schemes are defined as:
+computed for focal alleles {math}`A_i` and {math}`B_j`.
+Then the weighting schemes are defined as:
+
+- `hap_weighted`: {math}`\sum_{i=1}^{n}\sum_{j=1}^{m}p(A_{i}B_{j})f_{ij}`,
+  where {math}`p(A_{i}B_{j})` is the frequency of haplotype {math}`A_{i}B_{j}`.
+  This method was first introduced in [Karlin
+  (1981)](https://doi.org/10.1111/j.1469-1809.1981.tb00308.x) and reviewed in
+  [Zhao (2007)](https://doi.org/10.1017/S0016672307008634).
+
+- `total_weighted`: {math}`\frac{1}{n m}\sum_{i=1}^{n}\sum_{j=1}^{m}f_{ij}`.
+  This method assigns equal weight to each of the possible pairs of focal
+  alleles at the two sites, taking the arithmetic mean of statistics over
+  focal haplotypes.
+
+Out of all of the available summary functions, only {math}`r^2` uses
+`hap_weighted` normalisation, with the remainder using uniform weighting
+(`total_weighted`).
+
+(sec_stats_two_locus_branch)=
+
+#### Branch
+
+The `branch` mode computes expected two-locus statistics between pairs of
+trees, conditioned on the marginal topologies and branch lengths of those
+trees. The trees for which we compute statistics are specified by positions,
+and for a pair of positions we consider all possible haplotypes that could be
+generated by a single mutation occurring at the two trees.
-generated by a single mutation occurring at the two trees.
+generated by a single mutation occurring on each of the two trees.
-generated by a single mutation occurring at the two trees.
+generated by a single mutation occurring on each of the two trees.
+
+For two trees, one with {math}`n` branches and the other with {math}`m`
+branches, there are {math}`nm` possible pairs of branches that may carry the
+pair of mutations. For each pair, we compute the two-locus statistic, and then
+sum these values weighted by the product of the two branch lengths. Given the
+two mutations occur, this accounts for the relative probability that the two
+mutations fall on any pair of branches.
+
+The time complexity of this method is quadratic in the number of samples,
+due to the pairwise comparisons of branches from each pair of trees.
+By default, this method computes
+a symmetric matrix for all pairs of trees, with rows and columns representing
+each tree in the tree sequence. Similar to the site method, we can restrict the
+output to a subset of trees, either by specifying a vector of positions or
+a pair of vectors for row and column positions separately. To select a specific
+tree, the specified positions must land in the tree span (`[start, end)`).
+
+In the following, we compute a matrix of expected {math}`r^2` within and
+between the first 4 trees in the tree sequence. The tree breakpoints are
+a convenient way to specify those first four trees.
+
+```{code-cell} ipython3
+ld = ts.ld_matrix(mode="branch", positions=[ts.breakpoints(as_array=True)[0:4]])
+print(ld)
+```
+
+Again, we can specify the row and column trees separately.
+
+```{code-cell} ipython3
+breakpoints = ts.breakpoints(as_array=True)
+ld = ts.ld_matrix(mode="branch", positions=[breakpoints[[0]], breakpoints[0:4]])
+print(ld)
+```
+
+(sec_stats_two_locus_sample_sets)=
+
+#### Sample Sets
+
+The two-locus API provides a mechanism by which to subset the samples under
+consideration, providing the ability to compute a separate LD matrix for each
+sample set or an LD matrix between sample sets. Output dimensions are handled
-consideration, providing the ability to compute a separate LD matrix for each
-sample set or an LD matrix between sample sets. Output dimensions are handled
+consideration, providing the ability to compute a LD matrix between distinct sample sets.
+Output dimensions are handled
-consideration, providing the ability to compute a separate LD matrix for each
-sample set or an LD matrix between sample sets. Output dimensions are handled
+consideration, providing the ability to compute a LD matrix between distinct sample sets.
+Output dimensions are handled
+in the same manner as the rest of the stats API (see
+{ref}`sec_stats_output_dimensions`). The two-locus API differs from the rest of
+the stats API in that we handle one-way and two-way statistics in the same
+function call.
+
+To compute a two-way two-locus statistic, the `index` argument must be
-To compute a two-way two-locus statistic, the `index` argument must be
+To compute a two-way two-locus statistic, the `indexes` argument must be
-To compute a two-way two-locus statistic, the `index` argument must be
+To compute a two-way two-locus statistic, the `indexes` argument must be
+provided. The statistics are selected in the same way (with the `stat`
+argument), but we provide a restricted set of two-way statistics (see
+{ref}`sec_stats_two_locus_summary_functions_two_way`). The dimension-dropping
+rules for the result follow the rest of the tskit stats API in that a single list
+or tuple will produce a single two-dimensional matrix, while list of these
+will produce a three-dimensional array, with the outer dimension of length
-will produce a three-dimensional array, with the outer dimension of length
+will produce a three-dimensional array, with the first dimension of length
-will produce a three-dimensional array, with the outer dimension of length
+will produce a three-dimensional array, with the first dimension of length
+equal to the length of the list.
+
+For example, to compute the {math}`r^2` LD matrix over a subset of samples in
+the tree sequence (such as sample nodes 0 through 7), we would specify the
+samples as follows:
+
+```
+ts = msprime.sim_ancestry(
+    20,
+    population_size=10000,
+    sequence_length=1000,
+    recombination_rate=2e-8,
+    random_seed=12)
+ts = msprime.sim_mutations(ts, rate=2e-8, random_seed=12)
+
+ld = ts.ld_matrix(mode=“site”, sample_sets=range(8))
+print(ld)
+```
+
+For concreteness, we would expect the following dimensions with the specified
+`sample_sets` and `indexes` arguments.
+
+```
+# one-way
+ts.ld_matrix(sample_sets=None) -> 2 dimensions
+ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
+ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
+# two-way
+ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
+ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
+```
+
+#### Why are there nan values in the LD matrix?
+
+For some statistics, it is possible to observe nan entries in the LD matrix,
-For some statistics, it is possible to observe nan entries in the LD matrix,
+For some statistics, it is possible to observe `nan` entries in the LD matrix,
-For some statistics, it is possible to observe nan entries in the LD matrix,
+For some statistics, it is possible to observe `nan` entries in the LD matrix,
+which can be surprising or numerically impact downstream analyses. A nan entry
-which can be surprising or numerically impact downstream analyses. A nan entry
+which can be surprising or numerically impact downstream analyses. A `nan` entry
-which can be surprising or numerically impact downstream analyses. A nan entry
+which can be surprising or numerically impact downstream analyses. A `nan` entry
+may occur when computing a ratio statistic (such as {math}`r` or {math}`r^2`)
+with a denominator of zero, indicating that one or both sites in the pair are
+monomorphic. This can happen for a number of reasons:
+
+- The mutation models allows for reversible mutations, so a back mutation at
+  a site has resulted in a single allele despite multiple mutations in the
+  history of the sample.
+- LD is computed for a subsample of individuals, and some sites are not
+  variable among the sample nodes in the subsample.
+- A mutation exists above the root of the local tree, so that all samples carry
+  the mutation, and one or more sites are not variable.
+
+(sec_stats_two_locus_sample_one_way_stats)=
+
+##### One-way Statistics
+
+One-way statistics are summaries of two loci in a single sample set, using a
+triple of haplotype counts {math}`\{w_{Ab}, w_{aB}, w_{AB}\}` and the size of
+the sample set {math}`n`, where the capitalized and lowercase letters in our notation
+represent alternate alleles.
+
+(sec_stats_two_locus_sample_two_way_stats)=
+
+##### Two-way Statistics
+
+Two-way statistics are summaries of haplotype counts between two sample sets,
+which operate on the three haplotype counts (as in one-way stats, above)
+computed from each sample set, indexed by `(i, j)`. These statistics take on
+a different meaning from their one-way counterparts. For instance {math}`D_i
+D_j` can be thought of as the covariance between two loci when computed for
+a haplotype in a single sample set.
+
+Only a subset of our summary functions are two-way statistics (see
+{ref}`sec_two_locus_summary_functions_two_way`). Note that the unbiased two-way
+statistics expect non-overlapping sample sets, and we do not make any
+assertions about the sample sets and assume that `i` and `j` represent disjoint
+sets of samples (see also the note in {meth}`~TreeSequence.divergence`).
+
+(sec_stats_two_locus_summary_functions)=
+
+#### Summary Functions
+
+(sec_stats_two_locus_summary_functions_one_way)=
+
+##### One-way
+
+The two-locus summary functions all take haplotype counts and sample set size as
+input. Each of our summary functions has the signature
+{math}`f(w_{Ab}, w_{aB}, w_{AB}, n)`, converting to haplotype frequencies
+{math}`\{p_{Ab}, p_{aB}, p_{AB}\}` where appropriate by dividing by {math}`n`.
+
+`D`
+: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{ab} - p_{a}p_{b}`
+
+  This statistic is polarised, as the unpolarised result, which averages over
+  allele labelings, is zero. Uses the `total` weighting method.
+
+`D_prime`
+: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{D_{\max}}`,
+
+  where {math}
+            \min\{p_A (1-p_B), p_B (1-p_B)\} & \textrm{if }D>=0 \\
+            \min\{p_A p_B, (1-p_B) (1-p_B)\} & \textrm{otherwise}
+        \end{cases}```
+
+  and {math}`D` is defined above. Polarised, `total` weighted.
+
+`D2`
+: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = (p_{ab} - p_{a} p_{b})^2`
+
+  Unpolarised, `total` weighted.
+
+`Dz`
+:  {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = D (1 - 2 p_{a})(1 - 2p_{b})`,
+
+  where {math}`D` is defined above. Unpolarised, `total` weighted.
+
+`pi2`
+: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{a}p_{b}(1-p_{a})(1-p_{b})`
+
+  Unpolarised, `total` weighted.
+
+`r`
+: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{\sqrt{p_{a}p_{b}(1-p_{a})(1-p_{b})}}`,
+
+  where {math}`D` is defined above. Polarised, `total` weighted.
+
+`r2`
+: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D^{2}}{p_{a}p_{b}(1-p_{a})(1-p_{b})}`,
+
+  where {math}`D` is defined above. Unpolarised, `haplotype` weighted.
+
+Unbiased two-locus statistics from the Hill-Robertson (1968) system are
+computed from haplotype counts. Definitions of these unbiased estimators can
+be found in [Ragsdale and Gravel
+(2020)](https://doi.org/10.1093/molbev/msz265). They require at least 4 samples
+to be valid and are specified as `stat="D2_unbiased"`, `"Dz_unbiased"`, or
+`"pi2_unbiased"`.
+
+(sec_two_locus_summary_functions_two_way)=
+
+(sec_stats_two_locus_summary_functions_two_way)=
+
+##### Two-way
+
+Two-way statistics are indexed by sample sets {math}`i, j` and compute values
+using haplotype counts within pairs of sample sets.
+
+`D2`
+: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = D_i * D_j`,
+
+where {math}`D` is defined above.
+
+`r2`
+: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{(p_{AB_i} - (p_{A_i}  p_{B_i})) (p_{AB_j} - (p_{A_j}  p_{B_j}))}{\sqrt{p_{A_i} (1 - p_{A_i}) p_{B_i} (1 - p_{B_i})}\sqrt{p_{A_j} (1 - p_{A_j}) p_{B_j} (1 - p_{B_j})}}`
+
+And `D2_unbiased`, which can be found in [Ragsdale and Gravel
+(2020)](https://doi.org/10.1093/molbev/msz265).
 
 
 (sec_stats_notes)=

diff --git a/python/CHANGELOG.rst b/python/CHANGELOG.rst
@@ -7,6 +7,12 @@ In development
 - Add ``json+struct`` metadata codec that allows storing binary data using a struct
   schema alongside JSON metadata. (:user:`benjeffery`, :pr:`3306`)
 
+**Features**
+
+- Add ``TreeSequence.ld_matrix`` stats method and documentation, for computing
+  two-locus statistics in site and branch mode.
+  (:user:`lkirk`, :user:`apragsdale`, :pr:`3416`) 
+
 --------------------
 [1.0.2] - 2026-03-06
 --------------------