Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
308 changes: 302 additions & 6 deletions docs/stats.md
Original file line number Diff line number Diff line change
Expand Up @@ -684,14 +684,310 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1.
and {math}`v_j` is the covariance of the trait with the j-th covariate.


(sec_stats_multi_site)=

## Multi site statistics

:::{todo}
Document statistics that use information about correlation between sites, such as
LdCalculator (and perhaps reference {ref}`sec_identity`). Note that if we have a general
framework which has the same calling conventions as the single site stats,
we can rework the sections above.
:::
(sec_stats_two_locus)=

### Two-locus statistics

The {meth}`~TreeSequence.ld_matrix` method provides an interface to
a collection of two-locus statistics with predefined summary functions (see
{ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch`
{ref}`modes <sec_stats_mode>`. The LD matrix method differs from other
statistics methods in that it provides a unified API with an argument to
specify different two-locus summaries of the data. It otherwise behaves
similarly to most other functions with respect to `sample_sets` and `indexes`.

Two-locus statistics can be computed using two modes, either `site` or
`branch`, and these should be interpreted in the same way as these modes in the
single-site statistics. Within this framework, statistics may be either
polarised or unpolarised. For statistics that are polarised, we compute
statistic values for pairs of derived alleles. Unpolarised statistics compute
statistics over all possible alleles, derived and ancestral, and the result in
averaged over statistics computed for all pairs of alleles (see weighting
schemes below). The option for polarisation is not exposed to the user.
Comment on lines +707 to +710
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
statistic values for pairs of derived alleles. Unpolarised statistics compute
statistics over all possible alleles, derived and ancestral, and the result in
averaged over statistics computed for all pairs of alleles (see weighting
schemes below). The option for polarisation is not exposed to the user.
statistic values for pairs of derived alleles. Unpolarised statistics compute
statistics over all pairs of alleles, derived and ancestral. In either case,
the result is averaged over these values, using a weighting
scheme descrbed below. The option for polarisation is not exposed to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this note about polarisation should be down below with the "weighting" schemes. Since isn't polarisation just chosen so that "people get what they expect"? So perhaps bundle those two bits into a "Computational details" section?

Instead, we implement statistics that are polarised where appropriate.

(sec_stats_two_locus_site)=

#### Site

The `site` mode computes two-locus statistics summarized over alleles between
all pairs of specified sites. The default behavior, leaving `sites`
unspecified, will compute a matrix for all pairs of sites, with
one row and column for each site in the tree sequence (i.e., an n x n
matrix where n is the number of sites in the tree sequence). We can also
restrict the output to a subset of sites, either by specifying a single vector
for both rows and columns or a pair of vectors for the row sites and column
sites separately.

The following computes a matrix of the {math}`r^2` measure of linkage
disequilibrium (LD) computed pairwise between the first 4 sites in the tree
sequence among all samples. In our computations, row sites are used as the row
("left-hand") loci and column sites are used as the column ("right-hand")
locus, and with a single list of sites specified, we obtain a symmetric square
matrix.

```{code-cell} ipython3
ld = ts.ld_matrix(sites=[[0, 1, 2, 3]])
print(ld)
```

The following demonstrates how we can specify the row and column sites
independently of each other. We're specifying 3 columns and 2 rows, which
computes a subset of the matrix shown above.

```{code-cell} ipython3
ld = ts.ld_matrix(sites=[[0, 1], [1, 2, 3]])
print(ld)
```

Because we allow for two-locus statistics to be computed for multi-allelic
data, we need to be able to combine statistical results from each pair of
alleles into one summary for a pair of sites. We use two implementations for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
alleles into one summary for a pair of sites. We use two implementations for
alleles into one summary for a pair of sites. This does not affect biallelic
data (and so this section can be skipped on first reading).
We use two implementations for

combining results from multiple alleles: `hap_weighted` and `total_weighted`.
These are statistic-specific and not chosen by the user.

Briefly, consider a pair of sites with {math}`n` alleles at the first locus and
{math}`m` alleles at the second. Write {math}`f_{i,j}` as the statistic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{math}`m` alleles at the second. Write {math}`f_{i,j}` as the statistic
{math}`m` alleles at the second. (Whether this includes the ancestral allele
depends on whether the statistic is polarised.) Write {math}`f_{i,j}` as the statistic

computed for focal alleles {math}`A_i` and {math}`B_j`, with haplotype weights
{math}`(A_i B_j, A_i b_j, a_i B_j)`, where {math}`a_i` and {math}`b_j` are the
collection of alleles that are not the focal alleles {math}`A_i` or
{math}`B_j`, respectively. Then the weighting schemes are defined as:
Comment on lines +755 to +758
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
computed for focal alleles {math}`A_i` and {math}`B_j`, with haplotype weights
{math}`(A_i B_j, A_i b_j, a_i B_j)`, where {math}`a_i` and {math}`b_j` are the
collection of alleles that are not the focal alleles {math}`A_i` or
{math}`B_j`, respectively. Then the weighting schemes are defined as:
computed for focal alleles {math}`A_i` and {math}`B_j`.
Then the weighting schemes are defined as:

This bit didn't make sense, and I think wasn't used anyhow.


- `hap_weighted`: {math}`\sum_{i=1}^{n}\sum_{j=1}^{m}p(A_{i}B_{j})f_{ij}`,
where {math}`p(A_{i}B_{j})` is the frequency of haplotype {math}`A_{i}B_{j}`.
This method was first introduced in [Karlin
(1981)](https://doi.org/10.1111/j.1469-1809.1981.tb00308.x) and reviewed in
[Zhao (2007)](https://doi.org/10.1017/S0016672307008634).

- `total_weighted`: {math}`\frac{1}{n m}\sum_{i=1}^{n}\sum_{j=1}^{m}f_{ij}`.
This method assigns equal weight to each of the possible pairs of focal
alleles at the two sites, taking the arithmetic mean of statistics over
focal haplotypes.

Out of all of the available summary functions, only {math}`r^2` uses
`hap_weighted` normalisation, with the remainder using uniform weighting
(`total_weighted`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say "see Ragsdale (XXXX) for explanation of this choice" or something?


(sec_stats_two_locus_branch)=

#### Branch

The `branch` mode computes expected two-locus statistics between pairs of
trees, conditioned on the marginal topologies and branch lengths of those
trees. The trees for which we compute statistics are specified by positions,
and for a pair of positions we consider all possible haplotypes that could be
generated by a single mutation occurring at the two trees.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
generated by a single mutation occurring at the two trees.
generated by a single mutation occurring on each of the two trees.


For two trees, one with {math}`n` branches and the other with {math}`m`
branches, there are {math}`nm` possible pairs of branches that may carry the
pair of mutations. For each pair, we compute the two-locus statistic, and then
sum these values weighted by the product of the two branch lengths. Given the
two mutations occur, this accounts for the relative probability that the two
mutations fall on any pair of branches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about inserting one of the precise interpretations of what's computed? This is still very vague ("accounts for the relative probability").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another important thing to mention: which branches? For instance, if there are branches above the root, or not ancestral to any samples, are those included? What about branches ancestral to samples but not to any of the samples in sample_sets? I assumed it would be "only branches on which a mutation would produce a polymorphic site", but I'm noticing that I can get nans from branch stats, so maybe not?

mts.ld_matrix(sample_sets, stat='r2', indexes=(0,1), mode='branch')
array([[nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan]])

The time complexity of this method is quadratic in the number of samples,
due to the pairwise comparisons of branches from each pair of trees.
By default, this method computes
a symmetric matrix for all pairs of trees, with rows and columns representing
each tree in the tree sequence. Similar to the site method, we can restrict the
output to a subset of trees, either by specifying a vector of positions or
a pair of vectors for row and column positions separately. To select a specific
tree, the specified positions must land in the tree span (`[start, end)`).

In the following, we compute a matrix of expected {math}`r^2` within and
between the first 4 trees in the tree sequence. The tree breakpoints are
a convenient way to specify those first four trees.

```{code-cell} ipython3
ld = ts.ld_matrix(mode="branch", positions=[ts.breakpoints(as_array=True)[0:4]])
print(ld)
```

Again, we can specify the row and column trees separately.

```{code-cell} ipython3
breakpoints = ts.breakpoints(as_array=True)
ld = ts.ld_matrix(mode="branch", positions=[breakpoints[[0]], breakpoints[0:4]])
print(ld)
```

(sec_stats_two_locus_sample_sets)=

#### Sample Sets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to first insert a paragraph explaining what happens without indexes: i.e., a list of samples gets you a matrix; and a list-of-lists of samples gets you a 3D array. Then say "okay and for a two-way thing you need a list-of-list of indexes, ...".

The two-locus API provides a mechanism by which to subset the samples under
consideration, providing the ability to compute a separate LD matrix for each
sample set or an LD matrix between sample sets. Output dimensions are handled
Comment on lines +823 to +824
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right, since separate LD matrices are produced for each sample set by default even for non-two-way statistics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
consideration, providing the ability to compute a separate LD matrix for each
sample set or an LD matrix between sample sets. Output dimensions are handled
consideration, providing the ability to compute a LD matrix between distinct sample sets.
Output dimensions are handled

in the same manner as the rest of the stats API (see
{ref}`sec_stats_output_dimensions`). The two-locus API differs from the rest of
the stats API in that we handle one-way and two-way statistics in the same
function call.

To compute a two-way two-locus statistic, the `index` argument must be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To compute a two-way two-locus statistic, the `index` argument must be
To compute a two-way two-locus statistic, the `indexes` argument must be

provided. The statistics are selected in the same way (with the `stat`
argument), but we provide a restricted set of two-way statistics (see
{ref}`sec_stats_two_locus_summary_functions_two_way`). The dimension-dropping
rules for the result follow the rest of the tskit stats API in that a single list
or tuple will produce a single two-dimensional matrix, while list of these
will produce a three-dimensional array, with the outer dimension of length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
will produce a three-dimensional array, with the outer dimension of length
will produce a three-dimensional array, with the first dimension of length

equal to the length of the list.

For example, to compute the {math}`r^2` LD matrix over a subset of samples in
the tree sequence (such as sample nodes 0 through 7), we would specify the
samples as follows:

```
ts = msprime.sim_ancestry(
20,
population_size=10000,
sequence_length=1000,
recombination_rate=2e-8,
random_seed=12)
ts = msprime.sim_mutations(ts, rate=2e-8, random_seed=12)

ld = ts.ld_matrix(mode=“site”, sample_sets=range(8))
print(ld)
```

For concreteness, we would expect the following dimensions with the specified
`sample_sets` and `indexes` arguments.

```
# one-way
ts.ld_matrix(sample_sets=None) -> 2 dimensions
ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
# two-way
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
```

#### Why are there nan values in the LD matrix?

For some statistics, it is possible to observe nan entries in the LD matrix,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For some statistics, it is possible to observe nan entries in the LD matrix,
For some statistics, it is possible to observe `nan` entries in the LD matrix,

which can be surprising or numerically impact downstream analyses. A nan entry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
which can be surprising or numerically impact downstream analyses. A nan entry
which can be surprising or numerically impact downstream analyses. A `nan` entry

may occur when computing a ratio statistic (such as {math}`r` or {math}`r^2`)
with a denominator of zero, indicating that one or both sites in the pair are
monomorphic. This can happen for a number of reasons:

- The mutation models allows for reversible mutations, so a back mutation at
a site has resulted in a single allele despite multiple mutations in the
history of the sample.
- LD is computed for a subsample of individuals, and some sites are not
variable among the sample nodes in the subsample.
- A mutation exists above the root of the local tree, so that all samples carry
the mutation, and one or more sites are not variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps address also nans in branch stats here

(sec_stats_two_locus_sample_one_way_stats)=

##### One-way Statistics

One-way statistics are summaries of two loci in a single sample set, using a
triple of haplotype counts {math}`\{w_{Ab}, w_{aB}, w_{AB}\}` and the size of
the sample set {math}`n`, where the capitalized and lowercase letters in our notation
represent alternate alleles.

(sec_stats_two_locus_sample_two_way_stats)=

##### Two-way Statistics

Two-way statistics are summaries of haplotype counts between two sample sets,
which operate on the three haplotype counts (as in one-way stats, above)
computed from each sample set, indexed by `(i, j)`. These statistics take on
a different meaning from their one-way counterparts. For instance {math}`D_i
D_j` can be thought of as the covariance between two loci when computed for
Comment on lines +901 to +902
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this called D^2 (also see here), not D_i D_j?

a haplotype in a single sample set.

Only a subset of our summary functions are two-way statistics (see
{ref}`sec_two_locus_summary_functions_two_way`). Note that the unbiased two-way
statistics expect non-overlapping sample sets, and we do not make any
assertions about the sample sets and assume that `i` and `j` represent disjoint
sets of samples (see also the note in {meth}`~TreeSequence.divergence`).

(sec_stats_two_locus_summary_functions)=

#### Summary Functions

(sec_stats_two_locus_summary_functions_one_way)=

##### One-way

The two-locus summary functions all take haplotype counts and sample set size as
input. Each of our summary functions has the signature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these functions are in terms of p, not w, right? So why introduce w at all here?

{math}`f(w_{Ab}, w_{aB}, w_{AB}, n)`, converting to haplotype frequencies
{math}`\{p_{Ab}, p_{aB}, p_{AB}\}` where appropriate by dividing by {math}`n`.

`D`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{ab} - p_{a}p_{b}`

This statistic is polarised, as the unpolarised result, which averages over
allele labelings, is zero. Uses the `total` weighting method.

`D_prime`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{D_{\max}}`,

where {math}
\min\{p_A (1-p_B), p_B (1-p_B)\} & \textrm{if }D>=0 \\
\min\{p_A p_B, (1-p_B) (1-p_B)\} & \textrm{otherwise}
\end{cases}```

and {math}`D` is defined above. Polarised, `total` weighted.

`D2`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = (p_{ab} - p_{a} p_{b})^2`

Unpolarised, `total` weighted.

`Dz`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = D (1 - 2 p_{a})(1 - 2p_{b})`,

where {math}`D` is defined above. Unpolarised, `total` weighted.

`pi2`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{a}p_{b}(1-p_{a})(1-p_{b})`

Unpolarised, `total` weighted.

`r`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{\sqrt{p_{a}p_{b}(1-p_{a})(1-p_{b})}}`,

where {math}`D` is defined above. Polarised, `total` weighted.

`r2`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D^{2}}{p_{a}p_{b}(1-p_{a})(1-p_{b})}`,

where {math}`D` is defined above. Unpolarised, `haplotype` weighted.

Unbiased two-locus statistics from the Hill-Robertson (1968) system are
computed from haplotype counts. Definitions of these unbiased estimators can
be found in [Ragsdale and Gravel
(2020)](https://doi.org/10.1093/molbev/msz265). They require at least 4 samples
to be valid and are specified as `stat="D2_unbiased"`, `"Dz_unbiased"`, or
`"pi2_unbiased"`.

(sec_two_locus_summary_functions_two_way)=

(sec_stats_two_locus_summary_functions_two_way)=

##### Two-way

Two-way statistics are indexed by sample sets {math}`i, j` and compute values
using haplotype counts within pairs of sample sets.

`D2`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = D_i * D_j`,

where {math}`D` is defined above.

`r2`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{(p_{AB_i} - (p_{A_i} p_{B_i})) (p_{AB_j} - (p_{A_j} p_{B_j}))}{\sqrt{p_{A_i} (1 - p_{A_i}) p_{B_i} (1 - p_{B_i})}\sqrt{p_{A_j} (1 - p_{A_j}) p_{B_j} (1 - p_{B_j})}}`

And `D2_unbiased`, which can be found in [Ragsdale and Gravel
(2020)](https://doi.org/10.1093/molbev/msz265).


(sec_stats_notes)=
Expand Down
6 changes: 6 additions & 0 deletions python/CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ In development
- Add ``json+struct`` metadata codec that allows storing binary data using a struct
schema alongside JSON metadata. (:user:`benjeffery`, :pr:`3306`)

**Features**

- Add ``TreeSequence.ld_matrix`` stats method and documentation, for computing
two-locus statistics in site and branch mode.
(:user:`lkirk`, :user:`apragsdale`, :pr:`3416`)

--------------------
[1.0.2] - 2026-03-06
--------------------
Expand Down
Loading
Loading