.impute_downshift

proteopy.pp.impute_downshift(adata, zero_to_na=False, downshift=1.8, width=0.3, group_by=None, inplace=True, force=False, random_state=42, verbose=False)[source]

Impute missing values via a downshifted Gaussian.

Replaces NaN (and optionally zero) entries by sampling from a Gaussian centered at median - downshift * sd with standard deviation width * sd, simulating expression signals below the detection limit as popularised by the Perseus platform [1]. The median and standard deviation are estimated from the observed values of the global distribution or distributions defined by the group_by parameter:

  • group_by=None — global distribution (all finite values in .X). Recommended when sample-level distributions are similar.

  • group_by=<obs column> — per-group distribution pooled across all samples sharing the same label in that column.

When group_by is set and a group contains fewer than three finite values, or its finite values are all constant (zero standard deviation), the global distribution (all finite values in .X) is used as a fallback for that group.

The function records an imputation mask in .layers["imputation_mask_X"] (True where values were imputed) and stores run metadata in .uns["imputation"].

It is recommended to work on the log-transformed intensities space.

Parameters:
  • adata (ad.AnnData) – Proteodata-formatted AnnData.

  • zero_to_na (bool, optional) – If True, replace zeros in .X with NaN before imputation so they are treated as missing values.

  • downshift (float, optional) – Number of standard deviations to shift the distribution center leftward from the observed median.

  • width (float, optional) – Scaling factor applied to the observed standard deviation to set the width of the sampling distribution.

  • group_by (str | None, optional) – Column in adata.obs defining groups over which the reference distribution is pooled. When None, the global distribution across all samples is used.

  • inplace (bool, optional) – If True, modify adata in place and return None. If False, return an imputed copy without altering adata.

  • force (bool, optional) – If False, raise a ValueError when the data are detected as non-log-transformed. Set to True to bypass this check and impute regardless.

  • random_state (int | None, optional) – Seed for the NumPy random generator. Pass None for a non-deterministic run.

  • verbose (bool, optional) – If True, print summary statistics (measured / imputed counts) and, when group_by is set, up to the first five groups that trigger each per-group fallback to global stats.

Returns:

Imputed AnnData when inplace=False; None otherwise. The returned or modified object contains:

  • .X — imputed intensity matrix (sparse if input was sparse).

  • .layers["imputation_mask_X"] — boolean mask; True marks positions that were imputed.

  • .uns["imputation"] — dict with keys method, downshift, width, group_by, random_state, n_imputed, and pct_imputed.

Return type:

ad.AnnData | None

Raises:
  • TypeError – If any argument has an unexpected type.

  • ValueError – If width is not positive, fewer than three finite values exist globally, the global finite values are constant (zero standard deviation), or the data appear non-log-transformed and force=False.

  • KeyError – If group_by is not a column in adata.obs.

References

Examples

>>> import numpy as np
>>> import proteopy as pr
>>> adata = pr.datasets.karayel_2020()
>>> adata.layers["raw"] = adata.X
>>> adata.X[adata.X == 0] = np.nan
>>> adata.X = np.log2(adata.X)

Simple imputation as popularized by Tyanova et. al 2016 (downshift=1.8, width=0.3)

>>> pr.pp.impute_downshift(adata)

Impute by drawing from sample-level Gaussian distributions instead of global:

>>> pr.pp.impute_downshift(adata, group_by="sample_id")