.impute_downshift
- proteopy.pp.impute_downshift(adata, zero_to_na=False, downshift=1.8, width=0.3, group_by=None, inplace=True, force=False, random_state=42, verbose=False)[source]
Impute missing values via a downshifted Gaussian.
Replaces
NaN(and optionally zero) entries by sampling from a Gaussian centered atmedian - downshift * sdwith standard deviationwidth * sd, simulating expression signals below the detection limit as popularised by the Perseus platform [1]. The median and standard deviation are estimated from the observed values of the global distribution or distributions defined by thegroup_byparameter:group_by=None— global distribution (all finite values in.X). Recommended when sample-level distributions are similar.group_by=<obs column>— per-group distribution pooled across all samples sharing the same label in that column.
When
group_byis set and a group contains fewer than three finite values, or its finite values are all constant (zero standard deviation), the global distribution (all finite values in.X) is used as a fallback for that group.The function records an imputation mask in
.layers["imputation_mask_X"](Truewhere values were imputed) and stores run metadata in.uns["imputation"].It is recommended to work on the log-transformed intensities space.
- Parameters:
adata (ad.AnnData) – Proteodata-formatted
AnnData.zero_to_na (bool, optional) – If
True, replace zeros in.XwithNaNbefore imputation so they are treated as missing values.downshift (float, optional) – Number of standard deviations to shift the distribution center leftward from the observed median.
width (float, optional) – Scaling factor applied to the observed standard deviation to set the width of the sampling distribution.
group_by (str | None, optional) – Column in
adata.obsdefining groups over which the reference distribution is pooled. WhenNone, the global distribution across all samples is used.inplace (bool, optional) – If
True, modifyadatain place and returnNone. IfFalse, return an imputed copy without alteringadata.force (bool, optional) – If
False, raise aValueErrorwhen the data are detected as non-log-transformed. Set toTrueto bypass this check and impute regardless.random_state (int | None, optional) – Seed for the NumPy random generator. Pass
Nonefor a non-deterministic run.verbose (bool, optional) – If
True, print summary statistics (measured / imputed counts) and, whengroup_byis set, up to the first five groups that trigger each per-group fallback to global stats.
- Returns:
Imputed
AnnDatawheninplace=False;Noneotherwise. The returned or modified object contains:.X— imputed intensity matrix (sparse if input was sparse)..layers["imputation_mask_X"]— boolean mask;Truemarks positions that were imputed..uns["imputation"]— dict with keysmethod,downshift,width,group_by,random_state,n_imputed, andpct_imputed.
- Return type:
ad.AnnData | None
- Raises:
TypeError – If any argument has an unexpected type.
ValueError – If
widthis not positive, fewer than three finite values exist globally, the global finite values are constant (zero standard deviation), or the data appear non-log-transformed andforce=False.KeyError – If
group_byis not a column inadata.obs.
References
Examples
>>> import numpy as np >>> import proteopy as pr >>> adata = pr.datasets.karayel_2020() >>> adata.layers["raw"] = adata.X >>> adata.X[adata.X == 0] = np.nan >>> adata.X = np.log2(adata.X)
Simple imputation as popularized by Tyanova et. al 2016 (downshift=1.8, width=0.3)
>>> pr.pp.impute_downshift(adata)
Impute by drawing from sample-level Gaussian distributions instead of global:
>>> pr.pp.impute_downshift(adata, group_by="sample_id")