3.5. FeatureTable

This module implements the FeatureTable object, which is mostly a wrapper around a pandas dataframe. This also includes methods to QAQC and batch correct the feature table.

class pcpfm.FeatureTable.FeatureTable(feature_table, experiment, moniker)[source]

Bases: object

A feature table is a data frame of feature for an experiment.

MissingFeatureZScores(intensity_cutoff=0)[source]

Count the number of features below the specified intensity cutoff per features and express as a Z-score based on missing feature count across all samples.

Parameters:

feature_vector_matrix (np.ndarray) – the selected feature matrix
acquisition_names (list[str]) – list of acquisition names
intensity_cutoff (int, optional) – values below this intensity are considered missing. Defaults to 0.
interactive_plot (bool, optional) – if True, interactive plots are made. Defaults to False.

Returns:

dictionary storing the result of this QCQA operation

Return type:

dict

QAQC(params)[source]

This is the wrapper for all the qcqa functions.

If these fields are present in the params, it will determine which methods are performed:

pca (bool, optional): Defaults to False. tsne (bool, optional): Defaults to False. pearson (bool, optional): Defaults to False. spearman (bool, optional): Defaults to False. kendall (bool, optional): Defaults to False. missing_feature_percentiles (bool, optional): Defaults to False. missing_feature_distribution (bool, optional): Defaults to False. median_correlation_outlier_detection (bool, optional): Defaults to False. missing_feature_outlier_detection (bool, optional): Defaults to False. intensity_analysis (bool, optional): Defaults to False. feature_distribution (bool, optional): Defaults to False. feature_outlier_detection (bool, optional): Defaults to False.

Parameters:: params (dict) – the params from the main process.
Returns:: with all qcqa results for the performed QCQA steps
Return type:: list

TIC_normalize(tic_normalization_percentile=0.9, by_batch=None, normalize_mode='median')[source]

This method will normalize the features of each acquisition based on the TICs of the samples. In this case, the TICs are calculated only using features that are present in TIC_normalization_percentile or greater percent of the samples.

Normalize mode determines how the normalization factor will be calculated, using either the mean or the median.

If by_batch is given, the normalization is performed in batches first with the batches determined by the field specified by_batch. Then all batches are normalized to one another.

Parameters:: TIC_normalization_percentile – only features in more than this

percent of samples are used for TIC calcualtion, defaults to 0.90 :type TIC_normalization_percentile: float :param by_batch: the field on which to group samples into batches :type by_batch: str, optional

Parameters:: normalize_mode (str, optional) – the method used to calculate the normalization factors, defaults to ‘median’

batch_correct(by_batch)[source]

This method batch corrects the feature intensities. The batches are determined dynamically using the by_batch field.

Parameters:: by_batch (str) – the field on which to batch sampels

blank_mask(blank_value='Blank', sample_value='Unknown', query_field='Sample Type', blank_intensity_ratio=3, by_batch=None, logic_mode='or')[source]

Given a feature table containing samples that we consider blanks, drop all features in non-blank samples that do not have an intensity blank_intensity_ratio times higher than the mean intensity in the blanks.

The blank samples are specified by the comibnation of blank_type and type_field. Non-blank samples are specified by sample_type and type_field in a similar manner.

If there are batches in the experiment, blank masking is done per-batch. Then dropped if the ratio condition is not true in one sample (if logic_mode is “or”) or in all samples if logic_mode is “and”. The batches are specified given a field in the metadata via the by_batch field.

_extended_summary_

Parameters:

by_batch (str, optional) – if true, blank mask by the batch field, defaults to None
blank_intensity_ratio (int, optional) – sample feautre intensity / blank intensity must exceed this value to be kept, defaults to 3
logic_mode (str, optional) – determines if a feature is dropped if it fails the test in one batch or all batches, defaults to “or”
blank_type (str, optional) – the value of type_field that specifies the blanks, defaults to “Blank”
sample_type (str, optional) – the value of type_field that specifies the study samples, defaults to “Unknown”
type_field (str, optional) – the field to look for the sample type in, defaults to “Sample Type”

clean_columns()[source]

Some helper scripts will convert the file path and append the directory name on the sample names using ‘___’ as a separator.

This will convert these back to the anticipated names.

correlation_heatmap(correlation_type, log_transform=False, full_results=False)[source]

Using a specified correlation function generate a correlation heatmap for the feature table. Optionally, log transform the feature table first.

The permitted correlation types are:

“pearson”, “spearman” or “kendall”

Only pearson will log_transform the feature table if enabled since the non-parametric correlations will not be affected by the log transform.

Parameters:

figure_params (dict) – dictionary with the figure params
correlation_type (str) – what correlation type to use
log_transform (bool, optional) – if true, log transform before linear correlation, defaults to True
full_results – if true, yield the corr matrix as dictionary, else discard the matrix

Returns:

a dict with the correlation results and configuration used to generate result

Return type:

dict

drop_invariants(zeros_only=False)[source]

This method drops features that have all zero intensity or the same intensity across all samples.

This situation occurs as a result of filtering. For instance if a contaiminant is only seen in the blanks, when the blanks are dropped from the feature table, that feature is still in the table but will be zero (or an interpolated value) for the remaning samples. These features have no information and can complicate downstream analysis.

Parameters:: zeros_only (bool, optional) – if true, only drop features that are all zero, defaults to False

drop_missing_features(by_batch=None, drop_percentile=0.8, logic_mode='or')[source]

This method will drop features that are uncommon in the feature table.

Drop_percentile is the threshold for inclusion.

Parameters:: by_batch – if provided, perform the operation on each batch separately. with

batches defined by this field., defaults to None :type by_batch: str, optional :param drop_percentile: features present in this percent or fewer of samples are dropped , defaults to 0.8 :type drop_percentile: float, optional :param logic_mode: if by batch, drop any feature that fails the threshold in ‘any’ batch or ‘all’ batches, defaults to “or” :type logic_mode: str, optional

drop_sample_by_name(drop_name, drop_others=False)[source]

This method drops a sample from a feature table by its name.

Optionally all other samples that do not match the name can be dropped as well.

Parameters:

drop_name (_type_) – the name to be dropped
drop_others (bool, optional) – drop other samples if true. Defaults to False.

drop_samples_by_field(value, field, drop_others=False)[source]

For a given field and a value for that field drop all samples that match or all samples that do not match.

Parameters:

value (str) – the value for the field to be dropped
field (str) – the field corresponding to the value that needs to be dropped
drop_others (bool, optional) – if true drop samples that do not match. Defaults to False.

drop_samples_by_filter(sample_filter, drop_others=False)[source]

Given a sample filter, a dictionary as described elsewhere, drop all other samples.

Parameters:

sample_filter (dict) – the dictionary specifying the filter
drop_others (bool, optional) – if true, reverse the logic of the drop. Defaults to False.

drop_samples_by_qaqc(qaqc_filter, drop_others=False, params=None)[source]

This drops samples based on a qaqc result. This requires an additional field in the filter called “conditions” which can accept keys “>” and “<” that control the logic of the comparison. Currently only numerical metrics can be used for dropping. The “Action” field is also need and can accept the values “Keep” and “Drop” which specify what should happen to the sample that matches the filter.

The permitted qaqc results for this filter are described in self.qaqc_results_to_key and if the metric has not been evaluated, it will be evaluated on demand in this method.

#todo - params seems unnecessary here

Parameters:

qaqc_filter (dict) – a dict detailing the qaqc filter
drop_others (bool, optional) – if true, reverse the logic of the drop. Defaults to False.
params (dict, optional) – the params from main, needed for figure_params. Defaults to None.

feature_distribution(intensity_cutoff=0)[source]

Count the number of features above the specified intensity cutoff per features

Parameters:

feature_vector_matrix (np.ndarray) – the selected feature matrix
acquisition_names (list[str]) – list of acquisition names
intensity_cutoff (int, optional) – values with greater intensiy are considered. Defaults to 0.
interactive_plot (bool, optional) – if True, interactive plots are made. Defaults to False.

Returns:

dictionary storing the result of this QCQA operation

Return type:

dict

feature_distribution_outlier_detection(intensity_cutoff=0)[source]

Count the number of features above the specified intensity cutoff per features and express as a Z-score based on feature count across all samples.

Parameters:

feature_vector_matrix (np.ndarray) – the selected feature matrix
acquisition_names (list[str]) – list of acquisition names
intensity_cutoff (int, optional) – values above this intensity are considered. Defaults to 0.
interactive_plot (bool, optional) – if True, plots are interactive. Defaults to False.

Returns:

dictionary storing the result of this QCQA operation

Return type:

result

gen_figure(figure_type, data, title='', x_label=None, y_label=None, fig_params=None, skip_annot=False, bins=100)[source]

A single method is used to generate the figures for the FeatureTable. This allows for consistent looking figures to be generated.

The permitted types of figures are:

“bar” - make a bar plot “scatter” - make a scatter plot “clustermap” - make a clustermap using seaborn “heatmap” - make a heatmap

This will be refactored in the future but this method is responsible for generating all figures related to FeatureTables. The figure paramaters such as color, markers, etc are stored as a datamember in the FeatureTable object.

Parameters:

figure_type (str) – which figure type to make
data (can be dict or list (need to better document)) – the data to plot
title (str, optional) – the title for the figure, defaults to ‘’
x_label (str, optional) – string to apply to the x-axis, defaults to None
y_label (str, optional) – string to apply to the y-axis, defaults to None
fig_params (dict, optional) – if provided override the object’s fig_param, defaults to None
skip_annot (bool, optional) – if true do not apply cosmetics to the figure, defaults to False

generate_cosmetic(colorby=None, markerby=None, textby=None, seed=None)[source]

Plots need colors, markers, and text fields. The colors and markers need to defined on the fly since they may not be known a priori. This method generates this mapping based on the fields in coloryb, markerby and textby.

Parameters:

colorby (list, optional) – list of fields that need colors, defaults to None
markerby (list, optional) – list of fields that need markers, defaults to None
textby – list of fields to be used for text, defaults to None.

largely here for future expansion :type textby: list, optional :param seed: if provided, this sets the seed for RNG purposes. Should allow reproducible maps.

Defaults to None.

Returns:: map of field values to colors, markers and text
Return type:: dict

generate_figure_params(params)[source]

This method generates the parameters used for plotting.

Parameters:: params (dict) – the params passed on the CLI.

get_mz_tree(mz_tol)[source]

Construct an interval tree to search for features using a query mz and a specific mz tolerance in ppm.

Parameters:: mz_tol (float or int) – float or int, this is the mass resolution in ppm
Returns:: interval tree for given mz_tol
Return type:: intervaltree

get_rt_tree(rt_tol)[source]

Construct an interval tree to search for features using a query rtime and a specific rtime tolerance in absolute units (sec).

Parameters:: rt_tol (float or int) – this is the rtime tolerance in sec
Returns:: interval tree for given rt_tol
Return type:: intervaltree

impute_missing_features(ratio=0.5, by_batch=None, method='min')[source]

Fill zero values with a small value to make downstream stats more robust. This value is a multiplier of the minimum value for that feature observed across all samples, excluding zeros.

Parameters:

ratio (float, optional) – multiply min value by this value, defaults to 0.5
by_batch (str, optional) – if try, impute per batch, defaults to None

intensity_analysis()[source]

This will report the sum, mean, median of features as well as those values when the missing values are removed or when they are log2 transformed.

Returns:: QAQC_result dict
Return type:: dict

intensity_distribution(skip_zero=True)[source]

This method generates various summaries of the intensity distribution in the feature table this includes TICs, LogTICs, median and mean intensity values including and excluding zeros and including the values after log transforming the intensities.

Parameters:: skip_zero (bool, optional) – if true, don’t include zero values. Defaults to True.

static load(moniker, experiment)[source]

This method yields a FeatureTable object when given a feature table moniker. FeatureTables are registered with the experiment object using a moniker, a string that points to the file path for that feature table. This method queries the experiment object, gets the feature table path, and creates the object.

Parameters:

moniker (str) – the string with which the FeatureTable is registered
experiment (object) – the experiment object with the FeatureTable

Returns:

the feature table for the moniker

Return type:

FeatureTable

log_transform(new_moniker, log_mode='log2')[source]

log transform the features in the table.

Parameters:

new_moniker (_type_) – _description_
log_mode (str, optional) – can be log10 or log2, which type of log to use, defaults to “log2”

property log_transformed

This property queries the experiment object to determine if the feature table has been log transformed already

Some operations log transform the feature table before analysis. Multiple log transforms would yield unwanted results so if an operation is going to log transform a feature table, check this first to ensure that it is has not already been log transformed.

Returns:: true if table is log_transformed
Return type:: bool

make_nonnegative(fill_value=1)[source]

This replaces all NaN and 0 values in the feature table with the specified fill_value

This is used primarially before log transforming the feature table to remove values that cannot be log transformed

Parameters:: fill_value (int, optional) – the value to replace NaN and 0 with, defaults to 1

median_correlation_outlier_detection(correlation_type='pearson')[source]

The median correlation of a sample against all other samples can be expressed as a z-score against the median of ALL correlations in the experiment. A high or low Z-score indicates that the sample was poorly correlated with other smaples in the experiment.

Parameters:

self – a feature table object
correlation_type (str) – can be ‘pearson’, ‘spearman’, ‘kendall’

Returns:

QAQC_result dict

Return type:

dict

missing_feature_distribution(intensity_cutoff=0)[source]

Count the number of missing features or featuers below the specified intensity cutoff per features

Parameters:

feature_vector_matrix (np.ndarray) – the selected feature matrix
acquisition_names (list[str]) – list of acquisition names
intensity_cutoff (int, optional) – values below this intesnity are considered missing. Defaults to 0.
interactive_plot (bool, optional) – if True, interactive plots are made. Defaults to False.

Returns:

dictionary storing the result of this QCQA operation

Return type:

dict

missing_feature_percentiles()[source]

Calculate the distribution of missing features with respect to percent of smaples with feature

Parameters:

feature_vector_matrix (np.ndarray) – the selected feature matrix
interactive_plot (bool, optional) – if True, interactive plots are made. Defaults to False.

Returns:

dictionary storing the result of this QCQA operation

Return type:

result

property non_sample_columns

Return a list of the column names in the feature table that are sample names.

This is used when filtering the feature tables but typically the list of sample columns is used instead.

Returns:: list of columns that are not samples
Return type:: list

property num_features

Returns the number of features in the feature table

Returns:: number of features in feature table
Return type:: int

property num_samples

Returns the number of samples in the feature table

Returns:: number of samples in feature table
Return type:: int

pca(log_transform=True)[source]

Perform PCA on provided feature table, optionally log transform it first.

Parameters:: log_transform (bool, optional) – if true log2 transform the table
Returns:: QAQC_result dict
Return type:: dict

properties_distribution()[source]: This method generates figures for the distribution (a histogram) of every parameter in the feature table that is not id_number, parent_masstrack_id or actual intensities in the samples. Useful for examining a feature table.

qaqc_result_to_key = {'cSelectivity_distribution': 'properties_distribution', 'feature_count_z_scores': 'feature_outlier_detection', 'intensity_distribution': 'intensity_distribution', 'intensity_distribution_log': 'intensity_distribution', 'kendall_correlation': 'kendall', 'kendall_logtransformed_correlation': 'log_kendall', 'log_missing_dropped_mean_intensity': 'intensity_analysis', 'log_missing_dropped_median_intensity': 'intensity_analysis', 'log_missing_dropped_sum_intensity': 'intensity_analysis', 'log_tics': 'intensity_analysis', 'mean_intensity': 'intensity_analysis', 'median_intensity': 'intensity_analysis', 'missing_dropped_mean_intensity': 'intensity_analysis', 'missing_dropped_median_intensity': 'intensity_analysis', 'missing_dropped_sum_intensity': 'intensity_analysis', 'missing_feature_z_scores': 'missing_feature_z_scores', 'pca': 'pca', 'pearson_correlation': 'pearson', 'pearson_logtransformed_correlation': 'log_pearson', 'snr_distribution': 'properties_distribution', 'spearman_correlation': 'spearman', 'spearman_logtransformed_correlation': 'log_spearman', 'sum_intensity': 'intensity_analysis', 'tics': 'intensity_analysis', 'tsne': 'tsne'}

property sample_columns

Return a list of the column names in the feature table that are sample names.

This is used when filtering the feature tables. When we search the experiment for a set of samples with a given filter, this returns samples in the experiment that may not be in the feature table. We can use this list tofilter out the samples in the experiment not in the feature table.

Returns:: list of sample columns
Return type:: list

save(new_moniker=None, drop_invariants=True)[source]

Save the feature table as a pandas-created .tsv and register the new on-disk location with the experiment object using the specified new_moniker or reuse the existing moniker. By default this drops features that have no variance in the feature table. This can occur when a sample or samples are dropped and one or more features are zero or interpolated only in the remaining samples.

When an operation is performed that modifies a feature table, the resulting feature table can be saved to disk using this method. The moniker for the feature table can be reused or a new moniker provided. If a new moniker is provided it cannot be preferred or full since we do not want to overwrite the asari results.

Dropping invariants is recommended to reduce the size of the feature table and prevent uninformative features from reaching downstream steps. There is no good reason to turn it off, but the option exists.

Parameters:

new_moniker (string, optional) – a new moniker to register the saved table with the experiment object, defaults to None
drop_invariants (bool, optional) – if true, drop features that have no variance, defaults to True

save_fig_path(name)[source]

Given a desired name for a figure, this returns the path to which this figure should be saved.

This ensures that the resulting path for the figure is a reasonable path without special figures and is saved to the appropriate location in the experiment directory.

Parameters:: name (str) – desired name for the figure
Returns:: path to save figure
Return type:: str

search_for_feature(query_mz=None, query_rt=None, mz_tolerance=None, rt_tolerance=None)[source]

Given a query_mz and query_rt with corresponding tolerances in ppm and absolute units respectively find all features by id_number that have a matching mz and rtime.

All search fields are optional but if none are provided then all the features will be considered matching. The mz tolerance should be in ppm while the rtime tolerance should be provided in rtime units.

Parameters:

query_mz (float, optional) – the mz to search for, defaults to None
query_rt (float, optional) – the rtime to search for, defaults to None
mz_tolerance (float, optional) – the tolerance in ppm for the mz match, defaults to None
rt_tolerance (float, optional) – the tolerance in absolute units for the rt match, defaults to None

Returns:

list of matching feature IDs

Return type:

list

tsne(perplexity=30)[source]

Perform TSNE on provided feature table

Parameters:: perplexity (int) – perplexity value for TSNE

Results: dict: QAQC result dict