3.3. EmpCpds
This module is concerned with the construction of EmpCpds and their annotation.
- class pcpfm.EmpCpds.EmpCpds(dict_empcpds, experiment, moniker)[source]
Bases:
objectThis object is largely a warpper around the dict_empcpds returned from Khipu.
- static construct_from_feature_table(experiment, isotopes=None, adducts=None, ext_adducts=None, feature_table_moniker='full', moniker='default', add_singletons=False, rt_search_window=2, mz_tol=5, charges=None)[source]
For a given feature table, generate the empirical compounds for that table using a set of isotopes, adducts, charges, and save it as either the table moniker or a new moniker.
- Parameters:
isotopes (list, optional) – isotopes for which to search
adducts (list, optional) – adducts to use, if None use defaults based on ionization.
extended_adducts (list, optional) – extended_adducts to use, if None, the default extended_adducts are used.
feature_table_moniker (str, optional) – the feature table to use
empCpd_moniker (str, optional) – the moniker to save the empcpds to
:param : param add_singletons (bool, optional): if true, add singletons to the khipus :param rt_search_window: the rt window to use for empcpd
construction, default is 2.
- Parameters:
mz_tol – the mz tolerance in ppm to use for empcpd construction, default is 5.
charges – the charges, in absolute units, to consider for empcpd construction.
- Returns:
empcpd object
- Return type:
empCpd
- create_annotation_table()[source]
This flattens the empcpd annotations into a dataframe summarizing the annotation on a per-feature level.
This is for the generation of outputs.
- Returns:
annotation table
- Return type:
dataframe
- property feature_id_to_khipu_id
This property provides a mapping from feature ids back to the khipu that contains them.
- Returns:
feature to kp id mapping dict
- Return type:
dict
- get_mz_tree(mz_tol, abs_error=False)[source]
This method will return an existing m/z based interval tree for these empcpds for a given mz_tol.
- Parameters:
mz_tol (float) – the mz_tol assumed to be in ppm
abs (bool) – if true, assume the mz tolerance provide is in daltons
- Returns:
interval tree for mz at the provided mz_tol
- Return type:
intervaltree
- get_precursor_mz_tree(mz_tol)[source]
This retrieves or generates the mz tree of all precursor ions for the empCpd MS2 spectra at a given ppm mass tolerance.
- Parameters:
mz_tol (float) – the mz tolerance in ppm
- Returns:
an interval tree for all precursor ion mzs at the given mz tolerance.
- Return type:
intervaltree
- get_precursor_rt_tree(rt_tolerance)[source]
This retrieves or generates the retention time tree of all precursor ions for the empCpd MS2 spectra at a given ppm mass tolerance.
- Parameters:
rt_tolerance (float) – the rtime tolerance in seconds
- Returns:
an interval tree for all precursor ion rtimes at the given rt tolerance.
- Return type:
intervaltree
- get_rt_tree(rt_tolerance)[source]
This method will return an existing rt based interval tree for these empcpds for a given rt_tolerance
- Parameters:
mz_tol (float) – the rt_tolerance in sec(s)
- Returns:
interval tree for rtime at the provided rt tolerance
- Return type:
intervaltree
- property khipu_id_to_feature_id
This property provides a mapping of khipu id to the feature ids in the khipu
- Returns:
kp_id to the feature ids
- Return type:
dict
- l1a_annotate(standards_csv, mz_tol=5, rt_tolerance=30, similarity_method='CosineHungarian', min_peaks=1, score_cutoff=0.5)[source]
Perform l1 annotation on the empcpds. Using CD authentic standard library.
- Parameters:
standards_csv (str) – path to CD csv export
mz_tol (int, optional) – mz tolerance to match precursors. Defaults to 5.
rt_tolerance (int, optional) – rt tolerance to match precursors. Defaults to 30.
similarity_method (str, optional) – which matchms similarity method to use. Defaults to “CosineHungarian”.
min_peaks (int, optional) – minimum number of peaks that must be shared for annotation. Defaults to 2.
score_cutoff (float, optional) – scores above this value are consider matches. Defaults to 0.50.
- l1b_annotate(standards_csv, mz_tol=5, rt_tolerance=10)[source]
Level1b annotations are based on mz, rtime tolerance against known standards.
This method takes the exported standard library from mz vault and compares a feature’s rtime and mz to the standard’s mz and retention time.
- Parameters:
standards_csv (str) – path to mzvault export
mz_tol (int, optional) – mz tolerance in ppm. Defaults to 5.
rt_tolerance (int, optional) – rt tolerance in seconds. Defaults to 10.
- l2_annotate(msp_files, mz_tol=5, similarity_method='CosineHungarian', min_peaks=1, score_cutoff=0.5)[source]
This method add l2 annotations to empirical compounds. This requires that first ms2 spectra be mapped to the empcpd object.
Level 2 annotations are lower confidence that Level 1 annotations but generated in a similar manner, MS2 similarity, but the references spectra are from a public reference database.
The similarity method can be any method that is provided by matchms. CosineHungarian is the default as it is a mathematically sound formulation of the cosine similarity and fast enough to be practical.
- Parameters:
msp_files (str) – path to directory with ms2 mzml files
mz_tol (int, optional) – mz tolerance in ppm for the precursor_mz_match. Defaults to 5.
similarity_method (str, optional) – name of the method for the similarity metric.
"CosineHungarian". (Defaults to) –
min_peaks (int, optional) – the minimum number of matching peaks between experimental
2. (and reference specturm. Defaults to) –
score_cutoff (float, optional) – the minimum score required for an annotation.
0.50. (Defaults to) –
- l4_annotate(annotation_sources, rt_tolerance=5)[source]
Given multiple annotation sources in the JSON format compliant with JMS, annotate based on neutral formula match to the annotation sources.
- Parameters:
annotation_sources – list of filepaths to annotation sources in JSON format
rt_tolerance – the rt_toleance to be used by ExperimentalEcpdDatabase. Defaults to 5.
- static load(moniker, experiment)[source]
This method generates the empCpd object for the provided moniker.
- Parameters:
moniker – the empCpd moniker to load
experiment – the experiment from which the empCpd was generated
- Returns:
the empCpds object for the specified moniker
- map_ms2(mapping_mz_tol=5, mapping_rt_tolerance=30, ms2_files=None, scan_experiment=False)[source]
When MS2 data is acquired, each spectrum will have a retention time and precursor ion mz. These can be mapped to features in the empCpds before annotation thus limiting any subsequent searches to just the MS2 spectra that appear to represent features that we care about.
By default this method searches all acquisitions in the experiment for MS2 spectra.
Additional MS2 spectra can be provided as mzml files using the ms2_files param.
- Parameters:
mapping_mz_tol (float, optional) – mz tolerance for the feature, ion precursor mz
5. (match in ppm. Defaults to) –
mapping_rt_tolerance (int, optional) – rt tolerance for the feature, ion precursor time match
30. (in seconds. Defaults to) –
ms2_files (str, optional) – path to additional ms2 acquisitions. Defaults to None.
scan_experiment (bool, optional) – _description_. Defaults to False.
- property ms2_spectra
This is a lazily evaluated data store for MS2 spectra
- Returns:
ms2_id to ms2_spetra dictionary.
- Return type:
dict
- property num_features
This method returns the number of features contained within the empcpds.
int: number of features in empcpds.
- property num_khipus
This method returns the number of khipus in empCpd
int: number of empcpds
- save(save_as_moniker=None)[source]
This method saves the empirical compound dictionary to the annotation_subdirectory. The path is determined by the moniker for the empCpds object, however, an alternative moniker can be provided which effectively saves a new empCpd object. This also updates the empCpds registry in the experiment with the path to the stored json.
- Parameters:
save_as_monhiker – an alternative moniker to which to save the table. Defaults to None.
- search_for_feature(query_mz=None, query_rt=None, mz_tol=None, rt_tolerance=None)[source]
Given a query_mz and query_rt with corresponding tolerances in ppm and absolute units respectively find all features by id_number that have a matching mz and rtime.
All search fields are optional but if none are provided then all the features will be considered matching. The mz tolerance should be in ppm while the rtime tolerance should be provided in rtime units.
Args:
- Parameters:
query_mz (float, optional) – the mz to search for, defaults to None
query_rt (float, optional) – the rtime to search for, defaults to None
mz_tol (float, optional) – the tolerance in ppm for the mz match, defaults to None
rt_tolerance (float, optional) – the tolerance in absolute units for the rt match, defaults to None
- Returns:
list of matching feature IDs
- Return type:
list