calour.io.read_ms

calour.io.read_ms(data_file, sample_metadata_file=None, feature_metadata_file=None, gnps_file=None, data_file_type='mzmine2', sample_in_row=None, direct_ids=None, get_mz_rt_from_feature_id=None, use_gnps_id_from_AllFiles=True, cut_sample_id_sep=None, mz_rt_sep=None, mz_thresh=0.02, rt_thresh=15, description=None, sparse=False, *, normalize, **kwargs)[source]

Read a mass-spec experiment.

Calour supports various ms table formats, with several preset formats (specified by the data_file_type=’XXX’ parameter), as well as able to read user specified formats.

With the installation of the gnps-calour database interface, Calour can integrate MS2 information from GNPS into the analysis:

If the data table and the gnps file share the same IDs (preferred), GNPS annotations use the uniqueID of the features. Otherwise, calour matches the features to the gnps file using an MZ and RT threshold window (specified by the mz_thresh=XXX, rt_thresh=XXX parameters).

Supported formats for ms analysis (as specified by the data_file_type=’XXX’ parameter) include:

  • ‘mzmine2’: using the csv output file of mzmine2. MZ and RT are obtained via the 2nd and 3rd column in the file.
  • ‘biom’: using a biom table for the metabolite table. featureIDs in the table (first column) can be either MZ_RT (concatenated with a separator), or a unique ID matching the gnps_file ids.
  • ‘openms’: using a csv data table with MZ_RT or unqie ID as featureID (first column). samples can be columns (default) or rows (using the sample_in_row=True parameter)
  • ‘gnps-ms2’: a tsv file exported from gnps, with gnps ids as featureIDs.
Parameters:
  • data_file (str) – The name of the data table (mzmine2 output/bucket table/biom table) containing the per-metabolite abundances.
  • sample_metadata_file (str or None (optional)) –

    None (default) to not load metadata per sample str to specify name of sample mapping file (tsv).

    Note: sample names in the bucket table and sample_metadata file must match. In case bucket table sample names contains additional information, you can split them at the separator character (usually ‘_’), keeping only the first part, using the cut_sample_id_sep=’_’ parameter (see below)

  • gnps_file (str or None (optional)) – name of the gnps clusterinfosummarygroup_attributes_withIDs_arbitraryattributes/XXX.tsv file, for use with the ‘gnps’ database. This enables identification of the metabolites with known MS2 (for the interactive heatmap and sorting/filtering etc), as well as linking to the gnps page for each metabolite (from the interactive heatmap - by double clicking on the metabolite database information). Note: requires gnps-calour database interface module (see Calour installation instructions for details).
  • feature_metadata_file (str or None (optional)) – Name of table containing additional metadata about each feature None (default) to not load
  • data_file_type (str, optional) –

    the data file format. options include:

    ’mzmine2’: load the mzmine2 output csv file.
    MZ and RT are obtained from this file. GNPS linking is direct via the unique id column. table is csv, columns are samples.
    ’biom’: load a biom table for the features
    MZ and RT are obtained via the featureID (first column), which is assumed to be MZ_RT. GNPS linking is indirect via the mz and rt threshold windows. table is a tsv/json/hdf5 biom table, columns are samples.
    ’openms’: load an openms output table
    MZ and RT are obtained via the featureID (first column), which is assumed to be MZ_RT. GNPS linking is indirect via the mz and rt threshold windows. table is a csv table, columns are samples.
    ’gnps-ms2’: load a gnps exported biom table
    MZ and RT are obtained via the gnps_file if available, otherwise are NA GNPS linking is direct via the first column (featureID). table is a tsv/json/hdf5 biom table, columns are samples.
  • sample_in_row (bool or None, optional) – False indicates rows in the data table file are features, True indicates rows are samples. None to use default value according to data_file_type
  • direct_ids (bool or None, optional) – True indicates the feature ids in the data table file are the same ids used in the gnps_file. False indicates feature ids are not the same as in the gnps_file (such as when the ids are the MZ_RT) None to use default value according to data_file_type
  • get_mz_rt_from_feature_id (bool or None, optional) – True indicates the data table file feature ids contain the MZ/RT of the feature. False to not obtain MZ/RT from the feature id None to use default value according to data_file_type
  • use_gnps_id_from_AllFiles (bool, optional) – True (default) to link the data table file gnps ids to the AllFiles column in the gnps_file. False to link the data table file gnps ids to the ‘cluster index’ column in the gnps_file.
  • cut_sample_id_sep (str or None, optional) – str (typically ‘_’) to split the sampleID in the data table file, keeping only the first part. Useful when the sampleIDs in the data table contain additional information compared to the mapping file (using a ‘_’ separator), and this needs to be removed in order to sync the sampleIDs between table and mapping file. None (default) to not change the data table file sampleID
  • mz_rt_sep (str or None, optional) – The separator character between the MZ and RT parts of the featureID (if it contains them) (usually ‘_’). If not supplied, autodetect the separator. Note this is used only if get_mz_rt_from_feature_id=True
  • mz_thresh (float, optional) – The tolerance for M/Z to match features to the gnps_file. Used only if parameter direct_ids=False.
  • rt_thresh (float, optional) – The tolerance for retention time to match features to the gnps_file. Used only if parameter direct_ids=False.
  • description (str or None (optional)) – Name of the experiment (for display purposes). None (default) to assign file name
  • sparse (bool (optional)) – False (default) to store data as dense matrix (faster but more memory) True to store as sparse (CSR)
  • normalize (int or None) – normalize each sample to the specified reads. None to not normalize
Keyword Arguments:
 
  • data_file (str) – file path to the biom table.
  • sample_metadata_file (None or str, optional) – None (default) to just use sample names (no additional metadata). if not None, file path to the sample metadata (aka mapping file in QIIME).
  • feature_metadata_file (None or str, optional) – file path to the feature metadata.
  • description (str) – description of the experiment
  • sparse (bool) – read the biom table into sparse or dense array
  • data_file_type (str, optional) – the data_file format. options: ‘biom’ : a biom table (biom-format.org) (default) ‘tsv’: a tab-separated table with (samples in column and feature in row) ‘openms’ : an OpenMS bucket table csv (rows are feature, columns are samples) ‘openms_transpose’ an OpenMS bucket table csv (columns are feature, rows are samples) ‘gnps_ms’ : an OpenMS bucket table tsv with samples as columns (exported from GNPS) ‘qiime2’ : a qiime2 biom table artifact (need to have qiime2 installed)
  • feature_metadata_kwargs (sample_metadata_kwargs,) – keyword arguments passing to pandas.read_table() when reading sample metadata or feature metadata. For example, you can set sample_metadata_kwargs={'dtype': {'ph': int}, 'encoding': 'latin-8'} to read the column of ph in the sample metadata as int and parse the file as latin-8 instead of utf-8. By default, it assumes the first column in the metadata files is sample/feature IDs and is read in as row index. To avoid this, please provide {‘index_col’: False}.
  • cls (class, optional) – what class object to read the data into (Experiment by default)
  • table_sample_id_proc (None or callable, optional) –
  • table_feature_id_proc (None or callable, optional) – if not None, modify each sample/feature id in the table using the callable function. The callable accepts a list of str and returns a list of str (sample/feature ids after processing). Useful in metabolomics experiments, where the sampleIDs in the data table contain additional information compared to the mapping file (using a ‘_’ separator), and this needs to be removed in order to sync the sampleIDs between table and mapping file.
  • sample_in_row (bool, optional) – False if data table columns are sample, True if rows are samples
  • normalize (int or None) – normalize each sample to the specified read count. None to not normalize
Returns:

Return type:

MS1Experiment

See also

read()