Tutorial:

MetaboAnalystR package is synchronized with the MetaboAnalyst website and is designed for metabolomics researchers who are comfortable using R to perform raw spectra processing and batch analysis. An optimized global metabolomics analysis workflow starting from raw spectra has been established. The following tutorials are meant to complement our web-based functions by providing step-by-step instructions for several of the most common tasks using the R package.

1. Overview

1.1 Introduction

MetaboAnalystR 3.0 contains the R functions and libraries underlying the popular MetaboAnalyst web server, including metabolomic data analysis, visualization, and functional interpretation. The package is synchronized with the MetaboAnalyst web server. After installing and loading the package, users will be able to reproduce the same results from their local computers using the corresponding R command history downloaded from MetaboAnalyst web site, thereby achieving maximum flexibility and reproducibility.

The version 3.0 aims to improve the current global metabolomics workflow by implementing a fast parameter optimization algorithm for peak picking, and automated identification of the most suitable method for batch effect correction from 12 well-established approaches. In addition, more support for functional interpretation directly from m/z peaks via mummichog2 (PMID: 23861661), and a new pathway-based method to integrate multiomics data has been added. To demonstrate this new functionality, we perform end-to-end metabolomics data analysis on the clinical IBD samples in this tutorial. More case study have been provided in the vignette of this R package. Here we'd prefer to provide an brief starting tutorial from installation of MetaboAnalystR 3.0.

1.2 Installation

Step 1. Install package dependencies

To use MetaboAnalystR 3.0, first install all package dependencies. Ensure that you have necessary system environment configured.

For Linux (e.g. Ubuntu 18.04/20.04): libcairo2-dev, libnetcdf-dev, libxml2, libxt-dev and libssl-dev should be installed at frist;

For Windows (e.g. 7/8/8.1/10): Rtools should be installed.

For Mac OS: In order to compile R for Mac OS, you need Xcode and GNU Fortran compiler installed (https://mac.r-project.org/tools/). We suggest you follow these steps: https://thecoatlessprofessor.com/programming/cpp/r-compiler-tools-for-rcpp-on-macos/ to help with your installation.

R base with version > 3.6.1 is recommended. The compatibility of latest version (v4.0.0) is under evaluation. As for installation of package dependencies, there are two options:

Option 1

Enter the R function (metanr_packages) and then use the function. A printed message will appear informing you whether or not any R packages were installed.

Function to download packages:

metanr_packages <- function(){
metr_pkgs <- c("impute", "pcaMethods", "globaltest", "GlobalAncova", "Rgraphviz", "preprocessCore", "genefilter", "SSPA", "sva", "limma", "KEGGgraph", "siggenes","BiocParallel", "MSnbase", "multtest","RBGL","edgeR","fgsea","devtools","crmn")
list_installed <- installed.packages()
new_pkgs <- subset(metr_pkgs, !(metr_pkgs %in% list_installed[, "Package"]))
if(length(new_pkgs)!=0){if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
        BiocManager::install(new_pkgs)
        print(c(new_pkgs, " packages added..."))
    }

if((length(new_pkgs)<1)){
        print("No new packages added...")
    }
}

Usage of function:

metanr_packages()

Option 2

Use the pacman R package (for those with >R 3.5.1).

install.packages("pacman")

library(pacman)

pacman::p_load(c("impute", "pcaMethods", "globaltest", "GlobalAncova", "Rgraphviz", "preprocessCore", "genefilter", "SSPA", "sva", "limma", "KEGGgraph", "siggenes","BiocParallel", "MSnbase", "multtest","RBGL","edgeR","fgsea"))

Step 2. Install the package

MetaboAnalystR 3.0 is freely available from GitHub. The package documentation, including the vignettes for each module and user manual is available within the downloaded R package file. You can install the MetaboAnalylstR 3.0 via any of the three options: A) using the R package devtools, B) cloning the github, C) manually downloading the .tar.gz file. Note that the MetaboAnalystR 3.0 github will have the most up-to-date version of the package.

Option A) Install the package directly from github using the devtools package. Open R and enter:

Due to issues with Latex, some users may find that they are only able to install MetaboAnalystR 3.0 without any documentation (i.e. vignettes).

# Step 1: Install devtools
install.packages("devtools")
library(devtools)

# Step 2: Install MetaboAnalystR without documentation
devtools::install_github("xia-lab/MetaboAnalystR", build = TRUE, build_vignettes = FALSE)

# Step 2: Install MetaboAnalystR with documentation
devtools::install_github("xia-lab/MetaboAnalystR", build = TRUE, build_vignettes = TRUE, build_manual =T)

Option B) Clone Github and install locally

The * must be replaced by what is actually downloaded and built.

git clone https://github.com/xia-lab/MetaboAnalystR.git
R CMD build MetaboAnalystR
R CMD INSTALL MetaboAnalystR_3.0.0.tar.gz

Option C) Manual download of MetaboAnalystR_3.0.0.tar.gz and install locally

Manually download the .tar.gz file from here.

cd ~/Downloads
R CMD INSTALL MetaboAnalystR_3.0.0.tar.gz

2. Case Study IBD

Inflammatory bowel diseases, which include Crohn’s disease and ulcerative colitis, have affected several million individuals around the world. Jason L. et al. have performed a longtitude multiomics study on the role of microbiome in the pathogenesis of IBD. Metabolomics study on the facal samples is introduced here for example case study of this novel parameters optimization pipeline.

2.1 Raw Data Processing

Raw data processing is the first for all metabolomics studies. MetaboAnalystR 3.0 support “.mzXML”, “.mzML” and “.CDF” formats. The original formats (e.g. “.raw” and “.RAW”) generated by vendors will be supported soon. You can convert your original data with ProteoWizard (PMID: 23051804) as the supported formats. Centroided data is preferred.

2.1.1 Load MetaboAnalystR 3.0

If you have finished the installation and been ready to use the package. Use the library() function to load the package into R.

# Load the MetaboAnalystR package
library("MetaboAnalystR")

2.1.2 Download IBD Example QC Data

The example Quality Control (QC) samples will be used for the next steps. Download the MS data for parameters' optimization at this step. To reach a goal of quicking learning and avoid the long time running for the whole samples (over 600 samples). We only provide 5 samples for each CD and nonIBD group here. If you want to repeat and verify the results in our manuscript, please go IBD MultiOmics Database, download the full batch data and run them using this pipeline.

## Setting the data depositing folder
data_folder_Sample <- "~/Data_IBD"
data_folder_QC <- "~/QC_IBD"  
# Use Google API for data downloading here. 
# Please "install.packages('googledrive')" and "install.packages('httpuv')"first.
library(googledrive);
temp <- tempfile(fileext = ".zip")
# Please authorize your google account to access the data
dl <- drive_download(
  as_id("10DBpPEWy2cZyXvmlLOIfpwqQYwFplKYK"), path = temp, overwrite = TRUE)
# Setting your own date file folder
out <- unzip(temp, exdir = data_folder_QC)
# Date files for parameters optimization are deposited below
out
# Now, download the small example data for comparison between CD vs. nonIBD
temp <- tempfile(fileext = ".zip")
dl <- drive_download(
  as_id("1-wlFUkzEwWX1afRWLJlY_KEJs7BfsZim"), path = temp, overwrite = TRUE)
# Setting the date file folder
out <- unzip(temp, exdir = data_folder_Sample)
# Date files for normal processing example are deposited below
out

2.1.3 Data Inspectation

Before running the data analysis, the general data structure and information can be inspected with PerformDataInspect. If there are some extremly significant contaminats, it will be discovered directly.

# Inspect the MS data via a 3D image. "res" are used to specify the resolution for the MS data.
PerformDataInspect(data_folder_QC,res = 50)
## 0020a_XAV_iHMP2_FFA_PREFA02.mzXML
## [1] "RT range is: 18.0985 and 1139.81 seconds !"
## [1] "MZ range is: 69.999908447266 and 849.902038574219 Thomson !"

plot of chunk unnamed-chunk-7

2.1.4 Data Trimming

QC samples data trimming was performed with PerformDataTrimming now. There are 3 modes for data trimming. Default is Standard Simulation Method (“ssm_trim”). Under this mode, the 3D MS will trimmed from m/z demionsion firstly and then from RT deminson. rt.idx could be used to adjust the RT range for RT data reminning. The other 2 modes, which are named as “rt_specific” and “mz_specific” can be used to extract (with positive values) or remove (with negative values) some parts of the spectra. Trimmed MS file can be saved with write = T. The trimmed file will be save as “.mzML”. The chromatogram can be plotted with plot = T.

# QC samples are trimmed with "ssm_trim" strategy. 
# The percentage of RT dimension retained is set as 20%.
raw_data <- PerformDataTrimming(data_folder_QC,rt.idx = 0.2)
## Data Loaded !
## Data Trimming...
## MS data Parsing ...
## M/Z Data Trimming...
## M/Z Data Trimming Done !                                                        
## Trim data outside the RT bins...
## Data Trimmed !                                                                  
## Loading required package: RColorBrewer
## Chromatogram Plotting Begin...
## Data Trimming Finished !
## Time Spent In Total:2.8mins

plot of chunk unnamed-chunk-10

2.1.5 Initalize Parameters' Setting

Instrument specific initial parameters have been embedded. Currently, the most popular platforms that include UPLC-Q/E, UPLC-Q/TOF, UPLC-T/TOF, UPLC-Ion_trap, UPLC-Orbitrap, UPLC-G2S, HPLC-Q/TOF, HPLC-Ion_Trap, HPLC-Orbitrap and HPLC-S/Q. If your platform is not listed here, the “general” option can be applied for this setting. Both “centWave” and “matchedFIlter” modes have been configured for further processing. Besides, the specific parameters could also be set individually, if you have your own opinion about the parameters. This parameters optimization step can be skipped.

# Initial platform specific parameters
param_initial <- SetPeakParam(platform = "UPLC-Q/E") 

2.1.6 Parameters' Optimization

Parameters optimization with “DoE” strategy. Initial parameters are from the optimized parameters above or the internal default parameters from embedded SetPeakParam function.

# Select relative core according to your work platform
param_optimized <- PerformParamsOptimization(raw_data, param = param_initial, ncore = 8) 
# Following results was generated at the Ubuntu 18.04.4 
# with 64 GB RAM and Interl Core i7-9700 CPU
## [1] "DoE Optimization Starting Now!"
## [1] "Evaluating Noise level..."
## Data Spliting Finished !                                                   
## Peak Preparing Begin...
## Peak Preparing Done !
## [1] "Round:1"
## DoE Running Begin...
## Round 1 Finished !                                                         
## Model Parsing...
## [1] "Parameters for centWave have been successfully parsed!"
## [1] "Peak Profiling Finished !"
## [1] "Starting peak filling!"
## Model Parsing Done !
## [1] "Round:2"
## DoE Running Begin...
## Round 2 Finished !                                                         
## Model Parsing...
## [1] "Parameters for centWave have been successfully parsed!"
## [1] "Peak Profiling Finished !"
## [1] "Starting peak filling!"
## Model Parsing Done !
## No Increase Stopping !
## best parameter settings:
## min_peakwidth: 6.25
## max_peakwidth: 44.5
## mzdiff: 0.0048
## snthresh: 11.5
## bw: 2
## Peak_method: centWave
## RT_method: peakgroup
## ppm: 2.12
## noise: 7195.33
## prefilter: 2
## value_of_prefilter: 13151.02
## minFraction: 0.5
## minSamples: 1
## maxFeatures: 100
## fitgauss: FALSE
## verbose.columns: FALSE
## mzCenterFun: wMean
## integrate: 1
## extra: 1
## span: 0.4
## smooth: loess
## family: gaussian
## 
## Optimization Finished !
## Parameters Optimization Finished !
## Time Spent In Total:25.1mins

2.1.7 Peak Profiling

The ImportRawMSData function reads in raw MS data files and saves it as an OnDiskMSnExp object. Two plots can be output with SetPlotParam function set with “Plot = T” - the Total Ion Chromatogram (TIC) which provides an overview of the entire spectra, and the Base Peak Chromatogram (BPC) which is a cleaner profile of the spectra based on the most abundant signals. These plots are useful to inform the setting of parameters downstream. For users who wish to view a peak of interest, an Extracted Ion Chromatogram (EIC) can be generated using the PlotEIC function.

# Import raw MS data. The "SetPlotParam" parameters can be used to determine plot or not.
rawData <- ImportRawMSData(data_folder_Sample,plotSettings = SetPlotParam(Plot=F))
## [1] "The number of CPU cores to be used is set to 6."
## [1] "Successfully imported raw MS data!"

The PerformPeakProfiling function is an updated peak processing pipeline from XCMS R functions that performs peak detection, alignment, and grouping in an automatical step. The function also generates two diagnostic plots including statistics on the total intensity of peaks in different samples, a retention time adjustment map, and a PCA plot showing the overall sample clustering prior to data cleaning and statistical analysis.

# Peak Profiling with optimized parameters
mSet <- PerformPeakProfiling(rawData,param_optimized$best_parameters,
                             plotSettings = SetPlotParam(Plot = T))
## [1] "Parameters for centWave have been successfully parsed!"
## [1] "6 CPU Threads will be used for peak profiling !"
## [1] "Step 1/3: Started peak picking! This step will take some time..."
## ROI searching under 2.12 ppm ... Detecting chromatographic peaks in 36166 regions of interest ... OK: 9985 found.
## ROI searching under 2.12 ppm ... Detecting chromatographic peaks in 46117 regions of interest ... OK: 11491 found.
## ROI searching under 2.12 ppm ... Detecting chromatographic peaks in 45942 regions of interest ... OK: 11615 found.
## ROI searching under 2.12 ppm ... Detecting chromatographic peaks in 47568 regions of interest ... OK: 12113 found.
## ROI searching under 2.12 ppm ... Detecting chromatographic peaks in 47199 regions of interest ... OK: 12939 found.
## ROI searching under 2.12 ppm ... Detecting chromatographic peaks in 39553 regions of interest ... OK: 10480 found.
## ROI searching under 2.12 ppm ... Detecting chromatographic peaks in 42731 regions of interest ... OK: 13247 found.
## ROI searching under 2.12 ppm ... Detecting chromatographic peaks in 38268 regions of interest ... OK: 12464 found.
## [1] "Step 2/3: Started peak alignment! This step is running..."
## Total of 6238 slices detected for processing... Done !
## Performing retention time correction using 6252 peak groups.
## Applying retention time adjustment to the identified chromatographic peaks ... Done !
## Total of 6238 slices detected for processing... Done !
## [1] "Step 3/3: Started peak filling! This step may take some time..."
## [1] "Starting peak filling!"
## Defining peak areas for filling-in...... OK
## Start integrating peak areas from original files
## Requesting 8318 peaks from 0393_XAV_iHMP2_FFA_SM-AMR37.mzXML ... got 4741.
## Requesting 7937 peaks from 0407_XAV_iHMP2_FFA_SM-AUP8B.mzXML ... got 4553.
## Requesting 8181 peaks from 0069_XAV_iHMP2_FFA_SM-9OS5Y.mzXML ... got 4796.
## Requesting 9905 peaks from 0051_XAV_iHMP2_FFA_SM-9WOBP.mzXML ... got 3815.
## Requesting 7114 peaks from 0027_XAV_iHMP2_FFA_SM-77FXR.mzXML ... got 3918.
## Requesting 7392 peaks from 0030_XAV_iHMP2_FFA_SM-6KUCT.mzXML ... got 3933.
## Requesting 7268 peaks from 0433_XAV_iHMP2_FFA_SM-9SNJ4.mzXML ... got 4166.
## Requesting 8946 peaks from 0508_XAV_iHMP2_FFA_SM-9X47O.mzXML ... got 4749.
## [1] "Peak picking finished successfully !"

Intensity statistics of all samples.

plot of chunk unnamed-chunk-18

2.1.8 Peak Annotation

The PerformPeakAnnotation function annotates isotope and adduct peaks with the method from CAMERA (PMID: 22111785). It outputs the result as a CSV file (“annotated_peaklist.csv”) and saves the annotated peaks to mSet object. Finally, the peak list is formatted to the correct structure for MetaboAnalystR and filtered based upon user’s specifications using the FormatPeakList function. This function permits the filtering of adducts (i.e. removal of all adducts except for [M+H]+/[M-H]-) and filtering of isotopes (i.e. removal of all isotopes except for monoisotopic peaks). The goal of filtering peaks is to remove degenerative signals and reduce the file size.

# Setting the Annotation Parameters.
annParams <- SetAnnotationParam(polarity = "negative", mz_abs_add = 0.005)
# Perform peak annotation.
annotPeaks <- PerformPeakAnnotation(mSet, annParams)
## [1] "Start grouping after retention time."
## Created 414 pseudospectra.
## Generating peak matrix...
## Run isotope peak annotation
##                                                                      
## Found isotopes: 3142 
## [1] "Start grouping after correlation." Generating EIC's . ... Detecting 0027_XAV_iHMP2_FFA_SM-77FXR.mzXML  ... 2162 peaks found ! 
## Detecting 0030_XAV_iHMP2_FFA_SM-6KUCT.mzXML  ... 1605 peaks found ! 
## Detecting 0051_XAV_iHMP2_FFA_SM-9WOBP.mzXML  ... 2086 peaks found ! 
## Detecting 0069_XAV_iHMP2_FFA_SM-9OS5Y.mzXML  ... 3027 peaks found ! 
## Detecting 0393_XAV_iHMP2_FFA_SM-AMR37.mzXML  ... 2560 peaks found ! 
## Detecting 0407_XAV_iHMP2_FFA_SM-AUP8B.mzXML  ... 663 peaks found ! 
## Detecting 0433_XAV_iHMP2_FFA_SM-9SNJ4.mzXML  ... 2284 peaks found ! 
## Detecting 0508_XAV_iHMP2_FFA_SM-9X47O.mzXML  ... 1721 peaks found ! 
## [1] "Warning: Found NA peaks in selected sample."
## [1] "Calculating peak correlations in 414 Groups... "
## [1] "Calculating graph cross linking in 414 Groups..."                     
## [1] "New number of ps-groups:  9877"                                       
## mSet has now 9877 groups, instead of 414 !
## [1] "Generating peak matrix for peak annotation!"
## [1] "Polarity is set in annotaParam: negative"
## [1] "Ruleset could not read from object! Recalculate"
## [1] "Calculating possible adducts in 9877 Groups... "
## Successfully performed peak annotation!
# # Format and filter the peak list for MetaboAnalystR
maPeaks <- FormatPeakList(annotPeaks, annParams, filtIso =F, filtAdducts = FALSE,
                          missPercent = 1)

Part of the table (first 10 rows and 2 samples) are shown below.

Sample X0027_XAV_iHMP2_FFA_SM.77FXR X0030_XAV_iHMP2_FFA_SM.6KUCT
Label CD CD
73.02799 2742381.138 260846441.6
74.03136 22903.8796 8149290.214
87.04368 19485085 9517513.356
88.04703 652663.6739 287542.3986
89.02297 16564773.34 165532946.5
90.02622 175217.7393 3444361.337
91.05389 132982136.1 7449146.189
92.05724 10414646.83 478931.5643
101.05949 30578949.94 13556551.39

2.2 Data Processing and Statistical Analysis

This step can be easily finished with 'Statistical Analysis' module in MetaboAnalyst website. The running R code generated by website could easily be used to repeat the same results. This brief chapter is established to show case how to generate the most results. More statistical utilities can be achieved with the same way.

Remaining in the same working directory, we will read in the filtered peak table. Then, we will replace missing values with a small value (half of the minimum positive value within the original data). Next we will filter varibles out non-informative signals, then normalize the samples to their median and perform log transformation. Finally, we will perform t-test analysis to identify differentially enriched features, setting the FDR-adjusted p-value threshold to 0.25.

2.2.1 mSet initialized for further analysis

# First step is to create the mSet Object, specifying that the data to be uploaded
# is a peak table ("pktable") and that statistical analysis will be performed ("stat").
mSet<-InitDataObjects("pktable", "stat", FALSE)
## Starting Rserve:
##  /usr/lib/R/bin/R CMD /home/xialab/R/x86_64-pc-linux-gnu-library/3.6/Rserve/libs//Rserve --no-save 
## 
## 
## R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
## Copyright (C) 2020 The R Foundation for Statistical Computing
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## R is free software and comes with ABSOLUTELY NO WARRANTY.
## You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.
## 
##   Natural language support but running in an English locale
## 
## R is a collaborative project with many contributors.
## Type 'contributors()' for more information and
## 'citation()' on how to cite R or R packages in publications.
## 
## Type 'demo()' for some demos, 'help()' for on-line help, or
## 'help.start()' for an HTML browser interface to help.
## Type 'q()' to quit R. ##> SOCK_ERROR: bind error #98(address already in use)
## Rserv started in daemon mode.
## [1] "MetaboAnalyst R objects initialized ..."
# Second step is to read in the filtered peak list, please set the path right first
mSet<-Read.TextData(mSet, "metaboanalyst_input.csv", "colu", "disc")

2.2.2 Perform Data Statistics

# The third step is to perform data processing using MetaboAnalystR (filtering/normalization)
# Perform data processing - Data checking
mSet<-SanityCheckData(mSet)
## [1] "Successfully passed sanity check!"                                             
##  [2] "Samples are not paired."                                                           
##  [3] "2 groups were detected in samples."                                                
##  [4] "Only English letters, numbers, underscore, hyphen and forward slash (/) are allowed."                 
##  [5] "<font color="orange">Other special characters or punctuations (if any) will be stripped off.</font>"
##  [6] "All data values are numeric."                                                      
##  [7] "A total of 30390 (23.6%) missing values were detected."                            
##  [8] "<u>By default, these values will be replaced by a small value.</u>"                
##  [9] "Click <b>Skip</b> button if you accept the default practice"                       
## [10] "Or click <b>Missing value imputation</b> to use other methods"
# Perform data processing - Minimum Value Replacing
mSet<-ReplaceMin(mSet);
# Perform data processing - Variable Filtering and Normalization
mSet<-FilterVariable(mSet, "iqr", "F", 25)
mSet<-PreparePrenormData(mSet)
mSet<-Normalization(mSet, "MedianNorm", "LogNorm", "NULL", ratio=FALSE, ratioNum=20)
## [1] " Further feature filtering based on Interquantile Range Reduced to 5000 features 
## based on Interquantile Range"
## [1] " Further feature filtering based on Interquantile Range Reduced to 5000 features 
## based on Interquantile Range"
mSet <- PlotNormSummary(mSet, "norm_0_", "png", 72, width=NA)
mSet <- PlotSampleNormSummary(mSet, "snorm_0_", "png", 72, width=NA)

plot of chunk unnamed-chunk-33

# The fourth step is to perform fold-change analysis
mSet <- FC.Anal.unpaired(mSet, 2.0, 0)
mSet <- PlotFC(mSet, "fc_0_", "png", 72, width=NA)

plot of chunk unnamed-chunk-35

2.2.3 Univariate Analysis - t test

# The fifth step is to perform t-test analysis
mSet <- Ttests.Anal(mSet, F, 0.05, FALSE, TRUE)
mSet <- PlotTT(mSet, "tt_0_", "png", 72, width=NA)

plot of chunk unnamed-chunk-37

## [1] "A total of 218 significant features were found."

2.2.4 Multivariate Analysis - PCA

# The sixth step is to perform PCA
mSet <- PCA.Anal(mSet)
mSet <- PlotPCAPairSummary(mSet, "pca_pair_0_", "png", 72, width=NA, 5)
mSet <- PlotPCAScree(mSet, "pca_scree_0_", "png", 72, width=NA, 5)
mSet <- PlotPCA2DScore(mSet, "pca_score2d_0_", "png", 72, width=NA, 1,2,0.95,1,0)
mSet <- PlotPCALoading(mSet, "pca_loading_0_", "png", 72, width=NA, 1,2);
mSet <- PlotPCABiplot(mSet, "pca_biplot_0_", "png", 72, width=NA, 1,2)
mSet <- PlotPCA3DScoreImg(mSet, "pca_score3d_0_", "png", 72, width=NA, 1,2,3, 40)

plot of chunk unnamed-chunk-40

## [1] "The Interactive 3D PCA plot has been created, please find it in mSet$imgSet$pca.3d."

2.2.5 Multivariate Analysis - PLS-DA

# The seventh step is to perform PLS-DA
mSet <- PLSR.Anal(mSet, reg=TRUE)
mSet <- PlotPLSPairSummary(mSet, "pls_pair_0_", "png", 72, width=NA, 5)
mSet <- PlotPLS2DScore(mSet, "pls_score2d_0_", "png", 72, width=NA, 1,2,0.95,1,0)
mSet <- PlotPLS3DScoreImg(mSet, "pls_score3d_0_", "png", 72, width=NA, 1,2,3, 40)
mSet <- PlotPLSLoading(mSet, "pls_loading_0_", "png", 72, width=NA, 1, 2);
mSet <- PLSDA.CV(mSet, "L",5, "Q2")
mSet <- PlotPLS.Classification(mSet, "pls_cv_0_", "png", 72, width=NA)
mSet <- PlotPLS.Imp(mSet, "pls_imp_0_", "png", 72, width=NA, "vip", "Comp. 1", 15,FALSE)

plot of chunk unnamed-chunk-44

2.3 From MS peaks to Pathways

Previous versions of MetaboAnalyst encompassed two modules for functional analysis, metabolic pathway analysis (MetPA) (Xia et al. 2010, https://academic.oup.com/bioinformatics/article/26/18/2342/208464) and metabolite set enrichment analysis (MSEA) (Xia et al. 2010, https://academic.oup.com/nar/article/38/suppl_2/W71/1101310). However, these modules require metabolite identifications prior to use, which remains an important challenge in untargeted metabolomics. In comparison, the mummichog algorithm (Li et al. 2013, http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003123) bypasses the bottleneck of metabolite identification prior to pathway analysis, leveraging a priori pathway and network knowledge to directly infer biological activity based on mass peaks. We have therefore implemented the mummichog algorithm in R in a new module named “MS Peaks to Pathways”. The knowledge-base for this module consists of five genome-scale metabolic models from the original Python implementation which have either been manually curated or downloaded from BioCyc, as well as an expanded library of 21 organisms derived from KEGG metabolic pathways.

To use this module, users must upload a table (.txt) using the Read.PeakListData function containing either:

  1. Three columns - m/z features, p-values, and t-scores or fold-change values.
  2. Two columns containing m/z features and either p-values or t-scores
  3. One column ranked by either p-values or t-scores.
  4. Four columns - m/z features, p-values, t-scores or fold-change values, and the mode (positive or negative).

All inputted files must be in .txt format. If the input is a three column table, both the mummichog and GSEA algorithms (and their combination) can be applied. If only p-values (or ranked by p-values) are provided, then only the mummichog algorithm will be applied. If only t-scores (or ranked by t-scores) are provided, then only the GSEA algorithm will be applied.

If p-values have not yet been calculated, users can use the “Statistical Analysis” module of MetaboAnalyst website or this R package to upload their raw peak tables, process the data, perform t-tests or fold-change analysis, and then upload these results into the module. With the table, users also need to specify the type of MS instrument, the ion mode (positive or negative) using the UpdateInstrumentParameters function. Currently, MetaboAnalystR only supports the handling of peaks obtained from high-resolution MS instruments such as Orbitrap, or Fourier Transform (FT)-MS instruments as recommended by the original mummichog implementation. Following data upload, users much select the organism’s library from which to perform untargeted pathway analysis using the SetMass.PathLib function. Users can then perform the mummichog algorithm on their data using PerformPSEA. First, users will set algorithm to be used to mummichog using SetPeakEnrichMethod. Second, users will set the p-value cutoff to delineate between significantly enriched and non-significantly enriched m/z features using the SetMummichogPval function. Finally, use the PerformPSEA to calcaulte pathway activity.

The output of this module first consists of a table of results identifying the top-pathways that are enriched in the user-uploaded data, which can be found in your working directory named “mummichog_pathway_enrichment.csv”. The table consists of the total number of hits, the raw p-value (Fisher’s or Hypergeometric), the EASE score, and the adjusted p-value (for permutations) per pathway. A second table can also be found in your working directory named “mummichog_matched_compound_all.csv”, that contains all matched metabolites from the user’s uploaded list of m/z features.

For this tutorial we will first directly use the an example peak list data obtained from untargeted metabolomics of IBD data above (mass accuracy 5.0, negative ion mode). This is an raw data table generated by the raw data processing pipeline.

2.3.1 Mummichog Version 1

Perform T test for further MS peaks to pathway analysis

# Perform t-test
library(MetaboAnalystR);
mSet<-InitDataObjects("pktable", "stat", FALSE);
mSet<-Read.TextData(mSet, "metaboanalyst_input.csv", "colu", "disc")
mSet<-SanityCheckData(mSet)

mSet<-ReplaceMin(mSet);
mSet<-FilterVariable(mSet, "iqr", "F", 25)
feature.nm.vec <- c("")# used to remove certain feature, i.e. c("157.0357")
smpl.nm.vec <- c("")# used to remove certain samples, i.e. c("samples 1")
grp.nm.vec <- c("") # used to remove certain groups, i.e. c("UC")
mSet <- UpdateData(mSet)
mSet<-PreparePrenormData(mSet)
mSet<-Normalization(mSet, "MedianNorm", "LogNorm", "NULL", ratio=FALSE, ratioNum=20)
mSet<-Ttests.Anal(mSet, F, 0.25, FALSE, TRUE)
# [1] "A total of 1329 significant features were found."

cmSet<-Convert2Mummichog(mSet, rt=TRUE)
mSet<-InitDataObjects("mass_all", "mummichog", FALSE)
SetPeakFormat("mprt")
mSet<-UpdateInstrumentParameters(mSet, 5, "negative");
mSet<-Read.PeakListData(mSet, "mummichog_input_your_date.txt");
mSet<-SanityCheckMummichogData(mSet)
mSet<-SetPeakEnrichMethod(mSet, "integ", version="v2")
# Please set the appropriate p value according to your data
mSet<-SetMummichogPval(mSet, 0.15)
mSet<-PerformPSEA(mSet, "hsa_mfn", "current", permNum =100)
mSet<-PlotIntegPaths(mSet, "peaks_to_paths_", "png"144)
## [1] "MetaboAnalyst R objects initialized ..."
## [1] "mprt"
## [1] "Retention time tolerance is 22.7888654"
## [1] "Only a small percentage (below 10%) peaks in your input peaks should be significant."                        
## [2] "The algorithm works best for <u>200~500 significant peaks in 3000~7000 total peaks</u>."                     
## [3] "A total of 1097 or 21.88 percent signficant mz features were found based on the selected p-value cutoff: 0.2"
## [1] "v1" 
## [1] "Got 5014 mass features."
## [1] "1227 matched compounds! cpd2mz"
## [1] "A total of 3636 of duplicates were merged."
## [1] "A total of 3348 of duplicates were merged."
## [1] "A total of 3636 of duplicates were merged."
## [1] "A total of 3636 of duplicates were merged."
## [1] "Resampling,  100 permutations to estimate background ..."

2.3.2 Mummichog Version 2

The SetPeakEnrichMethod now has a parameter to let users select whether to use Version 1 or Version 2 of the MS Peaks to Paths algorithms. With Version 2, users can now upload retention time information to perform pathway analysis. Note that Version 2 should only be used if user’s data contains a retention time (“rt” or “r.t”) column. Retention time is used to move pathway analysis from the “Compound” space to “Empirical Compound” space (details below in “How are Empirical Compounds calculated?”). The inclusion of retention time will increase the confidence and robustness of the potential compound matches. Another difference is that currency compounds are removed directly from the user’s selected pathway library, versus removed from potential compound hits during the permutations.

# Perform t-test
mSet<-Ttests.Anal(mSet, F, 0.25, FALSE, TRUE)
## [1] "A total of 1329 significant features were found."
mSet<-Convert2Mummichog(mSet, rt=TRUE)
mSet<-InitDataObjects("mass_all", "mummichog", FALSE)
SetPeakFormat("mprt")
mSet<-UpdateInstrumentParameters(mSet, 5, "negative");
mSet<-Read.PeakListData(mSet, "mummichog_input_your_date.txt");
mSet<-SanityCheckMummichogData(mSet)
mSet<-SetPeakEnrichMethod(mSet, "integ", version="v2")
# Please set the appropriate p value according to your data
mSet<-SetMummichogPval(mSet, 0.15)
mSet<-PerformPSEA(mSet, "hsa_mfn", "current", permNum = 100)
mSet<-PlotIntegPaths(mSet, "peaks_to_paths_", "png", 144)
## 
## [1] "MetaboAnalyst R objects initialized ..."
## [1] "mprt"
## [1] "Retention time tolerance is 22.7888654"
## [1] "Only a small percentage (below 10%) peaks in your input peaks should be significant."                       
## [2] "The algorithm works best for <u>200~500 significant peaks in 3000~7000 total peaks</u>."                    
## [3] "A total of 221 or 4.41 percent signficant mz features were found based on the selected p-value cutoff: 0.05"
## [1] "v2" trying URL 'https://www.metaboanalyst.ca/resources/libs/mummichog/hsa_mfn.rds' Content length 634759 bytes (619 KB)
## downloaded 619 KB
## 
## [1] "Got 5012 mass features."
## [1] "1226 matched compounds! cpd2mz"
## [1] "2857 inital ECs created!"
## [1] "1430 merged ECs identified!"
## 
## Attaching package: igraph
## 
## The following objects are masked from package:stats:
## 
##     decompose, spectrum
## 
## The following object is masked from package:base:
## 
##     union
## 
## [1] "352 empirical compounds identified in 1.93101406097412 seconds."
## [1] "A total of 1364 of duplicates were merged."
## [1] "A total of 1269 of duplicates were merged."
## [1] "A total of 1269 of duplicates were merged."
## [1] "A total of 1364 of duplicates were merged."
## [1] "A total of 1087 of duplicates were merged."
## [1] "A total of 1087 of duplicates were merged."
## [1] "A total of 1364 of duplicates were merged."
## [1] "A total of 1269 of duplicates were merged."
## [1] "Resampling,  100 permutations to estimate background ..."

2.3.3 Adduct & Currency Customization

Adduct Customization

Raw MS peaks contain a significant amount of adducts specific to their MS instrument and analytical mode. A comprehensive adduct list is shown below in the “Available” panel. Use this to customize the adduct list used by the Mummichog/GSEA algorithms to match m/z peaks to potential compounds hits. The list of available adducts to choose from can be found in here: positive; here: negative ; here:mixed.

For the negative ion mode, the adducts used are: M-H[-], M-2H[2-], M(C13)-H[-], M(S34)-H[-], M(Cl37)-H[-], M+Na-2H[-], M+K-2H[-], M-H2O-H[-], M+Cl[-], M+Cl37[-], M+Br[-], M+Br81[-], M+ACN-H[-], M+HCOO[-], M+CH3COO[-], and M-H+O[-].

For the positive ion mode, the adducts used are: M[1+], M+H[1+], M+2H[2+], M+3H[3+], M(C13)+H[1+], M(C13)+2H[2+], M(C13)+3H[3+], M(S34)+H[1+], M(Cl37)+H[1+], M+Na[1+], M+H+Na[2+], M+K[1+], M+H2O+H[1+], M-H2O+H[1+], M-H4O2+H[1+], M-NH3+H[1+], M-CO+H[1+], M-CO2+H[1+], M-HCOOH+H[1+], M+HCOONa[1+], M-HCOONa+H[1+], M+NaCl[1+], M-C3H4O2+H[1+], M+HCOOK[1+], and M-HCOOK+H[1+].

Currency Customization

Currency metabolites are abundant substances such as water and carbon dioxide known to occur in normal functioning cells and participate in a large number of metabolic reactions (Huss and Holme, 2007). Because of their ubiquitous nature, removing them can greatly improve pathway analysis and visualization. The list of available currency metabolites can be found here.

By default, the MS Peaks to Paths module considers these metabolites as currency: ‘C00001’, ‘C00080’, ‘C00007’, ‘C00006’, ‘C00005’, ‘C00003’, ‘C00004’, ‘C00002’, ‘C00013’, ‘C00008’, ‘C00009’, ‘C00011’, ‘G11113’, ‘H2O’, ‘H+’, ‘Oxygen’, ‘NADP+’, ‘NADPH’, ‘NAD+’, ‘NADH’, ‘ATP’, ‘Pyrophosphate’, ‘ADP’, ‘Orthophosphate’, and ‘CO2’.

Now we will use another example peak list data obtained from untargeted metabolomics of mice alveolar macrophages in lungs, using an Orbitrap LC-MS (C18 negative ion mode and HILIC positive ion mode). This is an example of the four-column table containing the m.z, mode, p.value, and t.score. We will perform both currency and adduct customization.

# Create objects for storing processed data from the MS peaks to pathways module
mSet<-InitDataObjects("mass_all", "mummichog", FALSE)

# Set peak formart - contains m/z features, p-values and t-scores
SetPeakFormat("mpt")
mSet<-UpdateInstrumentParameters(mSet, 5.0, "mixed");
mSet<-Read.PeakListData(mSet, "https://www.metaboanalyst.ca/MetaboAnalyst/resources/data/mummichog_mixed.txt");
mSet<-SanityCheckMummichogData(mSet)
# Customize currency
curr.vec <-c("Water (C00001)""Proton (C00080)","Oxygen (C00007)","NADPH (C00005)","NADP (C00006)","NADH (C00004)","NAD (C00003)","Adenosine triphosphate (C00002)","Pyrophosphate (C00013)","Phosphate (C00009)","Carbon dioxide (C00011)","Hydrogen (C00282)","Hydrogen peroxide (C00027)","Sodium (C01330)")
# Map selected currency to user's data
mSet<-Setup.MapData(mSet, curr.vec);
mSet<-PerformCurrencyMapping(mSet)
# Now customize adducts
add.vec <-c("M [1+]","M+H [1+]","M+2H [2+]","M+3H [3+]","M+Na [1+]","M+H+Na [2+]","M+K [1+]","M+H2O+H [1+]","M-H2O+H [1+]","M-H4O2+H [1+]","M(C13)+H [1+]","M(C13)+2H [2+]","M(C13)+3H [3+]","M(Cl37)+H [1+]","M-NH3+H [1+]","M-CO+H [1+]","M-CO2+H [1+]","M-HCOOH+H [1+]","M+HCOONa [1+]","M-HCOONa+H [1+]","M+NaCl [1+]","M-C3H4O2+H [1+]","M+HCOOK [1+]","M-HCOOK+H [1+]","M-H [1-]","M-2H [2-]","M-H2O-H [1-]","M-H+O [1-]","M+K-2H [1-]","M+Na-2H [1- ]","M+Cl [1-]","M+Cl37 [1-]","M+HCOO [1-]","M+CH3COO [1-]")
# Set up the selected adducts
mSet<-Setup.AdductData(mSet, add.vec);
mSet<-PerformAdductMapping(mSet, "mixed")

# Perform mummichog algorithm using selected currency and adducts, using Version1 of the mummichog algorithm
mSet<-SetPeakEnrichMethod(mSet, "mum", "v1")
mSet<-SetMummichogPval(mSet, 1.0E-5)
mSet<-PerformPSEA(mSet, "hsa_mfn", "current", 100)
mSet<-PlotPeaks2Paths(mSet, "peaks_to_paths_0_", "png", 72, width=NA) # Image is not shown below to avoid to be overwhelmed.

2.3.4 Customizing Empirical Compound Formation

By default, V2 of the MS Peaks to Pathways algorithms enforces primary ions to be present when creating Empirical Compounds (step 3 above). Users can disable this in the UpdateInstrumentParameters function by setting the “force_primary_ion” parameter to “no”. Additionally, by default the retention-time window used to split Empirical Compounds is calculated as the maximum retention time in the user’s data multiplied by the retention time fraction (default is 0.02). Users can either change the retention time fraction (rt_frac) or set the retention time-window (rt_tol - in seconds). Code details are shown below.

# Disable force primary ion
mSet <-UpdateInstrumentParameters(mSet, instrumentOpt, msModeOpt, force_primary_ion ="no", rt_frac =0.02,rt_tol =NA)
# Change retention time fraction when calculating the retention time window
mSet <-UpdateInstrumentParameters(mSet, instrumentOpt, msModeOpt,force_primary_ion ="yes", rt_frac =0.025, rt_tol =NA)
# Set the retention time window (in seconds)
mSet <-UpdateInstrumentParameters(mSet, instrumentOpt, msModeOpt, force_primary_ion ="yes", rt_frac =0.02, rt_tol =25)