Zhiqiang Pang, Jeff Xia


1.0 Overview of MetaboAnalystR

MetaboAnalystR is a R package, synchronized with the popular MetaboAnalyst website, designed for comprehensive metabolomic data analysis, visualization, and interpretation. This R package contains the numerous R functions and libraries underlying the web server necessary to perform raw spectral data pre-processing, data processing, normalization, statistical analysis, meta-data based co-variate adjustment, functional analysis for global metabolomic, metabolite set enrichment analysis and pathway analysis for targeted metabolomics, biomarker analysis and meta-analysis. This package provides support for several data types, including nuclear magnetic resonance (NMR) spectroscopy, gas chromatography mass spectrometry (GC-MS), and liquid chromatography-MS (LC-MS) data. Further, it is mainly designed to support quantitative metabolomics.

Following installation and loading of MetaboAnalystR, users will be able to reproduce web server results from their local computers using the corresponding R command history downloaded from MetaboAnalyst, thereby achieving maximum flexibility and reproducibility.

MetaboAnalystR serves as a platform to enable users to perform high-quality, comprehensive metabolomic data analysis, as well as produce computationally and statistically reproducible analytical workflows.

The aim of this vignette is to provide an overview of how to use MetaboAnalystR to perform comprehensive metabolomic data analysis and visualization. In detail, this vignette will go through the steps of data import, processing, normalization, and filtering.

1.1 Loading the package

After following the installation instructions on the MetaboAnalystR Github or MetaboAnalyst R tutorial page, you will be ready to use the package. Use the library() function to load the package into R.

# Load MetaboAnalystR
## MetaboAnalystR 4.0.0 initialized Successfully !

1.2 Tips for using the MetaboAnalystR package

  1. The first function that you will use in every module is the InitDataObjects function, which constructs the mSetObj object that stores user’s data for further processing and analysis. To use this function, users must specify to MetaboAnalystR the type of data and the type of analysis you will perform. On the MetaboAnalyst web-server, the suggested name of the mSetObj, which must be called consistently in your workflow, is mSet. It is not necessary to use this format, if users wish they can use whatever name they would prefer, as long as it is called exactly the same in each step.

  2. The MetaboAnalystR package directly creates plots/tables/analysis outputs in your current working directory. Users are suggested to create a specific working directory every time they perform any analysis, unless they want to overwrite existing results in current directory. It is not necessary to call any plotting functions onto the created mSetObj.

  3. Every command must be run in sequence, please do not skip any commands as this will result in errors downstream.

1.3 Data Format

Comma Separated Values (.csv) or Tab Delimited Text (.txt)

These two formats are used for concentration data, peak intensity tables, and MS/NMR spectral bins. Samples can be in either rows or columns. Please note:

  1. Both sample and feature names must be unique and consist of a combination of common English letters, underscores and numbers for naming purpose. Latin/Greek letters are not supported.

  2. Class labels must immediately follow sample names; for two-factor and time series data, there must be two class labels corresponding to the two factors.

  3. For time-series data, the time-point group must be named as Time. In addition, the samples collected from the same subjects at different time points should be consecutive (See the screenshots demo for “Two-factor / Time series”)

  4. Data values (concentrations, bins, peak intensities) should contain only numeric and positive values (using empty or NA for missing values).

Raw LC-MS/MS spectral data

For LC-MS/MS based global (untargeted) metabolomics raw data processing, MetaboAnalystR only accepts open-source formats (.mzML/.mzXML/.mzData/.cdf). All vendor formats must be converted and centroided into a open-source format (e.g. *.mzML, highly recommended) at first. Then, each centroided mzML files must be zipped individually before uploading for online processing. But for MetaboAnalystR, there is no need to upload to remote server. Therefore, it is unnecessary to zip raw spectral for MetaboAnalystR. In addition, different from MetaboAnalyst website, there is no limitation on file size and file number. User could process unlimited raw spectral data as long as the RAM of their local PC permits.

Zipped files (.zip) for Statistical Analysis

For NMR/MS peak list files and GC/LC-MS spectra data, users must upload a zipped folder containing data files from different groups under study (one file per spectrum and one sub-folder for each group). For paired comparison, users must upload a separate text file specifying the paired information.

GC/LC-MS spectra must be in either NetCDF, mzXML, or mzDATA format. The spectra should be stored in two separate folders according to their class labels, and then compressed into .zip files. Please note, the program is not compatible with the most recent WinZip (v12.0) with default option. Make sure to select the Legacy compression (Zip 2.0 compatible) for compressing files. No spaces are allowed in either the folder names or the spectra names. The size limit for each uploaded zip file is 50M.

The peak list data is composed of peak list files organized into separate folders named by their class labels. For example, if your data contains three groups, the peak list files should be organized into three folders accordingly. Compress these folders into a single zip file then upload them to MetaboAnalyst.

NMR peak list files should contain two comma separated columns with the 1st column for peak positions (ppm) and the 2nd column for peak intensities; MS peak list files can be in either two-column (mass and intensities) or three-column format (mass, retention time and intensities), but not a mixture of both. The first line of each peak list file is reserved for column labels. The file must be saved in comma separated values (.csv) format.

2.0 Example data processing basics for statistics analysis

2.1.0 Importing example data (Compound concentration, binned spectral, peak intensity table)

MetaboAnalystR reads in both comma separated values (CSV) and text (txt) files. The package also accepts zipped files (.zip), including NMR peak, MS peak, and LC/GC-MS spectra (NetCDF, mzDATA, mzXML) data.

We will first begin with an example dataset downloadable from the MetaboAnalyst web-server called “cachexia_concentrations.csv”. To begin the analysis, we first have to create an object for data processing, using the function InitDataObjects, and then we will read in the data file using the function Read.TextData. The InitDataObjects is the first function executed when uploading a dataset. It constructs the dataSet object, which is an R list with several variables assigned to it, including the type of data, the data format, and whether or not the data is paired. Is also creates the analSet object, and the imgSet object, which are also lists that will be filled in downstream analysis. Further, the function creates three libraries and databases which will be filled if necessary. Finally, it creates msg.vec, which is a character vector that will contain messages produced from the analysis. It it necessary for users to specify the dataType, (list, conc, specbin, pktable, nmrpeak, mspeak, msspec), and the type of analysis to be performed on the uploaded dataset, analType (stat, pathora, pathqea, msetora, msetssp, msetqea, ts, cmpdmap, smpmap). In this case, we are uploading a compound concentration dataset (“conc”), and will be performing statistical analysis on the dataset (“stat”).

Read.TextData is the second function to be executed to read in a specific file/s. You must specify the file path, the data format and the label type to provide the right parameters for MetaboAnalystR. In this case, the samples in the cachexia dataset are in rows (versus columns) and are unpaired (hence, “rowu”). Additionally, the data is discrete (“disc”) versus continuous.

Note that mSet\(dataSet\)read.msg can be used to view the messages saved from importing the dataset.

# Load MetaboAnalystR

# Create objects for storing processed data
mSet <- InitDataObjects("conc", "stat", FALSE)
## Starting Rserve:
##  /opt/R/4.2.2/lib/R/bin/R CMD /opt/R/4.2.2/lib/R/library/Rserve/libs//Rserve --no-save 
## [1] "MetaboAnalyst R objects initialized ..."
# Read in the data and fill in the dataSet list 

mSet <- Read.TextData(mSet, "", "rowu", "disc")

# To view messages from the data import and processing
## [1] "Samples are in rows and features in columns"                                
## [2] "The uploaded file is in comma separated values (.csv) format."              
## [3] "The uploaded data file contains 77 (samples) by 63 (compounds) data matrix."

The steps to import a peak intensity table are very similar to importing a compound concentration dataset. In InitDataObjects, you need to specify that the data type is a “pktable”, and that you will be performing statistical analysis on the imported data (“stat”).

mSet <- InitDataObjects("pktable", "stat", FALSE)

mSet <- Read.TextData(mSet, "", "rowu", "disc")

## [1] "Samples are in rows and features in columns"                                    
## [2] "The uploaded file is in comma separated values (.csv) format."                  
## [3] "The uploaded data file contains 410 (samples) by 11 (peaks(mz/rt)) data matrix."

To import binned spectral data, users will use the same Read.TextData functions in the above two examples. However, binned spectral data is typically of lower quality than compound concentration data and therefore requires a further step of data preprocessing to convert the data into a usable data matrix. For instance, binned spectral data often contains background noise and requires filtering of such unwanted variations. You can find an example here.

2.1.1 Importing example data (Zipped Peak Lists)

In this example, we will import a zipped file of peak lists (“”), which you will find in the “data” folder of the MetaboAnalystR GitHub. As above, we will first create a dataSet object, this time specifying “mspeak” as the dataType for LC-MS peak lists, and “stat” as the analysis type.

# Load MetaboAnalystR and clean environment
rm(list = ls())
unlink("upload", recursive = TRUE);

# Create an object for storing processed data
mSet <- InitDataObjects("mspeak", "stat", FALSE)
## Starting Rserve:
##  /opt/R/4.2.2/lib/R/bin/R CMD /opt/R/4.2.2/lib/R/library/Rserve/libs//Rserve --no-save 
## [1] "MetaboAnalyst R objects initialized ..."
# Fetch an example data
download.file("", destfile = "", method = "wget")

# Unzips the uploaded zip file/s, removes it and saves it as "upload"
UnzipUploadedFile("", "upload", F)
## [1] 1
# Read peak lists/intensity files
mSet <- Read.PeakList(mSet, "upload")

Following unzipping the files, we will use Read.PeakList to read in the peak lists and fill in the dataSet object. The Read.PeakList function creates a usable data matrix which is consistent with the XCMS matrix format, allowing for the use of XCMS functions downstream. Further, the peak lists must be formatted in 2 or 3 columns. Specifically, for NMR peak lists, the input should be formatted as 2 columns containing only numeric values (Column 1 - chemical shift, expressed in parts per million “ppm”; Column 2- intensity “int”). For MS peak lists, the lists can be formatted as two-columns (Column 1 - mass “mz”; Column 2 - intensity “int”), or as three-columns (Column 1 - mass “mz”; Column 2 - retention time “rt”; Column 3 - intensity “int”).

2.1.2 Preprocessing example data (NMR and MS Peak Lists)

Following import of NMR or MS peak lists, peaks that represent the same metabolite should be grouped together. The function GroupPeakList utilizes the XCMS grouping algorithm to combine peaks across samples with similar massess and retention times into groups. Further, SetPeakList.GroupValues cleans the peak groups, summing multiple identical peaks within groups if they originate from the same sample. Both functions work directly upon the mSet object.

# Load MetaboAnalystR and clean environment
rm(list = ls());
unlink("upload", recursive = TRUE);

# Initialize Objects
mSet <- InitDataObjects("nmrpeak", "stat", FALSE);
## Starting Rserve:
##  /opt/R/4.2.2/lib/R/bin/R CMD /opt/R/4.2.2/lib/R/library/Rserve/libs//Rserve --no-save 
## [1] "MetaboAnalyst R objects initialized ..."
# Fetch data and unzipping
download.file("", destfile = "", method = "wget");
UnzipUploadedFile("", "upload");
## [1] 1
# Read data in 
mSet <- Read.PeakList(mSet, "upload");
# Perform grouping of peaks
mSet <- GroupPeakList(mSet, 0.025, 30.0);
# Form peak groups
mSet <- SetPeakList.GroupValues(mSet)

2.1.3 Data integrity check

After reading in the metabolomics data, it is necessary for you to perform a data integrity check to ensure that the data is valid and suitable for subsequent analyses using SanityCheckData. This function must be used directly following data upload, if else, down-stream data processing functions such as ReplaceMin and Normalization will be unusable. SanityCheckData evaluates the accuracy of sample and class labels, data structure, deals with non-numeric values, removes columns that are constant across all samples (variance = 0), and by default replaces missing values with half of the original minimal positive value in your dataset. If the function is successful, you will see “Successfully passed sanity check!” printed on your screen, along with all the messages produced from the sanity check. Moreover, you can use mSet\(dataSet\)check.msg to view all the collected messages that came from running the sanity check.

# Run the sanity check, it will return a series of messages if the data is suitable for subsequent analyses. 
mSet <- SanityCheckData(mSet)
##  [1] "Successfully passed sanity check!"                                                                                
##  [2] "Samples are not paired."                                                                                          
##  [3] "2 groups were detected in samples."                                                                               
##  [4] "Only English letters, numbers, underscore, hyphen and forward slash (/) are allowed."                             
##  [5] "<font color=\"orange\">Other special characters or punctuations (if any) will be stripped off.</font>"            
##  [6] "All data values are numeric."                                                                                     
##  [7] "A total of 1072 (20.2%) missing values were detected."                                                            
##  [8] "<u>By default, missing values will be replaced by 1/5 of min positive values of their corresponding variables</u>"
##  [9] "Click the <b>Proceed</b> button if you accept the default practice;"                                              
## [10] "Or click the <b>Missing Values</b> button to use other methods."

2.2 Processing an example dataset (Compound concentration data)

The workflow for processing, normalizaing, and filtering the dataset are consistent across data types. While MetaboAnalystR was built with user’s flexibility in mind, data processing must make sense. We will now continue with the “cachexia_concentrations.csv” file for the rest of the examples. To begin with the data processing, MetaboAnalystR will first deal with missing values. Please note, missing value imputation can be performed before or after normalization in MetaboAnalystR. The suggested default method for missing value imputation used by MetaboAnalystR is ReplaceMin, which replaces all missing or zero values in the dataset with a very small value (half of the smallest positive value in the original data). The function works directly upon the dataSet object. Use mSet\(dataSet\)replace.msg to view what the value is that all replaced missing/zero values.

# Replace missing/zero values with a minimum positive value
mSet <- ReplaceMin(mSet)

# View messages collected during ReplaceMin()
## [1] "Zero or missing values were replaced by 1/5 of the min positive value for each variable."

Other options are available to deal with missing values, consisting of 2 steps. The first is an option to remove features with too many missing values, using RemoveMissingPercent, the threshold being a user defined cut-off. The function also works directly upon the dataSet object. In the example below, we are removing any features within the dataset with >50% missing values.

The second step is to remove or replace the remaining missing values, using ImputeMissingVar. There are several options to decide what to replace missing values with, such as replacement based on the minimum/mean/median value of each feature column, or several options to impute the missing values, using k-nearest neighbour (KNN), probabilistic PCA (PPCA), Bayesian PCA (BPCA), or Singular Value Decomposition (SVD). In the example below, we will exclude variables with missing values (“exclude”). An example of replacing missing values with KNN imputed values is also included (method = “knn”).

# STEP 1: Remove features containing a user-defined % cut-off of missing values
mSet <- RemoveMissingPercent(mSet, percent=0.5)

# STEP 2: Remove variables with missing values
mSet <- ImputeMissingVar(mSet, method="exclude")

######### Alternative Step 2: Replace missing values with KNN imputed values   
mSet <- ImputeMissingVar(mSet, method="knn_smp")

IsSmallSmplSize(mSet) is a quick check of the data, evaluating if there are too many groups that contain a very small number of replicates. It will return a 0 if the data passes the check, and will return a 1 if the sample size is too small.

# Check if the sample size is too small, returns a 0 if the data passes the check
## [1] 0

2.3 Normalizing example dataset (Compound concentration data)

MetaboAnalystR contains 14 methods for data normalization, categorized based upon whether these methods will be used on rows (row-wise) or on columns (column-wise).

Row-wise normalization

This category contains 7 methods for row-wise normalization, which focuses on reducing variation due to sampling. For instance, there is a method to normalize data to a reference feature (rowNorm = CompNorm), such as to an internal standard within your dataset, or to a physiological constant, such as creatine in urine. Besides, the normalization can be performed to a reference sample or a group of reference sample.

Column-wise normalization

This category contains 7 methods for centering, transforming, and scaling metabolomic data. Briefly, centering data involves substracting the mean from a variable, thereby moving the mean to 0. This is useful for adjusting for outliers and variation between high and low concentration metabolites. Scaling the data means that each variable is divided by a chosen factor, the scaling factor, adjusting for fold differences in metabolite concentrations. Finally, transformation converts the data to a different scale, for instance to the logarithmic scale. This non-linear data conversion aims to reduce heteroscedasticity and make unsymmetrical data more symmetric. In total, centering, transforming, and scaling are used to “clean” the data, reducing intra-group sample variation whilst focusing on biologically relevant information.

Normalization in MetaboAnalystR is a comprehensive function which allows users to perform data normalization (row-wise Normalization), transformation (transNorm), and centering/scaling (scaleNorm). This function contains several options for each of these categories, therefore please refer to the MetaboAnalystR manual for further details.

For this tutorial we will go through 3 examples of data normalization:

  1. Normalize the data according to a reference sample (Probabilistic Quotient Normalization). “SamplePQN” is the mode of normalization, and “P037” is the reference sample.

  2. Normalize the data according to a reference feature. “CompNorm” is the mode of normalization to select, and “1,6-Anhydro-beta-D-glucose” is the reference feature.

  3. Perform quantile normalization (“QuantileNorm”), log transformation (“LogNorm”), and mean-center scaling (“MeanCenter”).

MetaboAnalystR also has 2 functions to view the results of your selected normalization strategy. PlotNormSummary will give a feature-wise view of the data normalization, and PlotSampleNormSummary will give a sample-wise view of the normalization. Both functions will create a before and after-normalization view of the data, with a density plot and a box-plot.

### OPTION 1) Perform Probabilistic Quotient Normalization based upon a reference sample
mSet<-Normalization(mSet, "SamplePQN", "NULL", "NULL", "P037", ratio=FALSE, ratioNum=20)

### OPTION 2) Normalize by reference feature 
mSet<-Normalization(mSet, "CompNorm", "NULL", "NULL", "2.2882", ratio=FALSE, ratioNum=20)

### OPTION 3) Perform quantile normalization, log transformation, and mean-center scaling  
mSet<-Normalization(mSet, "QuantileNorm", "LogNorm", "MeanCenter", ref=NULL, ratio=FALSE, ratioNum=20)  

# View feature normalization
mSet<-PlotNormSummary(mSet, "feature_norm", format="png", dpi=72, width=NA)

# View sample normalization
mSet<-PlotSampleNormSummary(mSet, "sample_norm", format="pdf", width=NA)
Figure 1. Normalization results of features.

Figure 1. Normalization results of features.

For futher understanding, an excellent resource to understanding normalization of metabolomics data: van den Berg et al. 2006 (PMID:16762068).

2.4 Filtering an example dataset (Compound concentration data)

When working with a high-dimensional dataset, it is necessary to filter out uninformative information and focus on key features, thereby increasing the power to detect a true effect. Filtering data is an unsupervised method of feature selection, meaning that class/group labels are ignored, and that determining the importance of variables relies solely upon the data itself. A simple example of filtering is removing variables, in this case metabolites, with low variance across all samples. As these samples do not significantly change, they are uninformative to downstream analysis. Removing such samples will increase the power in further analyses. Further, it is highly recommended that users with untargeted metabolomics datasets perform filtering, as untargeted data is known to contain background noise which can be filtered out.

For metabolomics, non-informative data can be categorized into three groups:

  1. Small value variables - These are variables that are very small when compared to the rest of the variables, and are close to the baseline/detection limit. Using the mean/median filtering method can identify these variables.

  2. Near constant variables - These are variables with very low variance, as mentioned above. These are likely metabolites that have to do with housekeeping or homeostasis. Using the standard deviation or the more robust interquantile range will identify these variables.

  3. Variables with low repeatability - Repeatability refers to the variation in repeat measurements of the same sample at different times. Repeatability is measured using QC samples and calculated using the RSD, which is the standard deviation divided by the mean (RSD = SD/mean). Samples with high repeatability have a low RSD, and vice-versa. Features with a high RSD can be eliminated from the dataset by specifying the “rsd” argument in the FilterVariable function. For LC-MS data, the recommended cut-off is 20%, and for GC-MS data, the recommended cut-off is 30%.

To determine the number of variables to remove, small value and near constant variables will be subjected to these guidelines:

Less than 250 variables: 5% will be filtered Between 250 - 500 variables: 10% will be filtered Between 500 - 1000 variables: 25% will be filtered Over 1000 variables: 40% will be filtered

Dependent on your type of metabolomics data, select the filtering option/s as appropriate.

To perform filtering, utilize the FilterVariable function. There are three components the user must specify, the filtering option (filter), whether or not the filtering is based on quality-control (qcFilter) samples, and the relative standard deviation (rsd). There are two examples below, the first uses the median absolute deviation to filter variables and has no QC-samples. For the second example, the filtering method is set to “none”, and instead the QC-Filter is set to true, and uses a RSD of 25 to filter the samples. Read the document of FilterVariable to understand more options on filtering method. Here are two examples.

# Filter variables based on the median absolute deviation, percentage to filter out is 5 (%). No QS based filteration.
mSet <- FilterVariable(mSet, "mad", 5, "F", 25, T)

# Filter variables using Non-parametric relative standard deviation (MAD/median) and a QC-based RDS threshold is 25. 
mSet <- FilterVariable(mSet, "nrsd", 5, "T", 25, T)

2.5 Removing sample variables

Throughout the analysis of your metabolomic data, you may need to remove samples, variables, or even groups from your dataset. One reason to remove data could be that they are outliers, meaning that the value/s of that data is far from the majority of the other data, or a second reason could be that the data are accidental duplicates of other samples and their removal is necessary to have a usable data set.

Several options are available to visually identify potential outliers in your data. For instance, PlotNormSummary or PlotSampleNormSummary will give a box plot summary of samples and features, or PlotPCA2DScore will give you a 2D PCA score plot. In the case of a PCA score plot, outliers will be far away from the main cluster. Finally, PlotHCTree and PlotHeatMap are two functions that perform hierarchical clustering, providing you with a dendogram and a heatmap of the data, respectively. More details of these functions can be found in the Statistical Analysis Module vignette.

There are 3 functions to perform sample, feature, and group removal in MetaboAnalystR. For each function, the name of the variable you would like to remove must be written in quotation marks, as shown below. Sample/feature/group removal must be performed following data processing and filtering. If the data was normalized prior to removal, it is necessary to re-do the normalization.

# Remove a sample from the data set, in this case sample "PIF_178"
smpl.nm.vec <- c("PIF_178")# used to remove certain samples

# Remove a feature from the data set
feature.nm.vec <- c("2-Aminobutyrate")# used to remove certain feature, i.e. 2-Aminobutyrate

# Remove a group from the data set, in this case remove the "control" samples
grp.nm.vec <- c("control") # used to retain certain groups

mSet <- UpdateData(mSet)

3. Sweave Report Generation

Following analysis, a comprehensive report can be generated which contains a detailed description of each step performed in the R package, embedded with graphical and tabular outputs. To prepare the sweave report, please use the CreatePDFReport function. You must ensure that you have the nexessary LaTeX libraries to generate the report (i.e. texlive and texlive-fonts-extra). The object created must be named mSet, and specify the user name in quotation marks.

# Create Biomarker Sweave report 
PreparePDFReport(mSet, "User Name")

# To save all files created during your session