The purpose of MetaboAnalyst is to provide a free, user-friendly, and easily accessible tool for analyzing data
arising from high-throughput metabolomics data. It is designed to address two common types of problems:
1) to identify features that are significantly different between two conditions (biomarker discovery);
2) to use the metabolomic data to predict the conditions under study (classification). In addition, MetaboAnalyst
also provide tools for compound identification and pathway mapping for annotating significant features.
Note, MetaboAnalyst is designed for analyzing a large number of samples, very few samples (less than 10) will
cause some functions to work improperly.
MetaboAnalyst accepts data from either targeted profiling (concentration tables) or metabolic fingerprinting
approaches (spectral bins, peak lists) produced from either NMR, LC-MS, or GC-MS. In addition, GC/LC-MS spectra saved
as open data format (NetCDF, mzDATA, mzXML) can also be processed using the XCMS packages. For sample
files and format specification, please check "Data Formats" for more details.
Yes. The data files you upload for analysis as well as any analysis results, are not downloaded or examined in any way by
the administrators, unless required for system maintenance and troubleshooting. All files are deleted from the server after no
more than 72 hours, and no archives or backups are kept. You are advised to download your results as an zip immediately
after performing an analysis.
MetaboAnalyst supports GC/LC-MS spectra through the popular XCMS
package. However, limited by the web interface, only the most commonly used procedures and parameters are enabled.
The package also offers more advanced options for tuning parameters in peak picking and alignment, as well as other
spectra visualization options. These options are either computationally very intensive or requires a lot of user interactions.
Therefore, users are encouraged to first perform spectra processing either using the XCMS installed in their local machine
or using tools provided by the manufacturers, and then upload the processed peak list files (example) or peak intensity table
(example) for analysis.
For a detailed step-by-step instructions on how to use R and the XCMS package to process raw spectra, please
read the Raw LC-MS spectra processing using R and XCMS on our tutorial page. For web-based tool, we recommend
the XCMS Online service for LC-MS spectra processing;
and the MetabolomeExpress for GC-MS spectra processing.
After you obtain a peaklist table, you can then upload to MetaboAnalyst for statistical analysis.
MetaboAnalyst currently does not support LC-MS/MS spectra data analysis. For such data, users can try the recent
MetFamily tool. By integrating analysis of MS(1) abundances and MS/MS spectra,
the tool is able to discover regulated metabolite families in untargetted metabolomics studies. For details, please refer to
their orginal paper here
There are two possible reasons when this error happens:
The zip files cannot be decompressed by our program. This happens if you used the most recent WinZip (v12.0) with default option for
compression. Make sure to selecting the Legacy compression (Zip 2.0 compatible).
It can also be caused by the spectra itself. They must be in NetCDF (.CDF) or mzXML format. Also, please note,
the program does not handle tandem-MS files. In addition, make sure there are no other files (for example, log file or
other parameters files) included in the spectra folder.
Depending on the type of experiments, there may be significant amount of missing values present in the data set.
Missing values should be presented either as empty values or NA without quotes in order to be accepted by MetaboAnalyst.
Any other symbol will be treated as string character and will cause errors during data processing. MetaboAnalyst offers a variety of
methods to deal with missing values. By default, the missing values are treated as the result of low signal intensity. They will be
replaced by half of the minimum positive values detected in your data. Users can also specify other methods, such as replace by mean/median,
Probabilistic PCA (PPCA), Bayesian PCA (BPCA) method, or Singular Value Decomposition (SVD) method to impute the missing values
(Stacklies W. et al).
Potential outliers can be identified from PCA or PLS-DA plots. The scores plot can be used to identify sample outliers,
while the loadings plot can be used to identify feature outliers. The potential outlier will distinguish itself as
the one located far away from the major clusters formed by the remaining.
To deal with outliers, the first is to check if those samples / features are measured properly. In many cases, outliers
are the result of operational errors during analytical process. If those values cannot be corrected, they should be
removed from analysis. MetaboAnalyst provides DataEditor to enable easy removal of sample/feature outliers.
Please note, you may need to re-normalize the data after outlier removal.
For NMR spectra, there are several regions where no known compounds in biofluids have a resonance peak. The corresponding bins contain
only baseline noises. In addition, when signal approaches background, the relative errors increases and conclusions based on these data will be questionable.
Therefore, it is best to first remove these noises before further analysis. The current implementation supports the use of absolute
cut-off threshold for acceptable signal values. The default is a percentage based cut-off in which a fixed fraction (default 25%) of
the bins is discarded. Users are provided with a visual guidance to select an appropriate value. In future, the program will estimate the
baseline for each spectrum and its standard deviation. Signals less than two standard deviations above the baseline will be excluded.
This implies MetaboAnalyst failed to execute the command using the given parameters. Users should try to adjust parameter values.
We found in most cases, the problem is associated with sample size. In particular, if the sample size is very small (below 10),
some unpredictable error may happen. For instance, by default PCA and PLSDA will try to generate summary/classification/permutation
plot for the top 5 components, if the sample size is too small, it will fail to do so.
There are several unsupervised methods (PCA, hierarchical clustering, SOM, K-means) that can be used to detect
inherent patterns in unlabeled data. However you need to trick MetaboAnalyst to accept the data by providing
dummy two-group labels . In this case, results from feature selection or supervised classification methods
will be meaningless.
This depends on the biological questions under investigation. For example, if the purpose of technical replications is to see if
there is systematic variance introduced by sample handling or instrumentation, the clustering programs such as PCA or
hierarchical clustering can be used to investigate whether the same technical replicates tend to group together.
After checking the clustering pattern of these technical replicates, if users decide to merge all sample replications
(i.e. by averaging them). The processed data can be downloaded and processed with a different program
(i.e. spreadsheet). The data can then be re-uploaded to MetaboAnalyst for further data analysis and annotation.
Normalization by a reference sample, also known as probabilistic quotient normalization (Dieterle F et al ), is a robust method
to account for different dilution effects of biofluids. This method is based on the calculation of a most probable dilution factor
(median) by looking at the distribution of the quotients of the amplitudes of a test spectrum by those of a reference spectrum.
Generalized logarithm (glog) is a simple variation of ordinary log in order to deal with zero or negative values in the data set.
It has many desirable features (for details, see Durbin BP. et al
Its formula is shown below:
where a is a constant with a default value of 1.
The purposes of data editor and data filter are to help improve the quality of data for better separation, prediction or interpretation.
In particular, user can use data editor to remove outlier(s) which can be visually identified from PCA or PLS-DA scores plots);
user can use data filter to remove noisy or uninformative features (i.e. baseline noises, near-constant-features). These features
tend to dilute the signal and decrease the performance of most the statistical procedures. Be removing outliers and low-quality features,
the resulting data will be more consistent and reliable.
In a boxplot, the bottom and top of the box are always the 25th and 75th percentile (the lower and upper quartiles, or Q1 and Q3, respectively),
and the band near the middle of the box is always the 50th percentile (the median or Q2). The upper whisker is located at
the smaller of the maximum x value and Q3 + 1.5*IQR (Interquantile Range), whereas the lower whisker is located at the
larger of the smallest x value and Q1 - 1.5*IQR.
The purpose of fold change (FC) analysis is to compare absolute value change between two group averages. In paired fold change analysis,
users aim to find some features that are changed consistently (i.e. up-regulated or down-regulated, but not both) between two groups.
The consistency is measured as a percentage - (# of pairs with consistent change above a given FC threshold) / (total # of paired samples).
Users need to specify two thresholds - fold change threshold and significant counts (the percentage).
There are two variable importance measures in PLS-DA. The first, Variable Importance in Projection (VIP),
is a weighted sum of squares of the PLS loadings taking into account the amount of explained Y-variation,
in each dimension. Please note, since VIP scores are calculated for each components, when more than components
are used to calculate the feature importance, the average of the VIP scores are used. The other importance measure
is based on the weighted sum of PLS-regression. The weights are a function of the reduction of the sums of squares
across the number of PLS components. Please note, for multiple-group (more than two) analysis, the same number
of predictors will be built for each group. Therefore, the coefficient of each feature will be different depending
on which group you want to predict. The average of the feature coefficients are used as the overall coefficient-based importance.
It is well known that when there are too many variables and a small sample size, many supervised classification algorithms
tend to overfit the data. That is, even there is no actual difference between the groups, the program will still be able to
discriminate them by picking some features that are "different" between these two group by pure chance! Of course,
the classifier will be useless for new data since the pattern it detected is not real (not significant).
The purpose of a permutation test is to answer the question - "what is the performance if the groups are formed randomly".
The program uses the same data set with its class labels reassigned randomly. It then builds a new classifer,
its performance is then evaluated. The process will repeat many times to estimate the distribution of the performace
measure (not necessarily follows a normal distribution). By comparing the performance using the original label and
the performance based on the randomly labeled data, one can see if the former is significantly
different from the latter. For PLS-DA, the performance is measured using prediction accuracy or
group separation distance using the "B/W ratio" (as suggested by
Bijlsma et al.). The further away to the right of the distribution
formed by randomly permuted data, the more significant the discrimination. The p-value is calculated as the proportion of the times
that class separation based on randomly labeled sample is at least as good as the one based on the original data (one-sided p value).
Different methods use different criteria for ranking the features. The choice of method used can greatly affect the set of features
that are identified. This can be illustrated in the simple cases such as fold change v.s. t-tests, in which the former is interested in
absolute change in concentrations/intensities, while the latter focuses on changes relative to the underlying noises.
PLS-DA is based on linear regression; SAM and EBAM usually produce similar results since they are all based
on t statistics; both random forests and SVM are quite distinctive from all other methods by using an ensemble of classification trees
or projections to hyperplanes, respectively. Therefore, these methods tend to generate different results.
Significance Analysis of Microarray (SAM) is a robust method designed for identification of statistically significant genes. SAM use
moderated t-tests to computes a statistic d_{j} for each gene j, which measures the strength of the relationship between gene expression (X)
and a response variable (Y). This analysis uses non-parametric statistics by repeated permutations of the data to determine if the expression of any gene
is significant related to the response. The procedure accounts for correlations in genes and avoids normal assumptions about the distribution of individual genes.
More detailed description about about SAM history, features, and instructions can be found here.
Empirical Bayesian Analysis of Microarray (EBAM) is based on empirical Bayes method. Both
the prior and posterior distributions are estimated from the data. A gene is considered
to be differentially expressed if its calculated posterior is larger than or equal to
delta and no other genes with a more extreme test score that are not
called differentially expressed. The suggested fudge factor a0 is chosen that leads to the largest number of
differentially expressed genes. More information about empirical Bayes can be found
(here), and detailed description on
the EBAM algorithm be found from the original paper by
Efron, B., Tibshirani, R. et al.
RF is a powerful non-parametric classification method and can be used for both classification and important
variable selection (Breiman, 2001). RF uses an ensemble of classification trees, each of which is grown by
random feature selection using bootstrap sampling from the original sample set. Class prediction is based on
the majority vote of the ensemble. By default, 500 trees are used to build the RF classifier.
During tree construction, about one-third of the instances are left out of the bootstrap sampling process.
These “left-out” data are then used as test data to obtain an unbiased estimate of the classification error,
known as the ‘out-of-bag’ (OOB) error.
For a more detailed description about how this algorithm works, click here.
This is the workflow of RSVM algorithm described by Zhang X, et al..
Recursive SVM uses SVM for both classification and for selecting a subset of relevant genes according to their
relative contribution in the classification. This process is done recursively so that a series of data subsets and classification
models can be obtained in a recursive manner, at different levels of feature selection. The performance of the classification
can be evaluated either on an independent test data set or by cross validation on the same data set. R-SVM also includes
an option for permutation experiments to assess the significance of the performance. Please note, only linear kernel was used for
classification, since the information is usually far from sufficient for reliably estimating nonlinear relations for high-dimensional
data with a small sample size. First-order approximation will reduce the risk of overfitting in case of limited data.
CV2 refers to the cross validation embedded with the feature selection procedure as discussed above.
Cautions must be taken for the practice. If the features are selected based on the background biological knowledge,
it's fine to use the selected subset of data for classification. However, it is not proper to use the features
selected based on the whole dataset using some supervised methods (methods that utilize the class label information
such as t-tests, SAM, PLS-DA or other supervised classification methods). This procedure will introduce selection bias and result in a very optimistic
performance based on cross-validation due to "information leak". In order to obtain an objective performance evaluation,
one should include the feature selection procedure in the cross validation. Alternatively, one can evaluate the classifier using independent
dataset not used in the feature selections. These procedures were used by PLS-DA, random forest and R-SVM for feature
selection and classification. Currently, MetaboAnalyst does not support feature selection and classification using different
algorithms (i.e. SAM for feature selection and PLS-DA for classification). This is going to be our next step. Please refer to
paper by Ambroise C and McLachlan GJ for a more detailed discussion.
However, one can use non-specific filtering (i.e. not using the class labels) to select features. We recently introduced
the Data filter function under the Data Process category. Users can use this function to remove low
quanlity features using a variety of criteria. Additionally, users can also try to remove sample outliers if exists using the
Data editor. These are safe procedures that can potentially improve the classification performance.
MESA or Metabolite Set Enrichment Analysis is a novel way to identify biologically meaningful patterns in metabolite
concentration changes for quantitative metabolomic studies. In conventional approaches, metabolites are evaluated
individually for their significance under conditions of study. Those compounds that have passed certain sigificance level
are then combined to see if any meaningful patterns can be discerned. In contrast, MSEA directly investigates if a
group of functionally related metabolites are significantly enriched, eliminating the need to preselect compounds
based on some arbitrary cut-off threshold. It has the potential to identify "subtle but consistent" changes among a
group of related compounds, which may go undetected with conventional approaches.
Essentially, MSEA is a metabolomic version of the popular GSEA (Gene Set Enrichment Analysis) software, with its
own collection of metabolite set libraries as well as an implementation of user-friendly web-interfaces. GSEA is
widely used in genomics data analysis and has proven to be a powerful alternative to conventional approaches. For more
information, please refer to the original paper by Subramanian
A, and a nice review paper by Nam D, Kim SY.
Each metabolite set will have an associated source (Database links, PubMed ID, etc) to help users further track down
the original source of the information.
The SNP-associated metabolite sets were downloaded from the published journal websites, Only metabolites with
p values less than 1e-3 were retained to reduce false positives. For any given SNP ID, users can visit the NCBI SNP database to get detailed information about each
SNP.
A single column data contains a list of compound names for over-representation analysis (ORA).
For example, a list of important compounds identified by feature selection or clustering methods;
A two-column data separated by tab - the first column contains compound name and the second column is their measured
concentrations - for single sample profiling (SSP). This data is essentially a single profiled biological sample. The purpose is to
compare the measured values to their reference values to see if some compound concentrations are very high or low.
This input is limited to human biofluids (Blood, Urine, CSF) for which reference values exist;
A concentration table. The data contains quantitative metabolomic data of multiple samples
measured under different conditions. The data should be uploaded as a comma separated values (.csv) with the
each sample per row and each metabolite concentration per column. The first column is sample names and the second column
contains sample class labels (discrete or continuous). Users can now directly analyze named metabolite datasets
from the Metabolomics Workbench.
As the name suggests, ORA is to test if certain groups of metabolites are represented more often than expected by chance
within a given metabolite list. The most common approach to test this statistically is by using the hypergeometric test or
Fisher’s exact test to calculate the probability of seeing at least a particular number of metabolites containing
the biological term of interest in the given compound list. MSEA calculates hypergeometrics test score
based on cumulative binominal distribution.
One advantage of metabolomics is that metabolite concentrations in biofluids are tightly regulated through homeostasis
under normal physiological condition. For common human biofluids such as blood, urine, or cerebral spinal fluids (CSF),
normal concentration ranges are known for many metabolites. It is often desirable to know whether certain compound
concentrations from a particular sample are significantly higher or lower compared to their normal ranges. For this purpose,
we implemented SSP to compare the measured concentrations of compounds to their recorded normal reference range.
Compounds that are above or below the normal range beyond a user-specified threshold will then be further investigated
using ORA. Please note, SSP is only applicable to human blood, urine and CSF samples.
In ORA, only compound names are used for enrichment analysis. In this case, all compounds in the given list are treated
equally despite their obvious differences in importance measures (i.e. some are very significant, others only marginally
significant, based on p values). This is not optimal. In contrast, QEA takes into consideration not only the counts
(presence/absence), but also the importance measures (Q-statistic) calculated for each compounds for enrichment
analysis. Therefore, we called quantitative enrichment analysis. In MSEA, QEA was implemented using the
globaltest algorithm.
Yes, but you may have to provide your self-defined metabolite set libraries. The metabolite set libraries are species specific. For example, the pathway library is only valid for mammalian species. While the metabolite
sets based on diseases are only applicable for human. However, MSEA allows users to upload their own metabolite set libraries
for enrichment analysis. Therefore, if you want to perform enrichment analysis on data from species other than human being, you need to
provide two files, one contains metabolite concentration data, the other contains the metabolite set library.
The reference concentrations SSP were mainly collected from literature. In most cases, the measured concentration should be
comparable to the literature reported values as shown below. However, users should keep in mind that these concentrations were
measured based on heterogeneous analytical platforms. Therefore, it is more suitable to compare with concentrations measured
by the similar technologies. It is advisable to refer to the original literature (provided by SSP) if very extreme values are encountered.
The image below shows the
comparisons of the measured urinary Glycine concentration (indicated by a dotted yellow line) to literature reported concentration values.
The magenta squares indicate the mean concentration values, and the blue lines indicate the reported ranges. In some cases, 95%
confidence intervals are used (2 standard deviations from the mean) if no ranges were provided in the original report. Note the link
to the original reference of each study will be provided in a table under each plot during analysis.
The formula to calculate statistic Q can be obtained from the original publication by (Goeman
JJ, et al). Q can be intuitively interpreted as an aggregate of squared covariance between concentration
changes and the phenotypes - compounds with large variance have much more influence on the Q than compound with
small variance.
Metabolite set plot shows the detailed influence of individual compound to the total score of the metabolite set. The color of the bar indicates
positive or negative association with the phenotype (continuous) or positive association with a particular phenotype (discrete).
The reference line at bottom shows the expected influence if the compound was not associated with the phenotype. High bars
indicate influential metabolites. The marks on the bars show the standard deviation (SD) of the bar under null
hypothesis.
A typical metabolite set plot is shown below. In this case, the phenotype is discrete with four groups.
Among all possible comparisons between the data and null hypotheses, the nine amino acids are
most significantly associated with either group 3 and 4 as the color indicated. Please note, if you are interested in the comparison
between a particular two groups, you should edit and upload your data with samples from only these two groups
Currently, no single analytical technique can simultaneously measure all the metabolites involved in the metabolic
pathways. Different platforms - NMR, GC-MS, LC-MS, usually have different compound coverage. These differences
will likely introduce bias during the enrichment analysis, which is not related to the biological questions
under investigation. To correct for this bias, a reference metabolome should be used.
A reference metabolome refers to metabolites that can be measured by the analytical platform. Please note,
this correction is only necessary when only a list of significant compounds are provided (over-representation analysis).
When the whole concentration table is provided, all the compounds within the data are used as the reference
metabolome. It is not necessary to upload separately a compound list as reference metabolome.
In ORA and SSP: Fold-enrichment is calculated by dividing the observed number of hits by the expected
number of hits ("Hits" / "Expect" columns of the ORA and SSP results table).
In QEA: Fold-enrichment is calculated by dividing the observed Q statistic by the expected Q statistic ("Statistic" /
"Expected" columns of the QEA results table).
There may be times when not all uploaded compound names/IDs get matches in MetaboAnalyst. This is because our internal
compound database only contains all compounds found within our supported pathway/enrichment set libraries.
MetaboAnalyst now implements a "smart-matching" algorithm to help map user's lipid names to our internal compound database.
The algorithm considers common lipid synonyms (gathered from PubChem, HMDB and LIPID MAPS), lipid abbreviations (LIPID MAPS),
as well as different punctuation/symbols commonly used instead of slashes in lipid names. This step has significantly enhanced
the ability of our tool to support lipidomics data for enrichment analysis.
There are many commercial pathway analysis software tools, such as
Pathway Studio,
MetaCore,
Ingenuity Pathway Analysis, etc .
Compared to them, the pathway analysis in MetaboAnalystis a free, web-based tool targeting for metabolomics data analysis.
It uses the high-quality KEGG metabolic pathways as the backend knowledgebase.
It integrates many well-established (i.e. univariate analysis, over-representation analysis ) methods,
as well as novel algorithms / concepts (GlobalTest, GlobalAncova, pathway topology analysis)
into pathway analysis. In addition, MetPA implements a Google-Map style interactive visualization system to
help users understand their analysis results.
Over-representation analysis is to test if a particular group of compounds is represented more
than expected by chance within the user uploaded compound list. In the context of pathway analysis,
we are testing if compounds involved in a particular pathway is enriched compared by random hits.
The most common methods for such analysis is Fishers' exact test and hypergeometric test. Please note,
the over-representation analysis only consider the count (i.e. the total number of compounds that match
a particular pathway) and does not consider the magnitude of their concentration changes (not quantitative).
So compound that are changed more significant will be treated the same as compounds that are less significant.
Pathway enrichment analysis usually refers to quantitative enrichment analysis directly based on the compound
concentration values as compared to the compound lists used by over representation analysis.
It is usually more sensitive than over-representation analysis and has the potential to
discover "subtle but consistent" changes among compounds within the same biological pathway.
Many algorithms have been developed in the last decade for this type of enrichment analysis, the most famous
being the Gene Set Enrichment Analysis. Many new
and improved methods have since been implemented. The program uses
GlobalTest and
GlobalAncova for pathway
enrichment analysis when users upload concentration tables. Some important features about these two methods
include that they support binary, multi-group, as well as continuous phenotypes, and p values can be approximated
efficiently based on the asymptotic distribution without using permutations, which is critical for developing
web applications.
The structure of biological pathways represents our knowledge about the complex relationships between
molecules (activation, inhibition, reaction, etc.). However, neither over-representation analysis or pathway
enrichment analysis take the pathway structure into consideration when determining which pathways are
more likely to be involved in the conditions under study. It is obvious that changes in the key positions
of a network will trigger more severe impact on the pathway than changes on marginal or relatively isolated
positions. The program uses two well-established node centrality measures to estimate node importance -
betweenness centrality and degree centrality.
The former focus on node relative to overall pathway structure, while the latter focus on immediate local
connectivities.
Please note, for comparison among different pathways, the node importance values calculated from centrality
measures are further normalized by the sum of the importance of the pathway. Therefore, the total/maximum
importance of each pathway is 1; the importance measure of each metabolite node is actually the percentage w.r.t the
total pathway importance, and the pathway impact is the cumulative percentage from the matched metabolite nodes.
In a graph network, the degree of a node is the number of connections it has to other nodes ,
Degree centrality is defined as the number of links occurred upon a node. Nodes with higher node
degree act as hubs in a network. For directed graph, there are two types of degree: in-degree for links
come from other nodes, and out-degree for links initiated from the current node. Metabolic network
is directed graph. Here we only consider the out-degree for node importance measure. It is assumed
that nodes in upstream will have regulatory roles for the downstream nodes, not vice versa. In the
illustration on the right, the nodes colored in red have high degree centrality.
In a graph network, the betweenness centrality measures number of shortest paths going through the node.
Therefore, it take into consideration of global network structure, not only immediate neighbour of the current node.
For example, nodes that occur between two dense clusters will have a high betweenness centrality, even its degree
centrality is not high. Since metabolic network is directed, we use relative betweenness centrality for metabolite
importance measure. In the illustration on the right, the nodes colored in blue have high betweenness
centrality.
Currently, no single analytical technique can simultaneously measure all the metabolites involved in the
metabolic pathways. Different platforms - NMR, GC-MS, LC-MS, usually have different compound coverage.
These difference will likely introduce bias during the enrichment analysis, which is not related to the
biological questions under investigation. To correct for this bias, a reference metabolome should be used.
A reference metabolome refers to metabolites that can be measured by the analytical platform.
Please note, this correction is only necessary when only a list of significant compounds are provided
(over-representation analysis). When the whole concentration table is provided, the compounds within
the data are used as the reference metabolome. It is not necessary to upload separetely a compound list
as reference metabolome.
An example of "metabolome view" is shown below. It contains all the matched pathways (the metabolome)
arranged by p values (from pathway enrichment analysis) on Y-axis, and pathway impact values (from pathway topology
analysis) on X-axis. The node color is based on its p value and the node radius is determined based on their
pathway impact values. The pathway name is shown as mouse-over tooltip. Click the corresponding node when the
tool-tip shows will launch the corresponding pathway view
An example of "pathway view" is shown below. It shows the current metabolic pathway after user clicks the corresponding
node on the "metabolome view". The pathway is essentially a simplified KEGG pathway map shown only chemical compounds.
The default node color is light blue. When a reference metabolome is provided, some compounds will be colored
in light grey if they are not within the reference metabolome. The matched nodes will show varied heat map colors
based on their p values. The common compound names can be obtained via mouse over tool-tip. Click on any node will reveal
more detailed information as well as database links. For matched compounds, it will also include images summarizing the
concentration distribution of the corresponding compound (see compound view).
Two examples of "compound view" are shown below. The box plot image on the left shows the result with discrete
(binary or multi-group) phenotype labels; the scatter plot image on the right shows the result with continuous
phenotype labels. The "compound view" is generated from univariate analysis (t-tests, ANOVA or linear regression).
They show the concentration distributions of the corresponding compounds with regard to the phenotype labels.
Using pathway analysis, you may find that the graphics no longer appear in your web-browser,
but that the results table is still visible. In this case, you may see "Failed to process the request!"
appear on the page. To overcome this, delete your web-browser history and cookies, then retry the
pathway analysis.
There may be times when not all uploaded compound names/IDs get matches in MetaboAnalyst. This is because our internal
compound database only contains all compounds found within our supported pathway/enrichment set libraries.
For some KEGG pathways, there may be a discrepancy in the number of compounds listed from KEGG directly compared to MetaboAnalyst. This is because
a KEGG pathway can contain compounds that have not been verified for that species. Therefore, our pathway libraries only contain compounds that have been verified so
will have fewer compounds than KEGG itself.
Joint Pathway Analysis is a sort of integration of multi-omics (usually transcriptomics and metabolomics) datasets
against the backdrop of the existing biological knowledge (e.g. Library of pathways).
The main impetus for data integration is to improve the understanding of the underlying biological sense, as well as to be better able to
gain further insight into mechanistic details of the system.
Gene List input frame is required. You should input the Genes' ID (Entrez ID, Official Gene Name or Uniprot Protein ID) as the
1st column and the log fold change as the 2nd column (optional). Compound List input frame is also required. You should input the Compounds' ID (compound name, HMDB ID or KEGG ID) as the
1st column and the log fold change as the 2nd column (optional).
NOTE: the log fold change is optional for gene list and compound list.
When we perform the joint-analysis, if a pathway contains many more genes than metabolites, the significance of the genes will overwhelm
the significance of the metabolites. In addition, the transcriptomics usually report far more differential expression genes than metabolomics,
in this case, the integration result is always dominated by transcriptomics datasets.
The weight strategy for different ‘universes’ (“transcriptomics data universe” and “metabolomics data universe”) are performed with a weighted z-test.
The weighted z-test is proposed for combination analysis by Dmitri V. Zaykin. et. al and leveraged here for the weighted integration of
different datasets with significantly different sizes. A figure below is provided to illustrate the mechanism of this weighted integration of different
Omics-data in MetaboAnalyst.
Specifically, we assign different weights based on the proportion of genes and metabolites in the specific ‘omics universes’
to balance the influence from the different sizes of the ‘omics inputs upon the integrated pathway results.
The adjusted P value is estimated with a weighted Z-test below.
In the equation, w_{i} is the weights of the P values of genes or compounds within individual omics "universe" or "pathway space",
respectively; Z_{i} is the Z score of the corresponding P values of single omics data, usually, Z_{i} = Φ−1(1 −P_{i}); P_{i} is the P values
from the enrichment analysis above; Φ denote the standard normal cumulative distribution function.
There are four different types of pathways prepared for users to integrate and interpret their data.
The first is the Integrated Metabolic Pathways, which is composed of metabolites and their directly linked genes.
The second is the entire Integrated Pathways, containing all metabolites and genes in certain pathways (directly linked to metabolites
or not). The third and fourth pathway types are those consisting of metabolites or genes only. The metabolites and genes in the corresponding Pathway Databases
is providing the different "Universes" for analysis. You may need to select 1st database if you only focus on the metabolite-linked genes and their associated pathways.
The interactive PCA visualization summarizes all the data into the the first three principal components (PCs).
Each data point in the Scores Plot represents a sample. Samples that are close together are more similar to each other.
The colors of these data points are based on the factor labels. Users can change the colors according to any of the
two factor labels.
Each data point in the Loadings Plot represents a feature. When Scores plot and Loadings plot are
viewed from the identical perspective, the direction of separation on the scores plot can be explained by the
corresponding features on the same directions - i.e. features on the two ends of the direction contribute more
to the pattern of separation.
The two-way heatmap displays the data in the form of matrix with cells colored according to their values. It provides
an intuitive overview of the overall data points. To facilitate pattern discovery, MetaboAnalyst also performs clustering on the variables.
The samples are sorted by their factor labels.
Users can change the clustering algorithms, the color schema, as well as the order of the samples to suit their
preferences. When there are too many samples or variables, the high-resolution image can be used in order to get
a clear view of the results.
ASCA is a multivariate extension of univariate ANOVA approach. This implementation supports ASCA
model for two factors with one interaction effect. The algorithm first partitions the overall data variance
(X) into individual variances induced by each factor (A and B), as
well as by the interactions (AB). The formula is shown below with (E) indicates the residuals.
X = A + B + AB + E
The SCA part applies PCA to A, B, AB to summarize major variations in each
partition. Users need to specify the number of components used for each model.
The maximum allowed number of components of each factor must be less than the corresponding levels of the factor.
For example, if the phenotype factor contains two levels (control and disease). Only the top component
will be extracted. In most cases, users should focus on top one or two components.
It is critical to test the presence of interactions between the two experimental factors in order to
correctly interpret the data.
For individual variable, the interaction can be assessed by univariate two-way ANOVA.
Users can view a two-way box plot summary of each variable by clicking the corresponding name
in the detailed table view of the two-way ANOVA results.
The overall interaction effect can be assessed by ASCA permutation tests (under the "Model Validation"
tab on the ASCA page). It performs unrestricted permutation of observations (Manly's approach) and recalculates
the total sum of squares (TSS) for each experimental factor and their interactions. If the TSS calculated based
on the original data is significantly different from those calculated from the permuted data, then the effect
is significant. For more detailed discussions on this subject, please refer to the paper by
Vis D. J. at. al
When there are no significant interaction effect being detected, it is advisable to
analyze the data with regard to each experimental factor separately for simpler and easier interpretation.
When the presence of interactions between the two experimental factors are confirmed, it is critical to
interpret the data in the context of both experimental conditions . Here I will give two examples to
illustrate this - one for univariate two-way ANOVA, one for the multivariate ASCA, using the
example dataset provided by MetaboAnalyst. The study is to investigate the temporal profiles (factor 1: Time) during
a wounding process using two different plant lines (factor 2: Phenotype).
The figure on the left shows the peak (1.4087/417) with very different temporal profiles between the two plant lines.
As shown in figure, the baseline (time point 0) of MT is much higher than that of WT group. During a wounding process,
WT first shows a significant increase, then starts to decrease; while the MT group shows first decrease, followed by
slight increase at the third time point, then decrease again. The different response in the wounding time course
may suggest the different underlying biological mechanisms involved.
The figure on the right shows the major trends associated with MT and WT during the wounding time course.
The WT first increase then decrease, while the MT first decrease then increase, indicating very different biological processes involved.
Please note, the interpretations above are based on the objective of the experiment - to compare temporal profiles of
different plant lines. It is also possible to compare the difference between two plant lines at each time point. To summarize,
when the interaction is present, researchers must explain the result of factor A in the context of factor B
(or vice versa), whichever is more convenient or customary.
ASCA identifies the major trends associated with each experimental factor and their interactions.
The next step is to identify variables that are closely follow the detected trends
as well as those that clearly diverge from the trends. The first ones would represent those variables
that most co-coordinately respond in the experimental context; The second ones would be potential outliers.
The variable selection strategy is based on the approach described by
Nueda, M.J. et al.
An example figure is shown on the left. Important variables are put into two categories: 'well-modeled' refers
to those that change co-coordinately to the major profile(s) depicted in the trajectory plots of the factor,
and 'outliers' refer to those with relevant changes but different from the major profile(s).
These variables are selected based on the Leverage (a measure of the importance of a variable's
contribution to the fitted ASCA-model - higher Leverage means more importance) and SPE
(or squared prediction error, an evaluation of the goodness of fit of the model to a particular variable -
higher SPE means less fit ). The Leverage value (between 0 and 1) is used to select 'well-modeled'
variables; while SPE is used to select 'outliers'. SPE is controlled by Alpha (between 0 and 1). Smaller Alpha will lead to
higher SPE, which will select variables that diverge more significantly from the general trend.
The MEBA approach is designed to compare the time-course profiles with regard to different experimental conditions.
It is based on the timecourse method described by YC Tai. et al.
The result is a list of variables that are ranked by their difference in temporal profiles across different biological conditions.
The Hotelling-T2 is used to rank the variables with different temporal profiles between two biological conditions
under study; And the MB-statistics is used for more than two biological conditions. Higher statistical value indicates
the time-course profiles are more different across the biological conditions under study.
Although many approaches can be extended to support data with more than two
factors. However, as the number of factors increase, the analysis procedures becomes more complicated and data interpretation more
difficult. How to present these complicated results in an easy-to-understand form becomes a real challenge for web-based programs
(and programmers!). Currently, the task is better left to an expert statistician. We would strongly recommend
researchers to further stratify their subjects into more homogenous groups to reduce the number of factors before performing the analysis.
In most cases, researchers should focus on one or two factors for better tool support and easier interpretation.
Receiver Operating Characteristic (ROC) curve analysis is the method of choice for biomarker identification and performance evaluation.
However, to perform these two tasks typically requires a good understanding
of statistics and machine learning concepts, as well as a good knowledge of programming
language such as Matlab or R. MetaboAnalyst offers, through a user-friendly web interface,
classical ROC curve analysis, and integration with several well-established algorithms (currently support: PLS-DA,
Random Forests and SVM) to assist researchers in performing common ROC curve analysis for biomarker discovery
and performance evaluation in metabolomics studies.
A minimum of 30-40 samples for balanced groups (i.e. at least 15~20 each group) are recommended in order
to calculate decent ROC curves and AUC evaluations. Note, for unbalanced data, the algorithm uses
Monte Carlo random sampling to produce balanced sub-samples for training data.
The ROC curve is a fundamental tool for diagnostic test evaluation. It
provides a complete and easily visualized sensitivity/specificity report visualization. In a ROC curve the true
positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for
different cut-off points of a given parameter. Each point on the ROC curve represents a sensitivity/specificity pair corresponding
to a particular decision threshold. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes
through the upper left corner (100% sensitivity, 100% specificity). Therefore the closer the ROC curve is to the
upper left corner, the higher the overall accuracy of the test. The closer the ROC curve to the diagonal line, the poorer
the diagnostic power of the test. The area under the ROC curve is a measure of how well a parameter can distinguish
between two diagnostic groups (diseased/normal).
There are two different approaches to determine the optimal threshold based on ROC curves. In one
approach (Youden), the optimal cut-off is the threshold that maximizes the distance to the diagonal line.
The optimality criterion is max(sensitivity + specificity). The other approach is to find the point
closest to the top-left part of the ROC plot with perfect sensitivity or specificity. The optimality criterion is
min((1-sensitivity)^2 + (1-specificity)^2)
Please note, a) there might be more than one threshold identified using the above criteria; b) the optimal cutoff
is only calculated for the ROC curve using the actual data points, not the smoothed one.
The algorithm tries to identify important features through repeated random sub-sampling cross validation (CV).
In each CV, two thirds (2/3) of the samples are used to evaluate the importance of each feature based on VIP scores
(PLSDA), decreases in accuracy (Random Forest) or weighted coefficients (Linear SVM). The top 2, 3, 5, 10 ...100 (max)
important features are used to build classification/regression models which is validated on the 1/3 the samples that were
left out.
The ROC curves created (using ROC Explorer or Tester) are based on cross-validation performance of
multivariate algorithms (SVM, PLS-DA or Random Forests). In contrast, the classical univariate ROC curves are created
based on the performance measured by testing all possible cutoffs within ALL data points. Therefore, the
AUROC from cross validated ROC curve is more realistic for prediction purpose, while the AUROC calculated from ROC
curve created by univariate approach is often too optimistic. In other words, univariate ROC can be considered as an
indicator of the discriminating "potential" of the feature, not its actual performance.
The most important features are selected from the training data at each cross-validation run. The orders and identities
of the list could be (slightly) different each time. There are two forms in reporting the most important features. The
Frequency of being selected shows the stability of the rank of the importance for a given biomarker,
and the Average importance measure provides a quantitative measure of the importance for
a given biomarker. The first measure is more robust and not sensitive to outliers. In most cases, these two approaches
should produce exactly the same list.
The prediction overview shows the predicted class probabilities (x-axis) of each sample (y-axis). The probability scores are the
average from the 50 iterations, ranging from 0 ~ 1. For instance, less than 0.5 will belong to group A, more than 0.5 belong to
group B. In theory, a sample could be located on the 0.5 line, which means the sample has never been selected for testing
during the iterations.
Users can also use the Pred. Overview to identify potential outliers. For instance, if a sample is always predicted to have
a high probability in the wrong group, this may indicate that the sample could be labeled incorrectly. Users can check
"Label samples classified to the wrong groups" to identify these potential outliers.
The algorithm uses repeated random sub-sampling cross validation (CV) to evaluate the feature importance as well as
to test the performance of different models. Every time when user click the "Submit" button to re-do the analysis, a new
random sampling will be generated. Therefore, the results would be only slightly different. This is the intended
behavior, users should looking for the "stable core" for more robust results. The procedure is computationally intensive and
the following rules are used to control the time and resources based on the sample size:
< 100 samples: 50 repeats
100 - 200 samples: 30 repeats
200 - 500 samples: 20 repeats
> 500 samples: 10 repeats
Two other factors can also affect the degree of variation - sample size and presence of outliers. If the sample size is small,
the traing and testing tend to have high variation. The outliers can also affect the results. Due to the nature of repeated
sampling in each CV, some samples may be used multiple times and some samples may never be used. If the outlier samples
are used imbalanced (i.e. never been used in the first analysis but used multiple times in the subsequent analysis), the
result could change more than "slightly".
The algorithm uses repeated random sub-sampling cross validation (CV) to test the performance of models created with
different number of features. At each CV, 2/3 samples are used for feature selection and model training and the
remaining 1/3 of samples are used for testing. The procedures are repeated 50 times in order to produce a more
stable estimation. Users can further perform permutation tests to calculate the significance of the model in ROC Tester.
Benchmark testing using different simulated random datasets shows that approach does not overfit. The image
below shows some results for classification. The performance is evaluated by measuring the area under ROC curves,
error rates, and permutation tests.
The program uses non-parametric re-sampling based approach for calculating the confidence intervals. In either bootstrap (classical)
or MCCV (multivariate) approach, the data are re-sampled many times, with the AUC/pAUC calculated each time. Confidence
intervals are then estimated by simply sorting the data and taking the percentiles to our desired confidence interval bounds.
For instance, in 1000 resampling, 95% CIs will be the 25th and 975th values of the sorted AUC values. For details, please
refer to the paper by T.A. Lasko et al..
The permutation tests are performed only for the best model (assuming it containing x features)
using the following procedures:
Re-assign the phenotype labels randomly to each sample;
Perform random sub-sampling cross validation. Within each CV, perform feature ranking and select the
top x features to build a classifier using the 2/3 training data, which is then tested on the 1/3 hold-out data.
Note, the procedure is repeated only 3 times to save computational time. The performance
is then recorded
Compare the performance using the original phenotype label and the permuted labels
The performance measures of the permutated data will usually form a normal distribution.If the performance score
of the original data lies outside the distribution, then the result is significant. An empirical p value is also usually
given. For instance, in 1000 permutation tests, if none of the results are better than the original one, the p value
will be reported as p < 0.001.
Note, we have noticed that, when the data does not contain good signals (i.e. AUROC is < 0.65),
or when the sample size is small with outliers, the permutation results could be unstable (change from
significant to not significant) in different runs. This is unavoidable due to the random subsampling nature of
the procedure. Users should always check for performance measures and outliers when this happens.
, where α is the intercept term, β is the regression coefficient estimated from the sample dataset,
X_{i}; is the set of covariate (concentration) values,
and the P = Pr(y=1|x) is the probability of the outcome of interest (e.g. disease case) as the below.
P = Pr(Y=outcome of interest | X=x) = exp ( α + βx ) / (1 + exp ( α + βx ) )
The cutoff value of the model will be compared with this predicted probability value if P > cutoff-value,
then the sample can be classified as the interest outcome (usually assigned to 1). The best cutoff value will
be selected with ROC analysis.
The results of logistic regression including plots and tables are generated using MCCV (100-fold cross-validation) as other methods.
In addition, it produces the addition result with 10-fold cross-validation in the "LR model (10-fold CV)" tab.
In order to get the best parsimony model, it may need to try with several combination of compounds/variables,
because sometimes it does not work in automatically. However, the best model can be created
using the statistics (especially, Lasso Frequency) of variable selection (Builder).
The following is the general procedure for building a logistic regression model:
Select variables which are most significant using the statistics table (feature ranking) in Builder step.
Perform logistic regression analysis with selected variables and get the modeling results.
Compare the performance using the accuracy/performance plots and tables (auc, specificity, and sensitivity).
Repeat these procedures until you find a satisfied model.
Regarding the class/label of a input data, it is recommended to use the number (0 or 1) instead of string labels.
Usually, 1 is used for the case and 0 is for the control.
Power is the probability of detecting an effect, given that the effect is really there.
For instance, if a study comparing the two groups (control vs. disease) has power of 0.8
and assuming we can conduct the study many times, then 80% of the time, we would get a statistically
significant difference between the two groups, while 20% of time we run this experiment,
we will not obtain a statistically significant effect, even though there really is an effect in reality.
In practice, researchers are most interested in knowing the sample size (number of subjects) required in
order to obtain sufficient power. Power analysis often refers to calculating the sample size required
to detect an effect of a given size with a given degree of confidence.
There are three major factors: (a) effect size which is usually defined as the difference of two group means
divided by the pooled standard deviation. When all others are equal, a larger the effect size will lead to more power;
(b) degree of confidence which is usually the p value cutoff (alpha) for statistical significance. When all others
are equal, there will be reduced power if we require a very high degree of confidence; (c) the sample size - more
samples will in general increase power. In many cases, the sample size is our interest for a given power (i.e. 0.8)
The power calculation is based on two assumptions: 1) the effect is indeed present in the data, and 2) the test statistic
follow a normal or near normal (Students' t) distribution. The diagnostic plot is to help assess whether these two assumptions
are reasonably met. In particular, the panels on the left shows the distribution of the test statistics (t-stat) as a histogram
(top panel) and as QQ plot (bottom panel); the panels on the right shows the distribution of the raw p values as histogram (top panel)
and the bottom panel shows the sorted p values against their ranks. We expect a large proportion of p values will be close to
zero (indicating strong effect) - i.e. the histogram should be left-shifted and the bottom graph should shift to top left corner.
It is not appropriate to directly apply the traditional univariate approach on power analysis to omics datasets which are
characterized by high dimensions with 100s to 1000s of features and a relative small number of samples.
Special care needs to be taken in the estimation of effect size as well as the selection of the confidence level.
The power analysis in MetaboAnalyst is based on the R package
SSPA that was originally developed for power calculation for gene expression datasets. For a given pilot data,
the method estimates the average power for a given false discovery rate. For more details, please refer to the original
publication by M van Iterson et al
With Version 2 of the MS Peaks to Paths module, users can now upload retention time information to perform pathway analysis.
Version 2 will only be available if user's data contains a retention time ("rt" or "r.t") column. Retention time is used to
move pathway analysis from the "Compound" space to "Empirical Compounds" space (details below in "How are Empirical Compounds calculated?").
The inclusion of retention time will increase the confidence and robustness of the potential compound matches. Another difference is that currency compounds
are removed directly from the user's selected pathway library, versus removed from potential compound hits during the permutations.
Empirical Compounds are intermediaries between m/z features and compounds. The steps for how they are formed are as follows:
As in version 1, all m/z features are matched to potential compounds considering different adducts. Then, per compound, all matching m/z features
are split into Empirical Compounds based on whether they match within an expected retention time window. The retention time window (in seconds)
is calculated as the maximum retention time * 0.02. This results in the initial Empirical Compounds list.
Next, Empirical Compounds are merged if they have the same m/z, matched form/ion, and retention time. This results
in the merged Empirical Compounds list.
Finally, if primary ions are enforced, only Empirical Compounds containing at least 1 primary ion are kept.
Primary ions considered are 'M+H[1+]', 'M+Na[1+]', 'M-H2O+H[1+]', 'M-H[-]', 'M-2H[2-]', 'M-H2O-H[-]',
'M+H [1+]', 'M+Na [1+]', 'M-H2O+H [1+]', 'M-H [1-]', 'M-2H [2-]', and 'M-H2O-H [1-]'. This results in the final Empirical Compounds list.
Next, pathway libraries are converted from "Compound" space to "Empirical Compound" space. This is done by converting
all compounds in each pathway to all Empirical Compound matches. Then the mummichog/GSEA algorithms work as before to calculate
pathway enrichment.
Users can upload either peak lists (details below) or a peak intensity table.
Version 1 Peak Lists: The MS Peaks to Pathways module accepts either a three column table containing the m/z features, p-values, and statistical scores,
a two-column table containing m/z features and either p-values or t-scores, or a one-column table ranked by either
p-values or t-scores. All inputted files must be in .txt format. If the input is a three column table, both the mummichog
and GSEA algorithms (and their combination) can be applied. If only p-values (or ranked by p-values) are provided, then only the mummichog algorithm
will be applied. If only t-scores (or ranked by t-scores) are provided, then only the GSEA algorithm will be applied.
Version 2 Peak Lists: With Version 2 of the MS Peaks to Pathways module, retention time can be included as a new column with the "rt" or "r.t" heading.
The maximum number of columns that can be uploaded is now 5: "m.z", "r.t", "p.value", "t.score" and "mode".
The peak intensity table is a new addition to the MS Peaks to Paths module that allows for a different way for
users to upload their datasets. With the peak intensity table, it is processed (like in the Statistical Analysis module) to create the
peak list, and then the mummichog algorithm is performed on that. The only difference is in the visualization of the results. There is now an
interactive heatmap that shows the expression of each m/z feature per sample. By default, the m/z features are clustered by p-value.
As the visualization is interactive, use your mouse or touchpad to select a part of the heatmap
by dragging over a section of the heatmap on the left-side panel (Overview). For instance, this could be a portion that shows an interesting pattern that you wish to further investigate.
The selected part will appear in the "Focus View", showing a zoomed-in heatmap. To zoom in or out, change the "Resolution" on the top menu (High to zoom-in, Low to zoom-out).
The colors of the heatmap can also be customized using the "Colors" drop-down menu on the top menu. Finally, using the right-hand panel (Enrichment Analysis),
perform the mummichog algorithm either on the whole dataset (select entire heatmap), or on a subset of m/z features. Note that if MetaboAnalyst contains
multiple libraries for the select organism, the options will appear in the "Database" drop-down menu for users to explore different pathway libraries.
This module is based on the mummichog algorithm (Li et al.) to perform pathway analysis of untargeted metabolomics data.
Briefly, users provide a list of m/z features and a p-value for each feature, highlighting significantly different features between two groups.
Three lists are then drawn from this initial list, (1) Lsig, which is the list containing only all significant m/z features (determined by the user selected p-value cutoff), (2) Lref, which is
the list of all m/z features, and (3) Lperm, which is a list of randomly drawn m/z features from Lref, but the same length as Lsig.
The next steps are as follows:
A list of randomly drawn m/z features are drawn from Lref to create Lperm. The m/z features are then mapped to potential metabolites,
considering different adducts, protons, etc.
The list of potential compounds are then mapped to the user’s selected library of pathways, and a p-value is calculated per pathway.
Steps 1 and 2 are repeated many times to compute the null distribution of p-values (modeled as a gamma distribution).
The Lsig is mapped to potential metabolites for pathway enrichment analysis, and the resulting p-values (Fisher’s or Hypergeometric, and EASE scores)
per pathway for the Lsig compounds are then adjusted for the null-distribution.
For the negative ion mode, the adducts used are: M-H[-], M-2H[2-], M(C13)-H[-], M(S34)-H[-], M(Cl37)-H[-],
M+Na-2H[-], M+K-2H[-], M-H2O-H[-], M+Cl[-], M+Cl37[-], M+Br[-], M+Br81[-], M+ACN-H[-], M+HCOO[-], M+CH3COO[-], and M-H+O[-].
For the positive ion mode, the adducts used are: M[1+], M+H[1+], M+2H[2+], M+3H[3+], M(C13)+H[1+], M(C13)+2H[2+], M(C13)+3H[3+], M(S34)+H[1+],
M(Cl37)+H[1+], M+Na[1+], M+H+Na[2+], M+K[1+], M+H2O+H[1+], M-H2O+H[1+], M-H4O2+H[1+], M-NH3+H[1+],
M-CO+H[1+], M-CO2+H[1+], M-HCOOH+H[1+], M+HCOONa[1+], M-HCOONa+H[1+], M+NaCl[1+], M-C3H4O2+H[1+], M+HCOOK[1+], and M-HCOOK+H[1+].
By default, the MS Peaks to Paths module considers these metabolites as currency: 'C00001', 'C00080', 'C00007', 'C00006', 'C00005', 'C00003',
'C00004', 'C00002', 'C00013', 'C00008', 'C00009', 'C00011', 'G11113', 'H2O', 'H+', 'Oxygen', 'NADP+', 'NADPH', 'NAD+', 'NADH', 'ATP',
'Pyrophosphate', 'ADP', 'Orthophosphate', and 'CO2'.
The original mummichog algorithm implements an over-representation analysis (ORA) method to evaluate pathway-level enrichment
based on significant peaks. Users need to specify a pre-defined cutoff based on either t-statistics or fold changes.
An alternative approach is the Gene Set Enrichment Analysis (GSEA)
method, which is widely used to test for enriched functions from ranked gene lists. Unlike ORA, GSEA considers the overall ranks of features without using
a significance cutoff. It can therefore detect subtle and consistent changes which could be missed from using ORA methods.
The MS Peaks to Paths module uses the (Fisher's method)
for combining the mummichog and GSEA p-values. It takes the raw p-values per pathway to perform p-value combination.
This module contains both genome-scale metabolic models from the original mummichog python package, KEGG pathway
libraries for 21 different organisms, and soon SMPDB pathway libraries for 8 organisms. The genome-scale metabolic model of Danio rerio was
manually curated using the KEGG zebrafish model, as well as the human BiGG and Edinburgh Models, and is designated with a [MFN] at the end of its name (Li et al. 2010).
The human genome-scale metabolic model was also manually curated and originates from a number of sources (KEGG, BiGG, and Edinburgh Model) and has [MTF] at the end of its name.
The remaining four genome-scale metabolic models were directly derived from BioCyc, denoted with [BioCyc].
The KEGG pathway libraries are designated with [KEGG], and the SMPDB pathway libraries will be designated with [SMPDB].
Please select the library closest to your organism.
This module contains both genome-scale metabolic models and KEGG pathway libraries, thereby giving users the power to decide which
library to perform their pathway analysis. For instance, for humans, users can either use the manually curated genome-scale metabolic model [MTF], the BioCyc metabolic model [BioCyc],
or the KEGG pathway library [KEGG].
Users have the freedom to select any p-value cut-off to differentiate between significantly and non-significantly enriched m/z features. In
general, a cut-off of 0.05 is considered fair, and cut-offs less than 0.05 are considered more stringent. Users should note that their selection of p-value cutoff
could affect the results of this module. Changing the p-value cutoff may alter which compounds are included in the Lsig and thereby modify the output of the mummichog algorithm.
We therefore recommend that users explore different p-value cutoff values during their analysis, though the algorithm was shown to be robust in the original manuscript
(Li et al. 2013).
Users may find that their gamma adjusted p-values are NaN. This is because the permutation p-values could not be Gamma-modeled, and therefore could not be adjusted.
This arises because the number of permutations with significance is too close to zero.
The results of this module consists of the total number of hits, the raw p-value (Fisher’s), the Gamma-adjusted p-value (for permutations), and the Empirical p-value per pathway.
The Gamma-adjusted p-values are the Fisher's p-values calculated using Lsig that are adjusted for the null distribution of permutation p-values. Meanwhile, the
Empirical p-values consist of the number of times the p-values using the permuted data were better than the p-values using the significant data, divided
by the number of permutations.
The algorithms used by the MS Peaks to Paths module perform based on the mass accuracy of the instrument (i.e. is not platform specific).
However, we recommend that users use high-resolution MS (such as Orbitrap or other Fourier transform mass spectrometry family), which will significantly reduce false positives and improve the
pathway activity predictions.
The output of this module first of a plot summarizing the pathway analysis results. Below the plot consists of a table identifying the top-pathways that are enriched in your m/z data.
The table is ordered from the most significant to the least significant pathways and contains the number of hits and calculated statistics.
A second table is available for download as a link that contains a list of matched metabolites from the user’s uploaded list of m/z features.
A third output is the metabolic network visualization, which is based on the KEGG global metabolic network (ko01100).
The metabolic network visualization is based on the KEGG global metabolic network (KEGGscape) and has been manually curated. The
metabolites are represented as nodes (circles), and their size is based on their associated p-value. More significantly enriched metabolites
are bigger then less-significantly enriched metabolites. The aim of this visualization is to provide a global metabolic context for the
significantly enriched mapped metabolites, as well as provide an in-house option for visual exploration of the results from the untargeted
pathway analysis. Users have the option to colour each specific pathway to their own preference, change the background colour, switch
the view style, and download the image as a PNG or SVG file.
Currently, MetaboAnalyst supports mummichog version 1.0. In the coming time, we will work to
upgrade to mummichog version 2.0, which will use retention-time in the analysis.
The networks are generated by first mapping the metabolites, genes or both to the
underlying gene-metabolite-disease interactions database. A search algorithm is then performed to identify first-order neighbours
(e.g. metabolites that directly interact with a given metabolite) for each of these mapped metabolites ("seeds").
The resulting nodes and their interaction partners are returned to build the subnetworks.
The above figure shows an example of a generated metabolites-genes interaction network.
MetaboAnalyst uses a comprehensive gene-metabolite-disease interaction data based on MetPriCNet.
The data contains interaction data mainly extracted from published literature. The database currently contains 11,502 genes, 1,900 metabolites,
132 diseases making a total of 139570, 519, and 78200 interactions for gene-metabolite, metabolite-disease and metabolite-metabolite networks respectively in human.
The gene-chemical and chemical-chemical associations for the gene-metabolite and metabolite-metabolite networks, respectively, were extracted from STITCH, such
that only highly confident interactions are used. Most of associations in STITCH are based on co-mentions highlighted in PubMed abstracts including reactions from similar chemical structures and similar molecular activities.
The associations for the metabolite-disease network were obtained from HMDB. Disease association have been added to HMDB via the Human Metabolome Project’s literature curation team.
The Metabolite-Gene-Disease network is an integration of gene-metabolite, metabolite-disease and gene-disease interaction networks.
DSPC stands for debiased sparse partial correlation. It is an algorithm designed to build
partial correlation networks based on a graphical LASSO model using a data-driven approach
without mapping to any known metabolites interaction databases
(Jankova, 2015).
DSPC can distinguish between direct and indirect associations and provide insights into the structure of dependencies
between metabolites. The main assumption of this DSPC method is that the amount of real
metabolites interactions is far smaller than the sample size (i.e., the real partial correlation
network among the metabolites is sparse). Under this assumption, DSPC reconstructs a network
and calculates partial correlation coefficients and p-values for each pair of metabolic features.
Thus, DSPC makes it possible to uncover connectivity patterns among a large number of
metabolic features by using fewer samples
(Basu et al., 2017).
The inferred results can be visualized as weighted
networks, where nodes represent the metabolic features, and the edges depict the correlations
among them.
The above figure shows an example of a DSPC network, where the red edges represent positive correlations, while the blue edges
represent negative correlations
The visualization is actually limited by the performance of users' computers and screen resolutions.
Too many nodes will make the network too dense to visualize and the computer slow to respond.
We recommend limiting the total number of nodes to between 200 ~ 2000 for the best experience.
For very large networks, please make sure you have a decent computer equipped with a modern browser
(we recommend the latest Google Chrome).
Important nodes can be identified based on their position within the network.
The assumption is that changes in the key positions of a network will have more
impact on the network than changes on marginal or relatively isolated positions.
MetaboAnalyst provides two well-established node centrality measures to estimate node
importance - degree centrality and betweenness centrality.
In a graph network, the degree of a node is the number of connections it has to other nodes.
Nodes with higher node degree act as hubs in a network. The betweenness centrality measures
the number of shortest paths going through the node. It takes into consideration
the global network structure. For example, nodes that occur between two dense
clusters will have a high betweenness centrality even if their degree centrality values are not high.
Note, you can sort the node table based on either degree or betweenness values by
double clicking the corresponding column header.
Yes. For query metabolites, you can test for enriched KEGG pathways. For query genes only,
you can test enriched gene ontologies or pathways (KEGG/Reactome).
To do so, first select and highlight query genes using the Highlight Color toolbar on the top
left (you may have to highlight twice for upregulated and downregulated genes respectively); or you can use the
Hub Explorer and select queries from the node table. After that, select a functional catergory
from the Function Explorer section, and click the Submit button.
The enrichment analysis is to test whether any functional metabolites/genes from the user selected library
are significantly enriched among the currently highlighted nodes within the network. MetaboAnalyst uses
hypergeometric tests to compute the enrichment p values.
In the default network generated by MetaboAnalyst, the size of the nodes are based on their degree values,
with a big size for large degree values. The color of nodes are proportional to their betweenness centrality values.
When user switches to Expression View, the color will be based on their expression values (if available).
Yes. You can delete nodes (with their associated edges) from the current network. First
you need to select the nodes from the Node Table in the left pane. Then click the Delete
button at the top of the node table. A confirmation dialog will appear asking if you really want to
delete these nodes. Note, this action will trigger network re-arrangement, especially if
hub nodes are removed. In addition, "orphan" nodes may be produced due to removal.
These nodes will also be excluded during re-arrangement.
Yes, after you have performed functional enrichment analysis, the over-represented themes will be displayed in
the table below. By double clicking on a pathway name, all metabolite/gene members
of the pathway will be displayed as highlighted nodes within the current
network.
The networks are generated by first mapping the genes/metabolites to the underlying gene-metabolite-disease interaction database. A search algorithm is then performed to identify first-order neighbours (proteins that directly interact with a given protein) for each of these mapped proteins ("seeds"). The resulting nodes and their interaction partners are returned to build the subnetworks.
The above approach will typically return one giant subnetwork ("continent") with multiple smaller ones ("islands"). Most subsequent analyses are performed on the continent. Note, networks with less than 3 nodes will be excluded.
Meta-analysis is a type of statistical technique used to integrate multiple independent datasets that have been
collected to study same or similar experimental conditions, in order to obtain more robust biomarkers.
By combining multiple data sets, the approach can increase statistical power (more samples) and reduce potential
bias.
A key concept in meta-analysis is that it is generally unadvisable to directly combine different independent datasets
(i.e. merge them into a single large table) and analyze them as a single unit. This is due to potential batch effects
associated with each datasets, which can completely overwhelm the biological effects. This issue has been well-studies in
microarray experiments generated from different platforms. It is expected the issue could be more severe due to the lack
of standardization in metabolomics.
Instead, meta-analysis is usually computed based on summary statistics (p values, effect sizes, etc.) to identify
robust biomarkers. The meta-analysis module in MetaboAnalyst was developed to support these approaches. The results can
be visualized using heatmap to explore the patters across different studies.
Here are some basic rules for data collection for meta-analysis in MetaboAnalyst:
These data sets were collected under comparable experimental conditions,
and/or the underlying experiments share the same hypothesis or are held to
have the same mechanistic underpinnings;
Only two-group comparisons are supported at the moment (i.e., control vs disease);
These datasets must share the same type of IDs so that the majority of these feature will match;
It is best to keep all data on the same scale or range (i.e. both raw or normalized in the same way).
It is generally preferred to compare at log scale. MetaboAnalyst offers boxplots and log normalization
to help facilitate this process.
Calculating and combining P-values from multiple studies has long been used in the meta-analysis of microarray data, and
we are now using these approaches for metabolomics data. These are simple to calculate and flexible to use. Fisher’s method and Stouffer’s method are two popular approaches.
They have similar levels of performance and can be easily interpreted whereby larger scores reflect greater differential abundance.
The main difference is that weights (i.e. based on sample sizes) are incorporated into the calculation in Stouffer’s method,
whereas Fisher’s method is known as a weight-free method. Note that larger sample size does not warrant larger weights, as the quality
of each study can be variable. Users should choose to apply Stouffer’s method only when all studies are of similar quality.
Vote counting is the most primitive but simplest and most intuitive method of meta-analysis. In this approach, significant features are first selected
based on some criteria (e.g. adjusted P < 0.05 and same direction of fold changes) for each data set.
The vote for each features can then be calculated by counting the total
number of times it occurs as significant across all data sets. This method is statistically inefficient and should be considered as a last resort in situations when other meta-analysis methods cannot be applied.
In this approach, different data sets are merged into a mega-data set and then analyzed as if all data sets were derived from a single experiment.
This approach ignores the inherent bias and heterogeneity of data sets from different sources. Many other factors (experiment protocols,
technical platforms, raw data processing procedures and so forth) can potentially contribute to the observed differences.
This approach should only be used when data sets are very similar (i.e. from the same lab and same platform without batch effects).
Metabolomics is becoming an increasingly popular tool for biological research, however while individual studies may
identify certain results, such results may not be reproducible in other independent studies of the same biological
questions due to low sample size, sample heterogeneity, the type of LC-MS platform used, or metrics used for interpreting results.
Meta-analysis, which is the combination of findings from independent studies - can be used to overcome such limitations and
ultimately increase the power, precision, and generalizability of a study.
With the increased use of metabolomics and global efforts towards science transparency, the amount of publicly available metabolomics
data deposited in dedicated metabolomics repositories such as
Metabolomics Workbench, MetaboLights and
OmicsDI has grown tremendously.
The potential for researchers in the metabolomics community to enhance their findings with publicly available data is immense,
but little effort has been applied for the meta-analysis of untargeted metabolomics data. Such a method would also reduce the bias
individual studies may carry towards specific sample processing protocols or LC-MS instruments.
High-resolution LC-MS has become the dominant platform for global metabolomics. The typical LC-MS workflow starts with data acquisition,
followed by data pre-processing, feature selection, compound identification and then pathway analysis. However, peak identification
requires significant efforts and is a major challenge for downstream functional interpretation. One strategy is to completely bypass
bottleneck of metabolite identification and directly infer functional activity from MS peaks by leveraging the collective knowledge of
metabolic pathways and networks. This is the idea of the mummichog algorithm. Essentially what happens is that a list of significant
peaks, for instance 169.0814 m/z is matched to the monoistopic mass of potential compounds considering different adducts/isotopes (e.g.
Cysteic acid (169.0045) or Norepinephrine (169.0739)).
These putatively matched compounds are then mapped onto known biological pathways. If a list of truly significant features reflect biological
activity, then the representation of these true metabolites would be enriched on functional structures (Figure 1), while false matches
would be distributed at random (Figure 2). We have extended this idea to support the meta-analysis of untargeted metabolomics data.
Figure 1. List of significant peaks mapped onto a metabolic network.
Figure 2. List of randomly selected peaks mapped onto a metabolic network.
Overall, the workflow supports meta-analysis of MS peaks at either a) the pathway level or b) the compound/empirical compound level.
Each level of integration has their own merits, for instance at the pathway level, compounds do not have to be matched across all
studies, whereas the compound-level integration is better at bringing out weaker signals. The workflow also considers data
heterogeneity, meaning that the accuracy of the LC-MS instrument and the MS ionization mode are all taken into account when
performing putative metabolite annotation. Ultimately, the goal of this workflow is to enable the reuse/repurposong of multiple
datasets to identify a robust meta-signature for the phenotype in question.
One of the bottlenecks for meta-analysis of untargeted metabolomics studies is the heterogeneity in the sample processing, instrumentation used,
and even samples themselves that make it difficult to directly compare studies. Another significant hurdle is that unlike targeted metabolomics
where a single metabolite can be quantified, in untargeted metabolomics data each metabolite can be linked to multiple peaks, each with their
own expression levels. The approach we have adopted is to therefore combine p-values, effect sizes, or both from all peaks across all studies
linked to that metabolite to obtain an overall significance level for that metabolite. The workflow supports the use of the original mummichog algorithm,
the GSEA algorithm, or both ("integ") to perform the meta-analysis. The meta-analysis can be performed by combining the statistical significance of
proposed metabolites before pathway analysis (compound/empirical compound level), or after individual pathway analysis.
Using the compound or empirical compound meta-analysis, it will first perform the putative peak annotation on each individual study.
Next, the peak annotations are united by default, meaning that only the compounds/empirical compounds found across all studies are kept.
Following this step, the list of significant features are updated (see Figure 3 below).
Figure 3. Workflow for compound or empirical compound-level meta-analysis.
For the mummichog algorithm and "integ" options, a list of significant features are required. Therefore the next step is to combine m/z level
statistics across all studies for each compound/empirical compound to obtain the unified list of significant features.
Users can either integrate p-values (as if from a T-test), effect sizes, or both to select which m/z features should be used as input (Figure 4).
This list is then used as input for pathway activity prediction.
Figure 4. Workflow for how the unified list of significant features are created.
For the pathway-level meta-analysis, each study undergoes the typical steps of calculating m/z level statistics (if not already done),
putative metabolite annotation, followed by pathway activity prediction. Once this is complete for all studies, all pathway results
are combined to create a unified matrix of pathway-level results (keeping only pathways found across all-studies). Pathway activity
scores are then combined using one of several p-value integration methods.
Figure 5. Workflow for pathway-level meta-analysis.
Pathway-level integration should be used when the studies are independent of eachother - meaning they were performed using different
LC-MS instruments, using different extraction methods etc. This is because while lists of features from multiple independent studies
looking at the same disease often have little overlap, the use of pathway analysis improves the biological consistency across studies
(PMID: 20410053). Towards this, the Pathway-level integration does not require that compounds be matched across all studies. In comparison,
the Compound or Empirical-Compound level integration should be used when the data are more homogeneous, for instance data comes
from the same lab but different batches or different columns on the same samples and the same instrument. By default, the algorithm
will keep only compounds that match across all studies - for instance if the datasets were all the same column, same phenotype in
question, same instrument, we would only want to focus on compounds that are merried by all datasets. However, if we had different
columns but the same samples, then we would want to keep all putative compounds/empirical compounds for predicting pathway activity.
The meta-analysis of MS peaks framework supports various methods for p-value combination, including Fisher's, Edgington's, Stouffer's, Vote Count,
Minimum P-value or Maximum P-value. The choice of statistical method depends on the goal of the meta-analysis as well as the
data structure itself. Briefly, Fisher's statistic, the Minimum P-value or Maximum P-value are traditionally used methods for
combining p-values.
The Fisher's method is known to be more sensitive than other methods to very small (or very large) p-values which can result in a
high false positive rate. For instance, a single very small p-value can lead to a significant combined p-value. "Fisher’s method employs
the log product of individual P-values and thus, a single P-value of zero in one individual case will result in a combined P-value of zero
regardless of the other P-values." - (PMID: 26471455). This method should be followed when the data follows a Chi-squared distribution (positive values).
It is also not recommended for use for meta-analysis with( >5 datasets).
The Minimum P-value should be used to answer the question which metabolites are changed across at least one study? In this case,
the minimum p-value among all studies is taken as the combined p-value.
The Maximum P-value is the most restrictive method and should be used to answer the question of which genes are consistently changed across all studies?
In this case, the maximum p-value among all studies is taken as the combined p-value.
Stouffer's method attributes different weights to the p-values when combining them in a meta-analysis. This method should be applied when the data follows a Gaussian curve.
This method is not as sensitive as Fisher's to very small or very large p-values.
Edgington's method uses the sum of the p-values and unlike Fisher's method, is not sensitive to small p-values. It best fits circular data and it has been noted that using this method,
a single large p-value can overwhelm small p-values (PMID: 11788962).
The vote counting method is limited to answering the question is there any evidence of an effect? One issue with this method is that it compares the number of
"yes" studies to the number of "no" studies. The "yes" and "no" if often an arbitrary statistical cutoff that can bias the outcome. Secondly, it does not
apply any weights to the studies, therefore the effect of a study with 1000 samples has the same weight as a study with 10 samples. While this method is
simple to implement and interpret, this method should only be used when standard meta-analysis methods cannot be used. Meanwhile, at least 5 datasets are required for vote counting to reach the statistical confidence.
To read more on vote counting: http://assets.press.princeton.edu/chapters/s10045.pdf
The effect size is a way to quantify the strength of the difference of phenotype in question between two groups. One of the most recommended methods
for calculating the effect size is the Hedge's g. This method controls for bias in small studies to overestimate the effect. Cohen's d.
To combine the effect sizes across the studies, there are two possible methods: 1) the fixed effects model or 2) the random effects model.
The fixed effects model is a linear model that assumes that the different studies share an underlying common true effect and that observed
differences are due to sampling error. This model should therefore be used when the datasets are rather homogeneous (e.g. samples come from the same population).
The random effects model assumes that the true effect size follows a distribution, and hence is different between different studies.
The random effects model is more commonly used as data are more likely to be heterogeneous (e.g. different LC-MS platform used or batch effects).
A fixed effects model will select genes with the strongest effects amongst the studies, while random effects models will select genes with the
strongest average effect across the studies.