GCMS-ID: a webserver for identifying compounds from gas chromatography mass spectrometry experiments (2024)

Article Navigation

Article Contents

Abstract
Introduction
Conclusion
Data availability
Funding
References

Journal Article

Julia Wakoli

Department of Biological Sciences, University of Alberta

Edmonton

, AB T6G 2E9,

Canada

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Afia Anjum

Department of Computing Science, University of Alberta

Edmonton

, AB T6G 2E8,

Canada

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Tanvir Sajed

Department of Biological Sciences, University of Alberta

Edmonton

, AB T6G 2E9,

Canada

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Eponine Oler

Department of Biological Sciences, University of Alberta

Edmonton

, AB T6G 2E9,

Canada

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Fei Wang

Department of Computing Science, University of Alberta

Edmonton

, AB T6G 2E8,

Canada

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Vasuk Gautam

Department of Biological Sciences, University of Alberta

Edmonton

, AB T6G 2E9,

Canada

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Marcia LeVatte

Department of Biological Sciences, University of Alberta

Edmonton

, AB T6G 2E9,

Canada

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

DavidS Wishart

Department of Biological Sciences, University of Alberta

Edmonton

, AB T6G 2E9,

Canada

Department of Computing Science, University of Alberta

Edmonton

, AB T6G 2E8,

Canada

Department of Laboratory Medicine and Pathology, University of Alberta

Edmonton

, AB T6G 2B7,

Canada

Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta

Edmonton

, AB T6G 2H7,

Canada

To whom correspondence should be addressed. Tel: +1 780 492 8574; Email: dwishart@ualberta.ca

https://orcid.org/0000-0002-3207-2434

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Nucleic Acids Research, gkae425, https://doi.org/10.1093/nar/gkae425

Published:

23 May 2024

Article history

Received:

01 April 2024

Revision received:

28 April 2024

Accepted:

07 May 2024

Published:

23 May 2024

PDF
Split View
Views
- Article contents
- Figures & tables
- Video
- Audio
- Supplementary Data
Cite

Cite

Julia Wakoli, Afia Anjum, Tanvir Sajed, Eponine Oler, Fei Wang, Vasuk Gautam, Marcia LeVatte, DavidS Wishart, GCMS-ID: a webserver for identifying compounds from gas chromatography mass spectrometry experiments, Nucleic Acids Research, 2024;, gkae425, https://doi.org/10.1093/nar/gkae425

Close
Permissions Icon Permissions

Navbar Search Filter Mobile Enter search term Search

Navbar Search Filter Enter search term Search

Advanced Search

Search Menu

Abstract

GCMS-ID (Gas Chromatography Mass Spectrometry compound IDentifier) is a webserver designed to enable the identification of compounds from GC–MS experiments. GC–MS instruments produce both electron impact mass spectra (EI-MS) and retention index (RI) data for as few as one, to as many as hundreds of different compounds. Matching the measured EI-MS, RI or EI-MS+RI data to experimentally collected EI-MS and/or RI reference libraries allows facile compound identification. However, the number of available experimental RI and EI-MS reference spectra, especially for metabolomics or exposomics-related studies, is disappointingly small. Using machine learning to accurately predict the EI-MS spectra and/or RIs for millions of metabolomics and/or exposomics-relevant compounds could (partially) solve this spectral matching problem. This computational approach to compound identification is called in silico metabolomics. GCMS-ID brings this concept of in silico metabolomics closer to reality by intelligently integrating two of our previously published webservers: CFM-EI and RIpred. CFM-EI is an EI-MS spectral prediction webserver, and RIpred is a Kovats RI prediction webserver. We have found that GCMS-ID can accurately identify compounds from experimental RI, EI-MS or RI+EI-MS data through matching to its own large library of>1 million predicted RI/EI-MS values generated for metabolomics/exposomics-relevant compounds. GCMS-ID can also predict the RI or EI-MS spectrum from a user-submitted structure or annotate a user-submitted EI-MS spectrum. GCMS-ID is freely available at https://gcms-id.ca/.

Graphical Abstract

Open in new tabDownload slide

Introduction

Gas chromatography–mass spectrometry (GC–MS) is a commonly used analytical method to separate, identify and quantify small molecules in various mixtures and matrices (1). In GC–MS, the compounds of interest are often chemically derivatized, separated by a GC column, ionized, fragmented and the mass of the ion fragments measured by MS to determine their mass-to-charge ratios (m/z). Several ionization methods can be used in GC–MS, with electron impact ionization (EI) being far-and-away the most common method. EI ionization at 70 eV is particularly useful for maximizing energy transfer, yielding the strongest possible ionization and most informative molecular fragmentation. Ion detection is normally done via a single quadrupole mass spectrometer with unit mass (1 amu) resolution, although higher resolution (time-of-flight or TOF) instruments are available. Compound identification by EI-MS is most commonly done by comparing the measured EI-MS spectrum of an ‘unknown’ compound to an experimentally collected EI-MS spectral library of thousands of pure compounds. Including information about the normalized GC retention time (also called the Kovats retention index or RI) of the unknown and comparing it to known RI values can provide additional confidence about the compound's identity.

A number of databases containing experimentally measured EI-MS and/or RI data have been compiled and are commercially available. These include the NIST23 Mass Spectral Library (https://chemdata.nist.gov/dokuwiki/doku.php?id=chemdata:ridatabase), which contains EI-MS spectra for 153472 compounds and Kovats RI data for 180618 compounds, the 2023 Wiley Registry of Mass Spectral Data (https://www.wiley.com/en-us/Wiley+Registry+of+Mass+Spectral+Data+2023-p-9781119736325), which contains EI-MS spectra for 741000 compounds and Kovats RI data for 738400 compounds as well as a several specialized GC–MS libraries covering EI-MS and RI data for fragrances, drugs and pesticides that are produced or sold by Shimadzu and Wiley. Several freely available GC–MS databases are also available, including the Golm GC–MS library (http://gmd.mpimp-golm.mpg.de/download/), which contains EI-MS data on more than 11000 compounds and RI data on more than 9000, the Massbank EU library (2), which contains EI-MS data on nearly 12000 compounds, BinBase (3) which contains GC–MS data on several thousand compounds and mixtures, the Scientific Working Group for the Analysis of Seized Drugs or SWGDRUG MS (https://www.swgdrug.org/) library which has EI-MS data on 3500 drug or drug-related compounds and the Cayman Spectral Library (https://www.caymanchem.com/forensics/publications/csl) which has EI-MS data on 2000 organic compounds.

However, given that>150 million compounds are known (https://www.cas.org/about/cas-content), and numerous chemical (silyl, acyl, alkyl) derivatives of each compound are possible for each of these compounds, the number of experimentally measured EI-MS spectra (with or without Kovats RI data) is only a tiny fraction of what is needed to perform library-based compound identification. Furthermore, in the fields of metabolomics, exposomics and natural products research, the percentage of known compounds with experimentally measured EI-MS spectra and/or Kovats RI data is typically<2%. One approach to overcome the paucity of experimental EI-MS or Kovats RI data is to attempt to predict EI-MS spectra and RI data for the millions of known compound structures. If the EI-MS spectra and/or RI data could be predicted accurately enough, then these predicted spectral libraries could serve as substitutes to the experimental libraries. This would allow compounds to be identified by simply matching observed spectral or RI data to predicted spectral or RI data. This approach is commonly called in silico or reference-free metabolomics (4,5).

Over the past decade several EI-MS fragmentation predictors or EI-MS spectrum predictors have been developed. Some such as QCEIMS (6) use a combination of quantum mechanics (QM) and molecular dynamics to calculate EI-MS spectra. While QM methods are quite accurate, they are very slow and computationally expensive. Several other, much faster approaches have been developed that use combinatorial or machine learning approaches, such as MetFrag (7) and CFM-EI (8). These predictors use the structure of an input molecule to predict the compound fragmentation probabilities, the corresponding EI-MS spectra and the structures/formulae of the fragments. More recently developed tools, such as the Neural Electron–Ionization Mass Spectrometry (NEIMS) predictor (9), make use of graph neural networks (GNNs) to predict EI-MS spectra. The NEIMS predictor achieves better accuracy than CFM-EI (as measured by cosine similarity) (10; see the GCMS-ID website under ‘About’, and ‘NEIMS Performance’ in the dropdown menu), but it cannot annotate the generated EI-MS peaks with formulae or structures, as done by CFM-EI. In addition to developments in EI-MS spectral prediction, there have also been important advances in Kovats RI prediction. These include the development of a very high performing RI predictor based on GNNs, developed by NIST (which is commercially available) (11) and another freely available RI predictor developed by our group, called RIpred (12). Both predictors achieve correlation indices between predicted and observed RI values of>0.99 and mean absolute percentage errors (MAPEs) of<3%.

Given the current state of the art, it stands to reason that if one could combine the most accurate version of the NEIMS EI-MS predictor with the CFM-EI MS annotation capabilities and blend it with RIpred to predict Kovats RI values, then it would be possible to create a tool that could predict a high-quality, fully annotated EI-MS spectrum and an accurate Kovats RI value for any given compound. Running such a tool on all known metabolites in the Human Metabolome Database or HMDB (13), all known exposome compounds in the Norman Suspect List Exchange or NSLE (14) and all known natural products in the NP-MRD (15), along with their TMS (trimethylsilyl) and/or TBDMS (tert-butyldimethylsilyl) derivatives, would enable the creation of an enormous library of millions of (predicted) reference EI-MS spectra and reference RI values for exposome-relevant, metabolome-relevant or natural product-relevant compounds. Since GC–MS experiments produce both EI-MS and RI data, matching measured EI-MS, RI or EI-MS+RI data to these predicted reference RI, EI-MS or RI+EI-MS data could enable facile compound identification in GC–MS-based metabolomic, exposomic or natural product studies.

This concept has served as the motivation behind our development of the GCMS-ID (Gas Chromatography Mass Spectrometry compound IDentifier) webserver. Specifically, GCMS-ID is a webserver designed to help researchers identify compounds from GC–MS experiments. It does so by matching experimentally measured EI-MS data and/or RI data to its reference library of more than 1 million predicted RI/EI-MS values and producing a ranked list of top candidates. In addition to supporting GC–MS compound identification, GCMS-ID also offers the capability of taking a SMILES string or a drawn structure (via the MarvinJS chemical drawing applet) and predicting the RI and annotated EI-MS spectrum for the input structure. The design of the GCMS-ID server, the functions offered by the server, the graphical display tools, and the overall performance of the webserver in compound identification, EI-MS prediction/annotation and RI prediction are detailed in the following sections.

General design and operation

The GCMS-ID webserver supports four common GC–MS functions that are useful for analyzing GC–MS experiments, especially as they relate to metabolomics, exposomics and natural product chemistry. These include: (i) predicting Kovats retention indices (RIs); (ii) predicting EI-MS spectra; (iii) annotating EI-MS spectral peaks and (iv) identifying a chemical compound from experimentally acquired EI-MS and/or RI inputs. As shown in Figure 1, these four functions are presented on the GCMS-ID homepage as RI prediction, EI-MS prediction, peak assignment and compound identification. These functions can be accessed by either pressing on the sliding ‘banners’ that appear on the homepage or by selecting ‘Predict’ on the menu tabs located at the top of the homepage. Selecting the ‘Predict’ tab produces a pull-down menu with the same functions listed on the sliding banners.

Figure 1.

Open in new tabDownload slide

Summary flowchart of the four major functions of the GCMS-ID web server.

GCMS-ID’s RI Prediction function enables the computational prediction of Kovats RIs for submitted compounds in three stationary GC column phases including Standard Non-Polar (SNP), Semi-Standard Non-Polar (SSNP) and Standard Polar (SP) using a chemical structure submitted as either a SMILES string (16) or a drawn structure submitted via ChemAxon's Marvin JS applet (17). GCMS-ID’s EI-MS Prediction function supports the in silico generation of unit-mass resolution EI-MS mass spectra for compounds based on their chemical structure and common GC–MS derivatives including TMS, TBDMS or a combination of both TMS and TBDMS. GCMS-ID’s Peak Annotation function takes an observed EI-MS spectrum for a given compound (indicated by the SMILES string) and uses the CFM-EI algorithm (10,18) and a combinatorial chemical formula generator called PeakAnnotator (10), to perform comprehensive EI-MS peak annotation. GCMS-ID’s Compound Identification function takes an experimentally measured EI-MS spectrum and/or corresponding RI of an unknown or unidentified compound and searches against a user-selected database of predicted EI-MS spectra and RI values to determine the probable identity of the unknown compound. The GCMS-ID databases include predicted EI-MS spectra and predicted RI data from compounds in the NIST20 database (representing synthetic or industrial compounds) (https://chemdata.nist.gov/dokuwiki/doku.php?id=chemdata:ridatabase), the HMDB (representing mammalian metabolites) (13), a plant-specific natural product compound database from the NP-MRD (representing plant natural products) (15), NP-Atlas (representing microbially derived natural products) (15) and the Norman Suspect List Exchange or NSLE (representing common exposome compounds) (14). This collection of databases encompasses predicted EI-MS data and Kovats RI values (as well as compound names and chemical structures) for over 900000 different compounds and their derivatives.

In addition to these prediction functions, GCMS-ID also supports a variety of other actions. In particular, GCMS-ID’s top menu bar displays tabs for ‘Instructions’, ‘About’, ‘Download’ and ‘Contact Us’. Clicking on the ‘Instructions’ tab generates a detailed, richly annotated and fully illustrated tutorial on how GCMS-ID can be run, what kind of input is required and what kind of output is generated. Details on what each graph or graph annotation means and what each row and column in each generated table are provided. Clicking on the ‘About’ tab generates a brief overview of the GCMS-ID server, along with information about the webserver's operating system and web browser compatibility, details on the size and composition of the GCMS-ID spectral and RI libraries, more detailed explanations for some of GCMS-ID’s algorithms, summary statistics on the performance of GCMS-ID with regard to its EI-MS prediction accuracy (cosine similarity scores), its RI prediction accuracy (correlation index and mean absolute percentage error or MAPE) and its compound identification accuracy for several NIST database test sets. Clicking on the ‘Download’ tab generates a web page listing five separate downloadable files (in CSV format) that contain the predicted RI and EI-MS data for each of the compounds and their corresponding TMS and TBDMS derivatives for the NIST20, HMDB, NP-MRD (Plants), NP-Atlas and NSLE compounds. Each downloadable file contains information on the file size (in Mbytes), the name of the compound, the SMILES string, the predicted RI value in three different GC stationary phases and the EI-MS peak list (m/z and relative intensity). Clicking on the ‘Contact Us’ tab generates information about how to contact the GCMS-ID programming team to report bugs, offer suggestions or request programmatic API access.

Kovats RI prediction

GCMS-ID’s Kovats RI Predictor was adapted from RIpred (12) which can be independently accessed at https://ripred.ca/. RIpred was developed in our lab to facilitate accurate, freely accessible RI prediction for multiple GC column stationary phases. It was inspired by the NIST RI prediction model initially outlined by Qu etal. (11). RIpred uses a GNN architecture in which the submitted SMILES string is converted into a molecular graph, with nodes and edges representing atoms and their connections, respectively. Using RDKIT, atom-level features are extracted from the molecular graph, including atom types, formal charges, valences, and path features. The atom-level features, along with computed path features, are then fed into the GNN model, which is made up of five hidden layers, each with 160 hidden units. RIpred's implementation was adapted and refined from Python source code repositories that are publicly available. RIpred was trained on>120 000 RI data values obtained from various sources for three different GC stationary column phases. It has a correlation index (R²) between observed and predicted RIs of>0.98 with a MAPE of<3%.

RIpred also uses an in-house program called AUTOSILATOR to automatically add TMS and/or TBDMS moieties to appropriate sites on any given parent molecule and automatically limits both the size and extent of silylation to match known steric limitations and GC–MS instrument limitations. Additional details on the AUTOSILATOR algorithm are available in the RIpred paper (12) and on the GCMS-ID website (under ‘About’). GCMS-ID is capable of performing accurate RI predictions for both underivatized and TMS and/or TBDMS derivatized compounds across three distinct GC column phases (SNP, SSNP, SP) (12). To predict a Kovats RI value, GCMS-ID has four required inputs (Figure 2A): (i) a SMILES string which corresponds to the compound structure (or users can interactively draw structures using MarvinJS applet from ChemAxon (17), (ii) the compound name, (iii) the GC stationary phase and (iv) the type of derivatization. For the stationary phase selection, users can choose between SNP, SSNP and SP from the dropdown menu. For the type of derivatization, users can choose from TMS, TBDMS, combined TMS and TBDMS, and no derivatization, respectively from the dropdown menu. After pressing the ‘Submit’ button, the predicted RI is generated in a separate window, together with the requested type of derivatization, in a tabular format (Figure 2B). Users may click on the tabs (TMS-Derivatization, TBDMS-Derivatization, No-Derivatization) located above each table to navigate between different derivatization states. Each derivatization table consists of six columns: the Compound Name, the Chemical Structure, the GC Column/Stationary Phase, the Derivatization Type, the Predicted Kovats Retention Index and a Time/Date Stamp. Additionally, the generated SDF files, together with the table in CSV format, can be downloaded by the user by clicking on the ‘Download All’ button (or individual Download TMS/TBDMS/No Derivatization buttons) near the top of the page. To assist users in running the RI Prediction module, two example compounds (Example 1 and Example 2) are provided. Clicking on the corresponding ‘Load Example’ buttons will autofill the required fields after which users can press the ‘Submit’ button to obtain the RI prediction.

Figure 2.

Open in new tabDownload slide

GCMS-ID’s prediction functions: (A) Input submission form for the RI Prediction function with (B) the compounds with matching results displayed in a table, as well as (C) input submission form for the EI-MS Prediction function with (D) the predicted spectra for the input and its derivative.

EI-MS spectra prediction

The EI-MS Predictor for GCMS-ID uses a modification of the NEIMS EI-MS predictor. The algorithm and training process used for NEIMS has been explained previously (9). The modifications for NEIMS include the use of an in-house program called MIIP (molecular ion intensity predictor) which more accurately predicts the intensity and m/z value of the molecular ion and a peak annotation tool called PeakAnnotator (10). PeakAnnotator is a rule-based, combinatorial formula generator developed for GCMS-ID that generates fragment ion subformula for each m/z value based on the molecule's mass and elemental composition. This process uses well known rules such as the nitrogen rule, senior rule, and degree of unsaturation, along with a hand-built knowledgebase of known EI-MS peak patterns, to refine peak formula generation. Additional rules are applied to further improve formula generation including the handling high abundance isotopic elements (such as Cl and Br) and incorporating subformulae and substructures generated from CFM-EI’s (8) fragmentation module. Comparisons to PeakAnnotator's proposed molecular subformulae and those generated by experts from published, annotated EI-MS peak subformulae show nearly perfect agreement (more information about this program is available under the ‘About’ menu).

To predict an EI-MS spectrum, GCMS-ID has three required fields (Figure 2C): (i) a SMILES string, which can be directly pasted into the MarvinJS applet (or users can draw the structure into the MarvinJS applet) (17), (ii) a compound name and (iii) the type of derivatization. For the type of derivatization, users can choose from TMS, TBDMS, combined TMS and TBDMS, and no derivatization, respectively, from the dropdown menu. After pressing the ‘Submit’ button, the predicted EI-MS spectrum is generated in a separate window that consists of three parts (Figure 2D). This includes the Compound Details, the Spectrum View and the Documentation. The Compound Details for the query are presented in a table that lists the Compound Name, Molecular Formula, Molecular Weight, Exact Mass, SMILES string, InChI and InChI Key. This section can contain up to two different tables (the parent compound table and the derivative compound table), depending on whether derivatization was selected or not. The Spectrum View uses the ApexCharts open-source library (https://apexcharts.com/) to allow for interactive visualization of the predicted EI-MS spectrum. The EI-MS peaks are generated using relative intensity versus m/z (mass-to-charge) bar plots. Users can hover over the peaks with their mouse or trackpad to display the peak's m/z value, its intensity and its formula. Moreover, the predicted peaks are color-coded with red indicating peaks for which corresponding formula were found and blue for peaks where no reasonable formula could be generated. Icons located at the top of each EI-MS spectral diagram allow users to use their mouse or trackpad to expand, contract, zoom or drag the spectrum. Each EI-MS spectrum can be downloaded as an SVG, PNG or CSV file by clicking on the download icon (three horizontal bars) located beside the image control functions and selecting the preferred file format from the dropdown menu. Lastly, in the Documentation section, the generated SDF files can be downloaded by clicking on the corresponding hyperlinks. To assist users in running the EI-MS Prediction module, two example compounds (Example 1 and Example 2) are provided. Clicking on the corresponding ‘Load Example’ buttons will autofill the required fields after which users can press the ‘Submit’ button to obtain the EI-MS prediction.

Peak assignment

The Peak Assignment function for GCMS-ID was added to aid users with interpreting and annotating EI-MS spectral peaks collected from experimental GC–MS data. The same PeakAnnotator script is used to annotate the peak list of the submitted spectral data and a set of subformula are determined for each m/z value based on the structure (or presumptive structure) of the molecule under consideration. To annotate EI-MS spectra with GCMS-ID, three fields are required (Figure 3A): (i) the EI-MS spectrum which needs to be submitted as a two-column list with each value consisting of an m/z value and its corresponding normalized intensity (ranging from 1–1000); (ii) the SMILES string for the presumptive compound and (iii) the name of the compound. After pressing the ‘Submit’ button, the annotated EI-MS spectrum is generated and displayed in a separate window (Figure 3B). As with the EI-MS Predictor, the Peak Assignment function also uses ApexCharts to allow interactive visualization of the EI-MS spectral peaks. The hovering functionality allows users to use their mouse or track pad to view information about each annotated peak which includes the m/z value, the peak intensity and the proposed chemical formula. Additionally, all EI-MS peaks are color-coded with red indicating peaks for which molecular formulas can be assigned and blue indicating unassigned peaks. Icons located at the top of each EI-MS spectral diagram allow users to use their mouse or trackpad to expand, contract, zoom or drag the spectrum. Each assigned EI-MS spectrum can be downloaded as an SVG, PNG or CSV file by clicking on the download icon (three horizontal bars) located beside the image control functions and selecting the preferred file format from the dropdown menu. To assist users in running the Peak Assignment module, two example EI-MS spectra (Example 1 and Example 2) are provided. Clicking on the corresponding ‘Load Example’ buttons will autofill the required fields after which users can press the ‘Submit’ button to obtain the peak assignments.

Figure 3.

Open in new tabDownload slide

GCMS-ID’s annotation and identification functions: (A) input form for the Peak Assignment function with (B) the annotated spectrum, as well as (C) input form for Compound Identification with (D) the mirrored results showing the input and matched spectra.

Compound identification

Compound identification in GCMS-ID involves a spectral library search of an experimentally measured EI-MS spectrum and/or and an experimentally measured RI value against GCMS-ID’s large collection of predicted EI-MS spectra and/or predicted Kovats RI values. Key to the utility of this Compound Identification function are GCMS-ID’s unique spectral database collection and the accuracy of its predicted EI-MS spectra and predicted RI values. Five different databases are available including a database of 265920 compounds (and derivatives) with EI-MS spectra and RI values (for SNP, SSNP and SP) columns for synthetic or industrial chemicals (from the NIST20 library), a database of 144327 compounds (and derivatives) with EI-MS spectra and RI values (for SNP, SSNP and SP columns) for common metabolites (from the HMDB), a database of 252105 compounds (and derivatives) with EI-MS spectra and RI values (for SNP, SSNP and SP) columns for plant compounds (from NP-MRD), a database of 54594 compounds (and derivatives) with EI-MS spectra and RI values (for SNP, SSNP and SP) columns for microbial compounds (from NP-Atlas) and a database of 183427 compounds (and derivatives) with EI-MS spectra and RI values (for SNP, SSNP and SP) columns for exposome compounds or chemical contaminants (from NSLE). Each of these databases is intended to appeal to specialist applications in synthetic/industrial chemistry, metabolomics, natural product chemistry, and exposomics. Taken together, the GCMS-ID databases represent the largest collection of GC–MS data in the world. More specifically, in terms of the number of compounds, they are 21.5% larger than the Wiley GC–MS collection and 586% larger than the NIST GC–MS collection. What's more, they are all freely available (via the GCMS-ID ‘Download’ page) and freely searchable.

To identify a compound with GCMS-ID, users can provide three types of input data (Figure 3C): (i) an EI-MS spectrum and an RI value, (ii) an EI-MS spectrum only or (iii) an RI value only. These search options can be selected from the pulldown menu at the top of the page. The default is the EI-MS+RI option as this yields the most accurate results. Once the search option is selected, users must provide the appropriate EI-MS spectral data (in a two-column format, with m/z values in one column and normalized intensities in the other) and/or measured RI data (including the choice of the GC column and the use of any derivatization). Once these fields are filled in, the user must select the desired database to search via the pulldown menu. After pressing the ‘Submit’ button, the selected database is searched and the highest scoring matches are presented in a table via a new webpage. The search algorithm used for the Compound Identification function is relatively simple. When EI-MS+RI data or EI-MS-only data is provided, an estimate of the parent ion mass is made to limit the search and reduce the search time. The similarity between a query EI-MS spectrum and database EI-MS spectra are then calculated using a standard cosine similarity score (19). A cosine similarity score of 0.0 indicates no match to any peaks in terms of m/z values or intensity, while 1.0 indicates a perfect match in both m/z values and intensity. After the database search and comparison is completed, all high scoring hits are sorted in order of cosine similarity score and only the top 10 compounds are selected. This ‘short list’ of compounds may be further filtered by removing those with cosine similarity scores below 0.25 (unless there is only one compound). If an RI value is provided (either alone or in conjunction with the EI-MS spectral data), an RI similarity score is calculated for each compound remaining in the short list. The RI similarity (RI_sim) is calculated using the following formula:

$$\begin{eqnarray*}{\rm{R}}{{\rm{I}}}_{{\rm{sim}}} = (0.15-\left( {{\rm{abs}}\left( {{\rm{R}}{{\rm{I}}}_{{\rm{exp}}}-{\rm{R}}{{\rm{I}}}_{{\rm{pred}}}} \right)/{\rm{max}}\left( {{\rm{R}}{{\rm{I}}}_{{\rm{exp}}},{\rm{R}}{{\rm{I}}}_{{\rm{pred}}}} \right)} \right)/0.15 \end{eqnarray*}$$

(1)

where RI_exp is the experimental RI value, RI_pred is the predicted RI value, abs (X −Y) is the absolute value of the difference of two values, max(X,Y) is the maximum of the two values and the RI tolerance is set to±15%. A perfect match results in an RI score of 1.0 and total mismatch (outside the allowed RI tolerance) results in an RI score of 0. All compounds receiving RI scores of 0 are removed from the short list unless there is only one compound. Finally, a combined score, incorporating both cosine similarity and RI scores, is determined with varying weightings based on the RI score. If the RI score is greater than or equal to 0.8, a weighting of 0.4 for the RI score and 0.6 for the cosine similarity score is used. Otherwise, a weighting ratio of 0.9 to 0.1 is used, with 0.9 being used for the cosine similarity score and 0.1 for the RI score. If only EI-MS data are submitted, only the cosine similarity score is calculated and used in ranking hits. If only RI data are submitted, only the RI score is calculated and used in ranking hits.

Results from the EI-MS and/or RI searches are rendered in a table listing details for each matching hit (Figure 3D; Compound Name, Molecular Formula, Exact Mass, InChI Key) and computed scores (EI-MS Cosine Similarity, RI Score, Combined Score) along with a spectral comparison panel (if the EI-MS+RI or EI-MS option is chosen) featuring interactive EI-MS spectral display using the ApexCharts library. The EI-MS spectral visualization tool supports a peak hovering functionality and displays a mirrored plot with the query EI-MS spectrum being on top (in blue) and the matching (database) EI-MS spectrum placed below (in red) for facile comparison. The spectral comparison panel allows users to toggle between matched compounds on the mirror plot by clicking on the ‘View’ button, with the top scoring match displayed by default. The selected ‘View’ button for the selected compound is colored red while the un-selected ‘View’ buttons remain green. As with the Peak Assignment function, this result can also be downloaded as an SVG, PNG or CSV file by clicking on the download icon (three horizontal bars) located beside the image control functions and selecting the preferred file format from the dropdown menu. To assist users in running the Compound Identification module, two example queries (Example 1 and Example 2) are provided. Clicking on the corresponding ‘Load Example’ buttons will autofill the required fields after which users can press the ‘Submit’ button to obtain the ranked list of top scoring hits.

Compound identification accuracy

To evaluate GCMS-ID’s accuracy in identifying compounds using EI-MS spectra and RI data, a dataset comprising 100 experimentally collected EI-MS spectra from the NIST20 GC–MS database spanning a wide range of chemical classes (including alkanes, alkenes, aromatics, and heterocycles) was used. Each EI-MS spectrum was linked to its experimentally measured RI value. These RI values ranged from 360 to 2700 Kovats RI units. To assess the compound identification performance of GCMS-ID, each EI-MS spectrum-RI pair was submitted to the GCMS-ID webserver (after selecting the NIST20 database as the library to search) and the list of the top 10 scoring hits was then analyzed (see the GCMS-ID website under ‘About’, and ‘About GCMS-ID’ in the dropdown menu). Assessment of compound identification accuracy was based on the percentage of correctly identified compounds within the top ten, within the top two and the top hit. Our results showed that 80% of the compounds were identified as the top hit, while 87% were within the top two, and 100% were included in the top ten. If only EI-MS data was used to perform the search, 68% of the compounds were identified as the top hit, while 81% were within the top two, and 98% were included in the top ten. If only RI data was used to perform the search, 4% of the compounds were identified as the top hit, while 5% were within the top two, and 8% were included in the top ten. For the 100 compounds used in this test set, the average cosine similarity score over all 100 compounds was 0.755±0.099. Likewise, the average RI score was 0.837±0.155, with an average MAPE of 1.91%. These results are consistent with the previously reported accuracy of these predictive methods. A second set of 100 experimentally collected EI-MS spectra and RI values that were unique to the NIST23 GC–MS dataset were also used to validate the RIpred NEIMS and compound identification performance. These compounds were not used in the training of either RIpred or NEIMS. The predictive performance for this second set of data showed very similar results, with an average cosine similarity score over all 100 compounds of 0.768±0.102, and an average RI score of 0.811±0.156 with an average MAPE of 1.92%. Using this NIST23 data set, our results showed that 81% of the compounds were identified as the top hit, while 86% were within the top two, and 100% were included in the top ten.

Additionally, the computational efficiency of our web server was assessed by measuring the time taken for compound identification and ranking. When EI-MS+RI data was used in the query, the average processing time was 25 s. When EI-MS-only data was used in the query, the average processing time was 48 s. When RI-only data was used in the query, the average processing time was 7 s. A more detailed summary of the testing results for all 200 compounds, including scores for each input method and for each test compound are available by clicking on the ‘About’ tab.

Programmatic implementation

The GCMS-ID server uses a variety of standardized web frameworks and data caching systems developed in our lab to make the website more user friendly and responsive. More specifically, the frontend of GCMS-ID was implemented as a RESTful web service using the Ruby on Rails framework. Ruby on Rails is a well-regarded web development system that employs a Model-View-Controller (MVC). In the MVC framework, models (the M in MVC) respond and interact with the data, views (the V in MVC) create the interface to show and interact with the data, and controllers (the C in MVC) connect the user to the views. The MVC framework allowed our lab to rapidly develop, prototype and test all of GCMS-ID’s web modules and page views. Many of the utilities with GCMS-ID were borrowed from a large collection of Ruby gems previously developed for the HMDB (13). This framework is particularly robust and code can be reused in different functions or changed easily to accommodate future feature expansion or abrupt changes in design. This allowed us to liberally borrow code and functions from other webservers developed in our lab (12,13,18–21).

Conclusion

The GCMS-ID webserver represents an important advance in the development of robust, reliable and freely available tools for analyzing GC–MS data. By making EI-MS and RI data for >1 million compounds (including their derivatives) freely available and providing tools to easily predict RI and EI-MS data within a single, easily used webserver, we believe that GC–MS data analysis will become significantly easier, cheaper and faster. We also believe that the development of GCMS-ID highlights the tremendous advances that machine learning has brought to the field of analytical chemistry, especially in MS spectral prediction and RI prediction. Indeed, these advances are now enabling the practical implementation of so-called in silico or reference-free methods for compound identification (22,23). Given the rather small size or limited scope of existing, experimentally derived MS spectral libraries and/or RI libraries and given the high cost of acquiring them, alternative, low-cost approaches need to be developed. In silico methods, such as those offered via GCMS-ID, appear to offer a compellingly cheap and practical alternative. Furthermore, given the challenge of preparing, synthesizing or isolating thousands of chemicals (especially for metabolomics, exposomics and natural product chemistry) and obtaining high quality reference EI-MS or RI data for these chemicals, a cheaper, more practical approach to compound identification needs to be developed.

Data availability

GCMS-ID is freelyavailable at https://gcms-id.ca/. All data used for operating the webserver is freely available through the webserver’s Download section.

Funding

University of Alberta (a graduate student research assistant fellowship for J.W.); Natural Sciences and Engineering Research Council of Canada (NSERC); Canada Foundation for Innovation (CFI); Genome Canada. Funding for open access charge:Genome Canada.

Conflict of interest statement. None declared.

References

Sparkman

O.D.

Penton

Z.E.

Kitson

F.G.

Section 2: GC conditions, derivatization, and mass spectral interpretation specific compound types

Gas Chromatography and Mass Spectrometry: A Practical Guide

2011

;

Burlington, MA

Elsevier

219

Google Scholar

OpenURL Placeholder Text

MassBank consortium and its contributors

2023

;

MassBank/MassBank-data: Release version 2023.11 (2023.11) [Data set] 10.5281/zenodo.10213786

Fiehn

Wohlgemuth

Scholz

Kind

Lee

D.Y.

Moon

Nikolau

Quality control for plant metabolomics: reporting MSI-compliant studies

Plant J.

2008

;

691

–

704

Nielson

F.F.

Kay

Young

S.J.

Colby

S.M.

Renslow

R.S.

Metz

T.O.

Similarity downselection: finding the n most dissimilar molecular conformers for reference-free metabolomics

Metabolites

2023

;

105

Djoumbou-Feunang

Pon

Karu

Zheng

Arndt

Gautam

Allen

Wishart

D.S.

CFM-ID 3.0: significantly improved ESI-MS/MS prediction and compound identification

Metabolites

2019

;

Grimme

Towards first principles calculation of electron impact mass spectra of molecules

Angew. Chem. Int. Ed.

2013

;

6306

–

6312

Google Scholar

Crossref

Search ADS

Ruttkies

Schymanski

E.L.

Wolf

Hollender

Neumann

MetFrag relaunched: incorporating strategies beyond in silico fragmentation

J. Cheminform.

2016

;

Allen

Pon

Greiner

Wishart

Computational prediction of electron ionization mass spectra to assist in GC/MS compound identification

Anal. Chem.

2016

;

7689

–

7697

Wei

J.N.

Belanger

Adams

R.P.

Sculley

Rapid prediction of electron-ionization mass spectrometry using neural networks

ACS Cent. Sci.

2019

;

700

–

708

10.

Anjum

Application of Machine Learning towards Compound Identification through Gas Chromatography Retention Index (RI) and Electron Ionization Mass Spectrometry (EI-MS) Predictions

2023

;

Edmonton, Canada

MSc Thesis, Dept. of Computing Science, University of Alberta

Google Scholar

OpenURL Placeholder Text

11.

Schneider

B.I.

Kearsley

A.J.

Keyrouz

Allison

T.C.

Predicting Kováts retention indices using graph neural networks

J. Chromatogr. A

2021

;

1646

462100

12.

Anjum

Liigand

Milford

Gautam

Wishart

D.S.

Accurate prediction of isothermal gas chromatographic Kováts retention indices

J. Chromatogr. A

2023

;

1705

464176

13.

Wishart

D.S.

Guo

Oler

Wang

Anjum

Peters

Dizon

Sayeeda

Tian

Lee

B.L.

et al..

HMDB 5.0: the Human Metabolome Database for 2022

Nucleic Acids Res.

2022

;

D622

–

D631

14.

MohammedTaha

Aalizadeh

Alygizakis

Antignac

J.P.

Arp

H.P.H.

Bade

Baker

Belova

Bijlsma

Bolton

E.E.

et al..

The NORMAN suspect List Exchange (NORMAN-SLE): facilitating European and worldwide collaboration on suspect screening in high resolution mass spectrometry

Environ. Sci. Eur.

2022

;

104

15.

Wishart

D.S.

Sayeeda

Budinski

Guo

A.C.

Lee

B.L.

Berjanskii

Rout

Peters

Dizon

Mah

et al..

NP-MRD: the Natural Products Magnetic Resonance Database

Nucleic Acids Res.

2022

;

D665

–

D677

16.

Weininger

SMILES, a chemical language and information system: 1: introduction to methodology and encoding rules

J. Chem. Inf. Comput. Sci.

1988

;

–

Google Scholar

Crossref

Search ADS

17.

Csizmadia

JChem: java applets and modules supporting chemical database handling from web browsers

J. Chem. Inf. Comput. Sci.

2000

;

323

–

324

18.

Allen

Pon

Wilson

Greiner

Wishart

CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra

Nucleic Acids Res.

2014

;

W94

–

W99

19.

Wang

Liigand

Tian

Arndt

Greiner

Wishart

D.S.

CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification

Anal. Chem.

2021

;

11692

–

11700

20.

Xia

Wishart

Using MetaboAnalyst 3.0 for comprehensive metabolomics data analysis

Curr. Protoc. Bioinformatics

2016

;

14.10.1

–

14.10.91

21.

Wishart

D.S.

Tian

Allen

Oler

Peters

Lui

V.W.

Gautam

Djoumbou-Feunang

Greiner

Metz

T.O.

BioTransformer 3.0—A web server for accurately predicting metabolic transformation products

Nucleic Acids Res.

2022

;

W115

–

W123

22.

Sabater

Olano

Corzo

Montilla

GC–MS characterisation of novel artichoke (Cynara scolymus) pectic-oligosaccharides mixtures by the application of machine learning algorithms and competitive fragmentation modelling

Carbohydr. Polym.

2019

;

205

513

–

523

23.

McEachran

A.D.

Balabin

Cathey

Transue

T.R.

Al-Ghoul

Grulke

Sobus

J.R.

Williams

A.J.

Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns

Sci. Data

2019

;

141

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Download all slides

Comments

0 Comments

Comments (0)

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Advertisem*nt

Citations

Views

507

Altmetric

More metrics information

Metrics

Total Views 507

407 Pageviews

100 PDF Downloads

Since 5/1/2024

Month:	Total Views:
May 2024	507

Citations

Altmetrics

Email alerts

Article activity alert

Advance article alerts

New issue alert

Subject alert

Receive exclusive offers and updates from Oxford Academic

Citing articles via

Google Scholar

Latest
Most Read
Most Cited

CGeNArate: a sequence-dependent coarse-grained model of DNA for accurate atomistic MD simulations of kb-long duplexes

Generating, modeling and evaluating a large-scale set of CRISPR/Cas9 off-target sites with bulges

Genomic context-dependent histone H3K36 methylation by three Drosophila methyltransferases and implications for dedicated chromatin readers

mosaicMPI: a framework for modular data integration across cohorts and -omics modalities

The developmental and evolutionary characteristics of transcription factor binding site clustered regions based on an explainable machine learning model

GCMS-ID: a webserver for identifying compounds from gas chromatography mass spectrometry experiments (2024)

Article Contents

Cite

Abstract

Introduction

General design and operation

Kovats RI prediction

EI-MS spectra prediction

Peak assignment

Compound identification

Compound identification accuracy

Programmatic implementation

Conclusion

Data availability

Funding

References

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

References