Compiled: Fri Mar 3 14:51:39 2023
This document describes how the Biocrates p180-based targeted metabolomics data from the CHRIS study (Pattaro et al. 2015) (Verri Hernandes et al. 2022) can be loaded and analyzed in R.
The provided values for each metabolite is a absolute concentration in natural scale.
Data in TDF format is best imported into R using the tidyfr R package which can be installed using the code below.
remotes::install_github("EuracBiomedicalResearch/tidyfr")
To load the CHRIS p180 metabolomics data, access to the folder with the data in
TDF format is required. The name of the data folder (and hence the name of the
CHRIS module) is metabolomics_p180. Below we list all versions of the data
set available for the this module. In this example we assume the
metabolomics_p180 folder to be present in the current folder in which R is
run. Parameter path allows to specify the folder in which CHRIS data modules
are stored (we use "." to indicate the current working directory).
library(tidyfr)
##
## Attaching package: 'tidyfr'
## The following object is masked from 'package:utils':
##
## data
list_data_modules(path = ".")
## name version
## 1 metabolomics_p180 1.0.0.1
## 2 metabolomics_p180 1.0.1.1
## description
## 1 Targeted metabolomics data based on the Biocrates p180 kit comprising measured concentrations of 175 metabolites and lipids in serum samples from CHRIS participants.
## 2 Targeted metabolomics data based on the Biocrates p180 kit comprising measured concentrations of 175 metabolites and lipids in serum samples of CHRIS participants.
In this particular case, 2 versions of the module are available. We can next
load the module with the data_module function specifying the name of the
module, the version we want to load and also the path where the module can be
found.
metabo <- data_module("metabolomics_p180", version = "1.0.1.1", path = ".")
The metabo variable is now a reference to this module which provides some
general information:
metabo
## Object of class DataModule
## o name: metabolomics_p180
## o version: 1.0.1.1
## o description: Targeted metabolomics data based on the Biocrates p180 kit comprising measured concentrations of 175 metabolites and lipids in serum samples of CHRIS participants.
## o date: 2020-03
Note that we could also use the functions moduleName, moduleVersion,
moduleDescription and moduleDate to extract these metadata.
The actual data can be loaded with the data function. This will read the full
data which includes general information for each measurement along with the
metabolite concentrations and a quality information flag for each individual
measurement.
metabo_data <- data(metabo)
dim(metabo_data)
## [1] 7251 350
In this data columns are variables (such as metabolite concentrations of quality
information) and rows participants. For the present data set concentrations for
175 metabolites are available along with quality information on each of these
measurements (hence there are in total 350 columns). The AIDs (i.e., identifiers
for each sample/participant) are provided as the row names of the
data.frame. Description and information on the individual columns can be
loaded with the labels function:
metabo_ann <- labels(metabo)
This data.frame contains metadata and additional information on each variable
in data:
head(metabo_ann)
## label unit type min max missing description
## x0pt001 x0pt001 NA float 2.745243e-03 257.15694 -89 ADMA
## x0pt002 x0pt002 NA float 1.605280e+02 686.16781 -89 Ala
## x0pt003 x0pt003 NA float 1.455732e-03 27.97999 -89 alpha-AAA
## x0pt004 x0pt004 NA float 4.195551e+01 197.87741 -89 Arg
## x0pt005 x0pt005 NA float 2.329051e+01 96.62628 -89 Asn
## x0pt006 x0pt006 NA float 1.925211e+00 2291.64673 -89 Asp
## analyte_name analyte_class analyte_quant biochemical_name
## x0pt001 ADMA biogenic amines semi Asymmetric dimethylarginine
## x0pt002 Ala aminoacids ok Alanine
## x0pt003 alpha-AAA biogenic amines semi alpha-Aminoadipic acid
## x0pt004 Arg aminoacids ok Arginine
## x0pt005 Asn aminoacids ok Asparagine
## x0pt006 Asp aminoacids semi Aspartate
## aliases formula lipid_maps analyte_flag analyte_note
## x0pt001 ADMA C8H18N4O2 0
## x0pt002 Alanine C3H7NO2 0
## x0pt003 alpha-Aminoadipic acid C6H11NO4 1 outlier plates
## x0pt004 Arginine C6H14N4O2 0
## x0pt005 Asparagine C4H8N2O3 0
## x0pt006 Aspartate C4H7NO4 0
## hmdb_id ms_type cv_qc_chris long_description
## x0pt001 HMDB0001539 LC-MS 0.06627664 Asymmetric dimethylarginine
## x0pt002 HMDB0000161 LC-MS 0.04789639 Alanine
## x0pt003 HMDB0000510 LC-MS 0.21486936 alpha-Aminoadipic acid
## x0pt004 HMDB0000517 LC-MS 0.03824621 Arginine
## x0pt005 HMDB0000168 LC-MS 0.03138246 Asparagine
## x0pt006 HMDB0000191 LC-MS 0.05464532 Aspartate
Each row provides annotations for each column in the metabo_data data.frame
(rows and columns are in the same order).
stopifnot(all(rownames(metabo_ann) == colnames(metabo_data)))
The column cv_qc_chris represent the coefficient of variation (CV) calculated on QC samples, the QC CHRIS Pool samples. This values thus represents the technical variability for each metabolite in the present dataset.
While these functions now loaded the data, it is suggested to further process
and reformat the data to simplify its analysis. At first we replace the internal
identifiers for CHRIS labels (i.e. starting with x0pt*) with more meaningful
column names.
colnames(metabo_data) <- metabo_ann$description
rownames(metabo_ann) <- metabo_ann$description
With that it is much easier to access individual values.
quantile(metabo_data$Gly, na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 91.36936 202.04114 235.15132 277.15061 693.95894
It is also helpful to discriminate between columns in metabo_data that contain
the actual metabolite concentrations or the quality information. Below we
identify the columns with quality information. For these the keyword * flags* is
added to the metabolite name.
flag_cols <- grep("flags$", colnames(metabo_data))
We next list all available quality information (which is encoded as a factor).
levels(unlist(metabo_data[, flag_cols]))
## [1] "OK" "Removed because of technical reason"
## [3] "Below lower level of quantification" "Above upper level of quantification"
## [5] "Below level of detection"
Also, we will need at some point to annotate metabolites with some additional
annotations. For that we can use the metabo_ann data frame that represents the
annotation for the labels of the data module. Below we select 5 random
metabolites and extract their annotation from this data frame.
ids <- sample(colnames(metabo_data)[-flag_cols], 5)
metabo_ann[ids, ]
## label unit type min max missing description
## Ile x0pt015 NA float 26.78392210 243.3909540 -89 Ile
## PC aa C26:0 x0pt087 NA float 0.30337354 1.1886800 -89 PC aa C26:0
## Spermidine x0pt029 NA float 0.02855096 0.7281069 -89 Spermidine
## C10:2 x0pt040 NA float 0.04312934 0.2741845 -89 C10:2
## SDMA x0pt026 NA float 0.14667452 232.3364149 -89 SDMA
## analyte_name analyte_class analyte_quant
## Ile Ile aminoacids ok
## PC aa C26:0 PC aa C26:0 glycerophospholipids semi
## Spermidine Spermidine biogenic amines semi
## C10:2 C10:2 acylcarnitines semi
## SDMA SDMA biogenic amines semi
## biochemical_name aliases formula
## Ile Isoleucine Isoleucine C6H13NO2
## PC aa C26:0 Phosphatidylcholine diacyl C26:0 PC 26:0 C34H68NO8P
## Spermidine Spermidine Spermidine C7H19N3
## C10:2 Decadienylcarnitine Decadienylcarnitine C17H29NO4
## SDMA Symmetric dimethylarginine SDMA C8H18N4O2
## lipid_maps
## Ile
## PC aa C26:0 LMGP01010388;LMGP01010432;LMGP01010475;LMGP01010456;LMGP01010725;LMGP01011243
## Spermidine
## C10:2
## SDMA
## analyte_flag analyte_note
## Ile 0
## PC aa C26:0 0
## Spermidine 0
## C10:2 1 small dynamic range in QC samples; outlier plates
## SDMA 0
## hmdb_id ms_type cv_qc_chris long_description
## Ile HMDB0000172 LC-MS 0.03691048 Isoleucine
## PC aa C26:0 FIA 0.07579529 Phosphatidylcholine diacyl C26:0
## Spermidine HMDB0001257 LC-MS 0.05213707 Spermidine
## C10:2 HMDB0013325 FIA 0.09161977 Decadienylcarnitine
## SDMA HMDB0003334 LC-MS 0.08145940 Symmetric dimethylarginine
We could also simply calculate the mean abundance of these 5 metabolites across all available CHRIS participants using the code below:
vapply(metabo_data[, ids], mean, numeric(1), na.rm = TRUE)
## Ile PC aa C26:0 Spermidine C10:2 SDMA
## 66.24374575 0.49278034 0.19819053 0.08016315 0.50839056
As we can see from the values above they are in natural scale - so, for data
analysis it might be better to transform them using log2 or log10 (which
will also ensure the data to be more Gaussian distributed). See also
(Verri Hernandes et al. 2022) for more information on data distribution and
quality.
sessionInfo()
## R Under development (unstable) (2023-02-22 r83892)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyfr_0.99.11 BiocStyle_2.27.1 rmarkdown_2.20
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.0 knitr_1.42
## [3] rlang_1.0.6 xfun_0.37
## [5] DelayedArray_0.25.0 jsonlite_1.8.4
## [7] SummarizedExperiment_1.29.1 S4Vectors_0.37.3
## [9] RCurl_1.98-1.10 htmltools_0.5.4
## [11] sass_0.4.5 stats4_4.3.0
## [13] MatrixGenerics_1.11.0 Biobase_2.59.0
## [15] grid_4.3.0 evaluate_0.20
## [17] jquerylib_0.1.4 bitops_1.0-7
## [19] fastmap_1.1.1 yaml_2.3.7
## [21] IRanges_2.33.0 GenomeInfoDb_1.35.15
## [23] bookdown_0.32 BiocManager_1.30.20
## [25] compiler_4.3.0 XVector_0.39.0
## [27] lattice_0.20-45 digest_0.6.31
## [29] R6_2.5.1 GenomeInfoDbData_1.2.9
## [31] GenomicRanges_1.51.4 Matrix_1.5-3
## [33] bslib_0.4.2 tools_4.3.0
## [35] matrixStats_0.63.0 zlibbioc_1.45.0
## [37] BiocGenerics_0.45.0 cachem_1.0.7