A comparison of strategies for generating artificial replicates in RNA-seq experiments

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Autoren

  • Babak Saremi
  • Frederic Gusmag
  • Ottmar Distl
  • Frank Schaarschmidt
  • Julia Metzger
  • Stefanie Becker
  • Klaus Jung

Organisationseinheiten

Externe Organisationen

  • Stiftung Tierärztliche Hochschule Hannover
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Aufsatznummer7170
FachzeitschriftScientific reports
Jahrgang12
Ausgabenummer1
Frühes Online-Datum3 Mai 2022
PublikationsstatusVeröffentlicht - Dez. 2022

Abstract

Due to the overall high costs, technical replicates are usually omitted in RNA-seq experiments, but several methods exist to generate them artificially. Bootstrapping reads from FASTQ-files has recently been used in the context of other NGS analyses and can be used to generate artificial technical replicates. Bootstrapping samples from the columns of the expression matrix has already been used for DNA microarray data and generates a new artificial replicate of the whole experiment. Mixing data of individual samples has been used for data augmentation in machine learning. The aim of this comparison is to evaluate which of these strategies are best suited to study the reproducibility of differential expression and gene-set enrichment analysis in an RNA-seq experiment. To study the approaches under controlled conditions, we performed a new RNA-seq experiment on gene expression changes upon virus infection compared to untreated control samples. In order to compare the approaches for artificial replicates, each of the samples was sequenced twice, i.e. as true technical replicates, and differential expression analysis and GO term enrichment analysis was conducted separately for the two resulting data sets. Although we observed a high correlation between the results from the two replicates, there are still many genes and GO terms that would be selected from one replicate but not from the other. Cluster analyses showed that artificial replicates generated by bootstrapping reads produce it p values and fold changes that are close to those obtained from the true data sets. Results generated from artificial replicates with the approaches of column bootstrap or mixing observations were less similar to the results from the true replicates. Furthermore, the overlap of results among replicates generated by column bootstrap or mixing observations was much stronger than among the true replicates. Artificial technical replicates generated by bootstrapping sequencing reads from FASTQ-files are better suited to study the reproducibility of results from differential expression and GO term enrichment analysis in RNA-seq experiments than column bootstrap or mixing observations. However, FASTQ-bootstrapping is computationally more expensive than the other two approaches. The FASTQ-bootstrapping may be applicable to other applications of high-throughput sequencing.

ASJC Scopus Sachgebiete

Zitieren

A comparison of strategies for generating artificial replicates in RNA-seq experiments. / Saremi, Babak; Gusmag, Frederic; Distl, Ottmar et al.
in: Scientific reports, Jahrgang 12, Nr. 1, 7170, 12.2022.

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Saremi B, Gusmag F, Distl O, Schaarschmidt F, Metzger J, Becker S et al. A comparison of strategies for generating artificial replicates in RNA-seq experiments. Scientific reports. 2022 Dez;12(1):7170. Epub 2022 Mai 3. doi: 10.1038/s41598-022-11302-9
Saremi, Babak ; Gusmag, Frederic ; Distl, Ottmar et al. / A comparison of strategies for generating artificial replicates in RNA-seq experiments. in: Scientific reports. 2022 ; Jahrgang 12, Nr. 1.
Download
@article{267d083f28324bb1b95e3912c8ab8edf,
title = "A comparison of strategies for generating artificial replicates in RNA-seq experiments",
abstract = "Due to the overall high costs, technical replicates are usually omitted in RNA-seq experiments, but several methods exist to generate them artificially. Bootstrapping reads from FASTQ-files has recently been used in the context of other NGS analyses and can be used to generate artificial technical replicates. Bootstrapping samples from the columns of the expression matrix has already been used for DNA microarray data and generates a new artificial replicate of the whole experiment. Mixing data of individual samples has been used for data augmentation in machine learning. The aim of this comparison is to evaluate which of these strategies are best suited to study the reproducibility of differential expression and gene-set enrichment analysis in an RNA-seq experiment. To study the approaches under controlled conditions, we performed a new RNA-seq experiment on gene expression changes upon virus infection compared to untreated control samples. In order to compare the approaches for artificial replicates, each of the samples was sequenced twice, i.e. as true technical replicates, and differential expression analysis and GO term enrichment analysis was conducted separately for the two resulting data sets. Although we observed a high correlation between the results from the two replicates, there are still many genes and GO terms that would be selected from one replicate but not from the other. Cluster analyses showed that artificial replicates generated by bootstrapping reads produce it p values and fold changes that are close to those obtained from the true data sets. Results generated from artificial replicates with the approaches of column bootstrap or mixing observations were less similar to the results from the true replicates. Furthermore, the overlap of results among replicates generated by column bootstrap or mixing observations was much stronger than among the true replicates. Artificial technical replicates generated by bootstrapping sequencing reads from FASTQ-files are better suited to study the reproducibility of results from differential expression and GO term enrichment analysis in RNA-seq experiments than column bootstrap or mixing observations. However, FASTQ-bootstrapping is computationally more expensive than the other two approaches. The FASTQ-bootstrapping may be applicable to other applications of high-throughput sequencing.",
author = "Babak Saremi and Frederic Gusmag and Ottmar Distl and Frank Schaarschmidt and Julia Metzger and Stefanie Becker and Klaus Jung",
note = "Funding Information: Open Access funding enabled and organized by Projekt DEAL. This project received funding from the Deutsche Forschungsgemeinschafft (DFG, German Research Foundation) [398066876/GRK 2485/1]. Funding Information: We thank Heike Klippert-Hasberg (Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover) for technical assistance in the sequencing experiment.",
year = "2022",
month = dec,
doi = "10.1038/s41598-022-11302-9",
language = "English",
volume = "12",
journal = "Scientific reports",
issn = "2045-2322",
publisher = "Nature Publishing Group",
number = "1",

}

Download

TY - JOUR

T1 - A comparison of strategies for generating artificial replicates in RNA-seq experiments

AU - Saremi, Babak

AU - Gusmag, Frederic

AU - Distl, Ottmar

AU - Schaarschmidt, Frank

AU - Metzger, Julia

AU - Becker, Stefanie

AU - Jung, Klaus

N1 - Funding Information: Open Access funding enabled and organized by Projekt DEAL. This project received funding from the Deutsche Forschungsgemeinschafft (DFG, German Research Foundation) [398066876/GRK 2485/1]. Funding Information: We thank Heike Klippert-Hasberg (Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover) for technical assistance in the sequencing experiment.

PY - 2022/12

Y1 - 2022/12

N2 - Due to the overall high costs, technical replicates are usually omitted in RNA-seq experiments, but several methods exist to generate them artificially. Bootstrapping reads from FASTQ-files has recently been used in the context of other NGS analyses and can be used to generate artificial technical replicates. Bootstrapping samples from the columns of the expression matrix has already been used for DNA microarray data and generates a new artificial replicate of the whole experiment. Mixing data of individual samples has been used for data augmentation in machine learning. The aim of this comparison is to evaluate which of these strategies are best suited to study the reproducibility of differential expression and gene-set enrichment analysis in an RNA-seq experiment. To study the approaches under controlled conditions, we performed a new RNA-seq experiment on gene expression changes upon virus infection compared to untreated control samples. In order to compare the approaches for artificial replicates, each of the samples was sequenced twice, i.e. as true technical replicates, and differential expression analysis and GO term enrichment analysis was conducted separately for the two resulting data sets. Although we observed a high correlation between the results from the two replicates, there are still many genes and GO terms that would be selected from one replicate but not from the other. Cluster analyses showed that artificial replicates generated by bootstrapping reads produce it p values and fold changes that are close to those obtained from the true data sets. Results generated from artificial replicates with the approaches of column bootstrap or mixing observations were less similar to the results from the true replicates. Furthermore, the overlap of results among replicates generated by column bootstrap or mixing observations was much stronger than among the true replicates. Artificial technical replicates generated by bootstrapping sequencing reads from FASTQ-files are better suited to study the reproducibility of results from differential expression and GO term enrichment analysis in RNA-seq experiments than column bootstrap or mixing observations. However, FASTQ-bootstrapping is computationally more expensive than the other two approaches. The FASTQ-bootstrapping may be applicable to other applications of high-throughput sequencing.

AB - Due to the overall high costs, technical replicates are usually omitted in RNA-seq experiments, but several methods exist to generate them artificially. Bootstrapping reads from FASTQ-files has recently been used in the context of other NGS analyses and can be used to generate artificial technical replicates. Bootstrapping samples from the columns of the expression matrix has already been used for DNA microarray data and generates a new artificial replicate of the whole experiment. Mixing data of individual samples has been used for data augmentation in machine learning. The aim of this comparison is to evaluate which of these strategies are best suited to study the reproducibility of differential expression and gene-set enrichment analysis in an RNA-seq experiment. To study the approaches under controlled conditions, we performed a new RNA-seq experiment on gene expression changes upon virus infection compared to untreated control samples. In order to compare the approaches for artificial replicates, each of the samples was sequenced twice, i.e. as true technical replicates, and differential expression analysis and GO term enrichment analysis was conducted separately for the two resulting data sets. Although we observed a high correlation between the results from the two replicates, there are still many genes and GO terms that would be selected from one replicate but not from the other. Cluster analyses showed that artificial replicates generated by bootstrapping reads produce it p values and fold changes that are close to those obtained from the true data sets. Results generated from artificial replicates with the approaches of column bootstrap or mixing observations were less similar to the results from the true replicates. Furthermore, the overlap of results among replicates generated by column bootstrap or mixing observations was much stronger than among the true replicates. Artificial technical replicates generated by bootstrapping sequencing reads from FASTQ-files are better suited to study the reproducibility of results from differential expression and GO term enrichment analysis in RNA-seq experiments than column bootstrap or mixing observations. However, FASTQ-bootstrapping is computationally more expensive than the other two approaches. The FASTQ-bootstrapping may be applicable to other applications of high-throughput sequencing.

UR - http://www.scopus.com/inward/record.url?scp=85129336434&partnerID=8YFLogxK

U2 - 10.1038/s41598-022-11302-9

DO - 10.1038/s41598-022-11302-9

M3 - Article

C2 - 35505053

AN - SCOPUS:85129336434

VL - 12

JO - Scientific reports

JF - Scientific reports

SN - 2045-2322

IS - 1

M1 - 7170

ER -

Von denselben Autoren