Details
Originalsprache | Englisch |
---|---|
Seiten (von - bis) | 1650-1658 |
Seitenumfang | 9 |
Fachzeitschrift | BIOINFORMATICS |
Jahrgang | 34 |
Ausgabenummer | 10 |
Publikationsstatus | Veröffentlicht - 23 Nov. 2017 |
Abstract
Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq.
ASJC Scopus Sachgebiete
- Mathematik (insg.)
- Statistik und Wahrscheinlichkeit
- Biochemie, Genetik und Molekularbiologie (insg.)
- Biochemie
- Biochemie, Genetik und Molekularbiologie (insg.)
- Molekularbiologie
- Informatik (insg.)
- Angewandte Informatik
- Informatik (insg.)
- Theoretische Informatik und Mathematik
- Mathematik (insg.)
- Computational Mathematics
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
in: BIOINFORMATICS, Jahrgang 34, Nr. 10, 23.11.2017, S. 1650-1658.
Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review
}
TY - JOUR
T1 - CALQ
T2 - Compression of quality values of aligned sequencing data
AU - Voges, Jan
AU - Ostermann, Jörn
AU - Hernaez, Mikel
N1 - Funding information: This work has been partially supported by the Leibniz Universität Hannover eNIFE grant, the Stanford Data Science Initiative (SDSI), the National Science Foundations grant NSF 1184146-3-PCEIC and the National Institute of Health grant with number NIH 1 U01 CA198943-01.
PY - 2017/11/23
Y1 - 2017/11/23
N2 - Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq.
AB - Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq.
UR - http://www.scopus.com/inward/record.url?scp=85047088167&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btx737
DO - 10.1093/bioinformatics/btx737
M3 - Article
C2 - 29186284
AN - SCOPUS:85047088167
VL - 34
SP - 1650
EP - 1658
JO - BIOINFORMATICS
JF - BIOINFORMATICS
SN - 1367-4803
IS - 10
ER -