Details
Original language | English |
---|---|
Pages (from-to) | 1650-1658 |
Number of pages | 9 |
Journal | BIOINFORMATICS |
Volume | 34 |
Issue number | 10 |
Publication status | Published - 23 Nov 2017 |
Abstract
Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq.
ASJC Scopus subject areas
- Mathematics(all)
- Statistics and Probability
- Biochemistry, Genetics and Molecular Biology(all)
- Biochemistry
- Biochemistry, Genetics and Molecular Biology(all)
- Molecular Biology
- Computer Science(all)
- Computer Science Applications
- Computer Science(all)
- Computational Theory and Mathematics
- Mathematics(all)
- Computational Mathematics
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
In: BIOINFORMATICS, Vol. 34, No. 10, 23.11.2017, p. 1650-1658.
Research output: Contribution to journal › Article › Research › peer review
}
TY - JOUR
T1 - CALQ
T2 - Compression of quality values of aligned sequencing data
AU - Voges, Jan
AU - Ostermann, Jörn
AU - Hernaez, Mikel
N1 - Funding information: This work has been partially supported by the Leibniz Universität Hannover eNIFE grant, the Stanford Data Science Initiative (SDSI), the National Science Foundations grant NSF 1184146-3-PCEIC and the National Institute of Health grant with number NIH 1 U01 CA198943-01.
PY - 2017/11/23
Y1 - 2017/11/23
N2 - Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq.
AB - Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq.
UR - http://www.scopus.com/inward/record.url?scp=85047088167&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btx737
DO - 10.1093/bioinformatics/btx737
M3 - Article
C2 - 29186284
AN - SCOPUS:85047088167
VL - 34
SP - 1650
EP - 1658
JO - BIOINFORMATICS
JF - BIOINFORMATICS
SN - 1367-4803
IS - 10
ER -