CALQ: Compression of quality values of aligned sequencing data

Research output: Contribution to journalArticleResearchpeer review

Authors

External Research Organisations

  • University of Illinois at Urbana-Champaign
View graph of relations

Details

Original languageEnglish
Pages (from-to)1650-1658
Number of pages9
JournalBIOINFORMATICS
Volume34
Issue number10
Publication statusPublished - 23 Nov 2017

Abstract

Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq.

ASJC Scopus subject areas

Cite this

CALQ: Compression of quality values of aligned sequencing data. / Voges, Jan; Ostermann, Jörn; Hernaez, Mikel.
In: BIOINFORMATICS, Vol. 34, No. 10, 23.11.2017, p. 1650-1658.

Research output: Contribution to journalArticleResearchpeer review

Voges J, Ostermann J, Hernaez M. CALQ: Compression of quality values of aligned sequencing data. BIOINFORMATICS. 2017 Nov 23;34(10):1650-1658. doi: 10.1093/bioinformatics/btx737
Voges, Jan ; Ostermann, Jörn ; Hernaez, Mikel. / CALQ : Compression of quality values of aligned sequencing data. In: BIOINFORMATICS. 2017 ; Vol. 34, No. 10. pp. 1650-1658.
Download
@article{d589ff10858046a6bfc9d6592ae86cc4,
title = "CALQ: Compression of quality values of aligned sequencing data",
abstract = "Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq.",
author = "Jan Voges and J{\"o}rn Ostermann and Mikel Hernaez",
note = "Funding information: This work has been partially supported by the Leibniz Universit{\"a}t Hannover eNIFE grant, the Stanford Data Science Initiative (SDSI), the National Science Foundations grant NSF 1184146-3-PCEIC and the National Institute of Health grant with number NIH 1 U01 CA198943-01.",
year = "2017",
month = nov,
day = "23",
doi = "10.1093/bioinformatics/btx737",
language = "English",
volume = "34",
pages = "1650--1658",
journal = "BIOINFORMATICS",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "10",

}

Download

TY - JOUR

T1 - CALQ

T2 - Compression of quality values of aligned sequencing data

AU - Voges, Jan

AU - Ostermann, Jörn

AU - Hernaez, Mikel

N1 - Funding information: This work has been partially supported by the Leibniz Universität Hannover eNIFE grant, the Stanford Data Science Initiative (SDSI), the National Science Foundations grant NSF 1184146-3-PCEIC and the National Institute of Health grant with number NIH 1 U01 CA198943-01.

PY - 2017/11/23

Y1 - 2017/11/23

N2 - Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq.

AB - Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq.

UR - http://www.scopus.com/inward/record.url?scp=85047088167&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btx737

DO - 10.1093/bioinformatics/btx737

M3 - Article

C2 - 29186284

AN - SCOPUS:85047088167

VL - 34

SP - 1650

EP - 1658

JO - BIOINFORMATICS

JF - BIOINFORMATICS

SN - 1367-4803

IS - 10

ER -

By the same author(s)