A Two-Level Scheme for Quality Score Compression

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Autoren

Externe Organisationen

  • Technische Universität Istanbul
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Seiten (von - bis)1141-1151
Seitenumfang11
FachzeitschriftJournal of Computational Biology
Jahrgang25
Ausgabenummer10
PublikationsstatusVeröffentlicht - 4 Okt. 2018

Abstract

Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part - spending only 0.49 bits per quality score on average - shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.

ASJC Scopus Sachgebiete

Zitieren

A Two-Level Scheme for Quality Score Compression. / Voges, Jan; Fotouhi, Ali; Ostermann, Jörn et al.
in: Journal of Computational Biology, Jahrgang 25, Nr. 10, 04.10.2018, S. 1141-1151.

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Voges J, Fotouhi A, Ostermann J, Külekci MO. A Two-Level Scheme for Quality Score Compression. Journal of Computational Biology. 2018 Okt 4;25(10):1141-1151. doi: 10.1089/cmb.2018.0065
Voges, Jan ; Fotouhi, Ali ; Ostermann, Jörn et al. / A Two-Level Scheme for Quality Score Compression. in: Journal of Computational Biology. 2018 ; Jahrgang 25, Nr. 10. S. 1141-1151.
Download
@article{ad26e5b0ebbd45f8acec0b0a943876bc,
title = "A Two-Level Scheme for Quality Score Compression",
abstract = "Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part - spending only 0.49 bits per quality score on average - shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.",
keywords = "genomic data management, high-throughput sequencing, lossless data compression, lossy data compression, quality score compression, variant calling",
author = "Jan Voges and Ali Fotouhi and J{\"o}rn Ostermann and K{\"u}lekci, {Muhammed Oǧuzhan}",
year = "2018",
month = oct,
day = "4",
doi = "10.1089/cmb.2018.0065",
language = "English",
volume = "25",
pages = "1141--1151",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "10",

}

Download

TY - JOUR

T1 - A Two-Level Scheme for Quality Score Compression

AU - Voges, Jan

AU - Fotouhi, Ali

AU - Ostermann, Jörn

AU - Külekci, Muhammed Oǧuzhan

PY - 2018/10/4

Y1 - 2018/10/4

N2 - Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part - spending only 0.49 bits per quality score on average - shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.

AB - Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part - spending only 0.49 bits per quality score on average - shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.

KW - genomic data management

KW - high-throughput sequencing

KW - lossless data compression

KW - lossy data compression

KW - quality score compression

KW - variant calling

UR - http://www.scopus.com/inward/record.url?scp=85054439697&partnerID=8YFLogxK

U2 - 10.1089/cmb.2018.0065

DO - 10.1089/cmb.2018.0065

M3 - Article

C2 - 30059248

AN - SCOPUS:85054439697

VL - 25

SP - 1141

EP - 1151

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 10

ER -

Von denselben Autoren