Details
Original language | English |
---|---|
Pages (from-to) | 1141-1151 |
Number of pages | 11 |
Journal | Journal of Computational Biology |
Volume | 25 |
Issue number | 10 |
Publication status | Published - 4 Oct 2018 |
Abstract
Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part - spending only 0.49 bits per quality score on average - shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.
Keywords
- genomic data management, high-throughput sequencing, lossless data compression, lossy data compression, quality score compression, variant calling
ASJC Scopus subject areas
- Mathematics(all)
- Modelling and Simulation
- Biochemistry, Genetics and Molecular Biology(all)
- Molecular Biology
- Biochemistry, Genetics and Molecular Biology(all)
- Genetics
- Mathematics(all)
- Computational Mathematics
- Computer Science(all)
- Computational Theory and Mathematics
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
In: Journal of Computational Biology, Vol. 25, No. 10, 04.10.2018, p. 1141-1151.
Research output: Contribution to journal › Article › Research › peer review
}
TY - JOUR
T1 - A Two-Level Scheme for Quality Score Compression
AU - Voges, Jan
AU - Fotouhi, Ali
AU - Ostermann, Jörn
AU - Külekci, Muhammed Oǧuzhan
PY - 2018/10/4
Y1 - 2018/10/4
N2 - Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part - spending only 0.49 bits per quality score on average - shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.
AB - Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part - spending only 0.49 bits per quality score on average - shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.
KW - genomic data management
KW - high-throughput sequencing
KW - lossless data compression
KW - lossy data compression
KW - quality score compression
KW - variant calling
UR - http://www.scopus.com/inward/record.url?scp=85054439697&partnerID=8YFLogxK
U2 - 10.1089/cmb.2018.0065
DO - 10.1089/cmb.2018.0065
M3 - Article
C2 - 30059248
AN - SCOPUS:85054439697
VL - 25
SP - 1141
EP - 1151
JO - Journal of Computational Biology
JF - Journal of Computational Biology
SN - 1066-5277
IS - 10
ER -