Details
Originalsprache | Englisch |
---|---|
Titel des Sammelwerks | 2019 Data Compression Conference (DCC) |
Herausgeber/-innen | James A. Storer, Ali Bilgin, Joan Serra-Sagrista, Michael W. Marcellin |
Herausgeber (Verlag) | Institute of Electrical and Electronics Engineers Inc. |
Seiten | 93-102 |
Seitenumfang | 10 |
ISBN (elektronisch) | 978-1-7281-0657-1 |
ISBN (Print) | 978-1-7281-0658-8 |
Publikationsstatus | Veröffentlicht - Mai 2019 |
Veranstaltung | 2019 Data Compression Conference, DCC 2019 - Snowbird, USA / Vereinigte Staaten Dauer: 26 März 2019 → 29 März 2019 |
Publikationsreihe
Name | Data Compression Conference Proceedings |
---|---|
Band | 2019-March |
ISSN (Print) | 1068-0314 |
ISSN (elektronisch) | 2375-0359 |
Abstract
As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.
ASJC Scopus Sachgebiete
- Informatik (insg.)
- Computernetzwerke und -kommunikation
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
2019 Data Compression Conference (DCC). Hrsg. / James A. Storer; Ali Bilgin; Joan Serra-Sagrista; Michael W. Marcellin. Institute of Electrical and Electronics Engineers Inc., 2019. S. 93-102 8712770 (Data Compression Conference Proceedings; Band 2019-March).
Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review
}
TY - GEN
T1 - AliCo
T2 - 2019 Data Compression Conference, DCC 2019
AU - Ochoa, Idoia
AU - Li, Hongyi
AU - Baumgarte, Florian
AU - Hergenrother, Charles
AU - Voges, Jan
AU - Hernaez, Mikel
N1 - Funding Information: This work was partially funded by grant numbers 2018-182798 and 2018-182799 from the Chan Zuckerberg Initiative DAF, and an SRI grant from UIUC.
PY - 2019/5
Y1 - 2019/5
N2 - As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.
AB - As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.
KW - Aligned data
KW - Compression
KW - Genomic data
KW - SAM file
UR - http://www.scopus.com/inward/record.url?scp=85066315861&partnerID=8YFLogxK
U2 - 10.1109/DCC.2019.00017
DO - 10.1109/DCC.2019.00017
M3 - Conference contribution
AN - SCOPUS:85066315861
SN - 978-1-7281-0658-8
T3 - Data Compression Conference Proceedings
SP - 93
EP - 102
BT - 2019 Data Compression Conference (DCC)
A2 - Storer, James A.
A2 - Bilgin, Ali
A2 - Serra-Sagrista, Joan
A2 - Marcellin, Michael W.
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 26 March 2019 through 29 March 2019
ER -