AliCo: A New Efficient Representation for SAM Files

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autorschaft

  • Idoia Ochoa
  • Hongyi Li
  • Florian Baumgarte
  • Charles Hergenrother
  • Jan Voges
  • Mikel Hernaez

Externe Organisationen

  • University of Illinois Urbana-Champaign (UIUC)
  • University of Notre Dame
  • Carl R. Woese Institute for Genomic Biology (IGB)
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des Sammelwerks2019 Data Compression Conference (DCC)
Herausgeber/-innenJames A. Storer, Ali Bilgin, Joan Serra-Sagrista, Michael W. Marcellin
Herausgeber (Verlag)Institute of Electrical and Electronics Engineers Inc.
Seiten93-102
Seitenumfang10
ISBN (elektronisch)978-1-7281-0657-1
ISBN (Print)978-1-7281-0658-8
PublikationsstatusVeröffentlicht - Mai 2019
Veranstaltung2019 Data Compression Conference, DCC 2019 - Snowbird, USA / Vereinigte Staaten
Dauer: 26 März 201929 März 2019

Publikationsreihe

NameData Compression Conference Proceedings
Band2019-March
ISSN (Print)1068-0314
ISSN (elektronisch)2375-0359

Abstract

As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.

ASJC Scopus Sachgebiete

Zitieren

AliCo: A New Efficient Representation for SAM Files. / Ochoa, Idoia; Li, Hongyi; Baumgarte, Florian et al.
2019 Data Compression Conference (DCC). Hrsg. / James A. Storer; Ali Bilgin; Joan Serra-Sagrista; Michael W. Marcellin. Institute of Electrical and Electronics Engineers Inc., 2019. S. 93-102 8712770 (Data Compression Conference Proceedings; Band 2019-March).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Ochoa, I, Li, H, Baumgarte, F, Hergenrother, C, Voges, J & Hernaez, M 2019, AliCo: A New Efficient Representation for SAM Files. in JA Storer, A Bilgin, J Serra-Sagrista & MW Marcellin (Hrsg.), 2019 Data Compression Conference (DCC)., 8712770, Data Compression Conference Proceedings, Bd. 2019-March, Institute of Electrical and Electronics Engineers Inc., S. 93-102, 2019 Data Compression Conference, DCC 2019, Snowbird, USA / Vereinigte Staaten, 26 März 2019. https://doi.org/10.1109/DCC.2019.00017
Ochoa, I., Li, H., Baumgarte, F., Hergenrother, C., Voges, J., & Hernaez, M. (2019). AliCo: A New Efficient Representation for SAM Files. In J. A. Storer, A. Bilgin, J. Serra-Sagrista, & M. W. Marcellin (Hrsg.), 2019 Data Compression Conference (DCC) (S. 93-102). Artikel 8712770 (Data Compression Conference Proceedings; Band 2019-March). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/DCC.2019.00017
Ochoa I, Li H, Baumgarte F, Hergenrother C, Voges J, Hernaez M. AliCo: A New Efficient Representation for SAM Files. in Storer JA, Bilgin A, Serra-Sagrista J, Marcellin MW, Hrsg., 2019 Data Compression Conference (DCC). Institute of Electrical and Electronics Engineers Inc. 2019. S. 93-102. 8712770. (Data Compression Conference Proceedings). doi: 10.1109/DCC.2019.00017
Ochoa, Idoia ; Li, Hongyi ; Baumgarte, Florian et al. / AliCo : A New Efficient Representation for SAM Files. 2019 Data Compression Conference (DCC). Hrsg. / James A. Storer ; Ali Bilgin ; Joan Serra-Sagrista ; Michael W. Marcellin. Institute of Electrical and Electronics Engineers Inc., 2019. S. 93-102 (Data Compression Conference Proceedings).
Download
@inproceedings{b6af8f238a114d2d9a2717310a580d04,
title = "AliCo: A New Efficient Representation for SAM Files",
abstract = "As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.",
keywords = "Aligned data, Compression, Genomic data, SAM file",
author = "Idoia Ochoa and Hongyi Li and Florian Baumgarte and Charles Hergenrother and Jan Voges and Mikel Hernaez",
note = "Funding Information: This work was partially funded by grant numbers 2018-182798 and 2018-182799 from the Chan Zuckerberg Initiative DAF, and an SRI grant from UIUC.; 2019 Data Compression Conference, DCC 2019 ; Conference date: 26-03-2019 Through 29-03-2019",
year = "2019",
month = may,
doi = "10.1109/DCC.2019.00017",
language = "English",
isbn = "978-1-7281-0658-8",
series = "Data Compression Conference Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "93--102",
editor = "Storer, {James A.} and Ali Bilgin and Joan Serra-Sagrista and Marcellin, {Michael W.}",
booktitle = "2019 Data Compression Conference (DCC)",
address = "United States",

}

Download

TY - GEN

T1 - AliCo

T2 - 2019 Data Compression Conference, DCC 2019

AU - Ochoa, Idoia

AU - Li, Hongyi

AU - Baumgarte, Florian

AU - Hergenrother, Charles

AU - Voges, Jan

AU - Hernaez, Mikel

N1 - Funding Information: This work was partially funded by grant numbers 2018-182798 and 2018-182799 from the Chan Zuckerberg Initiative DAF, and an SRI grant from UIUC.

PY - 2019/5

Y1 - 2019/5

N2 - As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.

AB - As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.

KW - Aligned data

KW - Compression

KW - Genomic data

KW - SAM file

UR - http://www.scopus.com/inward/record.url?scp=85066315861&partnerID=8YFLogxK

U2 - 10.1109/DCC.2019.00017

DO - 10.1109/DCC.2019.00017

M3 - Conference contribution

AN - SCOPUS:85066315861

SN - 978-1-7281-0658-8

T3 - Data Compression Conference Proceedings

SP - 93

EP - 102

BT - 2019 Data Compression Conference (DCC)

A2 - Storer, James A.

A2 - Bilgin, Ali

A2 - Serra-Sagrista, Joan

A2 - Marcellin, Michael W.

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 26 March 2019 through 29 March 2019

ER -