on the positional BWT

Author

Mattia S.

bf-pbwt results

BLIM timing(s)	chr10(18gb)	chr21(4.7gb)	chr22(4.8)
2bfpbwt-bm on master	690.53	110.47	116.99
2bfpbwt on (71a045a)	294.98	74.48	75.54
2bfpbwt-bm on (hotfix)	753.71	108.56	113.40

BLIM ram (GiB)	chr10(18gb)	chr21(4.7gb)	chr22(4.8)
2bfpbwt-bm on master	250	82	84
2bfpbwt on (71a045a)	158	43	44
2bfpbwt-bm on (hotfix)	244	82	84

REVIEW

Durbin’s Original PBWT [1] git

Input:

a binary matrix \(X\), of \(m\) rows and \(n\) columns. \((\Sigma = \{0, 1\})\), where \(x_i\) are haplotype sequences; \(0\leq i < m\) and \(x_i \in \Sigma^n\)

Output:

for each column \(k\) in \(0\leq k < n\) we represent:

a prefix array \(A_k\) that stores the permutation of the rows in \(X\) according to the co-lex ordering of the subsequences of \(X\) from \(0\) to \(k\);
\(Y_k\), the entire matrix \(X\) with \(x_i\) permuted according to \(A_k\) (not computed)
an array \(y^k = Y_k[k+1]\) containg the permutation of the input \(X\)’s next column, according to the row permutation in \(A_k\). This is computed on-the-fly (and not stored!) reading \(X[k+1]\) by the permutation stored in \(A_k\).
a divergence array \(D_k\) that stores the length of the \(LCP^{-1}\) between each pair \((y_i[0,k], y_{i-1}[0,k])\)

The catch

\(A_k\) and \(D_k\) can be computed by \(A_{k-1}\) and \(D_{k-1}\) linearly in space/time \(O(m)\)[1].

The overall computation time of the \(\texttt{PBWT}\) (composed of arrays \(A_k\) and \(D_k\) for each \(k\) … is \(O(n\cdot m)\).
Moreover, if we keep only \(\texttt{PBWT}_{k}\) to compute the next \(\texttt{PBWT}_{k+1}\), we can do so in \(O(m)\) space, performing some other computation on each \(\texttt{PBWT}\) before it is discarded.

PBWT Use:

Indexing + Querying of Set Maximal Exact Matches and/or L-long matches, computing so-called matching statistics
- SMEMs: match a pattern \(P\) against \(X\), retrive the longest matches of \(P[0,k]\) subsequences in co-lex ordering.
Computing internal all-vs-all matches longer than \(L\) and/or Maximal Perfect Haplotype Blocks (MPHB)

Tools (long matches and SMEMs)

Durbin’s pbwt [1] git
- First Definition of PBWT
- Reads VCF/BCF or .macs simulated data
- On-line computation
- Computes internal L-long matches, internal SMEMs, external (query) SMEMs.
mu-pbwt [2] git
- Run-Length encoding of the PBWT
- reads VCF/BCF
- indexes and queries of a pattern against the VCF
- computes external SMEMs and L-long matches
d-pbwt [3] git
- Dynamic pbwt to dynamically insert/delete reads
- reads VCF
- indexes and queries of patterns against Panel,
- computes external L-long matches
- panel is stored as a std::vector<std::vector<bool>>
dynamic-mu-pbwt [4] git
- dynamic + mu PBWT
- reads VCF
- indexes and queries of patterns against Panel,
- computes external SMEMs and L-long matches
PBWT-Query [5] git
- Code is unaccesible, may read VCF
- Uses “Leap Arrays” to jump from different \(D\) values in \(y^k\)
- Notes: several version of the algorithms are proposed
  - PBWT-Query memory extensive, panel in memory
  - L-PBWT-Query memory mapped, panel lazy-loaded; uses Leap arrays
  - L-PBWT-Query memory extensive, panel in memory; uses Leap arrays
- computes external L-long matches
syllable-PBWT [6] git
- syllable-encoding of the PBWT, 64 or 128 bits
  - code: reading VCF encoded in int syllables SyllableQuery.cpp
- reads VCF
- indexes and queries of patterns against Panel, parametrized by \(B\) syllable length (\(B=\{64, 128\}\))
- computes external L-long matches with \(L \geq 2B\)
- notes: in the paper section (2.1) and Table 1, they write about “bit-PBWT” referring to the Durbin’s original PBWT used in PBWT-Query [5] and d-PBWT [3]
mc-PBWT: multi column (parallel?) scanning of the PBWT (double and triple) [7] used for “Matches of exact Lenght \(L\)
- seems NOT implemented …
parallel PBWT (VCF/BCF )[8] git
- first parallel version (implemented)
  - vertically parallel + reconcile each \(p\) thread
- computes Long Matches and SMEMs (compares against Durbin’s pbwt)
- uses htslib (reads VCF/BCF)
- panel is read as std::vector<std:vector<bool>>
- main code at code
HB-parallel PBWT (VCF?): last version, horizontally parallel [9] Long Matches and SMEMs
- no code perovided
- benchmarks are against [8] and Durbin’s
- speed up only for very large Haplotype count (M size).

Tools (Haplotype Blocks, internal Matches)

Trie-based Block Matching [10]
- implemnted in “Haploblocks” below as trie algorithm by Stoye
- Blocks are found in \(O(MN^2)\)
- Rows are encoded as arrays of chars
Haploblocks (PBWT-based block Matching) [11] git
- Prefix and Divergence are stored as longlongint
- Panel is streamed by column from a text file of 0-1s representing the binary matrix
  - the implementation may accepts a custom ordered alphabet with minor adjustments (0, 1, 2, 3, etc.)
Bi-directional PBWT [12] git
- Reads VCFs (no htslib)
- Block lengths defaults to 500k in WIDTH!!!, check on the paper the assumptions of such large blocks.
- Block haplotype count defaults to 100, check as above
- Computes blocks with GAPS (defaults is 1 column gap). Increasing this value may greatly affect performances
  - To test normal haplotype blocks, set the gap to 0
- Requires intermediate data to be writte on disk, about 4 times the VCF size (that’s a lot!), encoding Prefix and Divergece with 4bytes(64bit ints) per value …
WildHap (Trie-based * block matching) [13] git
- Reads text file of 0-1s representing the binary matrix, plus * gaps.
- Computes MPHBw; can be used to compute regular blocks given as input a binary matrix
- Trie base, written in Java.
Wild-pBWT (PBWT-based * block matching) [14] git
- Reads text file of 0-1s representing the binary matrix, plus * gaps.
- Computes MHBw; can be used to compute regular blocks given as input a binary matrix.
- Code is ready for multi-allelic PBWT (unlike [11])
psmoother [15] git
- Reads VCF (no htslib)
- Extension of bi-PBWT. Finds blocks with gaps and the tries to smoothe the Panel, i.e. correcting the gapped columns to extend block matchs.
- Code is highly tangled with bi-PBWT, and uses in fact a bi-pBWT.

Other papers on PBWT

templated PBWT [16] git
- Uses templates, i.e. bit-masks to cover the input and compute Long Matches and Blocks with different Gaps and Masks.
- The computation of different templates is parallelized and intermediate data is saved to binary files (uint8 for each allele).
- Written in Cython
- Compared against PBWT, RaPid, hap-IBD
tree consistent pbwt (tc-pbwt)
fast recomb
Data Structures for SMEM-Finding in the PBWT
Compressed Data Structures for Population-Scale Positional Burrows–Wheeler Transforms
multiallelic pbwt (zing, zang, sanahulla)
FiMap: sparse linear algebra and random matrix algorithms to speed up IBD analysis
hap-IBD (zhou): parallel PBWT for IBD segments, in JAVA. Reads VCF
RaPID (naseri): reads vcf, uses PBWT and reconcile IBD segments; python
Compression
- Durbin
- CoMSA (gPBWT)
- GTShark

References

Durbin R (2014) Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30(9):1266–1272. https://doi.org/10.1093/bioinformatics/btu014

Cozzi D, Rossi M, Rubinacci S, et al (2023) \(\mu\)- PBWT: A lightweight r-indexing of the PBWT for storing and querying UK Biobank data. Bioinformatics 39(9):btad552. https://doi.org/10.1093/bioinformatics/btad552

Sanaullah A, Zhi D, Zhang S (2021) D-PBWT: Dynamic positional Burrows–Wheeler transform. Bioinformatics 37(16):2390–2397. https://doi.org/10.1093/bioinformatics/btab117

Shakya P, Sanaullah A, Zhi D, Zhang S (2025) Dynamic \(\mu\)-PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data. bioRxiv 2025.02.04.636479. https://doi.org/10.1101/2025.02.04.636479

Naseri A, Holzhauser E, Zhi D, Zhang S (2019) Efficient haplotype matching between a query and a panel for genealogical search. Bioinformatics 35(14):i233–i241. https://doi.org/10.1093/bioinformatics/btz347

Wang V, Naseri A, Zhang S, Zhi D (2023) Syllable-PBWT for space-efficient haplotype long-match query. Bioinformatics 39(1):btac734. https://doi.org/10.1093/bioinformatics/btac734

Shakya P, Naseri A, Zhi D, Zhang S (2022) mcPBWT: Space-Efficient Multi-column PBWT Scanning Algorithm for Composite Haplotype Matching. In: Bansal MS, Măndoiu I, Moussa M, et al (eds) Computational Advances in Bio and Medical Sciences. Springer International Publishing, Cham, pp 115–130

Wertenbroek R, Xenarios I, Thoma Y, Delaneau O (2023) Exploiting parallelization in positional Burrows–Wheeler transform (PBWT) algorithms for efficient haplotype matching and compression. Bioinformatics Advances 3(1):vbad021. https://doi.org/10.1093/bioadv/vbad021

Tang K, Sanaullah A, Zhi D, Zhang S (2025) Haplotype-based Parallel PBWT for Biobank Scale Data. 2025.02.04.636317

10.

Cunha L, Diekmann Y, Kowada L, Stoye J (2018) Identifying Maximal Perfect Haplotype Blocks. In: Alves R (ed) Advances in Bioinformatics and Computational Biology. Springer International Publishing, Cham, pp 26–37

11.

Alanko J, Bannai H, Cazaux B, Peterlongo P, Stoye J (2020) Finding all maximal perfect haplotype blocks in linear time. Algorithms for Molecular Biology 15(1):2. https://doi.org/10.1186/s13015-020-0163-6

12.

Naseri A, Yue W, Zhang S, Zhi D (2021) Efficient Haplotype Block Matching in Bi-Directional PBWT. In: Carbone A, El-Kebir M (eds) 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, pp 19:1–19:13

13.

Williams L, Mumey B (2020) Maximal Perfect Haplotype Blocks with Wildcards

14.

Bonizzoni P, Della Vedova G, Pirola Y, Rizzi R, Sgrò M (2023) Multiallelic Maximal Perfect Haplotype Blocks with Wildcards via PBWT. In: Rojas I, Valenzuela O, Rojas Ruiz F, Herrera LJ, Ortuño F (eds) Bioinformatics and Biomedical Engineering. Springer Nature Switzerland, Cham, pp 62–76

15.

Yue W, Naseri A, Wang V, Shakya P, Zhang S, Zhi D (2022) P-smoother: Efficient PBWT smoothing of large haplotype panels. Bioinformatics Advances 2(1):vbac045. https://doi.org/10.1093/bioadv/vbac045

16.

Freyman WA, McManus KF, Shringarpure SS, et al (2021) Fast and Robust Identity-by-Descent Inference with the Templated Positional Burrows–Wheeler Transform. Molecular Biology and Evolution 38(5):2131–2151. https://doi.org/10.1093/molbev/msaa328