on the positional BWT

Author

Mattia S.

bf-pbwt results

BLIM timing(s) chr10(18gb) chr21(4.7gb) chr22(4.8)
2bfpbwt-bm on master 690.53 110.47 116.99
2bfpbwt on (71a045a) 294.98 74.48 75.54
2bfpbwt-bm on (hotfix) 753.71 108.56 113.40
BLIM ram (GiB) chr10(18gb) chr21(4.7gb) chr22(4.8)
2bfpbwt-bm on master 250 82 84
2bfpbwt on (71a045a) 158 43 44
2bfpbwt-bm on (hotfix) 244 82 84

REVIEW

Durbin’s Original PBWT [1] git

Input:

a binary matrix \(X\), of \(m\) rows and \(n\) columns. \((\Sigma = \{0, 1\})\), where \(x_i\) are haplotype sequences; \(0\leq i < m\) and \(x_i \in \Sigma^n\)

Output:

for each column \(k\) in \(0\leq k < n\) we represent:

  • a prefix array \(A_k\) that stores the permutation of the rows in \(X\) according to the co-lex ordering of the subsequences of \(X\) from \(0\) to \(k\);

  • \(Y_k\), the entire matrix \(X\) with \(x_i\) permuted according to \(A_k\) (not computed)

  • an array \(y^k = Y_k[k+1]\) containg the permutation of the input \(X\)’s next column, according to the row permutation in \(A_k\). This is computed on-the-fly (and not stored!) reading \(X[k+1]\) by the permutation stored in \(A_k\).

  • a divergence array \(D_k\) that stores the length of the \(LCP^{-1}\) between each pair \((y_i[0,k], y_{i-1}[0,k])\)

The catch

\(A_k\) and \(D_k\) can be computed by \(A_{k-1}\) and \(D_{k-1}\) linearly in space/time \(O(m)\)[1].

  • The overall computation time of the \(\texttt{PBWT}\) (composed of arrays \(A_k\) and \(D_k\) for each \(k\) … is \(O(n\cdot m)\).

  • Moreover, if we keep only \(\texttt{PBWT}_{k}\) to compute the next \(\texttt{PBWT}_{k+1}\), we can do so in \(O(m)\) space, performing some other computation on each \(\texttt{PBWT}\) before it is discarded.

PBWT Use:

  1. Indexing + Querying of Set Maximal Exact Matches and/or L-long matches, computing so-called matching statistics
    • SMEMs: match a pattern \(P\) against \(X\), retrive the longest matches of \(P[0,k]\) subsequences in co-lex ordering.
  2. Computing internal all-vs-all matches longer than \(L\) and/or Maximal Perfect Haplotype Blocks (MPHB)

Tools (long matches and SMEMs)

  • Durbin’s pbwt [1] git
    • First Definition of PBWT
    • Reads VCF/BCF or .macs simulated data
    • On-line computation
    • Computes internal L-long matches, internal SMEMs, external (query) SMEMs.
  • mu-pbwt [2] git
    • Run-Length encoding of the PBWT
    • reads VCF/BCF
    • indexes and queries of a pattern against the VCF
    • computes external SMEMs and L-long matches
  • d-pbwt [3] git
    • Dynamic pbwt to dynamically insert/delete reads
    • reads VCF
    • indexes and queries of patterns against Panel,
    • computes external L-long matches
    • panel is stored as a std::vector<std::vector<bool>>
  • dynamic-mu-pbwt [4] git
    • dynamic + mu PBWT
    • reads VCF
    • indexes and queries of patterns against Panel,
    • computes external SMEMs and L-long matches
  • PBWT-Query [5] git
    • Code is unaccesible, may read VCF
    • Uses “Leap Arrays” to jump from different \(D\) values in \(y^k\)
    • Notes: several version of the algorithms are proposed
      • PBWT-Query memory extensive, panel in memory
      • L-PBWT-Query memory mapped, panel lazy-loaded; uses Leap arrays
      • L-PBWT-Query memory extensive, panel in memory; uses Leap arrays
    • computes external L-long matches
  • syllable-PBWT [6] git
    • syllable-encoding of the PBWT, 64 or 128 bits
    • reads VCF
    • indexes and queries of patterns against Panel, parametrized by \(B\) syllable length (\(B=\{64, 128\}\))
    • computes external L-long matches with \(L \geq 2B\)
    • notes: in the paper section (2.1) and Table 1, they write about “bit-PBWT” referring to the Durbin’s original PBWT used in PBWT-Query [5] and d-PBWT [3]
  • mc-PBWT: multi column (parallel?) scanning of the PBWT (double and triple) [7] used for “Matches of exact Lenght \(L\)
    • seems NOT implemented …
  • parallel PBWT (VCF/BCF )[8] git
    • first parallel version (implemented)
      • vertically parallel + reconcile each \(p\) thread
    • computes Long Matches and SMEMs (compares against Durbin’s pbwt)
    • uses htslib (reads VCF/BCF)
    • panel is read as std::vector<std:vector<bool>>
    • main code at code
  • HB-parallel PBWT (VCF?): last version, horizontally parallel [9] Long Matches and SMEMs
    • no code perovided
    • benchmarks are against [8] and Durbin’s
    • speed up only for very large Haplotype count (M size).

Tools (Haplotype Blocks, internal Matches)

  • Trie-based Block Matching [10]
    • implemnted in “Haploblocks” below as trie algorithm by Stoye
    • Blocks are found in \(O(MN^2)\)
    • Rows are encoded as arrays of chars
  • Haploblocks (PBWT-based block Matching) [11] git
    • Prefix and Divergence are stored as longlongint
    • Panel is streamed by column from a text file of 0-1s representing the binary matrix
      • the implementation may accepts a custom ordered alphabet with minor adjustments (0, 1, 2, 3, etc.)
  • Bi-directional PBWT [12] git
    • Reads VCFs (no htslib)
    • Block lengths defaults to 500k in WIDTH!!!, check on the paper the assumptions of such large blocks.
    • Block haplotype count defaults to 100, check as above
    • Computes blocks with GAPS (defaults is 1 column gap). Increasing this value may greatly affect performances
      • To test normal haplotype blocks, set the gap to 0
    • Requires intermediate data to be writte on disk, about 4 times the VCF size (that’s a lot!), encoding Prefix and Divergece with 4bytes(64bit ints) per value …
  • WildHap (Trie-based * block matching) [13] git
    • Reads text file of 0-1s representing the binary matrix, plus * gaps.
    • Computes MPHBw; can be used to compute regular blocks given as input a binary matrix
    • Trie base, written in Java.
  • Wild-pBWT (PBWT-based * block matching) [14] git
    • Reads text file of 0-1s representing the binary matrix, plus * gaps.
    • Computes MHBw; can be used to compute regular blocks given as input a binary matrix.
    • Code is ready for multi-allelic PBWT (unlike [11])
  • psmoother [15] git
    • Reads VCF (no htslib)
    • Extension of bi-PBWT. Finds blocks with gaps and the tries to smoothe the Panel, i.e. correcting the gapped columns to extend block matchs.
    • Code is highly tangled with bi-PBWT, and uses in fact a bi-pBWT.

Other papers on PBWT

  • templated PBWT [16] git
    • Uses templates, i.e. bit-masks to cover the input and compute Long Matches and Blocks with different Gaps and Masks.
    • The computation of different templates is parallelized and intermediate data is saved to binary files (uint8 for each allele).
    • Written in Cython
    • Compared against PBWT, RaPid, hap-IBD
  • tree consistent pbwt (tc-pbwt)
  • fast recomb
  • Data Structures for SMEM-Finding in the PBWT
  • Compressed Data Structures for Population-Scale Positional Burrows–Wheeler Transforms
  • multiallelic pbwt (zing, zang, sanahulla)
  • FiMap: sparse linear algebra and random matrix algorithms to speed up IBD analysis
  • hap-IBD (zhou): parallel PBWT for IBD segments, in JAVA. Reads VCF
  • RaPID (naseri): reads vcf, uses PBWT and reconcile IBD segments; python
  • Compression
    • Durbin
    • CoMSA (gPBWT)
    • GTShark

References

1.
Durbin R (2014) Efficient haplotype matching and storage using the positional BurrowsWheeler transform (PBWT). Bioinformatics 30(9):1266–1272. https://doi.org/10.1093/bioinformatics/btu014
2.
Cozzi D, Rossi M, Rubinacci S, et al (2023) \(\mu\)- PBWT: A lightweight r-indexing of the PBWT for storing and querying UK Biobank data. Bioinformatics 39(9):btad552. https://doi.org/10.1093/bioinformatics/btad552
3.
Sanaullah A, Zhi D, Zhang S (2021) D-PBWT: Dynamic positional BurrowsWheeler transform. Bioinformatics 37(16):2390–2397. https://doi.org/10.1093/bioinformatics/btab117
4.
Shakya P, Sanaullah A, Zhi D, Zhang S (2025) Dynamic \(\mu\)-PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data. bioRxiv 2025.02.04.636479. https://doi.org/10.1101/2025.02.04.636479
5.
Naseri A, Holzhauser E, Zhi D, Zhang S (2019) Efficient haplotype matching between a query and a panel for genealogical search. Bioinformatics 35(14):i233–i241. https://doi.org/10.1093/bioinformatics/btz347
6.
Wang V, Naseri A, Zhang S, Zhi D (2023) Syllable-PBWT for space-efficient haplotype long-match query. Bioinformatics 39(1):btac734. https://doi.org/10.1093/bioinformatics/btac734
7.
Shakya P, Naseri A, Zhi D, Zhang S (2022) mcPBWT: Space-Efficient Multi-column PBWT Scanning Algorithm for Composite Haplotype Matching. In: Bansal MS, Măndoiu I, Moussa M, et al (eds) Computational Advances in Bio and Medical Sciences. Springer International Publishing, Cham, pp 115–130
8.
Wertenbroek R, Xenarios I, Thoma Y, Delaneau O (2023) Exploiting parallelization in positional BurrowsWheeler transform (PBWT) algorithms for efficient haplotype matching and compression. Bioinformatics Advances 3(1):vbad021. https://doi.org/10.1093/bioadv/vbad021
9.
Tang K, Sanaullah A, Zhi D, Zhang S (2025) Haplotype-based Parallel PBWT for Biobank Scale Data. 2025.02.04.636317
10.
Cunha L, Diekmann Y, Kowada L, Stoye J (2018) Identifying Maximal Perfect Haplotype Blocks. In: Alves R (ed) Advances in Bioinformatics and Computational Biology. Springer International Publishing, Cham, pp 26–37
11.
Alanko J, Bannai H, Cazaux B, Peterlongo P, Stoye J (2020) Finding all maximal perfect haplotype blocks in linear time. Algorithms for Molecular Biology 15(1):2. https://doi.org/10.1186/s13015-020-0163-6
12.
Naseri A, Yue W, Zhang S, Zhi D (2021) Efficient Haplotype Block Matching in Bi-Directional PBWT. In: Carbone A, El-Kebir M (eds) 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, pp 19:1–19:13
13.
Williams L, Mumey B (2020) Maximal Perfect Haplotype Blocks with Wildcards
14.
Bonizzoni P, Della Vedova G, Pirola Y, Rizzi R, Sgrò M (2023) Multiallelic Maximal Perfect Haplotype Blocks with Wildcards via PBWT. In: Rojas I, Valenzuela O, Rojas Ruiz F, Herrera LJ, Ortuño F (eds) Bioinformatics and Biomedical Engineering. Springer Nature Switzerland, Cham, pp 62–76
15.
Yue W, Naseri A, Wang V, Shakya P, Zhang S, Zhi D (2022) P-smoother: Efficient PBWT smoothing of large haplotype panels. Bioinformatics Advances 2(1):vbac045. https://doi.org/10.1093/bioadv/vbac045
16.
Freyman WA, McManus KF, Shringarpure SS, et al (2021) Fast and Robust Identity-by-Descent Inference with the Templated Positional BurrowsWheeler Transform. Molecular Biology and Evolution 38(5):2131–2151. https://doi.org/10.1093/molbev/msaa328