on the positional BWT
bf-pbwt results
| BLIM timing(s) | chr10(18gb) | chr21(4.7gb) | chr22(4.8) | |
|---|---|---|---|---|
| 2bfpbwt-bm on master | 690.53 | 110.47 | 116.99 | |
| 2bfpbwt on (71a045a) | 294.98 | 74.48 | 75.54 | |
| 2bfpbwt-bm on (hotfix) | 753.71 | 108.56 | 113.40 |
| BLIM ram (GiB) | chr10(18gb) | chr21(4.7gb) | chr22(4.8) | |
|---|---|---|---|---|
| 2bfpbwt-bm on master | 250 | 82 | 84 | |
| 2bfpbwt on (71a045a) | 158 | 43 | 44 | |
| 2bfpbwt-bm on (hotfix) | 244 | 82 | 84 |
REVIEW
Durbin’s Original PBWT [1] git
Input:
a binary matrix \(X\), of \(m\) rows and \(n\) columns. \((\Sigma = \{0, 1\})\), where \(x_i\) are haplotype sequences; \(0\leq i < m\) and \(x_i \in \Sigma^n\)
Output:
for each column \(k\) in \(0\leq k < n\) we represent:
a prefix array \(A_k\) that stores the permutation of the rows in \(X\) according to the co-lex ordering of the subsequences of \(X\) from \(0\) to \(k\);
\(Y_k\), the entire matrix \(X\) with \(x_i\) permuted according to \(A_k\) (not computed)
an array \(y^k = Y_k[k+1]\) containg the permutation of the input \(X\)’s next column, according to the row permutation in \(A_k\). This is computed on-the-fly (and not stored!) reading \(X[k+1]\) by the permutation stored in \(A_k\).
a divergence array \(D_k\) that stores the length of the \(LCP^{-1}\) between each pair \((y_i[0,k], y_{i-1}[0,k])\)
The catch
\(A_k\) and \(D_k\) can be computed by \(A_{k-1}\) and \(D_{k-1}\) linearly in space/time \(O(m)\)[1].
The overall computation time of the \(\texttt{PBWT}\) (composed of arrays \(A_k\) and \(D_k\) for each \(k\) … is \(O(n\cdot m)\).
Moreover, if we keep only \(\texttt{PBWT}_{k}\) to compute the next \(\texttt{PBWT}_{k+1}\), we can do so in \(O(m)\) space, performing some other computation on each \(\texttt{PBWT}\) before it is discarded.
PBWT Use:
- Indexing + Querying of Set Maximal Exact Matches and/or L-long matches, computing so-called matching statistics
- SMEMs: match a pattern \(P\) against \(X\), retrive the longest matches of \(P[0,k]\) subsequences in co-lex ordering.
- Computing internal all-vs-all matches longer than \(L\) and/or Maximal Perfect Haplotype Blocks (MPHB)
Tools (long matches and SMEMs)
- Durbin’s pbwt [1] git
- First Definition of PBWT
- Reads VCF/BCF or .macs simulated data
- On-line computation
- Computes internal L-long matches, internal SMEMs, external (query) SMEMs.
- mu-pbwt [2] git
- Run-Length encoding of the PBWT
- reads VCF/BCF
- indexes and queries of a pattern against the VCF
- computes external SMEMs and L-long matches
- d-pbwt [3] git
- Dynamic pbwt to dynamically insert/delete reads
- reads VCF
- indexes and queries of patterns against Panel,
- computes external L-long matches
- panel is stored as a
std::vector<std::vector<bool>>
- dynamic-mu-pbwt [4] git
- dynamic + mu PBWT
- reads VCF
- indexes and queries of patterns against Panel,
- computes external SMEMs and L-long matches
- PBWT-Query [5] git
- Code is unaccesible, may read VCF
- Uses “Leap Arrays” to jump from different \(D\) values in \(y^k\)
- Notes: several version of the algorithms are proposed
- PBWT-Query memory extensive, panel in memory
- L-PBWT-Query memory mapped, panel lazy-loaded; uses Leap arrays
- L-PBWT-Query memory extensive, panel in memory; uses Leap arrays
- computes external L-long matches
- syllable-PBWT [6] git
- syllable-encoding of the PBWT, 64 or 128 bits
- code: reading VCF encoded in
intsyllables SyllableQuery.cpp
- code: reading VCF encoded in
- reads VCF
- indexes and queries of patterns against Panel, parametrized by \(B\) syllable length (\(B=\{64, 128\}\))
- computes external L-long matches with \(L \geq 2B\)
- notes: in the paper section (2.1) and Table 1, they write about “bit-PBWT” referring to the Durbin’s original PBWT used in PBWT-Query [5] and d-PBWT [3]
- syllable-encoding of the PBWT, 64 or 128 bits
- mc-PBWT: multi column (parallel?) scanning of the PBWT (double and triple) [7] used for “Matches of exact Lenght \(L\)
- seems NOT implemented …
- parallel PBWT (VCF/BCF )[8] git
- first parallel version (implemented)
- vertically parallel + reconcile each \(p\) thread
- computes Long Matches and SMEMs (compares against Durbin’s pbwt)
- uses htslib (reads VCF/BCF)
- panel is read as
std::vector<std:vector<bool>> - main code at code
- first parallel version (implemented)
- HB-parallel PBWT (VCF?): last version, horizontally parallel [9] Long Matches and SMEMs
- no code perovided
- benchmarks are against [8] and Durbin’s
- speed up only for very large Haplotype count (M size).
Tools (Haplotype Blocks, internal Matches)
- Trie-based Block Matching [10]
- implemnted in “Haploblocks” below as trie algorithm by Stoye
- Blocks are found in \(O(MN^2)\)
- Rows are encoded as arrays of
chars
- Haploblocks (PBWT-based block Matching) [11] git
- Prefix and Divergence are stored as
longlongint - Panel is streamed by column from a text file of 0-1s representing the binary matrix
- the implementation may accepts a custom ordered alphabet with minor adjustments (0, 1, 2, 3, etc.)
- Prefix and Divergence are stored as
- Bi-directional PBWT [12] git
- Reads VCFs (no htslib)
- Block lengths defaults to 500k in WIDTH!!!, check on the paper the assumptions of such large blocks.
- Block haplotype count defaults to 100, check as above
- Computes blocks with GAPS (defaults is 1 column gap). Increasing this value may greatly affect performances
- To test normal haplotype blocks, set the gap to 0
- Requires intermediate data to be writte on disk, about 4 times the VCF size (that’s a lot!), encoding Prefix and Divergece with 4bytes(64bit ints) per value …
- WildHap (Trie-based * block matching) [13] git
- Reads text file of 0-1s representing the binary matrix, plus * gaps.
- Computes MPHBw; can be used to compute regular blocks given as input a binary matrix
- Trie base, written in Java.
- Wild-pBWT (PBWT-based * block matching) [14] git
- Reads text file of 0-1s representing the binary matrix, plus * gaps.
- Computes MHBw; can be used to compute regular blocks given as input a binary matrix.
- Code is ready for multi-allelic PBWT (unlike [11])
- psmoother [15] git
- Reads VCF (no htslib)
- Extension of bi-PBWT. Finds blocks with gaps and the tries to smoothe the Panel, i.e. correcting the gapped columns to extend block matchs.
- Code is highly tangled with bi-PBWT, and uses in fact a bi-pBWT.
Other papers on PBWT
- templated PBWT [16] git
- Uses templates, i.e. bit-masks to cover the input and compute Long Matches and Blocks with different Gaps and Masks.
- The computation of different templates is parallelized and intermediate data is saved to binary files (
uint8for each allele). - Written in Cython
- Compared against PBWT, RaPid, hap-IBD
- tree consistent pbwt (tc-pbwt)
- fast recomb
- Data Structures for SMEM-Finding in the PBWT
- Compressed Data Structures for Population-Scale Positional Burrows–Wheeler Transforms
- multiallelic pbwt (zing, zang, sanahulla)
- FiMap: sparse linear algebra and random matrix algorithms to speed up IBD analysis
- hap-IBD (zhou): parallel PBWT for IBD segments, in JAVA. Reads VCF
- RaPID (naseri): reads vcf, uses PBWT and reconcile IBD segments; python
- Compression
- Durbin
- CoMSA (gPBWT)
- GTShark