malva.filter_minimizers module¶
a MinHash-based short word filtering approach for clustering k-mers compress billions of k-mers into a much smaller number of clusters by grouping k-mers with similar composition of short words (w-mers).
- class malva.filter_minimizers.KmerFilter¶
Bases:
objectencapsulates the filter approach so that k-mers can be processed on the fly (e.g., from a sorted stream) without dynamic reallocation or global state.
- k¶
Length of the k-mers (in nucleotides).
- Type:
int
- w¶
Length of the short word (w-mer) used for signature computation.
- Type:
int
- num_buckets¶
Predefined number of buckets.
- Type:
int
- filter_stream(kmers)¶
Filter a stream of 64-bit encoded k-mers, assigning each to a bucket.
- Parameters:
kmers (np.ndarray[np.uint64_t]) – 1D numpy array of encoded k-mers.
- Returns:
1D array (int32) of bucket assignments.
- Return type:
np.ndarray
- k¶
- num_buckets¶
- w¶
- malva.filter_minimizers.filter_kmers(kmers, k, w, num_buckets)¶
Stream and filter 64-bit encoded k-mers into a fixed set of buckets.
bucket = compute_signature(kmer, k, w) % num_buckets
- Parameters:
kmers (np.ndarray[np.uint64_t]) – 1D numpy array of 64-bit encoded k-mers.
k (int) – Length of the k-mer (in nucleotides).
w (int) – Length of the short word (w-mer) for signature computation.
num_buckets (int) – Predefined number of buckets.
- Returns:
- 1D array (of length kmers.shape[0]) containing the bucket index
for each k-mer.
- Return type:
np.ndarray[np.int32]