malva.filter_minimizers module

a MinHash-based short word filtering approach for clustering k-mers compress billions of k-mers into a much smaller number of clusters by grouping k-mers with similar composition of short words (w-mers).

class malva.filter_minimizers.KmerFilter

Bases: object

encapsulates the filter approach so that k-mers can be processed on the fly (e.g., from a sorted stream) without dynamic reallocation or global state.

k

Length of the k-mers (in nucleotides).

Type:

int

w

Length of the short word (w-mer) used for signature computation.

Type:

int

num_buckets

Predefined number of buckets.

Type:

int

filter_stream(kmers)

Filter a stream of 64-bit encoded k-mers, assigning each to a bucket.

Parameters:

kmers (np.ndarray[np.uint64_t]) – 1D numpy array of encoded k-mers.

Returns:

1D array (int32) of bucket assignments.

Return type:

np.ndarray

k
num_buckets
w
malva.filter_minimizers.filter_kmers(kmers, k, w, num_buckets)

Stream and filter 64-bit encoded k-mers into a fixed set of buckets.

bucket = compute_signature(kmer, k, w) % num_buckets

Parameters:
  • kmers (np.ndarray[np.uint64_t]) – 1D numpy array of 64-bit encoded k-mers.

  • k (int) – Length of the k-mer (in nucleotides).

  • w (int) – Length of the short word (w-mer) for signature computation.

  • num_buckets (int) – Predefined number of buckets.

Returns:

1D array (of length kmers.shape[0]) containing the bucket index

for each k-mer.

Return type:

np.ndarray[np.int32]