malva.compressed_index module

class malva.compressed_index.LazyChunkTable(filename, chunk_table_pos, num_chunks, buffer_size=1)[source]

Bases: object

__init__(filename, chunk_table_pos, num_chunks, buffer_size=1)[source]
class malva.compressed_index.CompressedArrayStorage(chunk_size=512, compression_level=9, compression_lib='zstd', dtype=<class 'numpy.uint64'>, fastpfor_codec=None, sort_chunks=False, use_delta=False, max_cache_size=2000000)[source]

Bases: object

Memory-efficient storage for large sorted numeric arrays with fast random access. Optimized for k-mer data with 512-element chunking. Features incremental file saving to avoid memory issues with large arrays.

__init__(chunk_size=512, compression_level=9, compression_lib='zstd', dtype=<class 'numpy.uint64'>, fastpfor_codec=None, sort_chunks=False, use_delta=False, max_cache_size=2000000)[source]

Initialize the compressed array storage.

Parameters:
  • chunk_size (int) – Number of elements per chunk (default: 512)

  • compression_level (int) – Compression level (1-9, where 9 is highest)

  • compression_lib (str) – Compression library to use (‘zstd’, ‘lz4’, ‘zlib’, etc.)

  • dtype (dtype) – Data type of array elements

load_all_chunks_to_memory()[source]

Load all compressed chunks into memory for faster access. Chunks will still be decompressed on demand, but accessing them will be faster since the data is already in memory.

Return type:

None

unload_chunks_from_memory()[source]

Unload chunks from memory to free up space. The array will remain accessible but will read from disk.

Return type:

None

from_array(array, filename=None)[source]

Initialize storage from a numpy array. If filename is provided, save incrementally to reduce memory usage.

Parameters:
  • array (ndarray) – Numpy array to store

  • filename (str) – Optional file path to save to incrementally

Return type:

None

from_generator(generator, total_size, filename)[source]

Initialize storage from a generator that yields sorted chunks. This allows processing very large arrays without loading everything into memory.

Parameters:
  • generator – Generator that yields sorted chunks of data

  • total_size (int) – Total number of elements expected

  • filename (str) – File path to save to

Return type:

None

save(filename)[source]

Save the compressed array to a file.

Parameters:

filename (str) – File path to save to

Return type:

None

load(filename, load_in_memory=False, max_chunks_in_memory=100000000)[source]
Return type:

None

get_chunk(chunk_id)[source]

Get a specific chunk by ID.

Parameters:

chunk_id (int) – ID of the chunk to retrieve

Return type:

ndarray

Returns:

Numpy array with the chunk data

get_chunk_raw(chunk_id)[source]

Get a specific chunk in its raw compressed format. This is useful for advanced use cases where you want to manage the decompression yourself.

Parameters:

chunk_id (int) – ID of the chunk to retrieve

Return type:

bytes

Returns:

Raw compressed bytes for the chunk

get(index)[source]

Get the value at a specific index.

Parameters:

index (int) – Index of the element to retrieve

Return type:

number

Returns:

Value at the specified index

get_batch(indices)[source]

Get values at multiple indices efficiently.

Parameters:

indices (List[int]) – List of indices to retrieve

Return type:

ndarray

Returns:

Numpy array with values at the specified indices

prefetch_chunks(chunk_ids)[source]

Prefetch and cache decompressed chunks.

Parameters:

chunk_ids – List of chunk IDs to prefetch

get_cache_stats()[source]

Get cache performance statistics

Return type:

dict

set_cache_size(new_size)[source]

Dynamically adjust cache size

get_multiple_slices(slice_requests)[source]

Optimized implementation for multiple slice requests.

Parameters:

slice_requests – List of (start_idx, end_idx, kmer) tuples

get_slice(start_idx, end_idx)[source]

Get a slice of the array.

Parameters:
  • start_idx (int) – Start index (inclusive)

  • end_idx (int) – End index (exclusive)

Return type:

ndarray

Returns:

Numpy array with the slice data

property shape: tuple

Return the shape of the array.

Returns:

Tuple representing the array dimensions (only 1D arrays supported currently)

get_chunk_for_cython(chunk_id)[source]

Get a chunk in a format suitable for Cython processing. This ensures the returned array is: 1. Exactly chunk_size elements 2. C-contiguous in memory 3. Of the correct dtype

Parameters:

chunk_id (int) – ID of the chunk to retrieve

Return type:

ndarray

Returns:

Numpy array with the chunk data, guaranteed to be chunk_size elements