malva.compressed_index module¶
- class malva.compressed_index.LazyChunkTable(filename, chunk_table_pos, num_chunks, buffer_size=1)[source]¶
Bases:
object
- class malva.compressed_index.CompressedArrayStorage(chunk_size=512, compression_level=9, compression_lib='zstd', dtype=<class 'numpy.uint64'>, fastpfor_codec=None, sort_chunks=False, use_delta=False, max_cache_size=2000000)[source]¶
Bases:
objectMemory-efficient storage for large sorted numeric arrays with fast random access. Optimized for k-mer data with 512-element chunking. Features incremental file saving to avoid memory issues with large arrays.
- __init__(chunk_size=512, compression_level=9, compression_lib='zstd', dtype=<class 'numpy.uint64'>, fastpfor_codec=None, sort_chunks=False, use_delta=False, max_cache_size=2000000)[source]¶
Initialize the compressed array storage.
- Parameters:
chunk_size (
int) – Number of elements per chunk (default: 512)compression_level (
int) – Compression level (1-9, where 9 is highest)compression_lib (
str) – Compression library to use (‘zstd’, ‘lz4’, ‘zlib’, etc.)dtype (
dtype) – Data type of array elements
- load_all_chunks_to_memory()[source]¶
Load all compressed chunks into memory for faster access. Chunks will still be decompressed on demand, but accessing them will be faster since the data is already in memory.
- Return type:
None
- unload_chunks_from_memory()[source]¶
Unload chunks from memory to free up space. The array will remain accessible but will read from disk.
- Return type:
None
- from_array(array, filename=None)[source]¶
Initialize storage from a numpy array. If filename is provided, save incrementally to reduce memory usage.
- Parameters:
array (
ndarray) – Numpy array to storefilename (
str) – Optional file path to save to incrementally
- Return type:
None
- from_generator(generator, total_size, filename)[source]¶
Initialize storage from a generator that yields sorted chunks. This allows processing very large arrays without loading everything into memory.
- Parameters:
generator – Generator that yields sorted chunks of data
total_size (
int) – Total number of elements expectedfilename (
str) – File path to save to
- Return type:
None
- save(filename)[source]¶
Save the compressed array to a file.
- Parameters:
filename (
str) – File path to save to- Return type:
None
- get_chunk(chunk_id)[source]¶
Get a specific chunk by ID.
- Parameters:
chunk_id (
int) – ID of the chunk to retrieve- Return type:
ndarray- Returns:
Numpy array with the chunk data
- get_chunk_raw(chunk_id)[source]¶
Get a specific chunk in its raw compressed format. This is useful for advanced use cases where you want to manage the decompression yourself.
- Parameters:
chunk_id (
int) – ID of the chunk to retrieve- Return type:
bytes- Returns:
Raw compressed bytes for the chunk
- get(index)[source]¶
Get the value at a specific index.
- Parameters:
index (
int) – Index of the element to retrieve- Return type:
number- Returns:
Value at the specified index
- get_batch(indices)[source]¶
Get values at multiple indices efficiently.
- Parameters:
indices (
List[int]) – List of indices to retrieve- Return type:
ndarray- Returns:
Numpy array with values at the specified indices
- prefetch_chunks(chunk_ids)[source]¶
Prefetch and cache decompressed chunks.
- Parameters:
chunk_ids – List of chunk IDs to prefetch
- get_multiple_slices(slice_requests)[source]¶
Optimized implementation for multiple slice requests.
- Parameters:
slice_requests – List of (start_idx, end_idx, kmer) tuples
- get_slice(start_idx, end_idx)[source]¶
Get a slice of the array.
- Parameters:
start_idx (
int) – Start index (inclusive)end_idx (
int) – End index (exclusive)
- Return type:
ndarray- Returns:
Numpy array with the slice data
- property shape: tuple¶
Return the shape of the array.
- Returns:
Tuple representing the array dimensions (only 1D arrays supported currently)
- get_chunk_for_cython(chunk_id)[source]¶
Get a chunk in a format suitable for Cython processing. This ensures the returned array is: 1. Exactly chunk_size elements 2. C-contiguous in memory 3. Of the correct dtype
- Parameters:
chunk_id (
int) – ID of the chunk to retrieve- Return type:
ndarray- Returns:
Numpy array with the chunk data, guaranteed to be chunk_size elements