Indexing & Query Framework¶

Note

This page documents the algorithmic design of the Malva index construction and query framework in full detail, enabling re-implementation and reproducibility. It corresponds to Supplementary Note 4 of the Malva manuscript.

Malva implements an inverted index for the single-cell colored \(k\)-mer indexing problem. Given a collection of sequencing libraries \(\mathcal{R} = \{R_1, \ldots, R_S\}\) where each library \(R_i\) contains reads from \(n_i\) cells, the goal is to support queries that return, for any \(k\)-mer \(x\), the set of cells

\[\text{Cells}(x) = \{(i,j) \mid x \text{ appears in cell } c_{ij}\}.\]

This problem differs from bulk-sample \(k\)-mer indexing in that the label space grows from \(O(S)\) samples to \(O(\sum_i n_i)\) cells — typically \(10^6\) to \(10^{10}\) unique labels — rendering traditional color-aggregative approaches impractical.

Definitions¶

The following entities are used and tracked throughout the indexing process:

Chunk: A self-contained index unit comprising multiple samples. Each chunk is stored as an independent file and can be queried in parallel.
Sample: A sequencing dataset (e.g., one 10x Chromium run) containing reads from multiple cells. Each sample has associated sample-level metadata (e.g., tissue type, donor ID, experimental condition).
Cell: An individual cell within a sample, identified by its barcode. Each cell has associated cell-level metadata (e.g., cell type annotation, cluster assignment).

These entities are linked to an external metadata database via composite identifiers, enabling queries to return not only matching cells but also their biological annotations.

Data Structure¶

Each chunk contains an inverted index stored in compressed sparse row (CSR) format comprising three arrays:

Array	Name	Description
\(\mathbf{I}\)	`indices`	\(N\) unique \(k\)-mers in lexicographically sorted order
\(\mathbf{P}\)	`indptr`	For each \(k\)-mer at position \(i\), \(\mathbf{P}[i]\) stores the start offset into the location array; \(\mathbf{P}[i+1] - \mathbf{P}[i]\) gives the number of cells containing \(k\)-mer \(i\)
\(\mathbf{D}\)	`data`	\(Z\) total cell identifiers (as 32-bit composites), grouped by \(k\)-mer

This structure supports queries in \(O(\log N + L_x)\) time where \(L_x = |\text{Cells}(x)|\).

Sequence array  I:  [ k₀  | k₁  | k₂  | k₃  | ···  | k_{N-1} ]
                          ↓       ↓       ↓
Pointer array   P:  [  0  |  2  |  5  |  6  |  9  | ···  |  Z  ]
                      ↓              ↓              ↓
Location array  D:  [ c₀ | c₁ | c₂ | c₃ | c₄ | c₅ | c₆ | c₇ | c₈ | ··· ]

32-bit cell ID layout:
┌─────────────────────┬──────────────────────────┐
│   sample (s bits)   │   cell (32 − s bits)     │
└─────────────────────┴──────────────────────────┘

Cell Identifier Encoding¶

Cell identifiers stored in \(\mathbf{D}\) are 32-bit integers partitioned into two components:

Upper \(s\) bits: sample ID within the chunk
Lower \(32 - s\) bits: cell ID within the sample

For example, allocating 8 bits for sample ID and 24 bits for cell ID (default) permits up to 256 samples per chunk with approximately 16 million cells each. Alternative allocations (e.g., 9 bits for sample, 23 bits for cell) allow 512 samples per chunk with approximately 8 million cells each. The choice depends on the expected cell counts per dataset.

Chunks are kept relatively large (many samples) to improve query efficiency by reducing the number of separate index files that must be accessed. The chunk ID itself is stored as external metadata alongside the index file, not within the 32-bit cell identifier.

Index Construction¶

Construction proceeds in three phases:

Streaming extraction of \(k\)-mers from FASTQ files into temporary chunks
Merging chunks within each sample
Merging samples into multi-sample chunks

K-mer Encoding and Sampling¶

Each \(k\)-mer is encoded as a 64-bit unsigned integer using the standard 2-bit nucleotide representation (A→00, C→01, G→10, T→11), supporting \(k \leq 32\).

From each read of length \(\ell\), \(k\)-mers are extracted at non-overlapping positions \(\{0, k, 2k, \ldots\}\) plus a final \(k\)-mer at position \(\ell - k\) to ensure complete coverage. The stride between consecutive \(k\)-mers equals \(k\) (non-overlapping). This sampling reduces storage by a factor of \(\sim k\) while guaranteeing that for any query window of length \(\geq 2k\), at least one indexed \(k\)-mer will be found.

\(K\)-mers containing ambiguous nucleotides (N) or consisting entirely of a single nucleotide (homopolymers) are excluded. Chunk-level counts of \(k\)-mers can be computed from pointer array offsets. Indexing is associative and does not preserve counts per label.

Barcode Handling¶

Cell barcodes are extracted from technology-specific read positions and validated against the manufacturer’s whitelist. Barcodes are encoded as 64-bit integers using the same 2-bit nucleotide representation as \(k\)-mers, allowing efficient storage and comparison.

The barcode-to-cell mapping is maintained in an unsorted map associating each encoded barcode with its integer cell ID (or -1 if absent) within the sample. This structure supports \(O(1)\) lookup during read processing. During the final merge phase, local cell IDs are combined with the sample ID to form the 32-bit composite identifier described above.

Streaming Chunk Construction¶

FASTQ files are processed in a streaming fashion. For each read, the barcode is validated and \(k\)-mers are extracted. Pairs of \((k\text{-mer},\,\text{cell-id})\) are accumulated in an in-memory buffer until reaching threshold \(T\) (default \(10^8\) pairs), at which point the buffer is flushed to disk as a temporary chunk.

Algorithm 1: Chunk Flush — Buffer to CSR Arrays

Input: Buffer \(B\) of \((k\text{-mer},\ \text{cell-id})\) pairs

Output: Arrays \(\mathbf{I}\), \(\mathbf{P}\), \(\mathbf{D}\) written to disk

Sort \(B\) lexicographically by \((k\text{-mer},\ \text{cell-id})\)
Initialize \(\mathbf{I} \gets []\), \(\mathbf{P} \gets [0]\), \(\mathbf{D} \gets []\)
Set \((\texttt{prev\_k},\ \texttt{prev\_c}) \gets (\text{null},\ \text{null})\)
For each \((x, c)\) in \(B\):
1. If \(x \neq \texttt{prev\_k}\): (new \(k\)-mer)
  - Append \(x\) to \(\mathbf{I}\)
  - Append \(|\mathbf{D}|\) to \(\mathbf{P}\) (record current position)
  - Set \(\texttt{prev\_c} \gets \text{null}\)
2. If \(c \neq \texttt{prev\_c}\): (deduplicate: presence/absence encoding)
  - Append \(c\) to \(\mathbf{D}\)
  - Set \(\texttt{prev\_c} \gets c\)
3. Set \(\texttt{prev\_k} \gets x\)
Append \(|\mathbf{D}|\) to \(\mathbf{P}\) (final pointer)
Compress \(\mathbf{I}\), \(\mathbf{P}\), \(\mathbf{D}\) blockwise and write to disk

Arrays are compressed blockwise to enable random access. Each array is partitioned into blocks of \(B = 512\) elements. For sorted arrays (\(\mathbf{I}\), \(\mathbf{P}\)), delta encoding may optionally be applied before compression. Compression is performed using the Blosc meta-compressor framework, which supports multiple underlying codecs including zstd (default), lz4, and zlib; the choice of codec can be configured based on the desired trade-off between compression ratio and speed. A chunk table records block offsets for random access.

Chunk Merging¶

After all reads are processed, temporary chunks must be merged. The merge algorithm processes \(k\)-mers in lexicographic order using proportional windows from each chunk to bound memory usage. The window size is set to 5% of each chunk’s size (capped at \(10^7\) entries).

Algorithm 2: Windowed Multi-Chunk Merge

Input: Chunks \(C_1, \ldots, C_m\), each with arrays \((\mathbf{I}_i, \mathbf{P}_i, \mathbf{D}_i)\)

Output: Merged index \((\mathbf{I}, \mathbf{P}, \mathbf{D})\) written to output

Initialize \(\texttt{ptr}[i] \gets 0\) for all \(i\)
Set \(\texttt{win}[i] \gets \min(0.05 \cdot |\mathbf{I}_i|,\ 10^7)\) for all \(i\)
While any chunk has unprocessed entries:
1. Find the maximum \(k\)-mer safely processable this iteration:
  - Set \(k_{\max} \gets \infty\)
  - For each active chunk \(C_i\): set \(\texttt{end} \gets \min(\texttt{ptr}[i] + \texttt{win}[i],\ |\mathbf{I}_i| - 1)\), then \(k_{\max} \gets \min(k_{\max},\ \mathbf{I}_i[\texttt{end}])\)
2. Collect all entries \(\leq k_{\max}\) from all chunks into set \(E\):
  - For each chunk \(C_i\), while \(\mathbf{I}_i[\texttt{ptr}[i]] \leq k_{\max}\): add \(\bigl(x,\ \mathbf{D}_i[\mathbf{P}_i[\texttt{ptr}[i]]:\mathbf{P}_i[\texttt{ptr}[i]+1]]\bigr)\) to \(E\) and advance \(\texttt{ptr}[i]\)
3. Merge entries sharing the same \(k\)-mer and write to output:
  - For each unique \(x\) in \(E\): write \(\bigl(x,\ \text{sorted}(\bigcup\{\texttt{cells} : (x, \texttt{cells}) \in E\})\bigr)\) to \(\mathbf{I}, \mathbf{P}, \mathbf{D}\)

Key insight: By taking the minimum of the window-end \(k\)-mers across all chunks, the algorithm guarantees that all occurrences of any \(k\)-mer \(\leq k_{\max}\) have been seen, enabling correct merging without loading entire chunks into memory.

When merging samples into multi-sample chunks, cell IDs are remapped to incorporate the sample offset within the chunk. The composite identifier then allows downstream queries to recover the chunk ID, sample ID, and cell ID, which serve as keys into the external metadata database.

Hierarchical Page Index¶

For indices where the sequence array exceeds available memory, a hierarchical page structure is constructed. The sequence array is partitioned into pages of \(P = 1024\) \(k\)-mers. A second-level array samples every \(P\)-th element; a third level samples every \(P\)-th element from the second; and so on until the root level contains fewer than \(P\) elements. This structure is stored alongside the main index arrays.

Level 2 (root):        [         k_max         ]
                        /          |           \
Level 1:         [k₁₀₂₄]      [k₂₀₄₈]       [···]
                 /      \      /      \
Base pages:   [P₀]    [P₁]  [P₂]    [P₃]    [P₄]  [···]

Each base page:    P = 1024 k-mers
Each level-1 node: max k-mer of P base pages

During lookup, navigation proceeds from root to leaf: at each level, binary search identifies which child page contains the target \(k\)-mer, narrowing the search space by factor \(P\) per level. For a sequence array of \(N\) \(k\)-mers, lookup requires \(O(\log_P N)\) page accesses, typically 3–4 for billion-scale indices. The root and intermediate levels are kept in memory; base pages are loaded on demand.

Querying¶

At query time, all chunks are accessed in parallel. For each chunk, the query sequence is decomposed into \(k\)-mers and matched against the chunk’s index. Results from all chunks are aggregated, and the composite cell identifiers are used to retrieve metadata from the external database.

Window-Based Matching¶

The query algorithm evaluates sequence similarity using a windowed voting scheme. For a query sequence \(Q\), a window of size \(w\) (default 64 bp) slides with stride 1 (overlapping windows). Within each window, non-overlapping \(k\)-mers at positions \(\{0, k, 2k, \ldots\}\) are extracted and looked up. A cell matches the window if at least fraction \(\tau\) (default 0.65) of the window’s \(k\)-mers are found in that cell. The final score for cell \(c\) is the count of windows where \(c\) achieved a match.

The use of overlapping windows with stride 1 ensures sensitivity to matches at any position within the query, while the non-overlapping \(k\)-mer extraction within each window controls computational cost.

Algorithm 3: Window-Based Query Scoring

Input: Query sequence \(Q\), index arrays \((\mathbf{I}, \mathbf{P}, \mathbf{D})\), window size \(w\), threshold \(\tau\)
Input: Abundance filters \(f_{\min}\) (count_at_least), \(f_{\max}\) (count_at_most)
Output: Map from cell ID to pseudocount score

Initialize \(\texttt{scores} \gets \{\}\) (cell → count of matching windows)
For \(p \gets 0\) to \(|Q| - w\) step 1: (overlapping windows, stride 1)
1. \(W \gets Q[p : p + w]\)
2. Extract non-overlapping \(k\)-mers: \(K \gets \{W[i \cdot k : (i+1) \cdot k] : i = 0, 1, \ldots, \lfloor w/k \rfloor - 1\}\)
3. Filter: \(K \gets \{x \in K : x \text{ contains no N}\}\)
4. Initialize \(\texttt{votes} \gets \{\}\) (cell → count of \(k\)-mers found)
5. For each \(k\)-mer \(x \in K\):
  1. \(i \gets \textsc{Lookup}(\mathbf{I}, x)\) (binary search or hierarchical lookup)
  2. If \(\mathbf{I}[i] \neq x\): skip (\(k\)-mer not in index)
  3. \(\texttt{cells} \gets \mathbf{D}[\mathbf{P}[i] : \mathbf{P}[i+1]]\)
  4. If \(|\texttt{cells}| < f_{\min}\) or \(|\texttt{cells}| > f_{\max}\): skip (filter rare/ubiquitous \(k\)-mers)
  5. For each \(c \in \texttt{cells}\): increment \(\texttt{votes}[c]\)
6. Set \(t \gets \lceil \tau \cdot |K| \rceil\)
7. For each \((c, v) \in \texttt{votes}\): if \(v \geq t\), increment \(\texttt{scores}[c]\)
Return \(\texttt{scores}\)

Example: \(k = 4\), \(w = 8\), \(\tau = 0.65\)

The window yields two \(k\)-mers (ACGT, ACGA). Each is located in \(\mathbf{I}\) via binary search; pointers in \(\mathbf{P}\) give the cell range in \(\mathbf{D}\).

ACGT → cells \(\{c_1, c_2\}\)
ACGA → cells \(\{c_1, c_3\}\)

With \(\tau = 0.65\), need \(t = \lceil 0.65 \times 2 \rceil = 2\) matching \(k\)-mers per window.

Cell \(c_1\): 2/2 \(\geq t\) → match (receives one window vote)
Cell \(c_2\): 1/2 \(< t\) → no vote
Cell \(c_3\): 1/2 \(< t\) → no vote

Users may specify query-time abundance filters:

count_at_least (default 0): excludes \(k\)-mers appearing in fewer than this many cells, filtering likely sequencing errors
count_at_most (default 100,000): excludes \(k\)-mers appearing in more than this many cells per chunk, filtering repetitive or ubiquitously expressed sequences that might dominate runtime and be uninformative

Filters are applied after lookup; no abundance filtering is applied during index construction.

Query Batching and Index Access Pattern¶

For efficiency, all \(k\)-mers across all query windows are collected, deduplicated, and sorted lexicographically before lookup. Because the sequence array is sorted, this ensures sequential access through the index, maximizing cache locality and minimizing disk seeks.

The access pattern proceeds as follows:

Collect all unique \(k\)-mers from all query windows
Sort \(k\)-mers lexicographically
For each \(k\)-mer in sorted order:
1. Binary search (or hierarchical lookup) in sequence array to find position \(i\)
2. Read \(\texttt{indptr}[i]\) and \(\texttt{indptr}[i+1]\) to determine cell range
3. Read \(\texttt{data}[\texttt{indptr}[i] : \texttt{indptr}[i+1]]\) to retrieve cells
Redistribute results to originating windows for scoring

For indices exceeding available memory, compressed array blocks are accessed via memory-mapped file I/O. The operating system’s virtual memory subsystem handles paging of index blocks on demand. Decompressed blocks are retained in an LRU cache (capacity \(2 \times 10^6\) blocks, approximately 4 GB) to avoid repeated decompression. With sequential access patterns, empirical cache hit rates exceed 90%.

Metadata Integration¶

Query results return composite cell identifiers, which are decomposed into (chunk ID, sample ID, cell ID) tuples serving as keys into a pre-indexed external database. This database stores:

Sample-level metadata: tissue type, donor ID, experimental condition, sequencing platform, publication source
Cell-level metadata: cell type annotation, cluster assignment, quality metrics

The separation of sequence index from metadata database allows efficient annotation updates without rebuilding the \(k\)-mer index.

Parameters Summary¶

Default parameter values for index construction and querying¶
Parameter	Symbol	Default	Description
Indexing
K-mer length	\(k\)	24	Nucleotides per \(k\)-mer
Buffer threshold	\(T\)	\(10^8\)	Pairs before chunk flush
Block size	\(B\)	512	Elements per compressed block
Page size	\(P\)	1024	Elements per hierarchical page
Merge window	—	5%	Fraction of chunk per iteration
Window cap	—	\(10^7\)	Maximum entries per merge window
Sample ID bits	\(s\)	8	Bits for sample within chunk
Querying
Query window	\(w\)	64	Nucleotides per scoring window
Window stride	—	1	Overlapping windows
Match threshold	\(\tau\)	0.65	Fraction of \(k\)-mers required per window
Min abundance	\(f_{\min}\)	0	`count_at_least` filter
Max abundance	\(f_{\max}\)	100,000	`count_at_most` filter
Cache capacity	—	\(2 \times 10^6\)	Blocks in LRU decompression cache