malva.autoannotate module¶

malva.autoannotate.load_marker_genes(json_path, exclude_technical=True)[source]¶

Load marker genes from a JSON file, optionally excluding technical genes

Parameters:¶

json_pathstr: Path to the marker gene JSON file
exclude_technicalbool: If True, exclude any gene categories that start with ‘technical_’

Returns:¶

: dict

Dictionary of filtered marker genes

malva.autoannotate.run_clustering(adata, savefig=None, resolution=1)[source]¶

Automated analysis pipeline for filtering and clustering

Parameters:

adata (AnnData) – AnnData object containing raw counts (not filtered nor normalized)
savefig (str, default=None) – Folder where to save the plots. By default, save in current path
resolution (float, default=1) – Resolution used for leiden clustering (higher values = more clusters)

Returns:

adata_filtered – AnnData object containing only the called cells

Return type:

AnnData

malva.autoannotate.preprocess_adata(adata, umi_cutoff=500, cell_cutoff=2)[source]¶

Preprocesses by filtering cells (by counts) and genes (by cells), and then applies normalization. Copies raw counts.

Parameters:

adata (AnnData) – AnnData object containing raw counts (not filtered nor normalized) umi_cutoff : int UMI count threshold used for cell filtering
cell_cutoff (int) – Cell count threshold used for gene filtering

Returns:

adata_filtered – AnnData object containing only the called cells

Return type:

AnnData

malva.autoannotate.score_cells_by_cell_type(adata, cell_markers)[source]¶

Score cells for each cell type based on marker gene expression

Parameters:¶

adataAnnData: Annotated data matrix with cells as rows and genes as columns
cell_markersdict: Dictionary mapping cell types to lists of marker genes

Returns:¶

: adata : AnnData

Input object with scores added to obs

malva.autoannotate.get_top_cell_types(adata, n_types=3, score_prefix='score_')[source]¶

For each cell, get the top scoring cell types

Parameters:¶

adataAnnData: Annotated data with cell type scores
n_typesint: Number of top cell types to retrieve
score_prefixstr: Prefix for score columns

Returns:¶

: top_cell_types : pd.DataFrame

DataFrame with top cell types and scores for each cell

malva.autoannotate.annotate_clusters(adata, cell_markers, cluster_key='leiden', threshold=0.4, min_markers=3, savefig=None)[source]¶

Score and annotate cell types based on cluster-level expression signatures

Parameters:¶

adataAnnData: Annotated data matrix with clustering results
cell_markersdict: Dictionary mapping cell types to lists of marker genes
cluster_keystr: Key in adata.obs for cluster assignments
thresholdfloat: Minimum differential expression score threshold for a marker to be considered
min_markersint: Minimum number of markers needed for a cell type to be assigned
savefigstr: Path where the plots will be saved. If None, then no plots are saved

Returns:¶

: adata : AnnData

Input object with cell type annotations added

annotationspd.DataFrame: Detailed annotation information for each cluster

malva.autoannotate.get_detailed_cluster_annotations(adata, cluster_key='leiden')[source]¶: Get detailed cluster annotations with cell type distribution

malva.autoannotate.analyze_technical_genes(adata, housekeeping_genes, savefig=None)[source]¶

Analyze technical and housekeeping genes to assess data quality

Parameters:¶

adataAnnData: Annotated data matrix
housekeeping_genesdict: Dictionary of gene categories and their corresponding genes
savefigstr: Path where the plots will be saved. If None, then no plots are saved

malva.autoannotate.umi_threshold_cell_calling(adata, expected_cells=None, min_cells=2, max_cells=None, percentile=99, ordmag_divisor=10, plot=True)[source]¶

Implementation of a UMI threshold method for cell calling, similar to Cell Ranger’s approach but independent of Cell Ranger and working directly with AnnData objects.

Parameters:

adata (AnnData) – AnnData object containing raw counts (not filtered or normalized)
expected_cells (int, optional) – Expected number of cells in the dataset. If None, it will be estimated.
min_cells (int, default=2) – Minimum number of cells to consider in the grid search
max_cells (int, optional) – Maximum number of cells to consider in the grid search. If None, it will be set to int(n_barcodes/2) or 45,000, whichever is smaller.
percentile (int, default=99) – Percentile used for the UMI threshold calculation (Cell Ranger uses 99th percentile)
ordmag_divisor (int, default=10) – Divisor used in the Order of Magnitude algorithm (Cell Ranger uses 10)
plot (bool, default=True) – Whether to plot the UMI distribution and threshold

Returns:

adata_filtered (AnnData) – AnnData object containing only the called cells
threshold (float) – UMI count threshold used for cell calling
cell_barcodes (list) – List of barcodes that were called as cells

malva.autoannotate.simple_good_turing_smoothing(counts)[source]¶

Implements the Simple Good-Turing smoothing algorithm to estimate probabilities for unseen events, ensuring non-zero proportions for genes with zero counts.

Parameters:: counts (array-like) – Count data for each gene
Returns:: probabilities – Smoothed probabilities for each gene
Return type:: ndarray

malva.autoannotate.emptydrops_refinement(adata, adata_filtered=None, threshold=None, ambient_min_umi=1, ambient_max_umi=100, min_total_umi=500, fdr_threshold=0.01, plot=True)[source]¶

Implementation of the EmptyDrops algorithm for refining cell calling by identifying low RNA content cells that are distinguishable from empty droplets.

Parameters:

adata (AnnData) – Original AnnData object containing all barcodes (filtered and unfiltered)
adata_filtered (AnnData, optional) – AnnData object containing barcodes called as cells by OrdMag or another method. If None, it assumes all barcodes in adata are potential cells.
threshold (float, optional) – UMI threshold used in initial cell calling. If None, it will be estimated.
ambient_min_umi (int, default=1) – Minimum total UMI count to consider a barcode for ambient RNA profile estimation
ambient_max_umi (int, default=100) – Maximum total UMI count to consider a barcode for ambient RNA profile estimation
min_total_umi (int, default=500) – Minimum total UMI count for a barcode to be considered as a candidate cell
fdr_threshold (float, default=0.01) – False discovery rate threshold for cell calling
plot (bool, default=True) – Whether to plot diagnostic visualizations

Returns:

adata_refined (AnnData) – AnnData object containing all called cells (OrdMag + EmptyDrops)
ambient_profile (ndarray) – Estimated ambient RNA profile
cell_barcodes (list) – List of barcodes that were called as cells

malva.autoannotate.load_markers(marker_source)[source]¶

Load marker genes and prepare technical/non-technical gene lists

Parameters:¶

marker_sourcestr: Either ‘human_markers’, ‘human_markers_hallmarks’, ‘mouse_markers’, or a path to a custom JSON file

Returns:¶

: tuple

(cell_markers_nontechnical, cell_markers, nontechnical_genes)

malva.autoannotate.score_annotate(adata, cell_markers, savefig=None)[source]¶

Score cells for each cell type and annotate with top cell types

Parameters:¶

adataAnnData: Clustered AnnData object
cell_markersdict: Dictionary mapping cell types to marker genes
savefigstr: Path where the plots will be saved. If None, then no plots are saved

Returns:¶

: AnnData

AnnData with cell type scores and annotations