malva.autoannotate module

malva.autoannotate.load_marker_genes(json_path, exclude_technical=True)[source]

Load marker genes from a JSON file, optionally excluding technical genes

Parameters:

json_pathstr

Path to the marker gene JSON file

exclude_technicalbool

If True, exclude any gene categories that start with ‘technical_

Returns:

: dict

Dictionary of filtered marker genes

malva.autoannotate.run_clustering(adata, savefig=None, resolution=1)[source]

Automated analysis pipeline for filtering and clustering

Parameters:
  • adata (AnnData) – AnnData object containing raw counts (not filtered nor normalized)

  • savefig (str, default=None) – Folder where to save the plots. By default, save in current path

  • resolution (float, default=1) – Resolution used for leiden clustering (higher values = more clusters)

Returns:

adata_filtered – AnnData object containing only the called cells

Return type:

AnnData

malva.autoannotate.preprocess_adata(adata, umi_cutoff=500, cell_cutoff=2)[source]

Preprocesses by filtering cells (by counts) and genes (by cells), and then applies normalization. Copies raw counts.

Parameters:
  • adata (AnnData) – AnnData object containing raw counts (not filtered nor normalized) umi_cutoff : int UMI count threshold used for cell filtering

  • cell_cutoff (int) – Cell count threshold used for gene filtering

Returns:

adata_filtered – AnnData object containing only the called cells

Return type:

AnnData

malva.autoannotate.score_cells_by_cell_type(adata, cell_markers)[source]

Score cells for each cell type based on marker gene expression

Parameters:

adataAnnData

Annotated data matrix with cells as rows and genes as columns

cell_markersdict

Dictionary mapping cell types to lists of marker genes

Returns:

: adata : AnnData

Input object with scores added to obs

malva.autoannotate.get_top_cell_types(adata, n_types=3, score_prefix='score_')[source]

For each cell, get the top scoring cell types

Parameters:

adataAnnData

Annotated data with cell type scores

n_typesint

Number of top cell types to retrieve

score_prefixstr

Prefix for score columns

Returns:

: top_cell_types : pd.DataFrame

DataFrame with top cell types and scores for each cell

malva.autoannotate.annotate_clusters(adata, cell_markers, cluster_key='leiden', threshold=0.4, min_markers=3, savefig=None)[source]

Score and annotate cell types based on cluster-level expression signatures

Parameters:

adataAnnData

Annotated data matrix with clustering results

cell_markersdict

Dictionary mapping cell types to lists of marker genes

cluster_keystr

Key in adata.obs for cluster assignments

thresholdfloat

Minimum differential expression score threshold for a marker to be considered

min_markersint

Minimum number of markers needed for a cell type to be assigned

savefigstr

Path where the plots will be saved. If None, then no plots are saved

Returns:

: adata : AnnData

Input object with cell type annotations added

annotationspd.DataFrame

Detailed annotation information for each cluster

malva.autoannotate.get_detailed_cluster_annotations(adata, cluster_key='leiden')[source]

Get detailed cluster annotations with cell type distribution

malva.autoannotate.analyze_technical_genes(adata, housekeeping_genes, savefig=None)[source]

Analyze technical and housekeeping genes to assess data quality

Parameters:

adataAnnData

Annotated data matrix

housekeeping_genesdict

Dictionary of gene categories and their corresponding genes

savefigstr

Path where the plots will be saved. If None, then no plots are saved

malva.autoannotate.umi_threshold_cell_calling(adata, expected_cells=None, min_cells=2, max_cells=None, percentile=99, ordmag_divisor=10, plot=True)[source]

Implementation of a UMI threshold method for cell calling, similar to Cell Ranger’s approach but independent of Cell Ranger and working directly with AnnData objects.

Parameters:
  • adata (AnnData) – AnnData object containing raw counts (not filtered or normalized)

  • expected_cells (int, optional) – Expected number of cells in the dataset. If None, it will be estimated.

  • min_cells (int, default=2) – Minimum number of cells to consider in the grid search

  • max_cells (int, optional) – Maximum number of cells to consider in the grid search. If None, it will be set to int(n_barcodes/2) or 45,000, whichever is smaller.

  • percentile (int, default=99) – Percentile used for the UMI threshold calculation (Cell Ranger uses 99th percentile)

  • ordmag_divisor (int, default=10) – Divisor used in the Order of Magnitude algorithm (Cell Ranger uses 10)

  • plot (bool, default=True) – Whether to plot the UMI distribution and threshold

Returns:

  • adata_filtered (AnnData) – AnnData object containing only the called cells

  • threshold (float) – UMI count threshold used for cell calling

  • cell_barcodes (list) – List of barcodes that were called as cells

malva.autoannotate.simple_good_turing_smoothing(counts)[source]

Implements the Simple Good-Turing smoothing algorithm to estimate probabilities for unseen events, ensuring non-zero proportions for genes with zero counts.

Parameters:

counts (array-like) – Count data for each gene

Returns:

probabilities – Smoothed probabilities for each gene

Return type:

ndarray

malva.autoannotate.emptydrops_refinement(adata, adata_filtered=None, threshold=None, ambient_min_umi=1, ambient_max_umi=100, min_total_umi=500, fdr_threshold=0.01, plot=True)[source]

Implementation of the EmptyDrops algorithm for refining cell calling by identifying low RNA content cells that are distinguishable from empty droplets.

Parameters:
  • adata (AnnData) – Original AnnData object containing all barcodes (filtered and unfiltered)

  • adata_filtered (AnnData, optional) – AnnData object containing barcodes called as cells by OrdMag or another method. If None, it assumes all barcodes in adata are potential cells.

  • threshold (float, optional) – UMI threshold used in initial cell calling. If None, it will be estimated.

  • ambient_min_umi (int, default=1) – Minimum total UMI count to consider a barcode for ambient RNA profile estimation

  • ambient_max_umi (int, default=100) – Maximum total UMI count to consider a barcode for ambient RNA profile estimation

  • min_total_umi (int, default=500) – Minimum total UMI count for a barcode to be considered as a candidate cell

  • fdr_threshold (float, default=0.01) – False discovery rate threshold for cell calling

  • plot (bool, default=True) – Whether to plot diagnostic visualizations

Returns:

  • adata_refined (AnnData) – AnnData object containing all called cells (OrdMag + EmptyDrops)

  • ambient_profile (ndarray) – Estimated ambient RNA profile

  • cell_barcodes (list) – List of barcodes that were called as cells

malva.autoannotate.load_markers(marker_source)[source]

Load marker genes and prepare technical/non-technical gene lists

Parameters:

marker_sourcestr

Either ‘human_markers’, ‘human_markers_hallmarks’, ‘mouse_markers’, or a path to a custom JSON file

Returns:

: tuple

(cell_markers_nontechnical, cell_markers, nontechnical_genes)

malva.autoannotate.score_annotate(adata, cell_markers, savefig=None)[source]

Score cells for each cell type and annotate with top cell types

Parameters:

adataAnnData

Clustered AnnData object

cell_markersdict

Dictionary mapping cell types to marker genes

savefigstr

Path where the plots will be saved. If None, then no plots are saved

Returns:

: AnnData

AnnData with cell type scores and annotations