malva.autoannotate module¶
- malva.autoannotate.load_marker_genes(json_path, exclude_technical=True)[source]¶
Load marker genes from a JSON file, optionally excluding technical genes
Parameters:¶
- json_pathstr
Path to the marker gene JSON file
- exclude_technicalbool
If True, exclude any gene categories that start with ‘technical_’
Returns:¶
: dict
Dictionary of filtered marker genes
- malva.autoannotate.run_clustering(adata, savefig=None, resolution=1)[source]¶
Automated analysis pipeline for filtering and clustering
- Parameters:
adata (AnnData) – AnnData object containing raw counts (not filtered nor normalized)
savefig (str, default=None) – Folder where to save the plots. By default, save in current path
resolution (float, default=1) – Resolution used for leiden clustering (higher values = more clusters)
- Returns:
adata_filtered – AnnData object containing only the called cells
- Return type:
AnnData
- malva.autoannotate.preprocess_adata(adata, umi_cutoff=500, cell_cutoff=2)[source]¶
Preprocesses by filtering cells (by counts) and genes (by cells), and then applies normalization. Copies raw counts.
- Parameters:
adata (AnnData) – AnnData object containing raw counts (not filtered nor normalized) umi_cutoff : int UMI count threshold used for cell filtering
cell_cutoff (int) – Cell count threshold used for gene filtering
- Returns:
adata_filtered – AnnData object containing only the called cells
- Return type:
AnnData
- malva.autoannotate.score_cells_by_cell_type(adata, cell_markers)[source]¶
Score cells for each cell type based on marker gene expression
Parameters:¶
- adataAnnData
Annotated data matrix with cells as rows and genes as columns
- cell_markersdict
Dictionary mapping cell types to lists of marker genes
Returns:¶
: adata : AnnData
Input object with scores added to obs
- malva.autoannotate.get_top_cell_types(adata, n_types=3, score_prefix='score_')[source]¶
For each cell, get the top scoring cell types
Parameters:¶
- adataAnnData
Annotated data with cell type scores
- n_typesint
Number of top cell types to retrieve
- score_prefixstr
Prefix for score columns
Returns:¶
: top_cell_types : pd.DataFrame
DataFrame with top cell types and scores for each cell
- malva.autoannotate.annotate_clusters(adata, cell_markers, cluster_key='leiden', threshold=0.4, min_markers=3, savefig=None)[source]¶
Score and annotate cell types based on cluster-level expression signatures
Parameters:¶
- adataAnnData
Annotated data matrix with clustering results
- cell_markersdict
Dictionary mapping cell types to lists of marker genes
- cluster_keystr
Key in adata.obs for cluster assignments
- thresholdfloat
Minimum differential expression score threshold for a marker to be considered
- min_markersint
Minimum number of markers needed for a cell type to be assigned
- savefigstr
Path where the plots will be saved. If None, then no plots are saved
Returns:¶
: adata : AnnData
Input object with cell type annotations added
- annotationspd.DataFrame
Detailed annotation information for each cluster
- malva.autoannotate.get_detailed_cluster_annotations(adata, cluster_key='leiden')[source]¶
Get detailed cluster annotations with cell type distribution
- malva.autoannotate.analyze_technical_genes(adata, housekeeping_genes, savefig=None)[source]¶
Analyze technical and housekeeping genes to assess data quality
Parameters:¶
- adataAnnData
Annotated data matrix
- housekeeping_genesdict
Dictionary of gene categories and their corresponding genes
- savefigstr
Path where the plots will be saved. If None, then no plots are saved
- malva.autoannotate.umi_threshold_cell_calling(adata, expected_cells=None, min_cells=2, max_cells=None, percentile=99, ordmag_divisor=10, plot=True)[source]¶
Implementation of a UMI threshold method for cell calling, similar to Cell Ranger’s approach but independent of Cell Ranger and working directly with AnnData objects.
- Parameters:
adata (AnnData) – AnnData object containing raw counts (not filtered or normalized)
expected_cells (int, optional) – Expected number of cells in the dataset. If None, it will be estimated.
min_cells (int, default=2) – Minimum number of cells to consider in the grid search
max_cells (int, optional) – Maximum number of cells to consider in the grid search. If None, it will be set to int(n_barcodes/2) or 45,000, whichever is smaller.
percentile (int, default=99) – Percentile used for the UMI threshold calculation (Cell Ranger uses 99th percentile)
ordmag_divisor (int, default=10) – Divisor used in the Order of Magnitude algorithm (Cell Ranger uses 10)
plot (bool, default=True) – Whether to plot the UMI distribution and threshold
- Returns:
adata_filtered (AnnData) – AnnData object containing only the called cells
threshold (float) – UMI count threshold used for cell calling
cell_barcodes (list) – List of barcodes that were called as cells
- malva.autoannotate.simple_good_turing_smoothing(counts)[source]¶
Implements the Simple Good-Turing smoothing algorithm to estimate probabilities for unseen events, ensuring non-zero proportions for genes with zero counts.
- Parameters:
counts (array-like) – Count data for each gene
- Returns:
probabilities – Smoothed probabilities for each gene
- Return type:
ndarray
- malva.autoannotate.emptydrops_refinement(adata, adata_filtered=None, threshold=None, ambient_min_umi=1, ambient_max_umi=100, min_total_umi=500, fdr_threshold=0.01, plot=True)[source]¶
Implementation of the EmptyDrops algorithm for refining cell calling by identifying low RNA content cells that are distinguishable from empty droplets.
- Parameters:
adata (AnnData) – Original AnnData object containing all barcodes (filtered and unfiltered)
adata_filtered (AnnData, optional) – AnnData object containing barcodes called as cells by OrdMag or another method. If None, it assumes all barcodes in adata are potential cells.
threshold (float, optional) – UMI threshold used in initial cell calling. If None, it will be estimated.
ambient_min_umi (int, default=1) – Minimum total UMI count to consider a barcode for ambient RNA profile estimation
ambient_max_umi (int, default=100) – Maximum total UMI count to consider a barcode for ambient RNA profile estimation
min_total_umi (int, default=500) – Minimum total UMI count for a barcode to be considered as a candidate cell
fdr_threshold (float, default=0.01) – False discovery rate threshold for cell calling
plot (bool, default=True) – Whether to plot diagnostic visualizations
- Returns:
adata_refined (AnnData) – AnnData object containing all called cells (OrdMag + EmptyDrops)
ambient_profile (ndarray) – Estimated ambient RNA profile
cell_barcodes (list) – List of barcodes that were called as cells
- malva.autoannotate.load_markers(marker_source)[source]¶
Load marker genes and prepare technical/non-technical gene lists
Parameters:¶
- marker_sourcestr
Either ‘human_markers’, ‘human_markers_hallmarks’, ‘mouse_markers’, or a path to a custom JSON file
Returns:¶
: tuple
(cell_markers_nontechnical, cell_markers, nontechnical_genes)
- malva.autoannotate.score_annotate(adata, cell_markers, savefig=None)[source]¶
Score cells for each cell type and annotate with top cell types
Parameters:¶
- adataAnnData
Clustered AnnData object
- cell_markersdict
Dictionary mapping cell types to marker genes
- savefigstr
Path where the plots will be saved. If None, then no plots are saved
Returns:¶
: AnnData
AnnData with cell type scores and annotations