malva.dbutils module

class malva.dbutils.EnsemblLocalDB(data_dir=PosixPath('/home/dleonpe/.malva/ensembl_data'), species='mus_musculus')[source]

Bases: object

A class to manage local storage and retrieval of Ensembl cDNA sequences.

This class provides functionality to download, index, and query cDNA sequences from Ensembl for specified species.

data_dir

Directory to store downloaded and indexed data.

Type:

str

index

Dictionary mapping Ensembl IDs to sequences.

Type:

dict

gene_to_ensembl

Dictionary mapping gene names to Ensembl IDs.

Type:

dict

__init__(data_dir=PosixPath('/home/dleonpe/.malva/ensembl_data'), species='mus_musculus')[source]
database_exists()[source]

Check if the local database exists, is properly set up, and contains data.

Returns:

True if the database exists, is set up, and contains data; False otherwise.

Return type:

bool

download_cdna_fasta()[source]

Download the cDNA FASTA file for a given species from Ensembl.

Returns:

The local path to the downloaded file.

Return type:

str

index_fasta(fasta_file)[source]

Parse the FASTA file and store the data in the SQLite database efficiently.

Parameters:

fasta_file (str) – Path to the FASTA file to be indexed.

save_index()[source]

Save the indexed data to disk using pickle.

load_index()[source]

Load the indexed data from disk.

Returns:

True if the index was successfully loaded, False otherwise.

Return type:

bool

get_from_gene(gene_id, seq_type='cdna')[source]

Retrieve cDNA sequences for a given gene ID or Ensembl ID.

This method first checks if the database exists. If not, it sets up the database by downloading and indexing the necessary files. The search is case-insensitive.

Parameters:
  • gene_id (str) – The gene ID or Ensembl ID to query.

  • seq_type (str, optional) – The type of sequence to retrieve. Currently only ‘cdna’ is supported. Defaults to ‘cdna’.

Returns:

A list of cDNA sequences associated with the given gene ID.

Return type:

list

Raises:

ValueError – If seq_type is not ‘cdna’.

malva.dbutils.process_dna_string(sequence)[source]

Validates and parses a DNA sequence or a FASTA-like format sequence.

Parameters:

sequence (str) – The input DNA sequence or FASTA-like format sequence.

Returns:

A single line DNA sequence with only ATCG characters, other nucleotides replaced by A, and U replaced by T.

Return type:

str

malva.dbutils.parse_multifasta(fasta_string)[source]

Validates and parses a FASTA-formatted string.

Parameters:

fasta_string (str) – FASTA-formatted sequence.

Returns:

A list of single line DNA sequences with only ATCG characters, other nucleotides replaced by A, and U replaced by T.

Return type:

list

malva.dbutils.handle_sequence(input_string, recursion=True)[source]

Checks the input string for specific conditions and routes it accordingly.

Parameters:

input_string (str) – The input string to check and handle.

Returns:

the parsed DNA sequence for the input_string feature

Return type:

str

malva.dbutils.process_gene_string(gene_string)[source]

Processes a string that starts with ‘gene:’ and extracts the gene ID, species, and split parameter.

Parameters:

gene_string (str) – The input string starting with ‘gene:’.

Returns:

A dictionary with keys ‘gene_id’, ‘species’, and ‘split’.

Return type:

dict

malva.dbutils.process_ensembl_string(ensembl_string)[source]

Processes a string that starts with ‘ensembl:’ and extracts the Ensembl ID.

Parameters:

ensembl_string (str) – The input string starting with ‘ensembl:’.

Returns:

A dictionary with the key ‘ensembl_id’.

Return type:

dict

malva.dbutils.get_from_gene(gene_id, species='homo_sapiens', seqtype='genomic')[source]
malva.dbutils.get_from_ensembl(ensembl_id, seqtype='genomic')[source]
malva.dbutils.validate_and_infer_query(input_string)[source]

Validate and infer whether the input is gene IDs or DNA sequences.

Parameters:

input_string (str) – The user input string.

Returns:

Corrected query string or raises an exception if validation fails.

Return type:

str