malva.dbutils module¶
- class malva.dbutils.EnsemblLocalDB(data_dir=PosixPath('/home/dleonpe/.malva/ensembl_data'), species='mus_musculus')[source]¶
Bases:
objectA class to manage local storage and retrieval of Ensembl cDNA sequences.
This class provides functionality to download, index, and query cDNA sequences from Ensembl for specified species.
- data_dir¶
Directory to store downloaded and indexed data.
- Type:
str
- index¶
Dictionary mapping Ensembl IDs to sequences.
- Type:
dict
- gene_to_ensembl¶
Dictionary mapping gene names to Ensembl IDs.
- Type:
dict
- database_exists()[source]¶
Check if the local database exists, is properly set up, and contains data.
- Returns:
True if the database exists, is set up, and contains data; False otherwise.
- Return type:
bool
- download_cdna_fasta()[source]¶
Download the cDNA FASTA file for a given species from Ensembl.
- Returns:
The local path to the downloaded file.
- Return type:
str
- index_fasta(fasta_file)[source]¶
Parse the FASTA file and store the data in the SQLite database efficiently.
- Parameters:
fasta_file (str) – Path to the FASTA file to be indexed.
- load_index()[source]¶
Load the indexed data from disk.
- Returns:
True if the index was successfully loaded, False otherwise.
- Return type:
bool
- get_from_gene(gene_id, seq_type='cdna')[source]¶
Retrieve cDNA sequences for a given gene ID or Ensembl ID.
This method first checks if the database exists. If not, it sets up the database by downloading and indexing the necessary files. The search is case-insensitive.
- Parameters:
gene_id (str) – The gene ID or Ensembl ID to query.
seq_type (str, optional) – The type of sequence to retrieve. Currently only ‘cdna’ is supported. Defaults to ‘cdna’.
- Returns:
A list of cDNA sequences associated with the given gene ID.
- Return type:
list
- Raises:
ValueError – If seq_type is not ‘cdna’.
- malva.dbutils.process_dna_string(sequence)[source]¶
Validates and parses a DNA sequence or a FASTA-like format sequence.
- Parameters:
sequence (str) – The input DNA sequence or FASTA-like format sequence.
- Returns:
A single line DNA sequence with only ATCG characters, other nucleotides replaced by A, and U replaced by T.
- Return type:
str
- malva.dbutils.parse_multifasta(fasta_string)[source]¶
Validates and parses a FASTA-formatted string.
- Parameters:
fasta_string (str) – FASTA-formatted sequence.
- Returns:
A list of single line DNA sequences with only ATCG characters, other nucleotides replaced by A, and U replaced by T.
- Return type:
list
- malva.dbutils.handle_sequence(input_string, recursion=True)[source]¶
Checks the input string for specific conditions and routes it accordingly.
- Parameters:
input_string (str) – The input string to check and handle.
- Returns:
the parsed DNA sequence for the input_string feature
- Return type:
str
- malva.dbutils.process_gene_string(gene_string)[source]¶
Processes a string that starts with ‘gene:’ and extracts the gene ID, species, and split parameter.
- Parameters:
gene_string (str) – The input string starting with ‘gene:’.
- Returns:
A dictionary with keys ‘gene_id’, ‘species’, and ‘split’.
- Return type:
dict