Parallel processing across alignments
Building workflows using piqtree apps¶
WARNING This page is under construction!
We can combine piqtree apps with other cogent3 apps to develop a pipeline. There are multiple concepts involved here, particularly data stores, composed apps, parallel execution, log files etc... See the cogent3 app documentation for more details.
To develop a pipeline efficiently we only need a subset of the sequences in an alignment. We will use the diverse-seq plugin for that purpose. This allows selecting a specified subset of sequences that capture the diversity in an alignment.
But first, we need the data.
from piqtree import download_dataset
alns_path = download_dataset("mammal-orths.zip", dest_dir="data", inflate_zip=False)
We open this directory as a cogent3 data store.
from cogent3 import open_data_store
dstore = open_data_store(alns_path, suffix="fa")
dstore.describe
| record type | number |
|---|---|
| completed | 200 |
| not_completed | 0 |
| logs | 0 |
3 rows x 2 columns
We need to create some apps to: load data, a divergent sequence selector app, drop alignment columns containing non-canonical nucleotides (so gaps and N's), select alignments with a minimym number of aligned columns, a "data store" to write results to and a writer. These will then be combined into a single composed app which will be applied to all the alignments in the data store.
import pathlib
from collections import Counter
from cogent3 import get_app
outpath = pathlib.Path("data/delme.sqlitedb")
outpath.unlink(missing_ok=True)
loader = get_app("load_aligned", format_name="fasta", moltype="dna")
divergent = get_app("dvs_nmost", n=10, k=6)
just_nucs = get_app("omit_degenerates") # has to go after the divergent selector
min_length = get_app("min_length", length=600)
best_model = get_app("piq_model_finder")
app = loader + divergent + just_nucs + min_length + best_model
model_counts = Counter(
str(result.best_aic)
for result in app.as_completed(dstore, show_progress=True, parallel=True)
if result
)
model_counts
Counter({'TIM3+F+G4': 5,
'TIM2+F+R2': 5,
'TIM3+F+R2': 5,
'TPM3u+F+G4': 4,
'GTR+F+R2': 4,
'TN+F+I+R2': 3,
'GTR+F+I+G4': 3,
'GTR+F+G4': 3,
'TN+F+R2': 3,
'TVM+F+R2': 3,
'GTR+F+I+R2': 3,
'TIM2+F+G4': 3,
'TIM2+F+I+R2': 2,
'TIM+F+I+R2': 2,
'TIM3+F+I+R2': 2,
'TIM+F+G4': 2,
'HKY+F+G4': 2,
'TPM3u+F+R2': 2,
'TVM+I+G4': 1,
'TN+F+R3': 1,
'TVM+F+I+G4': 1,
'TN+F+G4': 1,
'HKY+F+R2': 1,
'TPM3u+F+I+G4': 1,
'TIM3+F+I': 1,
'HKY+F+R3': 1,
'TIM+F+I+G4': 1,
'TN+F+I+G4': 1,
'TIM3+R2': 1,
'TIM3e+R2': 1,
'TPM3u+R2': 1,
'TPM2u+F+G4': 1,
'GTR+I+R2': 1,
'TVM+F+G4': 1,
'TPM3+G4': 1,
'TIM2+F+I+G4': 1,
'TVM+G4': 1,
'TIM2+F': 1,
'TVM+F': 1,
'GTR+F': 1,
'HKY+R2': 1})