Oncodrive3D ships with a plotting pipeline that turns the clustering results into analysis-ready visualizations and enriched tables. Its goal is threefold:
- help you interpret why a gene or residue achieved a significant 3D-clustering signal by overlaying structural/functional context,
- highlight diagnostic cues that reveal whether a signal looks biologically plausible or could be an artifact,
- provide downstream-analysis assets such as summary panels, per-gene tracks, volcano/log-odds association charts, and annotated CSVs to guide follow-up experiments or reporting.
To keep the runtime reasonable, structural and functional annotations are generated once per dataset via oncodrive3d build-annotations, then reused by oncodrive3d plot for any cohort.
This document describes the prerequisites, the intermediate files that are produced, and how to customize the plots.
- Datasets –
oncodrive3d build-datasets(orbuild-datasets --mane_only) must have been run already; the plotting stage readsdatasets/pdb_structures,seq_for_mut_prob.tsv, andconfidence.tsv. - Python environment – use the same environment (e.g.,
uvvirtualenv) that you rely on for the CLI. - PDB_Tool binary –
oncodrive3d build-annotationsinvokes the PDB_Tool executable namedPDB_Toolon$PATHto compute per-residue solvent accessibility and secondary structure. See Installing PDB_Tool below for a recipe. - Internet access – required to download Pfam annotations, UniProt features, and (if
--ddg_diris not set) RaSP ΔΔG predictions. You can point--ddg_dirto precomputed RaSP files to skip the download step or to provide in-house predicted scores. - Disk space – annotation folders contain many files; keep several GB free.
PDB_Tool compiles from source. If your conda env has no C/C++ toolchain, install one first:
conda install -c conda-forge gxx makeThen build and put it on $PATH:
git clone https://github.com/realbigws/PDB_Tool.git
cd PDB_Tool
make -C source_code
# The Makefile drops the binary at the repo root. Symlink into the active conda env so it resolves via `which PDB_Tool`:
ln -s "$(pwd)/PDB_Tool" "$CONDA_PREFIX/bin/PDB_Tool"Run once per dataset (or whenever you update AlphaFold structures):
# Default (Homo sapiens, public RaSP ΔΔG):
oncodrive3d build-annotations -d <build_folder> -o <annot_folder>
# With a MANE-built dataset:
oncodrive3d build-annotations -d <mane_build_folder> -o <annot_folder>
# Mouse with custom ΔΔG predictions (omit --ddg_dir to skip the ΔΔG step):
oncodrive3d build-annotations -d <build_folder> -o <annot_folder> -s mouse --ddg_dir <ddg_path>See oncodrive3d build-annotations --help for all options.
Worth knowing:
- ΔΔG predictions default to the public RaSP bundle (computed against the canonical AlphaFold v4 human proteome — not the MANE bundle). For datasets built with a different AF version, residue-level mismatches are filtered during validation (see
--ddg_mismatch_thresholdbelow). --ddg_diroverrides the default download with a folder of RaSP-style CSVs (columnsvariantandscore_ml; UniProt accession auto-detected anywhere in the filename, any separator). For non-human organisms the public bundle doesn't apply, so without--ddg_dirthe ΔΔG step is skipped with a warning — other annotations build normally. To generate predictions yourself on CPUs, see bbglab/rasp_cpu.--ddg_mismatch_threshold(default0.1) drops a protein if its wild-type residues disagree with the canonical UniProt sequence above this fraction. Set to1.0to disable the WT-mismatch check (positions outside the canonical sequence still drop the protein).- If
--output_direxists and isn't empty, you're prompted before its contents are cleaned (excludinglog/); pass--yesto auto-confirm.
What happens internally (scripts/plotting/build_annotations.py and helpers):
- Cleanup – if the target directory exists and is non-empty, you're prompted before cleaning (preserving
log/);--yesauto-confirms. - Stability change (ΔΔG) – RaSP predictions are downloaded (human) or read from
--ddg_dir. Each protein is parsed into{position: {ALT: ddg}}(averaging across fragments) and validated against the canonical sequence fromseq_for_mut_prob.tsv; proteins failing validation are dropped with a warning. Within a kept protein, positions with no prediction surface asNaN(not0.0) so plots show gaps, the annotated CSV distinguishes "no data" from "neutral mutation", and the logistic regression restricts itself to real measurements. - PDB features – AlphaFold structures are decompressed and sent through
PDB_Tool, producing.featurefiles that are then parsed intopdb_tool_df.tsvwith residue-level secondary structure (SSE) and relative accessibility (pACC). - Pfam domains – Pfam coordinates are pulled from the Ensembl BioMart archive plus the Pfam ID database, merged with Oncodrive3D’s sequence metadata, and written to
pfam.tsv. - UniProt features – the EMBL-EBI Proteins API supplies DOMAIN/PTM/SITE/MOTIF/MEMBRANE annotations. They are normalized, merged with Pfam entries, and stored in
uniprot_feat.tsv.
After a successful run the annotation folder contains:
annotations/
├── pdb_tool_df.tsv
├── pfam.tsv
├── uniprot_feat.tsv
├── stability_change/ # optional; absent for mouse builds without --ddg_dir
│ └── <UNIPROT>_ddg.json
└── log/
Keep this directory around — oncodrive3d plot reads the tables above and merges in ΔΔG values when stability_change/ is present, otherwise the ΔΔG track is omitted from per-gene plots and association analyses.
Once you have:
- Gene-level results (
<cohort>.3d_clustering_genes.csv), - Residue-level results (
<cohort>.3d_clustering_pos.csv), - Processed mutations (
<cohort>.mutations.processed.tsv), - Missense probability dictionary (
<cohort>.miss_prob.processed.json), seq_for_mut_prob.tsv,- Built datasets (
datasets/) and annotations (annotations/),
call:
oncodrive3d plot \
--gene_result_path output/COHORT/COHORT.3d_clustering_genes.csv \
--pos_result_path output/COHORT/COHORT.3d_clustering_pos.csv \
--maf_path output/COHORT/COHORT.mutations.processed.tsv \
--miss_prob_path output/COHORT/COHORT.miss_prob.processed.json \
--seq_df_path output/COHORT/COHORT.seq_df.processed.tsv \
--datasets_dir /path/to/datasets \
--annotations_dir /path/to/annotations \
--output_dir plots/COHORT \
--cohort COHORT \
--maf_for_nonmiss_path original_input.maf \
--lst_gene_tracks miss_count,miss_prob,score,clusters,ddg,disorder,pacc,ptm,site,sse,pfamSee oncodrive3d plot --help for all options.
Worth knowing:
--maf_pathis the processed missense-only TSV (<cohort>.mutations.processed.tsv) fromoncodrive3d run.--maf_for_nonmiss_pathis optional and takes the original MAF (before processing) — supply it to enable the non-missense track. All other input files (gene/pos results,--miss_prob_path,--seq_df_path,--datasets_dir,--annotations_dir) must come from that sameoncodrive3d runinvocation; mismatch yields empty plots or missing-track errors.--lst_summary_tracks/--lst_gene_tracksaccept comma-separated track names; pair them with--lst_*_hratiosto redistribute vertical space.
During execution (scripts/plotting/plot.py):
- Results are filtered to genes requested by the user, and the corresponding entries are sliced out of the sequence/annotation tables.
- A summary plot shows per-gene mutation counts, cluster residues, and score distributions.
- Per-gene plots combine multiple tracks: observed vs expected mutation counts, missense probabilities, clustering scores, PAE/pLDDT, ΔΔG, Pfam/UniProt annotations, PTMs, membrane regions, motifs, etc. Tracks that are not available for a gene are automatically removed.
- Annotated tables are built by merging the positional results with disorder (pLDDT), PDB features, transcript metadata, and UniProt domains. They are saved as
<cohort>.3d_clustering_pos.annotated.csvplus<cohort>.uniprot_feat.tsv. - Association plots (optional) – see the “Association Analyses” section below for details on how the logistic-regression statistics are generated and visualized (volcano, per-gene volcano, and log-odds panels).
The optional association module quantifies how strongly specific annotations track with significant clusters:
- Input preparation – residues with non-zero missense probability inherit standardized predictors: structural metrics (pLDDT, ΔΔG, surface exposure), categorical features (Pfam/UniProt/PTM/motif dummies), and the expected missense probability itself.
- Univariate logistic regressions – for each gene and each predictor, the pipeline fits
logit(C ~ feature)whereCis the binary cluster label. This yields log-odds, standard errors, and raw p-values that are stored in<cohort>.logreg_result.tsv. - Visualization – the statistics above feed three plot types under
<cohort>.associations_plots/:- A cohort-wide volcano plot highlighting annotations with the most extreme log-odds and p-values.
- Per-gene mini volcano plots to inspect feature enrichments gene by gene.
- Log-odds strip charts with 95% confidence intervals to visualize effect sizes.
Only raw p-values are provided; apply your preferred multiple-testing correction (e.g., BH-FDR) before drawing conclusions about specific features.
For interactive-ready 3D views, Oncodrive3D exposes a separate oncodrive3d chimerax-plot command. It takes the same gene/position-level CSVs produced by oncodrive3d run, plus the datasets directory (for AlphaFold structures) and the processed sequence dataframe, and renders PNG snapshots together with .defattr files under <output_dir>/<cohort>.chimerax/. Each snapshot shows the AlphaFold model with significant clusters highlighted, optional extended clusters, pLDDT coloring, and sample counts. Provide the path to your ChimeraX installation via --chimerax_bin or rely on the default /usr/bin/chimerax. The Nextflow pipeline exposes the same functionality through the chimerax_plot flag.
Example:
oncodrive3d chimerax-plot \
--gene_result_path output/COHORT/COHORT.3d_clustering_genes.csv \
--pos_result_path output/COHORT/COHORT.3d_clustering_pos.csv \
--datasets_dir /path/to/datasets \
--seq_df_path output/COHORT/COHORT.seq_df.processed.tsv \
--output_dir plots/COHORT \
--cohort COHORT \
--chimerax_bin /opt/ChimeraX/bin/ChimeraX \
--max_n_genes 20 \
--pixel_size 0.1 \
--cluster_extSee oncodrive3d chimerax-plot --help for all options.
Worth knowing:
--chimerax_bindefaults to/usr/bin/chimerax; override it if ChimeraX is installed elsewhere or running in a container.--pixel_sizecontrols resolution — smaller values produce larger images (default0.08).--cluster_extdisplays extended clusters (mutations that contribute to but don't directly form significant clusters).--af_versiondefaults to6(matchingbuild-datasets). Pass it only if your datasets were built with a different version (e.g.,--af_version 4for MANE builds).
- PDB_Tool missing – install the binary or adjust
$PATH. The build step logs the exact command being executed, making it easier to diagnose permission issues. - Annotation mismatches – plots rely on UniProt IDs matching between the run outputs and the annotation tables. Make sure you pass the same datasets directory used during
build-datasets. - Association plots without data – if a gene lacks both clustered and non-clustered residues after filtering, it is skipped from the logistic regression. The log file will note “There aren’t any relationship to plot”.