bigbio/quantmsdiann: Output¶
Introduction¶
This document describes the output produced by the pipeline. Most plots are taken from the pmultiqc report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview¶
The pipeline is built using Nextflow and processes DIA data using the following steps:
- RAW data is converted to mzML using ThermoRawFileParser (or .d/.dia files are handled natively)
- DIA-NN is used for identification and quantification of peptides and proteins
- DIA-NN report is converted to MSstats-compatible format
- Generation of QC reports using pmultiqc
Output structure¶
Output will be saved to the folder defined by the parameter --outdir.
Default Output Structure¶
results/
├── pipeline_info/ # Nextflow pipeline information
├── sdrf/ # SDRF files and configs
├── quant_tables/ # Quantification tables and results
│ ├── diann_report.{tsv,parquet} # Main DIA-NN report
│ ├── diann_report.pg_matrix.tsv # Protein group matrix
│ ├── diann_report.pr_matrix.tsv # Precursor matrix
│ ├── diann_report.gg_matrix.tsv # Gene group matrix
│ └── out_msstats_in.csv # MSstats-compatible output
└── pmultiqc/ # pmultiqc reports
├── multiqc_plots/
│ ├── png/
│ ├── svg/
│ └── pdf/
└── multiqc_data/
Verbose Output Structure¶
For more detailed output with all intermediate files, use the verbose output configuration by providing -profile verbose_modules. This is useful for debugging or detailed analysis:
results/
├── pipeline_info/
├── sdrf/
├── spectra/
│ ├── thermorawfileparser/ # Converted raw files
│ └── mzml_statistics/ # mzML file statistics
├── database_generation/
│ ├── insilico_library_generation/ # In silico library
│ └── assemble_empirical_library/ # Empirical library
├── diann_preprocessing/
│ ├── preliminary_analysis/ # Preliminary analysis results
│ └── individual_analysis/ # Individual analysis results
├── quant_tables/
└── pmultiqc/
Key Output Files¶
- DIA-NN quantification results:
quant_tables/diann_report.{tsv,parquet}- Main DIA-NN report with peptide and protein quantificationquant_tables/diann_report.pr_matrix.tsv- Precursor quantification matrixquant_tables/diann_report.pg_matrix.tsv- Protein group quantification matrixquant_tables/diann_report.gg_matrix.tsv- Gene group quantification matrixquant_tables/diann_report.unique_genes_matrix.tsv- Unique gene quantification matrixquant_tables/out_msstats_in.csv- MSstats-compatible quantification table
Parquet vs TSV Output¶
Starting with DIA-NN 2.0, the main report is produced in Apache Parquet format (diann_report.parquet) instead of the legacy TSV (diann_report.tsv). Parquet files are columnar, compressed, and significantly faster to load in downstream tools such as Python (pandas/pyarrow) or R (arrow).
| DIA-NN Version | Main report format | Matrix format |
|---|---|---|
| 1.8.1 | diann_report.tsv |
.tsv |
| 2.1.0+ | diann_report.parquet |
.tsv |
The pipeline detects the DIA-NN version and handles the output format automatically. Downstream steps (MSstats conversion, pmultiqc) accept both formats.
To read Parquet files:
MSstats-Compatible Output¶
The pipeline produces quant_tables/out_msstats_in.csv, an MSstats-compatible quantification table generated by quantms-utils. This file contains long-format precursor-level intensities with the columns required by the MSstats R package for downstream statistical analysis (e.g. differential expression, sample-size estimation).
Key columns include: ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity.
The condition and biological replicate assignments are derived from the SDRF factor columns.
Optional Output Files¶
These files are not published by default. Enable them with save_* parameters or ext.* config properties (see Usage: Optional outputs).
library_generation/*.tsv- TSV spectral library from in-silico library generation (--save_speclib_tsv)
Nextflow pipeline info¶
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline.
pipeline_info/:
execution_report.html- Resource usage reportexecution_timeline.html- Timeline visualizationexecution_trace.txt- Detailed execution tracepipeline_dag.html- DAG visualizationsoftware_versions.yml- Software versions used
pmultiqc¶
All QC results are generated by pmultiqc, a proteomics plugin for MultiQC. The interactive HTML report provides:
- Identification and quantification metrics
- Sample-level quality statistics
- Pipeline software versions