bigbio/quantmsdiann: Output¶

Introduction¶

This document describes the output produced by the pipeline. Most plots are taken from the pmultiqc report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview¶

The pipeline is built using Nextflow and processes DIA data using the following steps:

RAW data is converted to mzML using ThermoRawFileParser (or .d/.dia files are handled natively)
DIA-NN is used for identification and quantification of peptides and proteins
DIA-NN report is converted to MSstats-compatible format
Generation of QC reports using pmultiqc

Output structure¶

Output will be saved to the folder defined by the parameter --outdir.

Default Output Structure¶

results/
├── pipeline_info/             # Nextflow pipeline information
├── sdrf/                      # SDRF files and configs
├── quant_tables/              # Quantification tables and results
│   ├── diann_report.{tsv,parquet}  # Main DIA-NN report
│   ├── diann_report.pg_matrix.tsv  # Protein group matrix
│   ├── diann_report.pr_matrix.tsv  # Precursor matrix
│   ├── diann_report.gg_matrix.tsv  # Gene group matrix
│   └── out_msstats_in.csv     # MSstats-compatible output
└── pmultiqc/                  # pmultiqc reports
    ├── multiqc_plots/
    │   ├── png/
    │   ├── svg/
    │   └── pdf/
    └── multiqc_data/

Verbose Output Structure¶

For more detailed output with all intermediate files, use the verbose output configuration by providing -profile verbose_modules. This is useful for debugging or detailed analysis:

results/
├── pipeline_info/
├── sdrf/
├── spectra/
│   ├── thermorawfileparser/         # Converted raw files
│   └── mzml_statistics/             # mzML file statistics
├── database_generation/
│   ├── insilico_library_generation/ # In silico library
│   └── assemble_empirical_library/  # Empirical library
├── diann_preprocessing/
│   ├── preliminary_analysis/        # Preliminary analysis results
│   └── individual_analysis/         # Individual analysis results
├── quant_tables/
└── pmultiqc/

Key Output Files¶

DIA-NN quantification results:
quant_tables/diann_report.{tsv,parquet} - Main DIA-NN report with peptide and protein quantification
quant_tables/diann_report.pr_matrix.tsv - Precursor quantification matrix
quant_tables/diann_report.pg_matrix.tsv - Protein group quantification matrix
quant_tables/diann_report.gg_matrix.tsv - Gene group quantification matrix
quant_tables/diann_report.unique_genes_matrix.tsv - Unique gene quantification matrix
quant_tables/out_msstats_in.csv - MSstats-compatible quantification table

Parquet vs TSV Output¶

Starting with DIA-NN 2.0, the main report is produced in Apache Parquet format (diann_report.parquet) instead of the legacy TSV (diann_report.tsv). Parquet files are columnar, compressed, and significantly faster to load in downstream tools such as Python (pandas/pyarrow) or R (arrow).

DIA-NN Version	Main report format	Matrix format
1.8.1	`diann_report.tsv`	`.tsv`
2.1.0+	`diann_report.parquet`	`.tsv`

The pipeline detects the DIA-NN version and handles the output format automatically. Downstream steps (MSstats conversion, pmultiqc) accept both formats.

To read Parquet files:

# Python
import pandas as pd
df = pd.read_parquet("diann_report.parquet")

# R
library(arrow)
df <- read_parquet("diann_report.parquet")

MSstats-Compatible Output¶

The pipeline produces quant_tables/out_msstats_in.csv, an MSstats-compatible quantification table generated by quantms-utils. This file contains long-format precursor-level intensities with the columns required by the MSstats R package for downstream statistical analysis (e.g. differential expression, sample-size estimation).

Key columns include: ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity.

The condition and biological replicate assignments are derived from the SDRF factor columns.

Optional Output Files¶

These files are not published by default. Enable them with save_* parameters or ext.* config properties (see Usage: Optional outputs).

library_generation/*.tsv - TSV spectral library from in-silico library generation (--save_speclib_tsv)

Nextflow pipeline info¶

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline.

pipeline_info/:

execution_report.html - Resource usage report
execution_timeline.html - Timeline visualization
execution_trace.txt - Detailed execution trace
pipeline_dag.html - DAG visualization
software_versions.yml - Software versions used

pmultiqc¶

All QC results are generated by pmultiqc, a proteomics plugin for MultiQC. The interactive HTML report provides:

Identification and quantification metrics
Sample-level quality statistics
Pipeline software versions