Skip to content

bigbio/quantmsdiann: Output

Introduction

This document describes the output produced by the pipeline. Most plots are taken from the pmultiqc report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes DIA data using the following steps:

  1. RAW data is converted to mzML using ThermoRawFileParser (or .d/.dia files are handled natively)
  2. DIA-NN is used for identification and quantification of peptides and proteins
  3. DIA-NN report is converted to MSstats-compatible format
  4. Generation of QC reports using pmultiqc

Output structure

Output will be saved to the folder defined by the parameter --outdir.

Default Output Structure

results/
├── pipeline_info/             # Nextflow pipeline information
├── sdrf/                      # SDRF files and configs
├── quant_tables/              # Quantification tables and results
│   ├── diann_report.{tsv,parquet}  # Main DIA-NN report
│   ├── diann_report.pg_matrix.tsv  # Protein group matrix
│   ├── diann_report.pr_matrix.tsv  # Precursor matrix
│   ├── diann_report.gg_matrix.tsv  # Gene group matrix
│   └── out_msstats_in.csv     # MSstats-compatible output
└── pmultiqc/                  # pmultiqc reports
    ├── multiqc_plots/
    │   ├── png/
    │   ├── svg/
    │   └── pdf/
    └── multiqc_data/

Verbose Output Structure

For more detailed output with all intermediate files, use the verbose output configuration by providing -profile verbose_modules. This is useful for debugging or detailed analysis:

results/
├── pipeline_info/
├── sdrf/
├── spectra/
│   ├── thermorawfileparser/         # Converted raw files
│   └── mzml_statistics/             # mzML file statistics
├── database_generation/
│   ├── insilico_library_generation/ # In silico library
│   └── assemble_empirical_library/  # Empirical library
├── diann_preprocessing/
│   ├── preliminary_analysis/        # Preliminary analysis results
│   └── individual_analysis/         # Individual analysis results
├── quant_tables/
└── pmultiqc/

Key Output Files

  • DIA-NN quantification results:
  • quant_tables/diann_report.{tsv,parquet} - Main DIA-NN report with peptide and protein quantification
  • quant_tables/diann_report.pr_matrix.tsv - Precursor quantification matrix
  • quant_tables/diann_report.pg_matrix.tsv - Protein group quantification matrix
  • quant_tables/diann_report.gg_matrix.tsv - Gene group quantification matrix
  • quant_tables/diann_report.unique_genes_matrix.tsv - Unique gene quantification matrix
  • quant_tables/out_msstats_in.csv - MSstats-compatible quantification table

Parquet vs TSV Output

Starting with DIA-NN 2.0, the main report is produced in Apache Parquet format (diann_report.parquet) instead of the legacy TSV (diann_report.tsv). Parquet files are columnar, compressed, and significantly faster to load in downstream tools such as Python (pandas/pyarrow) or R (arrow).

DIA-NN Version Main report format Matrix format
1.8.1 diann_report.tsv .tsv
2.1.0+ diann_report.parquet .tsv

The pipeline detects the DIA-NN version and handles the output format automatically. Downstream steps (MSstats conversion, pmultiqc) accept both formats.

To read Parquet files:

# Python
import pandas as pd
df = pd.read_parquet("diann_report.parquet")
# R
library(arrow)
df <- read_parquet("diann_report.parquet")

MSstats-Compatible Output

The pipeline produces quant_tables/out_msstats_in.csv, an MSstats-compatible quantification table generated by quantms-utils. This file contains long-format precursor-level intensities with the columns required by the MSstats R package for downstream statistical analysis (e.g. differential expression, sample-size estimation).

Key columns include: ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity.

The condition and biological replicate assignments are derived from the SDRF factor columns.

Optional Output Files

These files are not published by default. Enable them with save_* parameters or ext.* config properties (see Usage: Optional outputs).

  • library_generation/*.tsv - TSV spectral library from in-silico library generation (--save_speclib_tsv)

Nextflow pipeline info

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline.

pipeline_info/:

  • execution_report.html - Resource usage report
  • execution_timeline.html - Timeline visualization
  • execution_trace.txt - Detailed execution trace
  • pipeline_dag.html - DAG visualization
  • software_versions.yml - Software versions used

pmultiqc

All QC results are generated by pmultiqc, a proteomics plugin for MultiQC. The interactive HTML report provides:

  • Identification and quantification metrics
  • Sample-level quality statistics
  • Pipeline software versions