nextflow run replikation/MPOA --fastq '*.fastq.gz' --fasta '*.fasta' -profile local,docker
To run MPOA the workflow manager Nextflow and its dependency java run time (default-jre) needs to be installed.
Install java run time via:
sudo apt install -y default-jre
Install Nextflow (this creates a nextflow file at your current location):
curl -s https://get.nextflow.io | bash
Optional: Move the executable "nextflow" to the $PATH
location (so you can execute it from everywhere):
sudo mv nextflow /bin && sudo chmod 770 /bin/nextflow
Full Docker installation can be found here for Ubuntu:
If you never worked with Docker we recommend to install it via apt install:
sudo apt install -y docker
sudo usermod -a -G docker $USER
Install Nextflow (this creates a nextflow file at your current location):
curl -s https://get.nextflow.io | bash
As an alternative to Docker you can also install and use Singularity, e.g. on a HPC.
To use Singularity follow the installion steps here.
Note, that with Singularity the following environment variables are automatically passed to the container to ensure execution on HPCs: HTTPS_PROXY
, HTTP_PROXY
, http_proxy
, https_proxy
, FTP_PROXY
and ftp_proxy
.
nextflow run replikation/MPOA --help
Make sure that you have one read file
(.fastq
or .fastq.gz
) per sample.
You can combine multiple read files into one via:
cat *.fastq > sample1_all.fastq
or cat *.fastq.gz > sample1_all.fastq.gz
Reads and Genome files are matched based on the first word before the first dot in their filename.
YES: Sample1.clean.fasta & Sample1.fastq.gz
NO: clean.Sample1.fasta & Sample1.fastq.gz
Open your terminal where you have access to the fastq and fasta files.
nextflow run replikation/MPOA --fasta '*.fasta' --fastq '*.fastq.gz' -profile local,docker -r '1.4.1'
The default output is a CSV file summarizing the masked positions per sample. You will find the information on how many positions were masked by the respective IUPAC base (N, W, S, M, K, R, Y, B, D, H, V) in masked_bases_summary.csv
. Depth masking (N) is performed if fewer than ten bases are present at one position. The masked positions on the entire genome and only on the chromosome are shown for each sample.
For example:
In sample2, we have 3108 positions masked by N on all contigs (genome), which means these positions exhibit a sequencing depth below 10. Let's take a look only at the chromosome contig. We see a massive reduction in masked and especially depth-masked positions, so we used only the chromosomal contig for our analysis.
name,type,N(ATCG),W(AT),S(CG),M(AC),K(TG),R(AG),Y(TC),B(TCG),D(ATG),H(ATC),V(ACG)
sample1,genome,1,0,0,1,0,14,12,0,0,0,0,
sample1,chromosome,1,0,0,1,0,12,9,0,0,0,0,
sample2,genome,3108,100,88,109,110,396,405,0,0,0,0,
sample2,chromosome,19,0,2,0,4,200,181,0,0,0,0,
sample3,genome,9534,6,14,20,10,310,293,0,0,0,0,
sample3,chromosome,17,0,4,4,0,266,254,0,0,0,0,
A graphical presentation of all ambiguous positions and their surrounding bases per analyzed sample. The overall height of each stack indicates the sequence conservation at a position (measured in bits)
, whereas the height of symbols within the stack reflects the relative frequency of the corresponding
base at that position.
In default mode, the sequence logo shows five bases up and downstream of the ambiguous position masked by IUPAC-Code. The length of the logo can be modified using the --motif
flag.
For example:
For sample BK16641_k14 we have 233 masked positions by R and 225 masked positions by Y. For all reagions masked by R and Y we see rough conserved motives RACG and CGTY.
A violin chart is created when the --frequency
flag is used. The plot is build from all analyzed samples of the run. For each ambiguous position found by aligning reads to their reference, the percentage occurrence of the bases within the reads is calculated. The orientation of the strand (forward or reverse) is determined and displayed as data points in two plots per position.
For example:
This plot comprises 6,556 ambiguous positions from 33 samples. Each dot represents a base occurrence within the respective base combination at the ambiguous position. The main errors exist when the Basecaller need to decide between Purin bases (A and G) masked by IUPAC Code R and Pyrimidin bases (T and C) masked by IUPAC code Y. 3,311 positions were masked by R. The plot differntiate between forward and reverse strand. On the reverse strand Guanine is more frequently called (100-95%) as Adenine (0-5%). This leads to the conclusion that Guanine is the correct base in this position. It's not that clear on the forward strand. The basecaller mainly calles Adenine in this position and less frequent Guanine. This base ratio in the reads can results in erroneous assemblies. Since it is a strand-specific error, it points to base-modification.doi.org/10.1101/2023.09.15.556300
If you are interested in the work of our group: click here
--cores 8 --max_cores 8
This will execute only one process after another and assigning only 8 threads to each. This should execute less process simultaneously. MPOA was tested on a 16 GB RAM laptop with 8 threads and was working fine.
-r
flag and the version you want to execute. nextflow run replikation/MPOA --fasta '*.fasta' --fastq '*.fastq.gz' -profile local,docker -r '1.4.1'