small RNA MetaVir — Documentation

About

small RNA MetaVir is a bioinformatics pipeline designed to detect viral sequences in small RNA sequencing (sRNA-seq) data from arthropods. Unlike conventional approaches that rely exclusively on sequence similarity to known viruses, this tool exploits a fundamental biological mechanism — RNA interference (RNAi) — to identify viral sequences even when they share no detectable similarity with any sequence in public databases.

The Biological Principle: RNAi as an Antiviral Defense

In arthropods such as mosquitoes, flies, and ticks, the RNAi pathway serves as a primary antiviral defense. When a virus infects an insect cell, the host enzyme Dicer-2 recognizes viral double-stranded RNA (dsRNA) replication intermediates and cleaves them into small interfering RNAs (siRNAs) of a characteristic length — predominantly 21 nucleotides. These virus-derived siRNAs (vsRNAs) then guide the RISC (RNA-Induced Silencing Complex) to degrade the viral genome. This leaves a detectable molecular signature: an enrichment of 21 nt reads mapping to the viral sequence.

By contrast, endogenous viral elements (EVEs) — ancient viral sequences integrated into the host genome — produce a different class of small RNAs called piRNAs (PIWI-interacting RNAs), which are typically 24–30 nt in length and show a characteristic ping-pong amplification signature.

🦠 Active Virus — siRNA Signature

Strong peak at 21 nt. Reads on both sense and antisense strands. Produced by Dicer-2 cleavage of viral dsRNA replication intermediates.

21 nt peak Dicer-2 RISC

🧬 EVE — piRNA Signature

Broad distribution at 24–30 nt. Ping-pong amplification pattern (A at pos 10). Produced from genomically integrated viral sequences.

24-30 nt PIWI Ping-pong

Key Features

Similarity-Independent

Finds novel “dark matter” viruses with no database matches through their RNAi footprint.

Virus vs. EVE Classifier

Random Forest trained on 2,315 curated sequences (~92.5% accuracy). 48 features from sRNA size profiles.

Multi-Assembler Strategy

Four parallel assemblies (Velvet + SPAdes) merged by CAP3 to maximize contig recovery from short reads.

Hardware Requirements

Resource	Minimum	Recommended
CPU Cores	8	20+
RAM	32 GB	64 GB
Disk Space	500 GB	1 TB+

Pipeline Architecture

The pipeline orchestrates 12+ stages from raw reads to final virus vs. EVE classification. The diagram below shows the complete processing flow, color-coded by stage type.

%%{init:{'theme':'dark','themeVariables':{'fontSize':'13px','primaryColor':'#0d1f35','primaryBorderColor':'#1a3050','lineColor':'#5de4c7'}}}%%
flowchart TD
    classDef input fill:#3a4a5a,stroke:#5a6a7a,color:#eaf0f7
    classDef filter fill:#1a4a6a,stroke:#2a6a9a,color:#c8e4f7
    classDef assembly fill:#1a6a4a,stroke:#2a9a6a,color:#c8f7e4
    classDef similarity fill:#8a6a2a,stroke:#baa040,color:#f7eac8
    classDef ml fill:#4a3a6a,stroke:#7a5aaa,color:#e0d0f7
    classDef output fill:#3a4a5a,stroke:#5de4c7,color:#5de4c7

    IN["📄 INPUT\nFASTA / FASTQ reads"]:::input

    S0["Step 0: Quality Control\nTrim Galore + FastQC\n(only with -fastq input)"]:::filter
    S1["Step 1: Host Filtering\nBowtie vs host genome\n(-v 1, -k 1)"]:::filter
    S2["Step 2: Bacterial Filtering\nBowtie vs bacterial genomes\n(--large-index)"]:::filter
    S3["Step 3: Size Selection\nfilter_fasta_by_size.py"]:::filter

    A1["Velvet Optimiser\nhash auto 13-19"]:::assembly
    A2["Velvet Fixed\nhash = 15"]:::assembly
    A3["SPAdes\nk = 13,15,17,19"]:::assembly
    A4["20-23 nt Assembly\nsiRNA-focused"]:::assembly
    CAP["Step 5: CAP3 Meta-Assembly\nMerge + filter ≥200 nt"]:::assembly

    BN["Step 6: BLASTn vs nt\nevalue 1e-5\nClassify: viral / non-viral / no-hit"]:::similarity
    DM["Step 7: Diamond BLASTx vs nr\nevalue 0.001, very-sensitive\n(no-hit contigs only)"]:::similarity

    MAP["Step 8-9: Read Mapping & sRNA Profiling\nBowtie map reads to contigs\nSize profiles 18-35 nt per contig"]:::ml
    ZS["Step 10: Z-Score Features\nNormalize profiles\n48 features per contig"]:::ml
    RF["Step 11: Random Forest Classifier\n50 trees, 2315 training sequences\nviral vs eve prediction"]:::ml

    OUT["🎯 OUTPUT\nviral-eve.csv\nsummary.tsv\nFASTA per class\nsRNA profiles"]:::output

    IN --> S0 --> S1 --> S2 --> S3
    S3 --> A1 & A2 & A3 & A4
    A1 & A2 & A3 & A4 --> CAP
    CAP --> BN --> DM
    DM --> MAP --> ZS --> RF --> OUT

Stage Color Legend

Blue Filtering & QC (host, bacteria, size) Green Assembly (Velvet, SPAdes, CAP3) Orange Similarity search (BLAST, Diamond) Purple Machine learning (profiling, Z-scores, RF)

Technology Stack

Category	Tools & Versions
Languages	Perl 5.36 (pipeline core), Python 3.9 (ML, filtering), R 4.0 (plotting, feature matrix)
Quality Control	Trim Galore 0.6.10, FastQC 0.12.1, cutadapt 3.2, FASTX-Toolkit
Read Mapping	Bowtie 1.3.0 (short-read aligner), samtools (SAM/BAM processing)
De Novo Assembly	Velvet / VelvetOptimiser 2.2.6, SPAdes 3.13.1, CAP3 (meta-assembler)
Similarity Search	BLAST+ 2.14.0 (BLASTn), legacy BLAST 2.2.26 (formatdb/blastall), Diamond 2.1.6 (BLASTx)
Machine Learning	scikit-learn 1.1.3 (Random Forest, 50 estimators), joblib 1.3.1, pandas 2.0.3
Perl Modules	Bio::SeqIO, Bio::SearchIO, Statistics::Basic, Statistics::RankCorrelation
R Packages	ggplot2, reshape2, Rtsne, umap, ComplexHeatmap
Containerization	Docker (multi-stage Dockerfile) / Podman

Installation

Docker (Recommended)

# 1. Clone the repository
git clone https://github.com/v-rogana/small-rna-metavir.git
cd small-rna-metavir/docker

# 2. Build the Docker image (30-60 min first time)
docker build --target small_rna_metavir -t small_rna_metavir:latest .

# 3. Start the container with data volumes
docker run -it --rm \
    -v /path/to/databases:/small-rna-metavir/asset \
    -v /path/to/your/reads:/data \
    -v /path/to/output:/small-rna-metavir/src/pipeline/runs \
    small_rna_metavir:latest bash

# 4. Inside the container
cd /small-rna-metavir/src/pipeline
perl main.pl -h   # verify installation

Podman (Staged Builds)

# Build with caching for faster rebuilds
podman build . --target=stage_perl         -t srna/perl:v01
podman build . --target=stage_python       -t srna/python:v01  --cache-from=localhost/srna/perl:v01
podman build . --target=stage_r            -t srna/r:v01       --cache-from=localhost/srna/python:v01
podman build . --target=stage_dependencies -t srna/deps:v01    --cache-from=localhost/srna/r:v01
podman build . --target=small_rna_metavir  -t srna/metavir:v01 --cache-from=localhost/srna/deps:v01

# Run with SELinux-compatible mount
podman run -it --rm -v /data:/small-rna-metavir/asset:Z srna/metavir:v01 bash

Required Databases

Database	Container Path	Size	Setup Command
Host genome	`-hostgenome` param	1-3 GB	`bowtie-build --threads 20 host.fa host.fa`
Bacterial genomes	`/small-rna-metavir/asset/refs/bacterial_genomes/`	~5 GB	`bowtie-build --large-index --threads 20 all_bacters.fa all_bacters.fa`
NCBI nt (core_nt)	`/small-rna-metavir/asset/blastdb/nt/`	~150 GB	`update_blastdb.pl --decompress nt`
NCBI nr (Diamond)	`/small-rna-metavir/asset/diamond/`	~60-100 GB	`diamond makedb --in nr.gz --db nrcluster_tax --taxonmap ...`
RF Classifier	`/small-rna-metavir/asset/classifier/`	<1 MB	Included in the repository (`asset/`)

Parameters Reference

Required Parameters

Parameter	Type	Description	Example
`-fasta <path>`	Path	Input reads in FASTA format (.fasta, .fa, .gz). Skips Trim Galore/FastQC.	`-fasta /data/lib_trimmed.fasta`
`-fastq <path>`	Path	Input reads in FASTQ format. Runs Trim Galore + FastQC automatically.	`-fastq /data/raw.fq.gz`
`-hostgenome <path>`	Path	Host genome FASTA with pre-built Bowtie index. Not needed with `-nohostfilter`.	`-hostgenome /refs/Aae.fasta`
`-prefix <str>`	String	Output file prefix for all generated files.	`-prefix SRR1652436`
`-size <int>`	Integer	Expected target virus genome size in bp.	`-size 20000`
`-si <int>`	Integer	Minimum read length to use (recommended: 15).	`-si 15`
`-se <int>`	Integer	Maximum read length to use (recommended: 35).	`-se 35`
`-process <int>`	Integer	Number of CPU threads.	`-process 20`
`-exec-id <str>`	String	Unique run identifier. Defines output directory under `./runs/`.	`-exec-id val-SRR1652436`

Optional Parameters

Parameter	Default	Description
`-hash <int>`	15	Velvet hash (k-mer) length for fixed-hash assembly. Try 13 or 17 if default yields poor results.
`-nohostfilter`	off	Skip host genome filtering. Reads go directly to bacterial filtering. Dispenses `-hostgenome`.
`-nononviralprofiles`	off	Skip sRNA profile generation for non-viral contigs. Saves significant time.
`-clean`	off	Remove large intermediate files (SAMs, assembly dirs). Saves 50-80% disk space.
`-largeindex`	off	Use Bowtie `--large-index` for host genomes exceeding 4 GB.
`-degradation`	off	Enable additional 24-30 nt assembly (piRNA-focused). Partially disabled in current code.
`-h`	—	Display help message and exit.

Key Internal Constants (Hardcoded)

Tool	Parameter	Value
Bowtie (mapping)	`-v` / `-k`	1 mismatch / 1 alignment
VelvetOptimiser	hash range	13–19 (auto)
SPAdes (optimized)	`-k`	13,15,17,19
BLASTn	evalue / alignments	1e-5 / 5
Diamond	evalue / sensitivity	0.001 / very-sensitive
Trim Galore	min length	18 nt
sRNA Profiling	size range	18–35 nt
Merge contigs	pid / plen	60% / 60%

Usage Examples

Aedes aegypti Virome Analysis

perl main.pl \
    -fasta /small-rna-metavir/asset/libs/RNPM162_trimmed.fq.fasta \
    -hostgenome /small-rna-metavir/asset/refs/ref_hosts/refs_Aae_Aalb_Agam_culex.fasta \
    -process 20 -si 15 -se 35 -size 20000 \
    -exec-id RNPM162 -prefix RNPM162

Runtime: ~2-6 hours • Disk: 5-20 GB per library

Raw FASTQ with Automatic Trimming

perl main.pl \
    -fastq /data/raw/library_R1.fastq.gz \
    -hostgenome /small-rna-metavir/asset/refs/ref_hosts/refs_Aae_Aalb_Agam_culex.fasta \
    -process 20 -si 15 -se 35 -size 20000 \
    -exec-id raw-lib01 -prefix RAWLIB01

Trim Galore + FastQC run automatically before the pipeline.

No Host Reference (Non-Model Organism)

perl main.pl \
    -fasta /data/nonmodel_trimmed.fasta \
    -nohostfilter \
    -process 20 -si 15 -se 35 -size 20000 \
    -exec-id nonmodel-01 -prefix NM01

Large Library with Disk Optimization

perl main.pl \
    -fasta /data/large_library.fasta \
    -hostgenome /refs/host.fasta \
    -process 40 -si 15 -se 35 -size 20000 \
    -exec-id prod-01 -prefix LARGE01 \
    -clean -nononviralprofiles

-clean saves 50-80% disk; -nononviralprofiles reduces runtime.

SRA Data Re-analysis

# Download from SRA
fasterq-dump SRR1652436 -O /data/sra/ -e 8

# Run the pipeline
perl main.pl \
    -fasta /data/sra/SRR1652436_trimmed.fq.fasta \
    -hostgenome /small-rna-metavir/asset/refs/ref_hosts/refs_Aae_Aalb_Agam_culex.fasta \
    -process 40 -si 15 -se 35 -size 20000 \
    -exec-id val-SRR1652436 -prefix SRR1652436

Interpreting Classification Results

Similarity	ML Class	Interpretation
viral	viral	Confirmed active virus. Known virus with siRNA-like profile.
viral	eve	Possible EVE. Known virus but piRNA-like profile — may be integrated.
nohit	viral	Novel viral candidate! No database match but viral sRNA signature.
nohit	eve	Unknown sequence with EVE-like profile.
nonviral	viral	Non-viral hit but viral profile. Inspect manually.
nonviral	eve	Non-viral / host-derived with EVE-like profile.

Glossary

Molecular Biology & RNA Interference

Dicer / Dicer-2

RNase III enzyme that cleaves viral dsRNA into 21-23 nt siRNA duplexes. The 21 nt peak is a hallmark of Dicer activity against active viruses. Dicer-2 is the specific isoform involved in antiviral RNAi in insects.

dsRNA (Double-Stranded RNA)

RNA with two complementary strands. Viral replication intermediates form dsRNA, triggering the RNAi pathway.

EVE (Endogenous Viral Element)

A viral sequence integrated into the host genome. EVEs are inherited vertically and produce piRNAs (24-30 nt) instead of siRNAs. Distinguishing EVEs from active infections is a key goal of this pipeline.

Ping-Pong Cycle

An amplification loop in the piRNA pathway. PIWI proteins cleave target RNA at position 10, generating a new piRNA with a 10 nt overlap. Computationally detectable as a bias for adenine at position 10.

piRNA (PIWI-Interacting RNA)

Small non-coding RNA, 24-30 nt, that interacts with PIWI-family proteins. Involved in transposon silencing. A dominant piRNA peak suggests the sequence is an EVE.

RISC (RNA-Induced Silencing Complex)

Multiprotein complex that uses siRNA guides to find and degrade complementary viral RNA. The catalytic component is Argonaute-2.

RNAi (RNA Interference)

Conserved antiviral mechanism: viral dsRNA → Dicer-2 → siRNAs → RISC → viral RNA degradation. The biological foundation of this pipeline.

siRNA (Small Interfering RNA)

20-23 nt RNA produced by Dicer. The 21 nt peak is the molecular signature of active viral replication in arthropods.

vsRNA (Virus-Derived Small RNA)

Small RNAs from viral sequences. Can be siRNAs (active virus, Dicer) or piRNAs (EVE, ping-pong). The size profile reveals the origin.

Genomics & Bioinformatics

Assembly (De Novo)

Reconstructing longer sequences (contigs) from short reads without a reference genome. This pipeline uses Velvet, SPAdes, and CAP3.

BLAST (Basic Local Alignment Search Tool)

Algorithm for comparing sequences against databases. BLASTn (nucleotide) is used with evalue 1e-5 against the NCBI nt database.

Bowtie

Fast short-read aligner for mapping reads to references. Used with -v 1 (1 mismatch), -k 1 (1 alignment), -f (FASTA input).

CAP3

Meta-assembler that merges overlapping contigs from different assemblies into a non-redundant set.

Contig

Contiguous sequence assembled from overlapping reads. Contigs ≥200 nt are analyzed by the pipeline.

Diamond

Fast protein aligner (BLASTx alternative). Used with evalue 0.001, very-sensitive mode, for contigs with no nucleotide-level hits.

E-value (Expect Value)

Statistical measure of alignment significance. Lower = better. BLASTn uses 1e-5; Diamond uses 0.001.

Metagenomics

Study of genetic material from environmental samples without culturing. sRNA metagenomics targets the small RNA fraction for viral discovery.

Scaffold

Sequence from SPAdes that may contain gaps bridged by assembly graph information. Merged with Velvet contigs.

Virome

The complete collection of viruses in a sample or organism.

Machine Learning & Statistics

Feature Matrix

Tabular representation: rows = contigs, columns = 48 features (21 sense Z-scores + 21 antisense Z-scores + 6 density/ratio features). Input to the classifier.

Random Forest

Ensemble ML algorithm using 50 decision trees by majority vote. Trained on 2,315 sequences (1,321 viral + 994 EVE). ~92.5% test accuracy.

Similarity Label

Initial contig classification by BLASTn/Diamond: viral, nonviral, or nohit. Distinct from the ML-based class (viral/eve).

Z-Score

Normalized value: (x - mean) / stddev. Makes sRNA profiles comparable across contigs regardless of read depth. Computed for sizes 15-35 nt on both strands.