Automated viral sequence identification through small RNA profiling. Discover novel viruses invisible to BLAST — via the molecular signatures of RNA interference.
small RNA MetaVir is a bioinformatics pipeline designed to detect viral sequences in small RNA sequencing (sRNA-seq) data from arthropods. Unlike conventional approaches that rely exclusively on sequence similarity to known viruses, this tool exploits a fundamental biological mechanism — RNA interference (RNAi) — to identify viral sequences even when they share no detectable similarity with any sequence in public databases.
In arthropods such as mosquitoes, flies, and ticks, the RNAi pathway serves as a primary antiviral defense. When a virus infects an insect cell, the host enzyme Dicer-2 recognizes viral double-stranded RNA (dsRNA) replication intermediates and cleaves them into small interfering RNAs (siRNAs) of a characteristic length — predominantly 21 nucleotides. These virus-derived siRNAs (vsRNAs) then guide the RISC (RNA-Induced Silencing Complex) to degrade the viral genome. This leaves a detectable molecular signature: an enrichment of 21 nt reads mapping to the viral sequence.
By contrast, endogenous viral elements (EVEs) — ancient viral sequences integrated into the host genome — produce a different class of small RNAs called piRNAs (PIWI-interacting RNAs), which are typically 24–30 nt in length and show a characteristic ping-pong amplification signature.
Strong peak at 21 nt. Reads on both sense and antisense strands. Produced by Dicer-2 cleavage of viral dsRNA replication intermediates.
21 nt peak Dicer-2 RISCBroad distribution at 24–30 nt. Ping-pong amplification pattern (A at pos 10). Produced from genomically integrated viral sequences.
24-30 nt PIWI Ping-pongFinds novel “dark matter” viruses with no database matches through their RNAi footprint.
Random Forest trained on 2,315 curated sequences (~92.5% accuracy). 48 features from sRNA size profiles.
Four parallel assemblies (Velvet + SPAdes) merged by CAP3 to maximize contig recovery from short reads.
| Resource | Minimum | Recommended |
|---|---|---|
| CPU Cores | 8 | 20+ |
| RAM | 32 GB | 64 GB |
| Disk Space | 500 GB | 1 TB+ |
The pipeline orchestrates 12+ stages from raw reads to final virus vs. EVE classification. The diagram below shows the complete processing flow, color-coded by stage type.
%%{init:{'theme':'dark','themeVariables':{'fontSize':'13px','primaryColor':'#0d1f35','primaryBorderColor':'#1a3050','lineColor':'#5de4c7'}}}%%
flowchart TD
classDef input fill:#3a4a5a,stroke:#5a6a7a,color:#eaf0f7
classDef filter fill:#1a4a6a,stroke:#2a6a9a,color:#c8e4f7
classDef assembly fill:#1a6a4a,stroke:#2a9a6a,color:#c8f7e4
classDef similarity fill:#8a6a2a,stroke:#baa040,color:#f7eac8
classDef ml fill:#4a3a6a,stroke:#7a5aaa,color:#e0d0f7
classDef output fill:#3a4a5a,stroke:#5de4c7,color:#5de4c7
IN["📄 INPUT\nFASTA / FASTQ reads"]:::input
S0["Step 0: Quality Control\nTrim Galore + FastQC\n(only with -fastq input)"]:::filter
S1["Step 1: Host Filtering\nBowtie vs host genome\n(-v 1, -k 1)"]:::filter
S2["Step 2: Bacterial Filtering\nBowtie vs bacterial genomes\n(--large-index)"]:::filter
S3["Step 3: Size Selection\nfilter_fasta_by_size.py"]:::filter
A1["Velvet Optimiser\nhash auto 13-19"]:::assembly
A2["Velvet Fixed\nhash = 15"]:::assembly
A3["SPAdes\nk = 13,15,17,19"]:::assembly
A4["20-23 nt Assembly\nsiRNA-focused"]:::assembly
CAP["Step 5: CAP3 Meta-Assembly\nMerge + filter ≥200 nt"]:::assembly
BN["Step 6: BLASTn vs nt\nevalue 1e-5\nClassify: viral / non-viral / no-hit"]:::similarity
DM["Step 7: Diamond BLASTx vs nr\nevalue 0.001, very-sensitive\n(no-hit contigs only)"]:::similarity
MAP["Step 8-9: Read Mapping & sRNA Profiling\nBowtie map reads to contigs\nSize profiles 18-35 nt per contig"]:::ml
ZS["Step 10: Z-Score Features\nNormalize profiles\n48 features per contig"]:::ml
RF["Step 11: Random Forest Classifier\n50 trees, 2315 training sequences\nviral vs eve prediction"]:::ml
OUT["🎯 OUTPUT\nviral-eve.csv\nsummary.tsv\nFASTA per class\nsRNA profiles"]:::output
IN --> S0 --> S1 --> S2 --> S3
S3 --> A1 & A2 & A3 & A4
A1 & A2 & A3 & A4 --> CAP
CAP --> BN --> DM
DM --> MAP --> ZS --> RF --> OUT
Blue Filtering & QC (host, bacteria, size) Green Assembly (Velvet, SPAdes, CAP3) Orange Similarity search (BLAST, Diamond) Purple Machine learning (profiling, Z-scores, RF)
| Category | Tools & Versions |
|---|---|
| Languages | Perl 5.36 (pipeline core), Python 3.9 (ML, filtering), R 4.0 (plotting, feature matrix) |
| Quality Control | Trim Galore 0.6.10, FastQC 0.12.1, cutadapt 3.2, FASTX-Toolkit |
| Read Mapping | Bowtie 1.3.0 (short-read aligner), samtools (SAM/BAM processing) |
| De Novo Assembly | Velvet / VelvetOptimiser 2.2.6, SPAdes 3.13.1, CAP3 (meta-assembler) |
| Similarity Search | BLAST+ 2.14.0 (BLASTn), legacy BLAST 2.2.26 (formatdb/blastall), Diamond 2.1.6 (BLASTx) |
| Machine Learning | scikit-learn 1.1.3 (Random Forest, 50 estimators), joblib 1.3.1, pandas 2.0.3 |
| Perl Modules | Bio::SeqIO, Bio::SearchIO, Statistics::Basic, Statistics::RankCorrelation |
| R Packages | ggplot2, reshape2, Rtsne, umap, ComplexHeatmap |
| Containerization | Docker (multi-stage Dockerfile) / Podman |
# 1. Clone the repository
git clone https://github.com/v-rogana/small-rna-metavir.git
cd small-rna-metavir/docker
# 2. Build the Docker image (30-60 min first time)
docker build --target small_rna_metavir -t small_rna_metavir:latest .
# 3. Start the container with data volumes
docker run -it --rm \
-v /path/to/databases:/small-rna-metavir/asset \
-v /path/to/your/reads:/data \
-v /path/to/output:/small-rna-metavir/src/pipeline/runs \
small_rna_metavir:latest bash
# 4. Inside the container
cd /small-rna-metavir/src/pipeline
perl main.pl -h # verify installation
# Build with caching for faster rebuilds
podman build . --target=stage_perl -t srna/perl:v01
podman build . --target=stage_python -t srna/python:v01 --cache-from=localhost/srna/perl:v01
podman build . --target=stage_r -t srna/r:v01 --cache-from=localhost/srna/python:v01
podman build . --target=stage_dependencies -t srna/deps:v01 --cache-from=localhost/srna/r:v01
podman build . --target=small_rna_metavir -t srna/metavir:v01 --cache-from=localhost/srna/deps:v01
# Run with SELinux-compatible mount
podman run -it --rm -v /data:/small-rna-metavir/asset:Z srna/metavir:v01 bash
| Database | Container Path | Size | Setup Command |
|---|---|---|---|
| Host genome | -hostgenome param | 1-3 GB | bowtie-build --threads 20 host.fa host.fa |
| Bacterial genomes | /small-rna-metavir/asset/refs/bacterial_genomes/ | ~5 GB | bowtie-build --large-index --threads 20 all_bacters.fa all_bacters.fa |
| NCBI nt (core_nt) | /small-rna-metavir/asset/blastdb/nt/ | ~150 GB | update_blastdb.pl --decompress nt |
| NCBI nr (Diamond) | /small-rna-metavir/asset/diamond/ | ~60-100 GB | diamond makedb --in nr.gz --db nrcluster_tax --taxonmap ... |
| RF Classifier | /small-rna-metavir/asset/classifier/ | <1 MB | Included in the repository (asset/) |
| Parameter | Type | Description | Example |
|---|---|---|---|
-fasta <path> | Path | Input reads in FASTA format (.fasta, .fa, .gz). Skips Trim Galore/FastQC. | -fasta /data/lib_trimmed.fasta |
-fastq <path> | Path | Input reads in FASTQ format. Runs Trim Galore + FastQC automatically. | -fastq /data/raw.fq.gz |
-hostgenome <path> | Path | Host genome FASTA with pre-built Bowtie index. Not needed with -nohostfilter. | -hostgenome /refs/Aae.fasta |
-prefix <str> | String | Output file prefix for all generated files. | -prefix SRR1652436 |
-size <int> | Integer | Expected target virus genome size in bp. | -size 20000 |
-si <int> | Integer | Minimum read length to use (recommended: 15). | -si 15 |
-se <int> | Integer | Maximum read length to use (recommended: 35). | -se 35 |
-process <int> | Integer | Number of CPU threads. | -process 20 |
-exec-id <str> | String | Unique run identifier. Defines output directory under ./runs/. | -exec-id val-SRR1652436 |
| Parameter | Default | Description |
|---|---|---|
-hash <int> | 15 | Velvet hash (k-mer) length for fixed-hash assembly. Try 13 or 17 if default yields poor results. |
-nohostfilter | off | Skip host genome filtering. Reads go directly to bacterial filtering. Dispenses -hostgenome. |
-nononviralprofiles | off | Skip sRNA profile generation for non-viral contigs. Saves significant time. |
-clean | off | Remove large intermediate files (SAMs, assembly dirs). Saves 50-80% disk space. |
-largeindex | off | Use Bowtie --large-index for host genomes exceeding 4 GB. |
-degradation | off | Enable additional 24-30 nt assembly (piRNA-focused). Partially disabled in current code. |
-h | — | Display help message and exit. |
| Tool | Parameter | Value |
|---|---|---|
| Bowtie (mapping) | -v / -k | 1 mismatch / 1 alignment |
| VelvetOptimiser | hash range | 13–19 (auto) |
| SPAdes (optimized) | -k | 13,15,17,19 |
| BLASTn | evalue / alignments | 1e-5 / 5 |
| Diamond | evalue / sensitivity | 0.001 / very-sensitive |
| Trim Galore | min length | 18 nt |
| sRNA Profiling | size range | 18–35 nt |
| Merge contigs | pid / plen | 60% / 60% |
perl main.pl \
-fasta /small-rna-metavir/asset/libs/RNPM162_trimmed.fq.fasta \
-hostgenome /small-rna-metavir/asset/refs/ref_hosts/refs_Aae_Aalb_Agam_culex.fasta \
-process 20 -si 15 -se 35 -size 20000 \
-exec-id RNPM162 -prefix RNPM162
Runtime: ~2-6 hours • Disk: 5-20 GB per library
perl main.pl \
-fastq /data/raw/library_R1.fastq.gz \
-hostgenome /small-rna-metavir/asset/refs/ref_hosts/refs_Aae_Aalb_Agam_culex.fasta \
-process 20 -si 15 -se 35 -size 20000 \
-exec-id raw-lib01 -prefix RAWLIB01
Trim Galore + FastQC run automatically before the pipeline.
perl main.pl \
-fasta /data/nonmodel_trimmed.fasta \
-nohostfilter \
-process 20 -si 15 -se 35 -size 20000 \
-exec-id nonmodel-01 -prefix NM01
perl main.pl \
-fasta /data/large_library.fasta \
-hostgenome /refs/host.fasta \
-process 40 -si 15 -se 35 -size 20000 \
-exec-id prod-01 -prefix LARGE01 \
-clean -nononviralprofiles
-clean saves 50-80% disk; -nononviralprofiles reduces runtime.
# Download from SRA
fasterq-dump SRR1652436 -O /data/sra/ -e 8
# Run the pipeline
perl main.pl \
-fasta /data/sra/SRR1652436_trimmed.fq.fasta \
-hostgenome /small-rna-metavir/asset/refs/ref_hosts/refs_Aae_Aalb_Agam_culex.fasta \
-process 40 -si 15 -se 35 -size 20000 \
-exec-id val-SRR1652436 -prefix SRR1652436
| Similarity | ML Class | Interpretation |
|---|---|---|
| viral | viral | Confirmed active virus. Known virus with siRNA-like profile. |
| viral | eve | Possible EVE. Known virus but piRNA-like profile — may be integrated. |
| nohit | viral | Novel viral candidate! No database match but viral sRNA signature. |
| nohit | eve | Unknown sequence with EVE-like profile. |
| nonviral | viral | Non-viral hit but viral profile. Inspect manually. |
| nonviral | eve | Non-viral / host-derived with EVE-like profile. |
RNase III enzyme that cleaves viral dsRNA into 21-23 nt siRNA duplexes. The 21 nt peak is a hallmark of Dicer activity against active viruses. Dicer-2 is the specific isoform involved in antiviral RNAi in insects.
RNA with two complementary strands. Viral replication intermediates form dsRNA, triggering the RNAi pathway.
A viral sequence integrated into the host genome. EVEs are inherited vertically and produce piRNAs (24-30 nt) instead of siRNAs. Distinguishing EVEs from active infections is a key goal of this pipeline.
An amplification loop in the piRNA pathway. PIWI proteins cleave target RNA at position 10, generating a new piRNA with a 10 nt overlap. Computationally detectable as a bias for adenine at position 10.
Small non-coding RNA, 24-30 nt, that interacts with PIWI-family proteins. Involved in transposon silencing. A dominant piRNA peak suggests the sequence is an EVE.
Multiprotein complex that uses siRNA guides to find and degrade complementary viral RNA. The catalytic component is Argonaute-2.
Conserved antiviral mechanism: viral dsRNA → Dicer-2 → siRNAs → RISC → viral RNA degradation. The biological foundation of this pipeline.
20-23 nt RNA produced by Dicer. The 21 nt peak is the molecular signature of active viral replication in arthropods.
Small RNAs from viral sequences. Can be siRNAs (active virus, Dicer) or piRNAs (EVE, ping-pong). The size profile reveals the origin.
Reconstructing longer sequences (contigs) from short reads without a reference genome. This pipeline uses Velvet, SPAdes, and CAP3.
Algorithm for comparing sequences against databases. BLASTn (nucleotide) is used with evalue 1e-5 against the NCBI nt database.
Fast short-read aligner for mapping reads to references. Used with -v 1 (1 mismatch), -k 1 (1 alignment), -f (FASTA input).
Meta-assembler that merges overlapping contigs from different assemblies into a non-redundant set.
Contiguous sequence assembled from overlapping reads. Contigs ≥200 nt are analyzed by the pipeline.
Fast protein aligner (BLASTx alternative). Used with evalue 0.001, very-sensitive mode, for contigs with no nucleotide-level hits.
Statistical measure of alignment significance. Lower = better. BLASTn uses 1e-5; Diamond uses 0.001.
Study of genetic material from environmental samples without culturing. sRNA metagenomics targets the small RNA fraction for viral discovery.
Sequence from SPAdes that may contain gaps bridged by assembly graph information. Merged with Velvet contigs.
The complete collection of viruses in a sample or organism.
Tabular representation: rows = contigs, columns = 48 features (21 sense Z-scores + 21 antisense Z-scores + 6 density/ratio features). Input to the classifier.
Ensemble ML algorithm using 50 decision trees by majority vote. Trained on 2,315 sequences (1,321 viral + 994 EVE). ~92.5% test accuracy.
Initial contig classification by BLASTn/Diamond: viral, nonviral, or nohit. Distinct from the ML-based class (viral/eve).
Normalized value: (x - mean) / stddev. Makes sRNA profiles comparable across contigs regardless of read depth. Computed for sizes 15-35 nt on both strands.