🏭 Open-Source Bioinformatics

small RNA MetaVir

Automated viral sequence identification through small RNA profiling. Discover novel viruses invisible to BLAST — via the molecular signatures of RNA interference.

About

small RNA MetaVir is a bioinformatics pipeline designed to detect viral sequences in small RNA sequencing (sRNA-seq) data from arthropods. Unlike conventional approaches that rely exclusively on sequence similarity to known viruses, this tool exploits a fundamental biological mechanism — RNA interference (RNAi) — to identify viral sequences even when they share no detectable similarity with any sequence in public databases.

The Biological Principle: RNAi as an Antiviral Defense

In arthropods such as mosquitoes, flies, and ticks, the RNAi pathway serves as a primary antiviral defense. When a virus infects an insect cell, the host enzyme Dicer-2 recognizes viral double-stranded RNA (dsRNA) replication intermediates and cleaves them into small interfering RNAs (siRNAs) of a characteristic length — predominantly 21 nucleotides. These virus-derived siRNAs (vsRNAs) then guide the RISC (RNA-Induced Silencing Complex) to degrade the viral genome. This leaves a detectable molecular signature: an enrichment of 21 nt reads mapping to the viral sequence.

By contrast, endogenous viral elements (EVEs) — ancient viral sequences integrated into the host genome — produce a different class of small RNAs called piRNAs (PIWI-interacting RNAs), which are typically 24–30 nt in length and show a characteristic ping-pong amplification signature.

🦠 Active Virus — siRNA Signature

Strong peak at 21 nt. Reads on both sense and antisense strands. Produced by Dicer-2 cleavage of viral dsRNA replication intermediates.

21 nt peak Dicer-2 RISC

🧬 EVE — piRNA Signature

Broad distribution at 24–30 nt. Ping-pong amplification pattern (A at pos 10). Produced from genomically integrated viral sequences.

24-30 nt PIWI Ping-pong

Key Features

Similarity-Independent

Finds novel “dark matter” viruses with no database matches through their RNAi footprint.

Virus vs. EVE Classifier

Random Forest trained on 2,315 curated sequences (~92.5% accuracy). 48 features from sRNA size profiles.

Multi-Assembler Strategy

Four parallel assemblies (Velvet + SPAdes) merged by CAP3 to maximize contig recovery from short reads.

Hardware Requirements

ResourceMinimumRecommended
CPU Cores820+
RAM32 GB64 GB
Disk Space500 GB1 TB+

Pipeline Architecture

The pipeline orchestrates 12+ stages from raw reads to final virus vs. EVE classification. The diagram below shows the complete processing flow, color-coded by stage type.

%%{init:{'theme':'dark','themeVariables':{'fontSize':'13px','primaryColor':'#0d1f35','primaryBorderColor':'#1a3050','lineColor':'#5de4c7'}}}%%
flowchart TD
    classDef input fill:#3a4a5a,stroke:#5a6a7a,color:#eaf0f7
    classDef filter fill:#1a4a6a,stroke:#2a6a9a,color:#c8e4f7
    classDef assembly fill:#1a6a4a,stroke:#2a9a6a,color:#c8f7e4
    classDef similarity fill:#8a6a2a,stroke:#baa040,color:#f7eac8
    classDef ml fill:#4a3a6a,stroke:#7a5aaa,color:#e0d0f7
    classDef output fill:#3a4a5a,stroke:#5de4c7,color:#5de4c7

    IN["📄 INPUT\nFASTA / FASTQ reads"]:::input

    S0["Step 0: Quality Control\nTrim Galore + FastQC\n(only with -fastq input)"]:::filter
    S1["Step 1: Host Filtering\nBowtie vs host genome\n(-v 1, -k 1)"]:::filter
    S2["Step 2: Bacterial Filtering\nBowtie vs bacterial genomes\n(--large-index)"]:::filter
    S3["Step 3: Size Selection\nfilter_fasta_by_size.py"]:::filter

    A1["Velvet Optimiser\nhash auto 13-19"]:::assembly
    A2["Velvet Fixed\nhash = 15"]:::assembly
    A3["SPAdes\nk = 13,15,17,19"]:::assembly
    A4["20-23 nt Assembly\nsiRNA-focused"]:::assembly
    CAP["Step 5: CAP3 Meta-Assembly\nMerge + filter ≥200 nt"]:::assembly

    BN["Step 6: BLASTn vs nt\nevalue 1e-5\nClassify: viral / non-viral / no-hit"]:::similarity
    DM["Step 7: Diamond BLASTx vs nr\nevalue 0.001, very-sensitive\n(no-hit contigs only)"]:::similarity

    MAP["Step 8-9: Read Mapping & sRNA Profiling\nBowtie map reads to contigs\nSize profiles 18-35 nt per contig"]:::ml
    ZS["Step 10: Z-Score Features\nNormalize profiles\n48 features per contig"]:::ml
    RF["Step 11: Random Forest Classifier\n50 trees, 2315 training sequences\nviral vs eve prediction"]:::ml

    OUT["🎯 OUTPUT\nviral-eve.csv\nsummary.tsv\nFASTA per class\nsRNA profiles"]:::output

    IN --> S0 --> S1 --> S2 --> S3
    S3 --> A1 & A2 & A3 & A4
    A1 & A2 & A3 & A4 --> CAP
    CAP --> BN --> DM
    DM --> MAP --> ZS --> RF --> OUT

Stage Color Legend

Blue Filtering & QC (host, bacteria, size)   Green Assembly (Velvet, SPAdes, CAP3)   Orange Similarity search (BLAST, Diamond)   Purple Machine learning (profiling, Z-scores, RF)

Technology Stack

CategoryTools & Versions
LanguagesPerl 5.36 (pipeline core), Python 3.9 (ML, filtering), R 4.0 (plotting, feature matrix)
Quality ControlTrim Galore 0.6.10, FastQC 0.12.1, cutadapt 3.2, FASTX-Toolkit
Read MappingBowtie 1.3.0 (short-read aligner), samtools (SAM/BAM processing)
De Novo AssemblyVelvet / VelvetOptimiser 2.2.6, SPAdes 3.13.1, CAP3 (meta-assembler)
Similarity SearchBLAST+ 2.14.0 (BLASTn), legacy BLAST 2.2.26 (formatdb/blastall), Diamond 2.1.6 (BLASTx)
Machine Learningscikit-learn 1.1.3 (Random Forest, 50 estimators), joblib 1.3.1, pandas 2.0.3
Perl ModulesBio::SeqIO, Bio::SearchIO, Statistics::Basic, Statistics::RankCorrelation
R Packagesggplot2, reshape2, Rtsne, umap, ComplexHeatmap
ContainerizationDocker (multi-stage Dockerfile) / Podman

Installation

Docker (Recommended)

# 1. Clone the repository
git clone https://github.com/v-rogana/small-rna-metavir.git
cd small-rna-metavir/docker

# 2. Build the Docker image (30-60 min first time)
docker build --target small_rna_metavir -t small_rna_metavir:latest .

# 3. Start the container with data volumes
docker run -it --rm \
    -v /path/to/databases:/small-rna-metavir/asset \
    -v /path/to/your/reads:/data \
    -v /path/to/output:/small-rna-metavir/src/pipeline/runs \
    small_rna_metavir:latest bash

# 4. Inside the container
cd /small-rna-metavir/src/pipeline
perl main.pl -h   # verify installation

Podman (Staged Builds)

# Build with caching for faster rebuilds
podman build . --target=stage_perl         -t srna/perl:v01
podman build . --target=stage_python       -t srna/python:v01  --cache-from=localhost/srna/perl:v01
podman build . --target=stage_r            -t srna/r:v01       --cache-from=localhost/srna/python:v01
podman build . --target=stage_dependencies -t srna/deps:v01    --cache-from=localhost/srna/r:v01
podman build . --target=small_rna_metavir  -t srna/metavir:v01 --cache-from=localhost/srna/deps:v01

# Run with SELinux-compatible mount
podman run -it --rm -v /data:/small-rna-metavir/asset:Z srna/metavir:v01 bash

Required Databases

DatabaseContainer PathSizeSetup Command
Host genome-hostgenome param1-3 GBbowtie-build --threads 20 host.fa host.fa
Bacterial genomes/small-rna-metavir/asset/refs/bacterial_genomes/~5 GBbowtie-build --large-index --threads 20 all_bacters.fa all_bacters.fa
NCBI nt (core_nt)/small-rna-metavir/asset/blastdb/nt/~150 GBupdate_blastdb.pl --decompress nt
NCBI nr (Diamond)/small-rna-metavir/asset/diamond/~60-100 GBdiamond makedb --in nr.gz --db nrcluster_tax --taxonmap ...
RF Classifier/small-rna-metavir/asset/classifier/<1 MBIncluded in the repository (asset/)

Parameters Reference

Required Parameters

ParameterTypeDescriptionExample
-fasta <path>PathInput reads in FASTA format (.fasta, .fa, .gz). Skips Trim Galore/FastQC.-fasta /data/lib_trimmed.fasta
-fastq <path>PathInput reads in FASTQ format. Runs Trim Galore + FastQC automatically.-fastq /data/raw.fq.gz
-hostgenome <path>PathHost genome FASTA with pre-built Bowtie index. Not needed with -nohostfilter.-hostgenome /refs/Aae.fasta
-prefix <str>StringOutput file prefix for all generated files.-prefix SRR1652436
-size <int>IntegerExpected target virus genome size in bp.-size 20000
-si <int>IntegerMinimum read length to use (recommended: 15).-si 15
-se <int>IntegerMaximum read length to use (recommended: 35).-se 35
-process <int>IntegerNumber of CPU threads.-process 20
-exec-id <str>StringUnique run identifier. Defines output directory under ./runs/.-exec-id val-SRR1652436

Optional Parameters

ParameterDefaultDescription
-hash <int>15Velvet hash (k-mer) length for fixed-hash assembly. Try 13 or 17 if default yields poor results.
-nohostfilteroffSkip host genome filtering. Reads go directly to bacterial filtering. Dispenses -hostgenome.
-nononviralprofilesoffSkip sRNA profile generation for non-viral contigs. Saves significant time.
-cleanoffRemove large intermediate files (SAMs, assembly dirs). Saves 50-80% disk space.
-largeindexoffUse Bowtie --large-index for host genomes exceeding 4 GB.
-degradationoffEnable additional 24-30 nt assembly (piRNA-focused). Partially disabled in current code.
-hDisplay help message and exit.

Key Internal Constants (Hardcoded)

ToolParameterValue
Bowtie (mapping)-v / -k1 mismatch / 1 alignment
VelvetOptimiserhash range13–19 (auto)
SPAdes (optimized)-k13,15,17,19
BLASTnevalue / alignments1e-5 / 5
Diamondevalue / sensitivity0.001 / very-sensitive
Trim Galoremin length18 nt
sRNA Profilingsize range18–35 nt
Merge contigspid / plen60% / 60%

Usage Examples

Aedes aegypti Virome Analysis

perl main.pl \
    -fasta /small-rna-metavir/asset/libs/RNPM162_trimmed.fq.fasta \
    -hostgenome /small-rna-metavir/asset/refs/ref_hosts/refs_Aae_Aalb_Agam_culex.fasta \
    -process 20 -si 15 -se 35 -size 20000 \
    -exec-id RNPM162 -prefix RNPM162

Runtime: ~2-6 hours • Disk: 5-20 GB per library

Raw FASTQ with Automatic Trimming

perl main.pl \
    -fastq /data/raw/library_R1.fastq.gz \
    -hostgenome /small-rna-metavir/asset/refs/ref_hosts/refs_Aae_Aalb_Agam_culex.fasta \
    -process 20 -si 15 -se 35 -size 20000 \
    -exec-id raw-lib01 -prefix RAWLIB01

Trim Galore + FastQC run automatically before the pipeline.

No Host Reference (Non-Model Organism)

perl main.pl \
    -fasta /data/nonmodel_trimmed.fasta \
    -nohostfilter \
    -process 20 -si 15 -se 35 -size 20000 \
    -exec-id nonmodel-01 -prefix NM01

Large Library with Disk Optimization

perl main.pl \
    -fasta /data/large_library.fasta \
    -hostgenome /refs/host.fasta \
    -process 40 -si 15 -se 35 -size 20000 \
    -exec-id prod-01 -prefix LARGE01 \
    -clean -nononviralprofiles

-clean saves 50-80% disk; -nononviralprofiles reduces runtime.

SRA Data Re-analysis

# Download from SRA
fasterq-dump SRR1652436 -O /data/sra/ -e 8

# Run the pipeline
perl main.pl \
    -fasta /data/sra/SRR1652436_trimmed.fq.fasta \
    -hostgenome /small-rna-metavir/asset/refs/ref_hosts/refs_Aae_Aalb_Agam_culex.fasta \
    -process 40 -si 15 -se 35 -size 20000 \
    -exec-id val-SRR1652436 -prefix SRR1652436

Interpreting Classification Results

SimilarityML ClassInterpretation
viralviralConfirmed active virus. Known virus with siRNA-like profile.
viralevePossible EVE. Known virus but piRNA-like profile — may be integrated.
nohitviralNovel viral candidate! No database match but viral sRNA signature.
nohiteveUnknown sequence with EVE-like profile.
nonviralviralNon-viral hit but viral profile. Inspect manually.
nonviraleveNon-viral / host-derived with EVE-like profile.

Glossary

Molecular Biology & RNA Interference

Dicer / Dicer-2

RNase III enzyme that cleaves viral dsRNA into 21-23 nt siRNA duplexes. The 21 nt peak is a hallmark of Dicer activity against active viruses. Dicer-2 is the specific isoform involved in antiviral RNAi in insects.

dsRNA (Double-Stranded RNA)

RNA with two complementary strands. Viral replication intermediates form dsRNA, triggering the RNAi pathway.

EVE (Endogenous Viral Element)

A viral sequence integrated into the host genome. EVEs are inherited vertically and produce piRNAs (24-30 nt) instead of siRNAs. Distinguishing EVEs from active infections is a key goal of this pipeline.

Ping-Pong Cycle

An amplification loop in the piRNA pathway. PIWI proteins cleave target RNA at position 10, generating a new piRNA with a 10 nt overlap. Computationally detectable as a bias for adenine at position 10.

piRNA (PIWI-Interacting RNA)

Small non-coding RNA, 24-30 nt, that interacts with PIWI-family proteins. Involved in transposon silencing. A dominant piRNA peak suggests the sequence is an EVE.

RISC (RNA-Induced Silencing Complex)

Multiprotein complex that uses siRNA guides to find and degrade complementary viral RNA. The catalytic component is Argonaute-2.

RNAi (RNA Interference)

Conserved antiviral mechanism: viral dsRNA → Dicer-2 → siRNAs → RISC → viral RNA degradation. The biological foundation of this pipeline.

siRNA (Small Interfering RNA)

20-23 nt RNA produced by Dicer. The 21 nt peak is the molecular signature of active viral replication in arthropods.

vsRNA (Virus-Derived Small RNA)

Small RNAs from viral sequences. Can be siRNAs (active virus, Dicer) or piRNAs (EVE, ping-pong). The size profile reveals the origin.

Genomics & Bioinformatics

Assembly (De Novo)

Reconstructing longer sequences (contigs) from short reads without a reference genome. This pipeline uses Velvet, SPAdes, and CAP3.

BLAST (Basic Local Alignment Search Tool)

Algorithm for comparing sequences against databases. BLASTn (nucleotide) is used with evalue 1e-5 against the NCBI nt database.

Bowtie

Fast short-read aligner for mapping reads to references. Used with -v 1 (1 mismatch), -k 1 (1 alignment), -f (FASTA input).

CAP3

Meta-assembler that merges overlapping contigs from different assemblies into a non-redundant set.

Contig

Contiguous sequence assembled from overlapping reads. Contigs ≥200 nt are analyzed by the pipeline.

Diamond

Fast protein aligner (BLASTx alternative). Used with evalue 0.001, very-sensitive mode, for contigs with no nucleotide-level hits.

E-value (Expect Value)

Statistical measure of alignment significance. Lower = better. BLASTn uses 1e-5; Diamond uses 0.001.

Metagenomics

Study of genetic material from environmental samples without culturing. sRNA metagenomics targets the small RNA fraction for viral discovery.

Scaffold

Sequence from SPAdes that may contain gaps bridged by assembly graph information. Merged with Velvet contigs.

Virome

The complete collection of viruses in a sample or organism.

Machine Learning & Statistics

Feature Matrix

Tabular representation: rows = contigs, columns = 48 features (21 sense Z-scores + 21 antisense Z-scores + 6 density/ratio features). Input to the classifier.

Random Forest

Ensemble ML algorithm using 50 decision trees by majority vote. Trained on 2,315 sequences (1,321 viral + 994 EVE). ~92.5% test accuracy.

Similarity Label

Initial contig classification by BLASTn/Diamond: viral, nonviral, or nohit. Distinct from the ML-based class (viral/eve).

Z-Score

Normalized value: (x - mean) / stddev. Makes sRNA profiles comparable across contigs regardless of read depth. Computed for sizes 15-35 nt on both strands.

🤖 sRNA MetaVir Assistant

Uses OpenAI API (gpt-4o-mini). Your key is kept only in memory — never stored or transmitted anywhere except OpenAI.
Hello! I can answer questions about the small RNA MetaVir pipeline — biology, installation, parameters, troubleshooting, or anything else.

Please paste your OpenAI API key below to start.
Thinking…