# deeparg **Repository Path**: nickkid/deeparg ## Basic Information - **Project Name**: deeparg - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-05-07 - **Last Updated**: 2026-05-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DeepARG DeepARG predicts Antibiotic Resistance Genes (ARGs) from metagenomic sequences. This repository now includes a modern PyTorch/Transformers runtime for the original deepARG-SS and deepARG-LS models. The default runtime uses the Hugging Face model bundle: gaarangoa/deeparg That bundle contains: * deepARG-SS weights and config * deepARG-LS weights and config * DeepARG v2 DIAMOND databases * DIAMOND 2.1.24 binaries for Linux x86_64 and Linux aarch64 * Greengenes gg13 Bowtie2 index for optional 16S normalization * source files for the modern pipeline ## Quick Start Install `uv` if needed: curl -LsSf https://astral.sh/uv/install.sh | sh Create and sync the environment from this repository: uv sync source .venv/bin/activate Run DeepARG with minimal parameters: deeparg predict \ --model LS \ -i ./test/ORFs.fa \ -o ./test/X For short reads use `SS`: deeparg predict \ --model SS \ -i ./test/ORFs.fa \ -o ./test/X Run the modern short reads pipeline on an already-clean FASTA: deeparg short_reads_pipeline \ --input-fasta ./test/ORFs.fa \ --output-file ./test/reads The first run downloads the model, database, and matching DIAMOND binary from Hugging Face into the local Hugging Face cache. ## Python API import deeparg outputs = deeparg.predict( input_file="./test/ORFs.fa", output_file="./test/X", model="LS", ) print(outputs["arg"]) print(outputs["potential_arg"]) The returned paths are: * `.mapping.ARG` * `.mapping.potential.ARG` * `.align.daa.tsv` Short reads pipeline API: result = deeparg.short_reads_pipeline( input_fasta="./test/ORFs.fa", output_file="./test/reads", ) print(result.arg_quant_file) ## What Gets Downloaded By default, the modern pipeline loads everything from Hugging Face: * model repo: `gaarangoa/deeparg` * model subfolder: `LS` or `SS`, selected from `--model` * database: `/database/features.dmnd` * gene lengths: `/database/features.gene.length` when needed * 16S normalization index: `gg13/dataset.*` * DIAMOND binary: * `bin/linux-x86_64/diamond` * `bin/linux-aarch64/diamond` DIAMOND resolution order: 1. `--diamond-path` 2. `diamond` on `PATH` 3. `/bin/diamond` 4. bundled DIAMOND from `gaarangoa/deeparg` If you already have a native DIAMOND installed, DeepARG will use it. Otherwise, on Linux x86_64 or Linux aarch64, DeepARG downloads and uses the bundled binary. ## Common Commands Use nucleotide input: deeparg predict \ --model LS \ --type nucl \ -i input.fasta \ -o output/sample Use protein input: deeparg predict \ --model LS \ --type prot \ -i proteins.faa \ -o output/sample Use a specific DIAMOND binary: deeparg predict \ --model LS \ -i input.fasta \ -o output/sample \ --diamond-path /path/to/diamond Use a local DeepARG data directory instead of the HF database: deeparg predict \ --model LS \ -i input.fasta \ -o output/sample \ --data-path /path/to/deeparg-data Use another Hugging Face repo or local `save_pretrained` folder: deeparg predict \ --model LS \ -i input.fasta \ -o output/sample \ --hf-model-path USER_OR_ORG/deeparg Force CPU: deeparg predict \ --model LS \ -i input.fasta \ -o output/sample \ --hf-device cpu ## Short Reads Pipeline The modern short reads pipeline keeps the legacy workflow but runs DeepARG through the PyTorch/Transformers API: 1. optional paired-end trimming with Trimmomatic 2. optional paired-end merge with vsearch 3. DeepARG-SS prediction 4. ARG interval merge and quantification in Python 5. optional 16S normalization with Bowtie2 and samtools Minimal run from an already-clean FASTA: deeparg short_reads_pipeline \ --input-fasta input.clean.fasta \ --output-file output/sample Run from paired FASTQ files: deeparg short_reads_pipeline \ --forward-pe-file sample_R1.fastq.gz \ --reverse-pe-file sample_R2.fastq.gz \ --output-file output/sample Enable 16S normalization. The `gg13` Bowtie2 index is downloaded from `gaarangoa/deeparg` automatically: deeparg short_reads_pipeline \ --input-fasta input.clean.fasta \ --output-file output/sample \ --normalize-16s Raw paired FASTQ mode requires `trimmomatic` and `vsearch` on `PATH`. `--normalize-16s` additionally requires `bowtie2` and `samtools` on `PATH`. The model, ARG database, DIAMOND binaries, and `gg13` index are all in the HF bundle. If Python HTTP clients hang while `curl` can reach Hugging Face, force IPv4: HF_FORCE_IPV4=1 deeparg predict \ --model LS \ -i input.fasta \ -o output/sample ## Output DeepARG writes two main files: * `*.ARG` * `*.potential.ARG` `*.ARG` contains predictions with probability greater than or equal to `--min-prob`, which defaults to `0.8`. `*.potential.ARG` contains lower probability ARG-like predictions. Columns: * ARG_NAME * QUERY_START * QUERY_END * QUERY_ID * PREDICTED_ARG_CLASS * BEST_HIT_FROM_DATABASE * PREDICTION_PROBABILITY * ALIGNMENT_BESTHIT_IDENTITY (%) * ALIGNMENT_BESTHIT_LENGTH * ALIGNMENT_BESTHIT_BITSCORE * ALIGNMENT_BESTHIT_EVALUE * COUNTS ## Development With uv Create or refresh the environment: uv sync Update dependency resolution: uv lock --upgrade uv sync Run the CLI from the environment: .venv/bin/deeparg predict --help Run a quick local test: HF_FORCE_IPV4=1 .venv/bin/deeparg predict \ --model LS \ -i ./test/ORFs.fa \ -o ./data/test_hf_real \ --hf-device cpu The project is configured by: * `pyproject.toml` * `uv.lock` ## Training and Conversion Train a new Transformers model from a DIAMOND/BLAST TSV: python -m deeparg.modern.train_transformers \ --train-alignments /path/to/train_reads.tsv \ --output-dir /path/to/deeparg-ss-transformers \ --pipeline reads \ --identity 30 \ --evalue 1 \ --coverage 25 Convert existing DeepARG Lasagne weights: python -m deeparg.modern.export_legacy \ --metadata-pkl /path/to/model/v2/metadata_SS.pkl \ --model-pkl /path/to/model/v2/model_SS.pkl \ --database-dir /path/to/database/v2 \ --feature-lengths /path/to/database/v2/features.gene.length \ --output-dir /path/to/deeparg-ss-transformers Upload a bundle to Hugging Face: HF_FORCE_IPV4=1 python -m deeparg.modern.upload_hf \ --repo-id USER_OR_ORG/deeparg \ --folder /path/to/deeparg-transformers ## Legacy Runtime The original DeepARG runtime was based on Python 2.7, Theano, Lasagne, and nolearn. It is deprecated in this repository; the active codebase now contains only the PyTorch/Transformers runtime described above. Legacy installation and usage notes were moved to: README_LEGACY.md ## Citation If you use DeepARG in published research, please cite: Arango-Argoty GA, Garner E, Pruden A, Heath LS, Vikesland P, Zhang L. DeepARG: A deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome 2018 6:23. https://doi.org/10.1186/s40168-018-0401-z ## License DeepARG is under the MIT license. Please also review the commercial restrictions of the databases used during the mining process, including CARD, ARDB, and UniProt. ## Contact gustavo1@vt.edu