Papers
arxiv:2606.04552

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

Published on Jun 3
Authors:
,

Abstract

LDARNet is a 120M-parameter genomic foundation model that uses dynamic chunking and adaptive token boundaries through BiMamba-2 state-space layers and local attention to improve performance on genomics tasks.

Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as k-mers, BPE, or single nucleotides, which impose arbitrary sequence boundaries that may obscure biologically relevant structure. We present LDARNet, a 120M-parameter hierarchical genomic foundation model that adapts H-Net-style dynamic chunking from autoregressive generation to masked language modeling, combining BiMamba-2 state-space layers with local attention, bidirectional routing, and a ratio-based regularizer to induce adaptive token boundaries without supervision. Fine-tuned on 27 tasks from the Nucleotide Transformer and Genomic Benchmarks suites, LDARNet achieves 11/18 wins among compact models (<300M parameters) and state-of-the-art results on 5 histone modification tasks, outperforming models up to 20times larger. A FLOPs-matched controlled experiment isolates learned routing as the source of these gains: learned boundaries beat fixed-grid boundaries by up to 14 percentage points on histone tasks at identical compute. Nucleotide-resolution analysis further shows that the learned boundaries align with canonical promoter motifs and splice junctions without supervision, providing a biological interpretation for adaptive tokenization in genomic foundation models.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.04552
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.04552 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.04552 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.04552 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.