You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Base LLM 400M

Self-mined multi-modal pretraining corpus targeting 200B tokens of high-quality text, video, audio, and image data.

Structure

data/shards/shard_{idx:08d}.bin — raw uint32 token IDs, ~128MB per shard
data/shards/shard_{idx:08d}.meta.json — 18-field sidecar with quality stats
state/ — runtime checkpoints (.state extension)

18-field sidecar schema

Field	Type	Description
`shard_idx`	int	Shard number
`filename`	str	Filename
`num_tokens`	int	Token count
`dtype`	str	`uint32`
`size_bytes`	int	File size
`created_at`	str	ISO 8601 timestamp
`tokens`	int	Token count (duplicate for compatibility)
`avg_score`	float	Mean quality score
`min_score`	float	Min quality score
`max_score`	float	Max quality score
`std_score`	float	Std dev of quality scores
`n_above_3`	int	Chunks with score > 3.0
`n_above_5`	int	Chunks with score > 5.0
`score_hist`	dict	21-bin histogram (0.0–10.0 step 0.5)
`modality_comp`	dict	Modality composition counts
`pillar_comp`	dict	Pillar composition counts
`ts`	float	Unix timestamp
`modality`	str	Primary modality

Loading (Python)

from datasets import load_dataset
ds = load_dataset("morningstarxcdcode/base-llm-400m", split="train", streaming=True)
for example in ds:
    print(example["tokens"])

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support