Base LLM 400M
Self-mined multi-modal pretraining corpus targeting 200B tokens of high-quality text, video, audio, and image data.
Structure
data/shards/shard_{idx:08d}.binโ rawuint32token IDs, ~128MB per sharddata/shards/shard_{idx:08d}.meta.jsonโ 18-field sidecar with quality statsstate/โ runtime checkpoints (.stateextension)
18-field sidecar schema
| Field | Type | Description |
|---|---|---|
shard_idx |
int | Shard number |
filename |
str | Filename |
num_tokens |
int | Token count |
dtype |
str | uint32 |
size_bytes |
int | File size |
created_at |
str | ISO 8601 timestamp |
tokens |
int | Token count (duplicate for compatibility) |
avg_score |
float | Mean quality score |
min_score |
float | Min quality score |
max_score |
float | Max quality score |
std_score |
float | Std dev of quality scores |
n_above_3 |
int | Chunks with score > 3.0 |
n_above_5 |
int | Chunks with score > 5.0 |
score_hist |
dict | 21-bin histogram (0.0โ10.0 step 0.5) |
modality_comp |
dict | Modality composition counts |
pillar_comp |
dict | Pillar composition counts |
ts |
float | Unix timestamp |
modality |
str | Primary modality |
Loading (Python)
from datasets import load_dataset
ds = load_dataset("morningstarxcdcode/base-llm-400m", split="train", streaming=True)
for example in ds:
print(example["tokens"])
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support