File size: 13,312 Bytes
a919793
 
 
 
041e099
 
a919793
7afaabf
041e099
949110f
041e099
 
 
7afaabf
041e099
 
3363bc4
041e099
 
 
 
f891df8
041e099
 
 
 
 
a919793
041e099
 
 
b6ab6dd
 
 
041e099
 
 
 
a919793
 
 
 
 
 
 
 
 
 
041e099
 
a919793
 
 
 
 
041e099
 
a919793
041e099
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a919793
041e099
 
 
a919793
041e099
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f891df8
041e099
 
 
 
 
 
a919793
041e099
949110f
 
041e099
949110f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
041e099
f891df8
041e099
 
 
f891df8
949110f
 
041e099
 
 
f891df8
041e099
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
949110f
041e099
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
949110f
 
 
 
041e099
 
949110f
 
a919793
 
 
 
949110f
7afaabf
 
a919793
949110f
 
 
 
 
 
 
 
 
a919793
 
b6ab6dd
949110f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a919793
949110f
 
 
 
041e099
 
 
 
 
 
 
 
 
f891df8
041e099
db7da83
041e099
 
 
7afaabf
041e099
 
7afaabf
6080cd0
564a146
041e099
 
 
 
 
 
 
0b9cb48
041e099
 
 
 
 
 
 
 
 
 
 
 
7afaabf
041e099
 
 
 
7afaabf
041e099
 
 
 
 
 
 
b6ab6dd
041e099
 
 
949110f
 
 
 
 
041e099
 
7afaabf
041e099
6080cd0
7afaabf
041e099
 
 
 
 
7afaabf
041e099
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
---
license: apache-2.0
---

<center> <div style="text-align: center;"> <img src="https://raw.githubusercontent.com/ZHZisZZ/dllm/main/assets/logo.gif" width="400" />
 </div> </center>

# Qwen3-0.6B-diffusion-bd3lm-v0.1

Qwen3-0.6B-diffusion-bd3lm-v0.1 is a diffusion-based language model adapted from [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) using [BD3LM](https://arxiv.org/abs/2503.09573) (block diffusion), trained with the [dLLM](https://github.com/ZHZisZZ/dllm) framework.

## Model Overview

Qwen3-0.6B-diffusion-bd3lm-v0.1 has the following features:

<!-- - **Architecture**: Transformer encoder with 8192-token context -->
- **Method**: [Block Discrete Denoising Diffusion Language Modeling (BD3LM)](https://arxiv.org/pdf/2503.09573)
- **Framework**: [dLLM](https://github.com/ZHZisZZ/dllm)
- **Base Model**: [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
- **Datasets**: [tulu-3-sft-mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture), [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk), [opc-sft-stage1](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage1) and [opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2) 

For training details, see the [W&B report](https://wandb.ai/asap-zzhou/dllm/reports/dLLM-Tiny-A2D--VmlldzoxNTI2NTEzOA).

## Installation

```shell
pip install torch transformers accelerate
```

## Quick Start

> [!NOTE]
> We recommend setting `enable_thinking=False` when using the model to ensure stable behavior and reproducible results.

```python
import math
import copy

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForMaskedLM


def add_gumbel_noise(logits, temperature):
    if temperature == 0:
        return logits
    logits = logits.to(torch.float64)
    noise = torch.rand_like(logits, dtype=torch.float64)
    g = (-torch.log(noise)) ** temperature
    return logits.exp() / g


def get_num_transfer_tokens(mask_index, steps):
    mask_num = mask_index.sum(dim=1, keepdim=True)
    base = mask_num // steps
    rem = mask_num % steps
    out = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.long) + base
    for i in range(mask_num.size(0)):
        out[i, : rem[i]] += 1
    return out


def build_staircase_attention_mask(x, block_size, pad_id):
    B, T = x.shape
    device = x.device

    valid = x != pad_id
    pos_raw = torch.cumsum(valid.long(), dim=-1)
    position_ids = torch.where(valid, pos_raw - 1, torch.zeros_like(pos_raw)).long()

    col = torch.arange(T, device=device)
    block_ids = (col // block_size).view(1, T).expand(B, T)
    block_ids = torch.where(valid, block_ids, torch.full_like(block_ids, -1))

    q = block_ids.view(B, 1, T, 1)
    k = block_ids.view(B, 1, 1, T)
    attn = (k <= q) & (q >= 0) & (k >= 0)

    return attn, position_ids


def diffusion_step_block(logits, x_block, mask_block, num_transfer, temperature, remasking):
    B, L, _ = logits.shape
    if not mask_block.any():
        return x_block

    noisy = add_gumbel_noise(logits, temperature)
    x0 = noisy.argmax(dim=-1)

    if remasking == "low_confidence":
        p = F.softmax(logits, dim=-1)
        conf = p.gather(-1, x0.unsqueeze(-1)).squeeze(-1)
    elif remasking == "random":
        conf = torch.rand((B, L), device=logits.device)
    else:
        raise ValueError(remasking)

    x0 = torch.where(mask_block, x0, x_block)
    neg_inf = torch.full_like(conf, -float("inf"))
    conf = torch.where(mask_block, conf, neg_inf)

    commit = torch.zeros_like(x_block, dtype=torch.bool)
    for i in range(B):
        k = int(num_transfer[i].item())
        if k > 0:
            valid = (conf[i] > -float("inf")).sum().item()
            k = min(k, valid)
            _, idx = torch.topk(conf[i], k)
            commit[i, idx] = True

    out = x_block.clone()
    out[commit] = x0[commit]
    return out


@torch.no_grad()
def generate(
    model,
    tokenizer,
    prompt,
    steps=128,
    max_new_tokens=128,
    block_size=32,
    temperature=0.0,
    cfg_scale=0.0,
    remasking="low_confidence",
):
    device = model.device
    mask_id = tokenizer.mask_token_id
    pad_id = tokenizer.pad_token_id
    if pad_id is None:
        pad_id = tokenizer.eos_token_id if tokenizer.eos_token_id is not None else tokenizer.mask_token_id

    if isinstance(prompt, torch.Tensor):
        x = prompt.to(device).long()
    else:
        if isinstance(prompt[0], (list, tuple)):
            max_len = max(len(p) for p in prompt)
            x = torch.full((len(prompt), max_len), pad_id, device=device, dtype=torch.long)
            for i, p in enumerate(prompt):
                x[i, : len(p)] = torch.tensor(p, device=device)
        else:
            x = torch.tensor(prompt, device=device).long()
    if x.dim() == 1:
        x = x.unsqueeze(0)

    B = x.size(0)
    finished = torch.zeros(B, dtype=torch.bool, device=device)

    num_blocks = math.ceil(max_new_tokens / block_size)
    steps_per_block = math.ceil(steps / num_blocks)
    generated = 0

    while generated < max_new_tokens:
        if finished.all():
            break
        T_prefix = x.size(1)
        offset = T_prefix % block_size
        room = block_size if offset == 0 else block_size - offset
        cur_len = min(room, max_new_tokens - generated)
        if cur_len <= 0:
            break

        attn_pfx, pos_pfx = build_staircase_attention_mask(x, block_size, pad_id)

        out = model(x, attention_mask=attn_pfx, position_ids=pos_pfx, use_cache=True)
        cond_past = out.past_key_values

        if cfg_scale > 0:
            un_x = x.clone()
            un_x[:] = mask_id
            out_un = model(un_x, attention_mask=attn_pfx, position_ids=pos_pfx, use_cache=True)
            uncond_past = out_un.past_key_values
        else:
            uncond_past = None

        block = torch.full((B, cur_len), mask_id, device=device, dtype=torch.long)
        block[finished] = pad_id
        x = torch.cat([x, block], dim=1)
        T_total = x.size(1)

        block_mask = x[:, -cur_len:] == mask_id
        num_transfer = get_num_transfer_tokens(block_mask, steps_per_block)
        eff_steps = num_transfer.size(1)

        full_attn, full_pos = build_staircase_attention_mask(x, block_size, pad_id)
        attn_blk = full_attn[:, :, T_prefix:T_total, :]
        pos_blk = full_pos[:, T_prefix:T_total]

        for t in range(eff_steps):
            x_blk = x[:, T_prefix:T_total]
            m_blk = x_blk == mask_id

            cond_logits = model(
                x_blk, attention_mask=attn_blk, position_ids=pos_blk,
                past_key_values=copy.deepcopy(cond_past), use_cache=False
            ).logits

            logits = cond_logits
            if cfg_scale > 0:
                un_logits = model(
                    x_blk, attention_mask=attn_blk, position_ids=pos_blk,
                    past_key_values=copy.deepcopy(uncond_past), use_cache=False
                ).logits
                logits = un_logits + (cfg_scale + 1.0) * (cond_logits - un_logits)

            x_blk_new = diffusion_step_block(
                logits, x_blk, m_blk, num_transfer[:, t], temperature, remasking
            )
            x[:, T_prefix:T_total] = x_blk_new
            if tokenizer.eos_token_id is not None:
                finished |= (x_blk_new == tokenizer.eos_token_id).any(dim=1)
            if finished.all():
                break

        generated += cur_len
        if finished.all():
            break

    return x


device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForMaskedLM.from_pretrained("dllm-collection/Qwen3-0.6B-diffusion-bd3lm-v0.1", dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained("dllm-collection/Qwen3-0.6B-diffusion-bd3lm-v0.1", trust_remote_code=True)

prompts = [
    [
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Implement a DFS traversal in Python with clear inline comments."},
    ],
    [
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 10 hours?"},
    ],
]

encoded = [tokenizer.apply_chat_template(m, add_generation_prompt=True, tokenize=True, enable_thinking=False) for m in prompts]
prompt_lens = [len(e) for e in encoded]
max_len = max(prompt_lens)
pad_id = tokenizer.pad_token_id
if pad_id is None:
    pad_id = tokenizer.eos_token_id if tokenizer.eos_token_id is not None else tokenizer.mask_token_id
input_ids = torch.full((len(encoded), max_len), pad_id, dtype=torch.long)
for i, ids in enumerate(encoded):
    input_ids[i, : len(ids)] = torch.tensor(ids, dtype=torch.long)
input_ids = input_ids.to(device)

max_new_tokens = 256
text = generate(
    model,
    tokenizer,
    input_ids,
    steps=256,
    max_new_tokens=max_new_tokens,
    block_size=32,
    temperature=0.0,
    cfg_scale=0.0,
    remasking="low_confidence",
)

new_tokens = [text[i, prompt_lens[i] : prompt_lens[i] + max_new_tokens].tolist() for i in range(len(prompt_lens))]
for idx, decoded in enumerate(tokenizer.batch_decode(new_tokens, skip_special_tokens=False)):
    print(f"\n[Sample {idx}]")
    print(decoded)
```

## Generation Parameters

| Parameter        | Description                                                                                    | Default  |
| ---------------- | ---------------------------------------------------------------------------------------------- | -------- |
| `max_new_tokens` | Number of tokens to generate                                                                   | 256      |
| `steps`          | Number of diffusion denoising iterations                                                       | 256      |
| `temperature`    | Sampling temperature; set to `0.0` for deterministic generation                                | 0.0      |
| `block_size`   | Token block size used during iterative denoising                                               | 32       |
| `cfg_scale`      | Classifier-free guidance scale controlling instruction adherence (higher = more deterministic) | 0.0      |
| `remasking`      | Strategy for re-masking during each denoising step (`random` or `low_confidence`)         | `low_confidence` |

## Command-Line Interface

Follow the Github repo's demo script [examples/a2d/bd3lm/chat.py](https://github.com/ZHZisZZ/dllm/blob/main/examples/a2d/bd3lm/chat.py) for visualized generation:

```shell
python -u examples/a2d/bd3lm/chat.py \
    --model_name_or_path dllm-collection/Qwen3-0.6B-diffusion-bd3lm-v0.1 \
    --chat_template True --block_size 32 --remasking low_confidence --steps 256 --max_new_tokens 256
```

## Evaluation

<table style="border-collapse: collapse; width: 100%; text-align: center;">
  <thead>
    <tr style="border-bottom: 3px solid #333;">
      <th style="padding: 8px;">Model                     </th>
      <th style="padding: 8px;">GSM8K</th>
      <th style="padding: 8px;">MATH</th>
      <th style="padding: 8px;">BBH</th>
      <th style="padding: 8px;">MMLU&#8209;Pro</th>
      <th style="padding: 8px;">Hellaswag</th>
      <th style="padding: 8px;">MMLU</th>
      <th style="padding: 8px;">HumanEval</th>
      <th style="padding: 8px;">MBPP</th>
    </tr>
  </thead>

  <tr style="background-color: #e8f2ff">
    <td style="padding: 8px;"><a href="https://huggingface.co/dllm-collection/Qwen3-0.6B-diffusion-bd3lm-v0.1"><code>Qwen3-0.6B-diffusion-bd3lm-v0.1</code></a> (evaluated)</td>
    <td>46.6</td><td>13.9</td><td>27.0</td><td>14.1</td><td>40.0</td><td>38.8</td><td>47.6</td><td>32.0</td>
  </tr>

  <tr style="background-color: #e8f2ff">
    <td style="padding: 8px;"><a href="https://huggingface.co/dllm-collection/Qwen3-0.6B-diffusion-mdlm-v0.1"><code>Qwen3-0.6B-diffusion-mdlm-v0.1</code></a> (evaluated)</td>
    <td>29.8</td><td>8.8</td><td>27.0</td><td>17.6</td><td>42.1</td><td>40.0</td><td>30.5</td><td>29.2</td>
  </tr>
  <tr>
    <td colspan="9" style="padding: 0; border-top: 3px double #666;"></td>
  </tr>

  <tr>
    <td style="padding: 8px;"><a href="https://huggingface.co/Qwen/Qwen3-0.6B-Base"><code>Qwen3-0.6B-Base</code></a> (reported)</td>
    <td>59.6</td><td>32.4</td><td>41.5</td><td>24.7</td><td>47.4</td><td>52.8</td><td>32.3</td><td>36.6</td>
  </tr>

  <tr>
    <td style="padding: 8px;"><a href="https://huggingface.co/Qwen/Qwen2.5-0.5B"><code>Qwen2.5-0.5B</code></a> (reported)</td>
    <td>41.6</td><td>19.5</td><td>20.3</td><td>15.7</td><td>52.1</td><td>47.5</td><td>30.5</td><td>39.3</td>
  </tr>

</table>

To automatically evaluate Qwen3-0.6B-diffusion-bd3lm-v0.1 on all benchmarks, run:
```shell
bash examples/a2d/bd3lm/eval.sh \
  --model_name_or_path dllm-collection/Qwen3-0.6B-diffusion-bd3lm-v0.1
```


## Citation

If you use Qwen3-0.6B-diffusion-bd3lm-v0.1 or dLLM, please cite:

```bibtex
@misc{dllm,
  author = {Zhanhui Zhou and Lingjie Chen and Hanghang Tong and Dawn Song},
  title = {dLLM: Simple Diffusion Language Modeling},
  year = {2025},
  howpublished = {\url{https://github.com/ZHZisZZ/dllm}},
}
```