Instructions to use dataslab/DLM-LST-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dataslab/DLM-LST-9B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="dataslab/DLM-LST-9B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("dataslab/DLM-LST-9B") model = AutoModelForImageTextToText.from_pretrained("dataslab/DLM-LST-9B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use dataslab/DLM-LST-9B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dataslab/DLM-LST-9B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dataslab/DLM-LST-9B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dataslab/DLM-LST-9B
- SGLang
How to use dataslab/DLM-LST-9B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "dataslab/DLM-LST-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dataslab/DLM-LST-9B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "dataslab/DLM-LST-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dataslab/DLM-LST-9B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use dataslab/DLM-LST-9B with Docker Model Runner:
docker model run hf.co/dataslab/DLM-LST-9B
DLM-LST-9B
DLM-LST-9B is a Qwen/Qwen3.5-9B derivative whose output projection
(lm_head.weight) has been re-tuned by Language Selection Tuning (LST),
specifically the Logit variant.
The goal is to suppress unwanted Chinese-character generation when the model is used to serve non-Chinese (English / Korean / Japanese etc.) users.
All other parameters are bit-identical to the base model. Vision and multimodal capabilities are preserved unchanged.
Why LST?
Multilingual LLMs trained on heavily skewed corpora (e.g., Qwen on Chinese-rich data) tend to leak the dominant training language regardless of prompt language. This phenomenon is known as language confusion. For Korean users, this means Chinese characters sometimes appear in the middle of an otherwise Korean answer. This hurts both readability and user trust.
Language Selection Tuning (LST) defines a differentiable training objective:
where $T_{\text{risk}}$ is the set of risk-scored target-language tokens
(Chinese characters in the CJK Unified Ideographs range + multi-byte BPE
fragments that decode to Chinese characters), $r(t)$ is each token's risk
score, and $\varphi(\cdot)$ is a per-row penalty applied to lm_head weight
row $w_t$.
This release uses the Logit variant:
which directly penalises positive-logit contributions of risky tokens
evaluated on a calibration distribution D of final-layer hidden states.
Because the optimization lives in the same gradient-based landscape that
downstream SFT / RLHF stages will traverse, the resulting weight
configuration tends to be more persistent under post-LST fine-tuning.
Key changes from Qwen/Qwen3.5-9B
| Component | Status |
|---|---|
lm_head.weight |
Re-trained by LST-Logit. 54,382 / 248,320 rows (21.9 %) modified. |
Token embeddings (embed_tokens) |
Bit-identical to base. |
| Transformer blocks (Gated Attention / Gated DeltaNet / SwiGLU FFN / RMSNorm) | Bit-identical to base. |
mtp.* (multi-token prediction head) |
Bit-identical to base. |
| Vision tower & preprocessor configs | Bit-identical to base. |
| Tokenizer (vocab / merges / chat template) | Identical content to base. |
Per-row script breakdown of the lm_head delta:
| Script | Vocab rows | Rows changed | % of category | mean ‖Δrow‖ |
|---|---|---|---|---|
| CJK | 55,328 | 53,429 | 96.57 % | 0.127 |
| Other non-Latin (broken-byte CJK fragments) | 15,141 | 953 | 6.29 % | 0.178 |
| Latin | 139,845 | 0 | 0.00 % | — |
| Korean | 6,807 | 0 | 0.00 % | — |
| Kana (Japanese) | 3,560 | 0 | 0.00 % | — |
| Cyrillic | 18,580 | 0 | 0.00 % | — |
| Arabic | 8,816 | 0 | 0.00 % | — |
The Logit penalty rotates each affected row only slightly while preserving
its norm (cos(w_orig, w_new) ≈ 0.992 mean over changed rows). Korean,
Japanese, Latin, Cyrillic, and Arabic rows are bit-identical to the base.
Quickstart
The recommended serving path is vLLM, which is also what we used in our evaluation pipeline.
vllm serve dataslab/DLM-LST-9B \
--port 8000 \
--dtype bfloat16 \
--gpu-memory-utilization 0.90 \
--reasoning-parser qwen3 # exposes <think> trace via OpenAI API
# --enable-reasoning # auto-on with --reasoning-parser (vLLM >= 0.7)
# --max-model-len 16384 # cap context to shrink KV cache (default: 262,144)
Use with transformers
Non-Thinking mode (recommended for fast chat)
import torch
from transformers import AutoTokenizer, AutoModelForImageTextToText
REPO = "dataslab/DLM-LST-9B"
tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForImageTextToText.from_pretrained(
REPO,
dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "한반도 주변에 가장 흔한 점토광물은?"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256)
text = tokenizer.decode(out[0][inputs.input_ids.shape[-1]:],
skip_special_tokens=True)
print(text)
Thinking mode (recommended for complex reasoning)
Either use thinking_budget (e.g., vLLM's --reasoning-parser qwen3) or
give max_new_tokens enough headroom (e.g., 8,192 + 256 = 8,448).
Caveat: without a thinking_budget cap, a too-small max_new_tokens
can be fully consumed inside <think> and the answer never gets emitted.
# ... tokenizer / model loaded as above ...
THINKING_BUDGET = 8192 # max tokens inside <think>
ANSWER_TOKENS = 256 # tokens after </think>
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=THINKING_BUDGET + ANSWER_TOKENS)
text = tokenizer.decode(out[0][inputs.input_ids.shape[-1]:],
skip_special_tokens=True)
# `text` contains the <think>...</think> trace followed by the final answer.
# (vLLM separates them via --reasoning-parser qwen3; see Quickstart above.)
print(text)
Why
AutoModelForImageTextToText?Qwen3.5-9B's declared architecture isQwen3_5ForConditionalGeneration, a composite class that wraps both the text decoder and the vision tower. Loading viaAutoModelForCausalLMworks for text-only inference but strips the vision submodule and may produce a config that downstream tools (e.g., vLLM) reject. If you need a pure text causal-LM handle, usemodel.language_modelafter loading.
Benchmark Results
Evaluation Metrics
(1) Selectivity
Fraction of explicit Chinese requests on which the model avoids producing Chinese.
- Lower better (~0): produces Chinese when asked (respects user intent).
- Higher worse (~1): refuses Chinese even when asked (blanket suppression).
| Metric | Benchmark Dataset |
|---|---|
chin_prom ↓ |
Self-built 1,000-prompt Chinese elicitation set (e.g., How do you say '사랑' in Chinese? or the Python + Chinese-comment prompt) |
(2) Chinese-leak suppression
Korean prompts → Korean answers expected; any Chinese token leaked into the answer is a failure. Metric is the clean-Korean response ratio.
- Higher better (~1): Korean answers stay fully Korean (no Chinese tokens leaked).
- Lower worse (~0): Chinese tokens leak into otherwise-Korean answers.
| Metric | Benchmark Dataset |
|---|---|
chin_cs ↑ |
KMMLU Computer Science subjects (free-form Korean generation) |
chin_ie ↑ |
KMMLU Industrial Engineering subjects (free-form Korean generation) |
chin_total ↑ |
KMMLU (free-form Korean generation) |
(3) Reasoning / task performance
| Metric | Benchmark Dataset |
|---|---|
acc_cs ↑ |
KMMLU Computer Science subjects (multiple-choice log-likelihood comparison) |
acc_ie ↑ |
KMMLU Industrial Engineering subjects (multiple-choice log-likelihood comparison) |
acc_total ↑ |
KMMLU (multiple-choice log-likelihood comparison) |
HumanEval ↑ |
HumanEval (pass@1) |
GSM8K ↑ |
GSM8K (exact-match accuracy) |
(4) SFT-persistence
| Metric | Details |
|---|---|
SRR (Suppression Retention Rate) ↑ |
Ratio of chin_total after SFT to before SFT. Closer to 1.0 = SFT did not erode the leak-suppression effect. Built on chin_total (not chin_prom) so direction stays higher-is-better. |
|Δ_selectivity| ↓ |
Absolute change in chin_prom. Smaller = SFT barely shifted selectivity. Sign of Δ: positive = bluntened selectivity (bad); negative = partial selectivity recovery (only meaningful if pre-SFT was blunt). |
Chinese Suppression (Thinking mode)
Evaluated with enable_thinking=True. The DLM-LST-9B column is calibrated with thinking enabled.
| Metric | Qwen3.5-9B (base) | LST-L1 | LST-L2 | DLM-LST-9B |
|---|---|---|---|---|
| Selectivity | ||||
| chin_prom ↓ | 0.029 | 0.993 | 0.992 | 0.065 |
| Chinese-leak suppression | ||||
| chin_cs ↑ | 0.985 | 1.000 | 1.000 | 0.999 |
| chin_ie ↑ | 0.978 | 1.000 | 0.999 | 0.997 |
| chin_total ↑ | 0.9717 | 0.9988 | 0.9988 | 0.9927 |
| Reasoning / Task performance | ||||
| acc_cs ↑ | 0.811 | 0.811 | 0.811 | 0.811 |
| acc_ie ↑ | 0.618 | 0.618 | 0.620 | 0.620 |
| acc_total ↑ | 0.5897 | 0.5897 | 0.5893 | 0.5893 |
| HumanEval ↑ | 0.6646 | 0.6768 | 0.6585 | 0.6646 |
| GSM8K ↑ | 0.8749 | 0.8749 | 0.8749 | 0.8749 |
DLM-LST-9B keeps chin_prom at 0.065. It preserves the ability to
generate Chinese when the user explicitly asks for it, while still cutting
unintended Chinese leakage to the level of chin_total ≈ 0.99.
Downstream reasoning (acc_*, HumanEval, GSM8K) is comparable to, or in some cases even better than, the base model.
Chinese Suppression (Non-Thinking mode)
Evaluated with enable_thinking=False. The DLM-LST-9B column here is a
separate think-OFF-calibrated checkpoint (not this release).
| Metric | Qwen3.5-9B (base) | LST-L1 | LST-L2 | DLM-LST-9B |
|---|---|---|---|---|
| Selectivity | ||||
| chin_prom ↓ | 0.037 | 0.966 | 0.963 | 0.080 |
| Chinese-leak suppression | ||||
| chin_cs ↑ | 0.964 | 0.999 | 1.000 | 0.990 |
| chin_ie ↑ | 0.934 | 0.997 | 0.999 | 0.983 |
| chin_total ↑ | 0.9405 | 0.9974 | 0.9975 | 0.9830 |
| Reasoning / Task performance | ||||
| acc_cs ↑ | 0.811 | 0.811 | 0.811 | 0.811 |
| acc_ie ↑ | 0.615 | 0.614 | 0.614 | 0.614 |
| acc_total ↑ | 0.5900 | 0.5897 | 0.5897 | 0.5897 |
| HumanEval ↑ | 0.6707 | 0.6768 | 0.6707 | 0.6707 |
| GSM8K ↑ | 0.8757 | 0.8749 | 0.8741 | 0.8787 |
Suppression Persistence after SFT-stage (Non-Thinking mode)
Each pipeline was fine-tuned (SFT) on the beomi/KoAlpaca-v1.1a dataset.
After the SFT stage, DLM-LST-9B keeps both its Chinese-leak suppression (SRR ≈ 1.000) and its selectivity (|Δ_selectivity| ≈ 0.08) almost unchanged.
| Metric | Qwen3.5-9B → SFT | DLM-LST-9B → SFT |
|---|---|---|
| Chinese-leak suppression | ||
| chin_total before ↑ | 0.9405 | 0.9928 |
| chin_total after ↑ | 0.9927 | 0.9926 |
| SRR ↑ | 1.0555 | 0.9998 |
| Selectivity | ||
| chin_prom before ↓ | 0.037 | 0.071 |
| chin_prom after ↓ | 0.128 | 0.155 |
| |Δ_selectivity| ↓ | 0.091 | 0.084 |
| Metric | Qwen3.5-9B (base) | Qwen3.5-9B → SFT |
DLM-LST-9B → SFT |
|---|---|---|---|
| Selectivity | |||
| chin_prom ↓ | 0.037 | 0.128 | 0.155 |
| Chinese-leak suppression | |||
| chin_cs ↑ | 0.964 | 0.998 | 0.998 |
| chin_ie ↑ | 0.934 | 0.993 | 0.994 |
| chin_total ↑ | 0.9405 | 0.9927 | 0.9926 |
| Reasoning / Task performance | |||
| acc_cs ↑ | 0.811 | 0.748 | 0.751 |
| acc_ie ↑ | 0.615 | 0.505 | 0.509 |
| acc_total ↑ | 0.5900 | 0.5202 | 0.5217 |
| HumanEval ↑ | 0.6707 | 0.6037 | 0.6402 |
| GSM8K ↑ | 0.8757 | 0.8211 | 0.8226 |
English Suppression (Non-Thinking mode)
On Llama-3.1-8B-Instruct, DLM-LST-9B is the only variant that keeps coding (HumanEval) and math (GSM8K) usable.
| Metric | Llama-3.1-8B-Instruct (base) | LST-L1 | LST-L2 | DLM-LST-9B |
|---|---|---|---|---|
| Selectivity | ||||
| eng_prom ↓ | 0.018 | 0.716 | 0.725 | 0.067 |
| English-leak suppression | ||||
| eng_cs ↑ | 0.241 | 0.683 | 0.667 | 0.365 |
| eng_ie ↑ | 0.483 | 0.887 | 0.888 | 0.551 |
| eng_total ↑ | 0.5391 | 0.8709 | 0.8702 | 0.6067 |
| Reasoning / Task performance | ||||
| acc_cs ↑ | 0.698 | 0.251 | 0.257 | 0.727 |
| acc_ie ↑ | 0.439 | 0.191 | 0.203 | 0.475 |
| acc_total ↑ | 0.4417 | 0.2229 | 0.2309 | 0.4580 |
| HumanEval ↑ | 0.6341 | 0.0000 | 0.0000 | 0.5671 |
| GSM8K ↑ | 0.6967 | 0.4837 | 0.4852 | 0.6088 |
Example Outputs
Asked in Korean about the most common clay mineral on the Korean
Peninsula, Qwen3.5-9B leaks 9 Chinese / mixed-script tokens (伊利石,
кao린, 的) into its answer. DLM-LST-9B answers the same prompt
entirely in Korean (0 Chinese tokens).
Domain-wise Leakage Comparisons
Computer Science
Prompt: TCP 와 UDP 프로토콜의 차이점을 자세히 설명해 주세요.
Qwen3.5-9B (leaks 才开始 mid-sentence):
TCP (Transmission Control Protocol) 와 UDP (User Datagram Protocol) 는 인터넷의 가장 기본이 되는 **전송 계층 (Transport Layer)** 프로토콜로, …
### 1. 연결 방식 (Connection)
* **TCP**: **연결 지향적 (Connection-oriented)**입니다. 데이터 전송 전에 서버와 클라이언트 사이에 '손잡이 (Handshake)' 과정을 거쳐 연결을 수립합니다. … 이는 전화통화처럼 상대방이 준비된 상태에서才开始하는 것과 같습니다.
* **UDP**: **비연결 지향적 (Connectionless)**입니다. 전송 전에 연결을 맺지 않고, 패킷 (데이터그램) 을 바로 보냅니다. …
DLM-LST-9B (clean Korean):
TCP (Transmission Control Protocol) 와 UDP (User Datagram Protocol) 는 인터넷과 네트워크 통신의 기반이 되는 **전송 계층 (Transport Layer)** 의 두 가지 핵심 프로토콜입니다. 둘 다 데이터를 전송하지만, 접근 방식과 사용처에서 근본적인 차이가 있습니다.
가장 큰 차이는 **"신뢰성 (Reliability)"**과 **"속도 (Speed)"** 사이의 트레이드-off 관계에 있습니다. TCP 는 데이터가 무결하게 도착하는 것을 최우선으로 하며, UDP 는 최대한 빠르고 효율적으로 데이터를 보내는 것을 목표로 합니다. …
### 1. 연결 방식 (Connection-Oriented vs Connectionless)
* **TCP**: **연결 지향적**입니다. 데이터 전송 전에 서버와 클라이언트 사이에 '손잡이 (Handshake)' 과정을 거쳐 연결을 수립합니다. 마치 우편물을 보낼 때 먼저 주소 확인과 발송 절차를 거치는 것과 같습니다. …
Industrial Engineering
Prompt: 린 (Lean) 제조 방식에서 정의하는 7가지 낭비 (muda) 의 종류를 한국어로 설명해 주세요.
Qwen3.5-9B (leaks 积压 mid-sentence):
린 (Lean) 제조 방식에서 **7 가지 낭비 (7 Muda)** 는 가치 없는 활동을 제거하여 효율성을 극대화하는 핵심 개념입니다. …
### 1. 과잉 생산 (Overproduction)
가장 위험하고 근본적인 낭비로 꼽힙니다. 고객이 아직 필요로 하지 않거나, 다음 공정에서 필요로 하는 것보다 미리 제품을 만들어내는 행위입니다. 과잉 생산은 재고积压을 유발하고, 보관 비용을 증가시키며, 실제 수요를 왜곡시켜 다른 낭비들을 더 발생시키는 원인이 됩니다.
### 2. 대기 (Waiting)
…
DLM-LST-9B (clean Korean):
린 (Lean) 제조 방식에서 **7 가지 낭비 (7 Muda)** 는 가치 없는 활동을 제거하여 효율성을 극대화하기 위한 핵심 개념입니다. …
### 1. 과잉 생산 (Overproduction)
가장 위험하고 근본적인 낭비로 꼽힙니다. 고객이 아직 필요로 하지 않거나, 다음 공정보다 앞서서 제품을 만들어내는 것을 말합니다. 이는 재고 증가, 보관 비용 상승, 그리고 나중에 발견될 결함 수리 비용을 초래합니다. "필요한 만큼, 필요한 때에, 필요한 양만큼"을 만드는 것이 목표입니다.
### 2. 대기 (Waiting)
원자재, 부품, 정보, 사람, 기계 등이 다음 공정을 시작할 때까지 기다려야 하는 시간입니다. …
Law
Prompt: 민법상 채권자대위권의 성립요건과 행사 방법을 자세히 설명해 주세요.
Qwen3.5-9B (leaks 享有的):
민법상 **채권자대위권**은 채권자가 채무자가 타인 (제 3 자) 에게享有的 채권을 행사하지 않을 때, 채권자가 직접 그 채권을 대신하여 행사할 수 있는 권리를 말합니다. 이는 채무자의 재산이 부족하여 채권자가 변제를 받을 수 없는 경우, 채무자의 재산 (채권) 을 확보함으로써 채권자의 이익을 보호하기 위한 제도입니다.
요청하신 **성립요건**과 **행사 방법**에 대해 상세히 설명해 드리겠습니다.
### 1. 채권자대위권의 성립요건
채권자대위권을 행사하기 위해서는 민법 제 401 조에 명시된 4 가지 요건이 모두 충족되어야 합니다. …
DLM-LST-9B (clean Korean):
민법상 **채권자대위권**은 채무자가 그 재산에 속하는 권리를 행사하지 않음으로써 채권자의 채권이 불이행될 우려가 있는 경우, 채권자가 채무자의 지위에 서서 그 권리를 직접 행사할 수 있는 제도입니다. 이는 채권자의 채권을 보전하기 위한 중요한 수단이며, 그 성립 요건과 행사 방법은 다음과 같이 상세히 설명드릴 수 있습니다.
### 1. 채권자대위권의 성립 요건
채권자대위권을 행사하려면 민법 제 401 조에 따라 다음 **4 가지 요건**을 모두 충족해야 합니다.
1. **채권자가 채무자에게 채권을 가지고 있어야 함**
* 채권자와 채무자 사이에 유효한 채권관계가 존재해야 합니다. …
…
Cross-lingual Selectivity
When the user explicitly asks for Chinese, DLM-LST-9B readily produces it. The previous examples showed the model avoiding unwanted Chinese inside an otherwise-Korean answer; the example below shows it emitting Chinese fluently when the user's instruction calls for it.
Prompt: 피보나치 수열의 n번째 항을 반환하는 파이썬 함수를 작성해주세요. 설명은 중국어로 해주세요.
DLM-LST-9B (code in Python, explanation in Chinese):
다음은 파이썬을 사용하여 피보나치 수열의 n 번째 항을 계산하는 함수입니다.
이 코드는 **반복 (Iterative)** 방식을 사용하여 효율적으로 구현되었습니다.
python
def fibonacci(n):
"""피보나치 수열의 n 번째 항을 반환합니다."""
if n <= 0: return 0
elif n == 1: return 1
a, b = 0, 1
for _ in range(2, n + 1):
a, b = b, a + b
return b
### 코드 설명 (中文):
1. **函数定义**: `fibonacci(n)` 接收 `n` 并返回该位置的斐波那契数。
2. **边界处理**: `n <= 0` 返回 0; `n == 1` 返回 1。
3. **迭代计算**: 使用 `a, b` 两个变量滚动累加 ...
Thinking-trace Suppression
Qwen3.5-9B's <think> block leaks Chinese even more severely than its
final answer, often slipping into Chinese once the reasoning gets stuck.
DLM-LST-9B suppresses that leakage inside the thinking block too.
Prompt: 업무 협조 요청을 받은 기관이 협조 요청 문서에 흠이 있음을 발견한 때에는 접수한 날부터 몇 일 이내에 보완을 요구하여야 하는가? (사무관리규정 개정으로 제외된 문제입니다. 정답은 3번 입니다.)
Chinese-character counts (thinking budget=8,192):
| Metric | Qwen3.5-9B | DLM-LST-9B |
|---|---|---|
<think> block, Chinese characters |
3,472 | 0 |
<answer> block, Chinese characters |
leaks Chinese (正如您所说, 该规定已被修订/删除) |
0 (clean Korean) |
| "should write in Korean" → Chinese events | 484 | 0 |
Qwen3.5-9B <think> block (keeps slipping back to Chinese after each self-correction to Korean):
Here's a thinking process that leads to the suggested answer:
1. **Analyze the Request:**
* Question: When an agency receives a request for cooperation
and finds an error/flaw in the request document, within how
many days from the date of receipt must they request a
supplement/correction?
* Context: The user explicitly states that this question has
been *removed* from the "Office Management Regulations"
(사무관리규정) revision.
* User's Hint: The user states "The answer is number 3"
(정답은 3 번 입니다).
…
… (later in the trace, ≈ 7,000 tokens in) …
"하지만正如您所说,该规定已被修订/删除。" -> "하지만正如您所说,该规定已被修订/删除。" (Wait, I need to write in Korean).
Okay, I will write in Korean.
"하지만正如您所说,该规定已被修订/删除。" -> "하지만正如您所说,该规定已被修订/删除。" (Wait, I need to write in Korean).
Okay, I will write in Korean.
"하지만正如您所说,该规定已被修订/删除。" -> "하지만正如您所说,该规定已被修订/删除。" (Wait, I need to write in Korean).
Okay, I will write in Korean.
… (this exact two-line cycle repeats until the budget is exhausted)
The base model knows it should write in Korean: every cycle ends
with (Wait, I need to write in Korean). Okay, I will write in Korean..
But the next token is still Chinese (正 from 正如您所说 = "as you
said"), and the trace slides right back into the same fragment. This
loop fires 484 times before the token budget runs out.
DLM-LST-9B targets exactly this failure: Chinese tokens being chosen
even right after the model says they should not be.
DLM-LST-9B <think> block (clean Korean or English, terminates naturally):
Here's a thinking process that leads to the suggested answer:
1. **Analyze the User's Request:**
* Question: When an institution receives a request for
cooperation and finds a flaw in the document, within how
many days must they request a supplement?
* Context: The user explicitly states that this question has
been *removed* due to amendments in the 사무관리규정.
* User's Hint: The user claims the correct answer is "Option 3".
* Goal: Provide the correct information based on the *old*
regulation but also clarify the current status.
…
Okay, writing the response now.
**Wait, one detail:** The prompt says "정답은 3 번 입니다"
(The answer is option 3). Usually, option 3 in these types of
questions corresponds to "3 days". I should explicitly mention
"3 days".
**Wait, is there any chance the answer is 5 days?**
Some older regulations had 5 days for other things. But for
"Cooperation Request Flaw", it was 3 days. Reference: 2019 Office
Management Regulations, Article 13, Item 2.
Okay, definitely 3 days.
Okay, proceeding.
DLM-LST-9B also second-guesses itself ("Wait, is there any chance the answer is 5 days?"), but every self-correction is in English or Korean instead of Chinese, and the trace closes cleanly after committing to "3 days". 8,062 trace characters, 0 Chinese characters.
Qwen3.5-9B Final answer (Chinese leaks into the user-facing answer):
네, 맞습니다. 정답은 **3 일**입니다.
과거 **국회 사무관리규정 제 26 조 (업무 협조 요청의 처리)** 에 따르면,
협조 요청 기관이 협조 요청 문서에 흠이 있음을 발견한 때에는 접수한 날부터
**3 일 이내**에 보완을 요구하여야 했습니다.
하지만正如您所说,该规定已被修订/删除。
**사무관리규정 개정으로 해당 조항이 삭제 (또는 변경) 되었기 때문에**,
현재는 해당 규정을 기준으로 하는 문제가 출제되지 않습니다.
* **정답:** 3 일
* **근거:** 과거 국회 사무관리규정 (개정 전)
* **현황:** 규정 개정으로 해당 조항이 제외됨
The Chinese leak from <think> also shows up in the final answer.
The same Chinese fragment 正如您所说,该规定已被修订/删除 that the base model
produced ~3,400 Chinese characters looping on inside <think> appears again,
unchanged, in the user-facing answer.
DLM-LST-9B Final answer (clean Korean):
네, 맞습니다. 말씀하신 대로 과거의 **사무관리규정 (개정 전)** 에 따른
정답은 **3 일**입니다.
구체적인 내용은 다음과 같습니다.
1. **과거 규정 (개정 전):**
* 사무관리규정 제 13 조 제 2 항에 따라, 업무 협조 요청 문서에
흠이 있는 경우 수신 기관은 **접수한 날부터 3 일 이내**에 보완을
요구해야 했습니다.
* 따라서 해당 문제가 출제되었을 당시의 정답은 **3 일** (보통
객관식 3 번) 이었습니다.
2. **현재 규정 (개정 후):**
* 사무관리규정이 개정되면서 해당 조항이 삭제되거나 변경되었습니다.
* 현재는 업무 협조 요청 문서의 흠이 있을 때 3 일 이내 보완 요구가
사무관리규정에 의해 의무화되지 않습니다.
Both models reach the correct conclusion ("3 일"), but the base model contaminates both the reasoning trace and the final answer with Chinese, while DLM-LST-9B keeps both in clean Korean / English.
Training procedure
LST-Logit was applied to Qwen/Qwen3.5-9B with the following configuration:
- Trainable parameter:
lm_head.weightonly (everything else frozen). - Penalty: .
- Risk-scored token set $T_{\text{risk}}$: 53,429 CJK rows + 953
broken-byte rows = 54,382 rows in total. Broken-byte risk uses n-gram
windows
(2, 3, 4)weighted(0.7, 0.2, 0.1)with 50 random-neighbour samples per token. - Calibration distribution
D: 1,000 final-layer hidden states gathered from 60 prompts,enable_thinking=true,apply_final_layer_norm=true. - Optimiser: AdamW (lr
1e-4, weight decay0.0). - Schedule: cosine λ-decay over 1,000 steps from λ₀ = 1.0.
- Seed: 42; bf16; single H100.
Limitations
- Not an instruction-tuned chat model. Only the output head of Qwen3.5-9B was re-tuned; the rest of the network is unchanged. Conversational behaviour is therefore identical to the base, aside from the Chinese-token logit suppression.
- Degraded Chinese generation. Tasks that require Chinese output, such as Chinese translation, Chinese code comments, or bilingual Q&A, will see noticeably lower quality. Use the base Qwen3.5-9B instead for such workloads.
- Multimodal not re-benchmarked. The vision tower weights are bit-identical to the base, so multimodal performance should be unchanged. We have not, however, re-benchmarked the vision pipeline in this release.
- Calibration-distribution sensitivity. The logit penalty was optimised
against a fixed calibration set
Dof 1,000 hidden states from 60 prompts. Suppression strength on contexts far fromD, such as very different domains, much longer generations, or unusual prompting styles, has not been separately verified.
License
This model is released under the Apache 2.0 License.
Contact
For questions, feedback, or collaboration inquiries, feel free to reach out via our website.
Citation
@misc{dataslab2026LST,
title = {Language Selection Tuning: A Learning-Based Approach to Persistent Language Bias Mitigation in Multilingual LLMs},
author = {{Data Science Lab., Ltd}},
month = {May},
year = {2026},
url = {https://huggingface.co/dataslab/DLM-LST-9B}
}
- Downloads last month
- -