DLM-LST-9B

DLM-LST-9B is a Qwen/Qwen3.5-9B derivative whose output projection (lm_head.weight) has been re-tuned by Language Selection Tuning (LST), specifically the Logit variant. The goal is to suppress unwanted Chinese-character generation when the model is used to serve non-Chinese (English / Korean / Japanese etc.) users.

All other parameters are bit-identical to the base model. Vision and multimodal capabilities are preserved unchanged.

Why LST?

Multilingual LLMs trained on heavily skewed corpora (e.g., Qwen on Chinese-rich data) tend to leak the dominant training language regardless of prompt language. This phenomenon is known as language confusion. For Korean users, this means Chinese characters sometimes appear in the middle of an otherwise Korean answer. This hurts both readability and user trust.

Language Selection Tuning (LST) defines a differentiable training objective:

Llang=1TrisktTriskr(t)φ(wt)\mathcal{L}_{\text{lang}} = \frac{1}{|T_{\text{risk}}|} \sum_{t \in T_{\text{risk}}} r(t) \cdot \varphi(w_t)

where $T_{\text{risk}}$ is the set of risk-scored target-language tokens (Chinese characters in the CJK Unified Ideographs range + multi-byte BPE fragments that decode to Chinese characters), $r(t)$ is each token's risk score, and $\varphi(\cdot)$ is a per-row penalty applied to lm_head weight row $w_t$.

This release uses the Logit variant:

φlogit(wt)=EhD[max(wth,0)]\varphi_{\text{logit}}(w_t) = \mathbb{E}_{h \sim D}\bigl[ \max(w_t \cdot h, 0) \bigr]

which directly penalises positive-logit contributions of risky tokens evaluated on a calibration distribution D of final-layer hidden states. Because the optimization lives in the same gradient-based landscape that downstream SFT / RLHF stages will traverse, the resulting weight configuration tends to be more persistent under post-LST fine-tuning.

Key changes from Qwen/Qwen3.5-9B

Component Status
lm_head.weight Re-trained by LST-Logit. 54,382 / 248,320 rows (21.9 %) modified.
Token embeddings (embed_tokens) Bit-identical to base.
Transformer blocks (Gated Attention / Gated DeltaNet / SwiGLU FFN / RMSNorm) Bit-identical to base.
mtp.* (multi-token prediction head) Bit-identical to base.
Vision tower & preprocessor configs Bit-identical to base.
Tokenizer (vocab / merges / chat template) Identical content to base.

Per-row script breakdown of the lm_head delta:

Script Vocab rows Rows changed % of category mean ‖Δrow‖
CJK 55,328 53,429 96.57 % 0.127
Other non-Latin (broken-byte CJK fragments) 15,141 953 6.29 % 0.178
Latin 139,845 0 0.00 %
Korean 6,807 0 0.00 %
Kana (Japanese) 3,560 0 0.00 %
Cyrillic 18,580 0 0.00 %
Arabic 8,816 0 0.00 %

The Logit penalty rotates each affected row only slightly while preserving its norm (cos(w_orig, w_new) ≈ 0.992 mean over changed rows). Korean, Japanese, Latin, Cyrillic, and Arabic rows are bit-identical to the base.

Quickstart

The recommended serving path is vLLM, which is also what we used in our evaluation pipeline.

vllm serve dataslab/DLM-LST-9B \
    --port 8000 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --reasoning-parser qwen3        # exposes <think> trace via OpenAI API
    # --enable-reasoning            # auto-on with --reasoning-parser (vLLM >= 0.7)
    # --max-model-len 16384         # cap context to shrink KV cache (default: 262,144)

Use with transformers

Non-Thinking mode (recommended for fast chat)

import torch
from transformers import AutoTokenizer, AutoModelForImageTextToText

REPO = "dataslab/DLM-LST-9B"

tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForImageTextToText.from_pretrained(
    REPO,
    dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "한반도 주변에 가장 흔한 점토광물은?"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=256)
text = tokenizer.decode(out[0][inputs.input_ids.shape[-1]:],
                       skip_special_tokens=True)
print(text)

Thinking mode (recommended for complex reasoning)

Either use thinking_budget (e.g., vLLM's --reasoning-parser qwen3) or give max_new_tokens enough headroom (e.g., 8,192 + 256 = 8,448). Caveat: without a thinking_budget cap, a too-small max_new_tokens can be fully consumed inside <think> and the answer never gets emitted.

# ... tokenizer / model loaded as above ...

THINKING_BUDGET = 8192   # max tokens inside <think>
ANSWER_TOKENS   = 256    # tokens after </think>

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=THINKING_BUDGET + ANSWER_TOKENS)
text = tokenizer.decode(out[0][inputs.input_ids.shape[-1]:],
                       skip_special_tokens=True)
# `text` contains the <think>...</think> trace followed by the final answer.
# (vLLM separates them via --reasoning-parser qwen3; see Quickstart above.)
print(text)

Why AutoModelForImageTextToText? Qwen3.5-9B's declared architecture is Qwen3_5ForConditionalGeneration, a composite class that wraps both the text decoder and the vision tower. Loading via AutoModelForCausalLM works for text-only inference but strips the vision submodule and may produce a config that downstream tools (e.g., vLLM) reject. If you need a pure text causal-LM handle, use model.language_model after loading.

Benchmark Results

Evaluation Metrics

(1) Selectivity

Fraction of explicit Chinese requests on which the model avoids producing Chinese.

  • Lower better (~0): produces Chinese when asked (respects user intent).
  • Higher worse (~1): refuses Chinese even when asked (blanket suppression).
Metric Benchmark Dataset
chin_prom Self-built 1,000-prompt Chinese elicitation set (e.g., How do you say '사랑' in Chinese? or the Python + Chinese-comment prompt)

(2) Chinese-leak suppression

Korean prompts → Korean answers expected; any Chinese token leaked into the answer is a failure. Metric is the clean-Korean response ratio.

  • Higher better (~1): Korean answers stay fully Korean (no Chinese tokens leaked).
  • Lower worse (~0): Chinese tokens leak into otherwise-Korean answers.
Metric Benchmark Dataset
chin_cs KMMLU Computer Science subjects (free-form Korean generation)
chin_ie KMMLU Industrial Engineering subjects (free-form Korean generation)
chin_total KMMLU (free-form Korean generation)

(3) Reasoning / task performance

Metric Benchmark Dataset
acc_cs KMMLU Computer Science subjects (multiple-choice log-likelihood comparison)
acc_ie KMMLU Industrial Engineering subjects (multiple-choice log-likelihood comparison)
acc_total KMMLU (multiple-choice log-likelihood comparison)
HumanEval HumanEval (pass@1)
GSM8K GSM8K (exact-match accuracy)

(4) SFT-persistence

Metric Details
SRR (Suppression Retention Rate) ↑ Ratio of chin_total after SFT to before SFT. Closer to 1.0 = SFT did not erode the leak-suppression effect. Built on chin_total (not chin_prom) so direction stays higher-is-better.
|Δ_selectivity| Absolute change in chin_prom. Smaller = SFT barely shifted selectivity. Sign of Δ: positive = bluntened selectivity (bad); negative = partial selectivity recovery (only meaningful if pre-SFT was blunt).

Chinese Suppression (Thinking mode)

Evaluated with enable_thinking=True. The DLM-LST-9B column is calibrated with thinking enabled.

Metric Qwen3.5-9B (base) LST-L1 LST-L2 DLM-LST-9B
Selectivity
chin_prom ↓0.0290.9930.9920.065
Chinese-leak suppression
chin_cs ↑0.9851.0001.0000.999
chin_ie ↑0.9781.0000.9990.997
chin_total ↑0.97170.99880.99880.9927
Reasoning / Task performance
acc_cs ↑0.8110.8110.8110.811
acc_ie ↑0.6180.6180.6200.620
acc_total ↑0.58970.58970.58930.5893
HumanEval ↑0.66460.67680.65850.6646
GSM8K ↑0.87490.87490.87490.8749

DLM-LST-9B keeps chin_prom at 0.065. It preserves the ability to generate Chinese when the user explicitly asks for it, while still cutting unintended Chinese leakage to the level of chin_total ≈ 0.99. Downstream reasoning (acc_*, HumanEval, GSM8K) is comparable to, or in some cases even better than, the base model.

Chinese Suppression (Non-Thinking mode)

Evaluated with enable_thinking=False. The DLM-LST-9B column here is a separate think-OFF-calibrated checkpoint (not this release).

Metric Qwen3.5-9B (base) LST-L1 LST-L2 DLM-LST-9B
Selectivity
chin_prom ↓0.0370.9660.9630.080
Chinese-leak suppression
chin_cs ↑0.9640.9991.0000.990
chin_ie ↑0.9340.9970.9990.983
chin_total ↑0.94050.99740.99750.9830
Reasoning / Task performance
acc_cs ↑0.8110.8110.8110.811
acc_ie ↑0.6150.6140.6140.614
acc_total ↑0.59000.58970.58970.5897
HumanEval ↑0.67070.67680.67070.6707
GSM8K ↑0.87570.87490.87410.8787

Suppression Persistence after SFT-stage (Non-Thinking mode)

Each pipeline was fine-tuned (SFT) on the beomi/KoAlpaca-v1.1a dataset. After the SFT stage, DLM-LST-9B keeps both its Chinese-leak suppression (SRR ≈ 1.000) and its selectivity (|Δ_selectivity| ≈ 0.08) almost unchanged.

Metric Qwen3.5-9B → SFT DLM-LST-9B → SFT
Chinese-leak suppression
chin_total before ↑0.94050.9928
chin_total after ↑0.99270.9926
SRR ↑1.05550.9998
Selectivity
chin_prom before ↓0.0370.071
chin_prom after ↓0.1280.155
|Δ_selectivity| ↓0.0910.084
Metric Qwen3.5-9B (base) Qwen3.5-9B → SFT
DLM-LST-9B → SFT
Selectivity
chin_prom ↓0.0370.1280.155
Chinese-leak suppression
chin_cs ↑0.9640.9980.998
chin_ie ↑0.9340.9930.994
chin_total ↑0.94050.99270.9926
Reasoning / Task performance
acc_cs ↑0.8110.7480.751
acc_ie ↑0.6150.5050.509
acc_total ↑0.59000.52020.5217
HumanEval ↑0.67070.60370.6402
GSM8K ↑0.87570.82110.8226

English Suppression (Non-Thinking mode)

On Llama-3.1-8B-Instruct, DLM-LST-9B is the only variant that keeps coding (HumanEval) and math (GSM8K) usable.

Metric Llama-3.1-8B-Instruct (base) LST-L1 LST-L2 DLM-LST-9B
Selectivity
eng_prom ↓0.0180.7160.7250.067
English-leak suppression
eng_cs ↑0.2410.6830.6670.365
eng_ie ↑0.4830.8870.8880.551
eng_total ↑0.53910.87090.87020.6067
Reasoning / Task performance
acc_cs ↑0.6980.2510.2570.727
acc_ie ↑0.4390.1910.2030.475
acc_total ↑0.44170.22290.23090.4580
HumanEval ↑0.63410.00000.00000.5671
GSM8K ↑0.69670.48370.48520.6088

Example Outputs

DLM-LST-9B vs Qwen3.5-9B on a Korean KMMLU prompt: base leaks 9 Chinese tokens (伊利石, кaо린, 的), DLM-LST-9B emits 0 Chinese tokens.

Asked in Korean about the most common clay mineral on the Korean Peninsula, Qwen3.5-9B leaks 9 Chinese / mixed-script tokens (伊利石, кao린, ) into its answer. DLM-LST-9B answers the same prompt entirely in Korean (0 Chinese tokens).

Domain-wise Leakage Comparisons

Computer Science

Prompt: TCP 와 UDP 프로토콜의 차이점을 자세히 설명해 주세요.

Qwen3.5-9B (leaks 才开始 mid-sentence):

TCP (Transmission Control Protocol) 와 UDP (User Datagram Protocol) 는 인터넷의 가장 기본이 되는 **전송 계층 (Transport Layer)** 프로토콜로, …

### 1. 연결 방식 (Connection)
*   **TCP**: **연결 지향적 (Connection-oriented)**입니다. 데이터 전송 전에 서버와 클라이언트 사이에 '손잡이 (Handshake)' 과정을 거쳐 연결을 수립합니다. … 이는 전화통화처럼 상대방이 준비된 상태에서才开始하는 것과 같습니다.
*   **UDP**: **비연결 지향적 (Connectionless)**입니다. 전송 전에 연결을 맺지 않고, 패킷 (데이터그램) 을 바로 보냅니다. …

DLM-LST-9B (clean Korean):

TCP (Transmission Control Protocol) 와 UDP (User Datagram Protocol) 는 인터넷과 네트워크 통신의 기반이 되는 **전송 계층 (Transport Layer)** 의 두 가지 핵심 프로토콜입니다. 둘 다 데이터를 전송하지만, 접근 방식과 사용처에서 근본적인 차이가 있습니다.

가장 큰 차이는 **"신뢰성 (Reliability)"**과 **"속도 (Speed)"** 사이의 트레이드-off 관계에 있습니다. TCP 는 데이터가 무결하게 도착하는 것을 최우선으로 하며, UDP 는 최대한 빠르고 효율적으로 데이터를 보내는 것을 목표로 합니다. …

### 1. 연결 방식 (Connection-Oriented vs Connectionless)
*   **TCP**: **연결 지향적**입니다. 데이터 전송 전에 서버와 클라이언트 사이에 '손잡이 (Handshake)' 과정을 거쳐 연결을 수립합니다. 마치 우편물을 보낼 때 먼저 주소 확인과 발송 절차를 거치는 것과 같습니다. …

Industrial Engineering

Prompt: 린 (Lean) 제조 방식에서 정의하는 7가지 낭비 (muda) 의 종류를 한국어로 설명해 주세요.

Qwen3.5-9B (leaks 积压 mid-sentence):

린 (Lean) 제조 방식에서 **7 가지 낭비 (7 Muda)** 는 가치 없는 활동을 제거하여 효율성을 극대화하는 핵심 개념입니다. …

### 1. 과잉 생산 (Overproduction)
가장 위험하고 근본적인 낭비로 꼽힙니다. 고객이 아직 필요로 하지 않거나, 다음 공정에서 필요로 하는 것보다 미리 제품을 만들어내는 행위입니다. 과잉 생산은 재고积压을 유발하고, 보관 비용을 증가시키며, 실제 수요를 왜곡시켜 다른 낭비들을 더 발생시키는 원인이 됩니다.

### 2. 대기 (Waiting)
…

DLM-LST-9B (clean Korean):

린 (Lean) 제조 방식에서 **7 가지 낭비 (7 Muda)** 는 가치 없는 활동을 제거하여 효율성을 극대화하기 위한 핵심 개념입니다. …

### 1. 과잉 생산 (Overproduction)
가장 위험하고 근본적인 낭비로 꼽힙니다. 고객이 아직 필요로 하지 않거나, 다음 공정보다 앞서서 제품을 만들어내는 것을 말합니다. 이는 재고 증가, 보관 비용 상승, 그리고 나중에 발견될 결함 수리 비용을 초래합니다. "필요한 만큼, 필요한 때에, 필요한 양만큼"을 만드는 것이 목표입니다.

### 2. 대기 (Waiting)
원자재, 부품, 정보, 사람, 기계 등이 다음 공정을 시작할 때까지 기다려야 하는 시간입니다. …

Law

Prompt: 민법상 채권자대위권의 성립요건과 행사 방법을 자세히 설명해 주세요.

Qwen3.5-9B (leaks 享有的):

민법상 **채권자대위권**은 채권자가 채무자가 타인 (제 3 자) 에게享有的 채권을 행사하지 않을 때, 채권자가 직접 그 채권을 대신하여 행사할 수 있는 권리를 말합니다. 이는 채무자의 재산이 부족하여 채권자가 변제를 받을 수 없는 경우, 채무자의 재산 (채권) 을 확보함으로써 채권자의 이익을 보호하기 위한 제도입니다.

요청하신 **성립요건**과 **행사 방법**에 대해 상세히 설명해 드리겠습니다.

### 1. 채권자대위권의 성립요건
채권자대위권을 행사하기 위해서는 민법 제 401 조에 명시된 4 가지 요건이 모두 충족되어야 합니다. …

DLM-LST-9B (clean Korean):

민법상 **채권자대위권**은 채무자가 그 재산에 속하는 권리를 행사하지 않음으로써 채권자의 채권이 불이행될 우려가 있는 경우, 채권자가 채무자의 지위에 서서 그 권리를 직접 행사할 수 있는 제도입니다. 이는 채권자의 채권을 보전하기 위한 중요한 수단이며, 그 성립 요건과 행사 방법은 다음과 같이 상세히 설명드릴 수 있습니다.

### 1. 채권자대위권의 성립 요건
채권자대위권을 행사하려면 민법 제 401 조에 따라 다음 **4 가지 요건**을 모두 충족해야 합니다.
1.  **채권자가 채무자에게 채권을 가지고 있어야 함**
    *   채권자와 채무자 사이에 유효한 채권관계가 존재해야 합니다. …
…

Cross-lingual Selectivity

When the user explicitly asks for Chinese, DLM-LST-9B readily produces it. The previous examples showed the model avoiding unwanted Chinese inside an otherwise-Korean answer; the example below shows it emitting Chinese fluently when the user's instruction calls for it.

Prompt: 피보나치 수열의 n번째 항을 반환하는 파이썬 함수를 작성해주세요. 설명은 중국어로 해주세요.

DLM-LST-9B (code in Python, explanation in Chinese):

다음은 파이썬을 사용하여 피보나치 수열의 n 번째 항을 계산하는 함수입니다.
이 코드는 **반복 (Iterative)** 방식을 사용하여 효율적으로 구현되었습니다.

python
def fibonacci(n):
    """피보나치 수열의 n 번째 항을 반환합니다."""
    if n <= 0: return 0
    elif n == 1: return 1
    a, b = 0, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b


### 코드 설명 (中文):
1.  **函数定义**: `fibonacci(n)` 接收 `n` 并返回该位置的斐波那契数。
2.  **边界处理**: `n <= 0` 返回 0; `n == 1` 返回 1。
3.  **迭代计算**: 使用 `a, b` 两个变量滚动累加 ...

Thinking-trace Suppression

Qwen3.5-9B's <think> block leaks Chinese even more severely than its final answer, often slipping into Chinese once the reasoning gets stuck. DLM-LST-9B suppresses that leakage inside the thinking block too.

Prompt: 업무 협조 요청을 받은 기관이 협조 요청 문서에 흠이 있음을 발견한 때에는 접수한 날부터 몇 일 이내에 보완을 요구하여야 하는가? (사무관리규정 개정으로 제외된 문제입니다. 정답은 3번 입니다.)

Chinese-character counts (thinking budget=8,192):

Metric Qwen3.5-9B DLM-LST-9B
<think> block, Chinese characters 3,472 0
<answer> block, Chinese characters leaks Chinese (正如您所说, 该规定已被修订/删除) 0 (clean Korean)
"should write in Korean" → Chinese events 484 0

Qwen3.5-9B <think> block (keeps slipping back to Chinese after each self-correction to Korean):

Here's a thinking process that leads to the suggested answer:

1.  **Analyze the Request:**
    *   Question: When an agency receives a request for cooperation
        and finds an error/flaw in the request document, within how
        many days from the date of receipt must they request a
        supplement/correction?
    *   Context: The user explicitly states that this question has
        been *removed* from the "Office Management Regulations"
        (사무관리규정) revision.
    *   User's Hint: The user states "The answer is number 3"
        (정답은 3 번 입니다).
    …

… (later in the trace, ≈ 7,000 tokens in) …

    "하지만正如您所说,该规定已被修订/删除。" -> "하지만正如您所说,该规定已被修订/删除。" (Wait, I need to write in Korean).
    Okay, I will write in Korean.
    "하지만正如您所说,该规定已被修订/删除。" -> "하지만正如您所说,该规定已被修订/删除。" (Wait, I need to write in Korean).
    Okay, I will write in Korean.
    "하지만正如您所说,该规定已被修订/删除。" -> "하지만正如您所说,该规定已被修订/删除。" (Wait, I need to write in Korean).
    Okay, I will write in Korean.
    …  (this exact two-line cycle repeats until the budget is exhausted)

The base model knows it should write in Korean: every cycle ends with (Wait, I need to write in Korean). Okay, I will write in Korean.. But the next token is still Chinese ( from 正如您所说 = "as you said"), and the trace slides right back into the same fragment. This loop fires 484 times before the token budget runs out. DLM-LST-9B targets exactly this failure: Chinese tokens being chosen even right after the model says they should not be.

DLM-LST-9B <think> block (clean Korean or English, terminates naturally):

Here's a thinking process that leads to the suggested answer:

1.  **Analyze the User's Request:**
    *   Question: When an institution receives a request for
        cooperation and finds a flaw in the document, within how
        many days must they request a supplement?
    *   Context: The user explicitly states that this question has
        been *removed* due to amendments in the 사무관리규정.
    *   User's Hint: The user claims the correct answer is "Option 3".
    *   Goal: Provide the correct information based on the *old*
        regulation but also clarify the current status.

…

    Okay, writing the response now.

    **Wait, one detail:** The prompt says "정답은 3 번 입니다"
    (The answer is option 3). Usually, option 3 in these types of
    questions corresponds to "3 days". I should explicitly mention
    "3 days".

    **Wait, is there any chance the answer is 5 days?**
    Some older regulations had 5 days for other things. But for
    "Cooperation Request Flaw", it was 3 days. Reference: 2019 Office
    Management Regulations, Article 13, Item 2.
    Okay, definitely 3 days.

    Okay, proceeding.

DLM-LST-9B also second-guesses itself ("Wait, is there any chance the answer is 5 days?"), but every self-correction is in English or Korean instead of Chinese, and the trace closes cleanly after committing to "3 days". 8,062 trace characters, 0 Chinese characters.

Qwen3.5-9B Final answer (Chinese leaks into the user-facing answer):

네, 맞습니다. 정답은 **3 일**입니다.

과거 **국회 사무관리규정 제 26 조 (업무 협조 요청의 처리)** 에 따르면,
협조 요청 기관이 협조 요청 문서에 흠이 있음을 발견한 때에는 접수한 날부터
**3 일 이내**에 보완을 요구하여야 했습니다.

하지만正如您所说,该规定已被修订/删除。
**사무관리규정 개정으로 해당 조항이 삭제 (또는 변경) 되었기 때문에**,
현재는 해당 규정을 기준으로 하는 문제가 출제되지 않습니다.

*   **정답:** 3 일
*   **근거:** 과거 국회 사무관리규정 (개정 전)
*   **현황:** 규정 개정으로 해당 조항이 제외됨

The Chinese leak from <think> also shows up in the final answer. The same Chinese fragment 正如您所说,该规定已被修订/删除 that the base model produced ~3,400 Chinese characters looping on inside <think> appears again, unchanged, in the user-facing answer.

DLM-LST-9B Final answer (clean Korean):

네, 맞습니다. 말씀하신 대로 과거의 **사무관리규정 (개정 전)** 에 따른
정답은 **3 일**입니다.

구체적인 내용은 다음과 같습니다.

1.  **과거 규정 (개정 전):**
    *   사무관리규정 제 13 조 제 2 항에 따라, 업무 협조 요청 문서에
        흠이 있는 경우 수신 기관은 **접수한 날부터 3 일 이내**에 보완을
        요구해야 했습니다.
    *   따라서 해당 문제가 출제되었을 당시의 정답은 **3 일** (보통
        객관식 3 번) 이었습니다.

2.  **현재 규정 (개정 후):**
    *   사무관리규정이 개정되면서 해당 조항이 삭제되거나 변경되었습니다.
    *   현재는 업무 협조 요청 문서의 흠이 있을 때 3 일 이내 보완 요구가
        사무관리규정에 의해 의무화되지 않습니다.

Both models reach the correct conclusion ("3 일"), but the base model contaminates both the reasoning trace and the final answer with Chinese, while DLM-LST-9B keeps both in clean Korean / English.

Training procedure

LST-Logit was applied to Qwen/Qwen3.5-9B with the following configuration:

  • Trainable parameter: lm_head.weight only (everything else frozen).
  • Penalty: φlogit(wt)=EhD[max(wth,0)]\varphi_{\text{logit}}(w_t) = \mathbb{E}_{h \sim D}[\max(w_t \cdot h, 0)].
  • Risk-scored token set $T_{\text{risk}}$: 53,429 CJK rows + 953 broken-byte rows = 54,382 rows in total. Broken-byte risk uses n-gram windows (2, 3, 4) weighted (0.7, 0.2, 0.1) with 50 random-neighbour samples per token.
  • Calibration distribution D: 1,000 final-layer hidden states gathered from 60 prompts, enable_thinking=true, apply_final_layer_norm=true.
  • Optimiser: AdamW (lr 1e-4, weight decay 0.0).
  • Schedule: cosine λ-decay over 1,000 steps from λ₀ = 1.0.
  • Seed: 42; bf16; single H100.

Limitations

  • Not an instruction-tuned chat model. Only the output head of Qwen3.5-9B was re-tuned; the rest of the network is unchanged. Conversational behaviour is therefore identical to the base, aside from the Chinese-token logit suppression.
  • Degraded Chinese generation. Tasks that require Chinese output, such as Chinese translation, Chinese code comments, or bilingual Q&A, will see noticeably lower quality. Use the base Qwen3.5-9B instead for such workloads.
  • Multimodal not re-benchmarked. The vision tower weights are bit-identical to the base, so multimodal performance should be unchanged. We have not, however, re-benchmarked the vision pipeline in this release.
  • Calibration-distribution sensitivity. The logit penalty was optimised against a fixed calibration set D of 1,000 hidden states from 60 prompts. Suppression strength on contexts far from D, such as very different domains, much longer generations, or unusual prompting styles, has not been separately verified.

License

This model is released under the Apache 2.0 License.

Contact

For questions, feedback, or collaboration inquiries, feel free to reach out via our website.

Citation

@misc{dataslab2026LST,
  title  = {Language Selection Tuning: A Learning-Based Approach to Persistent Language Bias Mitigation in Multilingual LLMs},
  author = {{Data Science Lab., Ltd}},
  month  = {May},
  year   = {2026},
  url    = {https://huggingface.co/dataslab/DLM-LST-9B}
}
Downloads last month
-
Safetensors
Model size
10B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dataslab/DLM-LST-9B

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(269)
this model

Collection including dataslab/DLM-LST-9B