VectorSynth-GiT10M

VectorSynth-GiT10M is a ControlNet-based pipeline that generates satellite imagery from OpenStreetMap (OSM) vector data, fine-tuned on the GiT10M dataset of paired OSM + satellite tiles. Like VectorSynth-COSA, it conditions Stable Diffusion 2.1 Base on rendered OSM text using the COSA (Contrastive OSM-Satellite Alignment) embedding space.

Model Description

VectorSynth-GiT10M uses a two-stage pipeline:

RenderEncoder: Projects 768-dim COSA embeddings to 3-channel control images.
ControlNet + UNet: Both fine-tuned on the GiT10M dataset to condition Stable Diffusion 2.1 on the rendered control images.

Unlike VectorSynth-COSA — which ships only a fine-tuned ControlNet on top of the stock SD 2.1 UNet — this model additionally fine-tunes the UNet on GiT10M, so users should load the full pipeline from this repo rather than from stable-diffusion-2-1-base.

Usage

import sys
import torch
from diffusers import StableDiffusionControlNetPipeline, DDIMScheduler
from huggingface_hub import snapshot_download

device = "cuda"

# Load pipeline (GiT10M-finetuned UNet + ControlNet, plus base SD 2.1 VAE/text encoder)
local_dir = snapshot_download("MVRL/VectorSynth-GiT10M")
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    local_dir,
    torch_dtype=torch.float16
)
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to(device)

# Load RenderEncoder
sys.path.insert(0, local_dir)
from render import RenderEncoder
checkpoint = torch.load(
    f"{local_dir}/render_encoder/cosa-render_encoder.pth",
    map_location=device, weights_only=False,
)
render_encoder = RenderEncoder(**checkpoint['config']).to(device).eval()
render_encoder.load_state_dict(checkpoint['state_dict'])

# Your hint tensor should be (H, W, 768) - per-pixel COSA embeddings
# hint = torch.load("your_hint.pt").to(device)
# hint = hint.unsqueeze(0).permute(0, 3, 1, 2)  # (1, 768, H, W)

# with torch.no_grad():
#     control_image = render_encoder(hint)

# Generate
# output = pipe(
#     prompt="An aerial image of a residential neighborhood",
#     image=control_image,
#     num_inference_steps=40,
#     guidance_scale=7.5
# ).images[0]

Files

unet/ — GiT10M-fine-tuned UNet (diffusion_pytorch_model.safetensors)
controlnet/ — GiT10M-fine-tuned ControlNet
render_encoder/cosa-render_encoder.pth — RenderEncoder weights (COSA 768→3)
render.py — RenderEncoder class definition
vae/, text_encoder/, tokenizer/, scheduler/, feature_extractor/ — copied from SD 2.1 Base (unmodified)

Training Data

Fine-tuned on GiT10M, a curated collection of paired OpenStreetMap vector data and Google satellite tiles (zoom 17, ~1m/pix). The dataset is split into a training set and two held-out test splits (random and spatial) for evaluation. See GeoDiT: Point Conditioned Diffusion Transformer for Satellite Image Synthesis for more details on the data.

Citation

@inproceedings{cher2025vectorsynth,
  title={VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics},
  author={Cher, Daniel and Wei, Brian and Sastry, Srikumar and Jacobs, Nathan},
  year={2025},
  eprint={arXiv:2511.07744},
  note={arXiv preprint}
}

Related Models

VectorSynth-COSA — trained on smaller cities dataset
VectorSynth — standard CLIP embedding variant
GeoSynth — text-to-satellite image generation

Downloads last month: 9

Model tree for MVRL/VectorSynth-GiT10M

Base model

stabilityai/stable-diffusion-2-1-base

Adapter

(712)

this model

Collection including MVRL/VectorSynth-GiT10M

VectorSynth

Collection

Models for https://arxiv.org/abs/2511.07744 • 3 items • Updated about 21 hours ago

Paper for MVRL/VectorSynth-GiT10M

VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics

Paper • 2511.07744 • Published Nov 11, 2025