# GlmImageTransformer2DModel

A Diffusion Transformer model for 2D data from [GlmImageTransformer2DModel] (TODO).

## GlmImageTransformer2DModel[[diffusers.GlmImageTransformer2DModel]]

#### diffusers.GlmImageTransformer2DModel[[diffusers.GlmImageTransformer2DModel]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/models/transformers/transformer_glm_image.py#L503)

**Parameters:**

patch_size (`int`, defaults to `2`) : The size of the patches to use in the patch embedding layer.

in_channels (`int`, defaults to `16`) : The number of channels in the input.

num_layers (`int`, defaults to `30`) : The number of layers of Transformer blocks to use.

attention_head_dim (`int`, defaults to `40`) : The number of channels in each head.

num_attention_heads (`int`, defaults to `64`) : The number of heads to use for multi-head attention.

out_channels (`int`, defaults to `16`) : The number of channels in the output.

text_embed_dim (`int`, defaults to `1472`) : Input dimension of text embeddings from the text encoder.

time_embed_dim (`int`, defaults to `512`) : Output dimension of timestep embeddings.

condition_dim (`int`, defaults to `256`) : The embedding dimension of the input SDXL-style resolution conditions (original_size, target_size, crop_coords).

pos_embed_max_size (`int`, defaults to `128`) : The maximum resolution of the positional embeddings, from which slices of shape `H x W` are taken and added to input patched latents, where `H` and `W` are the latent height and width respectively. A value of 128 means that the maximum supported height and width for image generation is `128 * vae_scale_factor * patch_size => 128 * 8 * 2 => 2048`.

sample_size (`int`, defaults to `128`) : The base resolution of input latents. If height/width is not provided during generation, this value is used to determine the resolution as `sample_size * vae_scale_factor => 128 * 8 => 1024`

