Transformers documentation
Add vision processing components
Add vision processing components
Adding a vision model requires two image processor classes on top of the standard modular approach.
For the modeling and config steps, follow the modular guide first.
- torchvision backend is the default and supports GPU acceleration.
- PIL backend is a fallback when no GPU is available.
Both classes share the same preprocessing logic but have different backends. Their constructor signatures and default values must be identical. AutoImageProcessor.from_pretrained() selects the backend at load time and falls back to PIL when torchvision isn’t available. Mismatched signatures cause the same saved config to behave differently across environments.
torchvision
Create image_processing_<model_name>.py with a class that inherits from TorchvisionBackend. Define a kwargs class first if your processor needs custom parameters beyond the standard ImagesKwargs.
from ...image_processing_backends import TorchvisionBackend
from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
from ...processing_utils import ImagesKwargs, Unpack
from ...utils import auto_docstring
class MyModelImageProcessorKwargs(ImagesKwargs, total=False):
tile_size: int # any model-specific kwargs
@auto_docstring
class MyModelImageProcessor(TorchvisionBackend):
resample = PILImageResampling.BICUBIC
image_mean = OPENAI_CLIP_MEAN
image_std = OPENAI_CLIP_STD
size = {"shortest_edge": 224}
do_resize = True
do_rescale = True
do_normalize = True
do_convert_rgb = True
def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
super().__init__(**kwargs)PIL
Create image_processing_pil_<model_name>.py with a class that inherits from PilBackend. Duplicate the kwargs class here instead of importing it from the torchvision file because it can fail when torchvision isn’t installed. Add an # Adapted from comment so the two stay in sync. For processors with no custom parameters, use ImagesKwargs directly.
from ...image_processing_backends import PilBackend
from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
from ...processing_utils import ImagesKwargs, Unpack
from ...utils import auto_docstring
# Adapted from transformers.models.my_model.image_processing_my_model.MyModelImageProcessorKwargs
class MyModelImageProcessorKwargs(ImagesKwargs, total=False):
tile_size: int # any model-specific kwargs
@auto_docstring
class MyModelImageProcessorPil(PilBackend):
resample = PILImageResampling.BICUBIC
image_mean = OPENAI_CLIP_MEAN
image_std = OPENAI_CLIP_STD
size = {"shortest_edge": 224}
do_resize = True
do_rescale = True
do_normalize = True
do_convert_rgb = True
def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
super().__init__(**kwargs)See CLIPImageProcessor/CLIPImageProcessorPil and LlavaOnevisionImageProcessor/LlavaOnevisionImageProcessorPil for reference.
Next steps
- Read the Auto-generating docstrings guide to auto-generate consistent docstrings with
@auto_docstring. - Read the Writing model tests guide to write integration tests for your model.