Transformers

This page documents the RankSEG integration path for standard Hugging Face transformers semantic-segmentation outputs.

Use this path when you already run inference through a standard processor -> model -> outputs workflow and want RankSEG to replace the final argmax-style prediction step.

Where RankSEG fits

Hugging Face segmentation models do not all expose the same raw tensor field. Some return outputs.logits; query-based models return class-query and mask logits; some processors also own model-specific resizing logic. The RankSEG Transformers helper keeps the official inference flow intact and handles the last post-processing step:

processor(images) -> model(**inputs) -> restore semantic probabilities
-> RankSEG -> prediction masks
from rankseg.integration import transformers

The main helper is:

transformers.postprocess(
    outputs,
    *,
    model=None,
    target_sizes=None,
    rankseg_kwargs=None,
)

Its role is intentionally narrow:

  • restore probabilities from supported Hugging Face output families;

  • resize them to the original image size when needed;

  • apply RankSEG as the final post-processing step.

Helper arguments

Argument or return

Shape or type

Meaning

outputs

Structured Transformers output

The object returned by model(**inputs). Tuple-style return_dict=False outputs are intentionally unsupported.

model

Optional Transformers model

Required for output families whose semantic reconstruction depends on the model configuration.

target_sizes

List or tensor of (height, width) pairs

One original output size per image. For a PIL image, use [image.size[::-1]].

rankseg_kwargs

dict forwarded to RankSEG

Example: {"metric": "dice", "solver": "RMA"}.

Return value

list[torch.Tensor]

One predicted mask per input image.

Minimal integration

The standard Hugging Face inference structure stays the same. The only integration change happens after outputs = model(**inputs).

from transformers import SegformerImageProcessor, AutoModelForSemanticSegmentation
from rankseg.integration import transformers
from PIL import Image
import requests

processor = SegformerImageProcessor.from_pretrained("mattmdjaga/segformer_b2_clothes")
model = AutoModelForSemanticSegmentation.from_pretrained("mattmdjaga/segformer_b2_clothes")

image = Image.open(requests.get("https://plus.unsplash.com/premium_photo-1673210886161-bfcc40f54d1f?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxzZWFyY2h8MXx8cGVyc29uJTIwc3RhbmRpbmd8ZW58MHx8MHx8&w=1000&q=80", stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

preds = transformers.postprocess(
    outputs,
    target_sizes=[image.size[::-1]],
    rankseg_kwargs={"metric": "dice"},
)

For supported output families, transformers.postprocess(...) preserves the surrounding Hugging Face inference code and replaces only the final prediction step. SAM-family outputs are intentionally handled by sam.Sam1, sam.Sam2, or sam.Sam3 after from rankseg.integration import sam instead of this helper.

Compare with the usual argmax step

The usual SegFormer-style baseline is:

import torch.nn.functional as F

upsampled_logits = F.interpolate(
    outputs.logits,
    size=image.size[::-1],
    mode="bilinear",
    align_corners=False,
)
baseline_pred = upsampled_logits.argmax(dim=1)[0]

The RankSEG version asks the helper to restore probabilities and then produce the final prediction:

rankseg_pred = transformers.postprocess(
    outputs,
    target_sizes=[image.size[::-1]],
    rankseg_kwargs={"metric": "dice", "solver": "RMA"},
)[0]

This is the intended replacement point: the processor, model, checkpoint, and input preparation remain unchanged.

Advanced probability helper

The namespace also exposes:

from rankseg.integration import transformers

transformers.restore_semantic_probs(...) returns restored semantic probability maps directly as a per-image list of (C, H, W) tensors. Use it when you need probability tensors instead of final RankSEG predictions.

Pass target_sizes as one (height, width) entry per batch item, for example target_sizes=[image.size[::-1]] for a single PIL image. transformers.postprocess(...) follows the same per-image list convention for prediction outputs.

Explicit helper imports are also supported when you prefer shorter local names:

from rankseg.integration.transformers import postprocess, restore_semantic_probs

Supported output families

The standard Transformers helper supports the main semantic-segmentation output families used by transformers:

  • outputs.logits

  • outputs.class_queries_logits + outputs.masks_queries_logits

  • outputs.logits + outputs.pred_masks

  • outputs.semantic_seg

When a branch requires model-specific handling, pass model=... so the helper can follow the corresponding official post-processing behavior.

Output defaults

If rankseg_kwargs omits output_mode, the helper chooses:

  • "multiclass" when the restored semantic probability map has more than one class channel;

  • "multilabel" when the restored probability map has one channel.

You can override this explicitly:

preds = transformers.postprocess(
    outputs,
    target_sizes=[image.size[::-1]],
    rankseg_kwargs={
        "metric": "dice",
        "solver": "RMA",
        "output_mode": "multiclass",
    },
)

Current exclusions

The simplified API does not currently support:

  • SAM-family outputs, which use the explicit adapters in rankseg.integration.sam;

  • outputs with patch_offsets that require official patch-merge logic;

  • tuple-style outputs such as return_dict=False returns;

  • custom unstructured outputs from trust_remote_code=True models;

  • SegGPT-style pred_masks semantic reconstruction.

These cases should fail explicitly rather than silently using an incorrect semantic restoration path.

Executable tutorial

The notebook below is written as a user-facing tutorial: it first runs the official Hugging Face baseline, then repeats the same inference flow with only the final post-processing step replaced by RankSEG.

The maintained script version is: