Papers
arxiv:2602.21627

Tokenizing Semantic Segmentation with RLE

Published on Feb 25
Authors:
,
,

Abstract

A unified approach for semantic and panoptic segmentation in images and videos using language modeling to generate segmentation masks as discrete token sequences with novel compression techniques.

AI-generated summary

This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks and then train a modified version of Pix2Seq to output these RLE tokens through autoregression. We propose novel tokenization strategies to compress the length of the token sequence to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our proposed models on two datasets to show that they are competitive with the state of the art in some scenarios in spite of being bottlenecked by our limited computational resources. We make our code and models publicly available to facilitate further work in this domain.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.21627 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.21627 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.