MixCount | Corentin Dumery

Paper

Dataset

MixCount is a large-scale synthetic dataset for mixed-object, open-vocabulary counting, the setting that dominates industrial inspection and sorting, but breaks current counting models. Our automatic generation pipeline produces pixel-perfect labels, text prompts at several levels of detail, and visual exemplars at scale.

58K counting scenes

1,522 object classes

4M+ counting instances

-18.3% MAE on PairTally

-20.14% MAE on FSC-147

Bridging the data gap

Visual counting models often struggle in mixed-object scenes. Common failure modes include:

(a) Distinguishing visually similar objects, e.g. big marbles in PairTally.

(b) Recognizing self-similar components as a single entity, e.g. counting pairs of sunglasses rather than lenses.

(c) Ignoring repetitive background patterns and focusing on the queried object class.

MixCount combines the scale of synthetic datasets with the photorealism of real-world 3D captures, while targeting these failure modes. As a result, training on MixCount yields about 20% lower error on recent open-vocabulary counting benchmarks.

Bridging the data gap: MixCount training improves CountGD++ on PairTally, FSC-147, and MixCount

Dataset

	FSC-147	PairTally	MCAC	MixCount
Multiple object types per image		✓	✓	✓
Fine-grained text prompts		✓		✓
External visual exemplars				✓
Segmentation & bounding boxes			✓	✓
# images	6,135	681	20K	58,000
# object classes	147	98	343	1,522

Dataset features: Each sample provides multiple visual exemplars per object (external crops and in-scene crops at different scales) together with short, concise, and detailed text descriptions, enabling flexible open-vocabulary counting prompts.

MixCount exemplars and tiered text descriptions

Annotations: Every image comes with pixel-perfect counting supervision plus dense labels including instance and class segmentations, bounding boxes, depth, and normal maps.

Data generator: Our generator samples objects, distractors, environment, and camera placement to procedurally create photorealistic training samples. All assets are built from high-quality captures of real-world objects, materials, and lighting.

MixCount automatic data generation pipeline

Citation

@article{dumery2026mixcount,
   title = {{The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting}},
   author = {Dumery, Corentin and Amini-Naieni, Niki and Naini, Shervin and Fua, Pascal},
   journal = {arXiv preprint arXiv:2605.18063},
   year = {2026}
}

Acknowledgements

We thank the following datasets and resources: DTC, VasTextures, LavalIndoor, and PolyHaven, as well as the Blender Foundation. We also thank Andrew Zisserman for insightful discussions. This work is partially funded by the Swiss National Science Foundation, an AWS Studentship, the Reuben Foundation, a Qualcomm Innovation Fellowship (mentors: Dr Farhad Zanjani and Dr Davide Abati), and the AIMS CDT program at the University of Oxford.

The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting

Corentin Dumery*1, Niki Amini-Naieni*2, Shervin Naini3, Pascal Fua1 1 2 3 (* denotes equal contribution)

The MixCount Dataset: Bridging the Data Gap
for Open-Vocabulary Object Counting

Corentin Dumery¹, Niki Amini-Naieni², Shervin Naini³, Pascal Fua¹

¹ ² ³

(* denotes equal contribution)