Predicting Perception of Scene Consistencies Using Graph Representations of Scene Grammar.

Wed—HZ_7—Talks7—6802

Presented by: Aylin Kallmayer

Aylin Kallmayer ¹^*, Ronja Schnellen ¹, Melissa Võ ^{1, 2}

¹ Goethe University Frankfurt, Department of Psychology, Germany, ² Neuro-Cognitive Psychology, Department of Psychology, LMU Munich

This study explores computational modeling of high-level scene understanding using graph-based representations to capture object identity (semantics: “what”) and their spatial relationships (syntax: “where”) in real-world scenes. Scene grammar – implicit rules that structure natural scenes – guides human understanding of both semantic (object-context) and syntactic (object-object) relationships in scenes. To model these relational structures, we employ graph autoencoders (GAEs) to learn scene embeddings, testing four experimental graph configurations: (1) a What graph (object identity), (2) a What and Where graph (object identity and spatial relations), (3) a Where graph (spatial relations only), and (4) a Control graph (object numerosity). Using these configurations, we tested whether simple assumptions about objects and spatial relations predict complex, human-like scene interpretations. We validated these embeddings using the SCEGRAM database, which systematically violates scene semantics and syntax. The first task assessed scene categorization, while the second predicted human consistency ratings. By comparing configurations against the numerosity baseline, we isolated the contributions of semantics and syntax. Results show that basic “what” and “where” features enhance model accuracy compared to the numerosity baseline, demonstrating that minimal representations can capture human-like scene consistency effects. Our findings suggest GAEs can model human scene understanding in an interpretable way. This work demonstrates the potential of graph-based approaches to study scene grammar representations and serve as a foundation for integrating graph-based representations with neural data to explore when, where, and how scene grammar aids efficient object and scene processing in the brain.

Keywords: computational models, scene grammar, scene perception