The idea below is rough and still in its early stages, as it originated entirely from the insights in paper [1]. It requires a thorough literature review and discussions with experts to further develop it.
Background:
Paper [1] provides a exciting categorization of the belief inference and goal inference in your proposed question types, which defines the atomic inference in complex actions. In reality, human intentions are often complex and layered, involving hierarchical goals (e.g. “find the water glass” as part of “boil the milk”). This complexity presents challenges that humans do not always approach tasks in a strictly subgoal-by-subgoal manner. Instead, they frequently adapt, reprioritize, and make adjustments based on context. This poses challenges to the perceive and adapt to the contextual shift.
Key Challenges:
- Hierarchical Goals: Recognizing that goals often contain nested subgoals (like “finding a water glass” as part of “boiling milk”) provides a realistic depiction of human goal-seeking
- Real-time Goal Shifts: The intentions and goals are often flexible, and contextually driven rather than rigidly following a linear subgoal process. Given this complexity, achieving adaptive systems requires a model that can account for contextual shifts, reprioritization, and situational awareness
Method:
Build (Modify) the benchmark first...
References
[1] MMToM-QA: Multimodal Theory of Mind Question Answering, 202401, JHU
It is highly relevant to the multi-agent social simulation work I'm interested in. I'll continue it in my spare time.
--Edited on 2024/10/30