Interactive Episodic Memory with User Feedback

Nikesh Subedi; Loris Bazzani; Ziad Al-Halah

EM-QnF task overview: user provides feedback to refine an incorrect episodic memory prediction

EM-QnF task overview. Given an initial query and an incorrect model prediction (top), the user refines the query through natural language feedback (middle). The model shifts its focus toward the correct moment in the video (bottom).

Abstract

In episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g., "Where did I place the mug?") that requires searching a long egocentric video to find the moment that answers it. However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods address EM-NLQ in a one-shot setup, ignoring this key aspect and limiting their applicability in real-world scenarios.

In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model's initial prediction or add more information (e.g., "Before this. I'm looking for the big blue mug not the white one"), helping the model refine its predictions interactively. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. Our approach significantly improves over the state of the art on three challenging benchmarks and is competitive with commercial LVLMs while remaining efficient.

Task

Episodic Memory with Questions & Feedback (EM-QnF)

Episodic Memory with Natural Language Query (EM-NLQ) retrieves specific moments from a person's past visual experiences (such as wearable-camera video) using free-form text questions. For example, a user might ask, "What did I put in the frying pan?", and the model must identify the exact temporal window in the video that answers the question. Current methods localize the query in a one-shot manner and therefore cannot handle query ambiguity or model errors.

We introduce EM-QnF, which extends EM-NLQ by allowing users to provide natural language feedback after seeing an incorrect prediction. Given a video V, query Q, and an initial (possibly wrong) prediction, the user provides feedback F (containing additional, contrastive, or temporal information) to guide the model toward the correct answer. This interactive process continues over multiple rounds, bringing episodic memory search closer to how humans naturally seek information about their past experiences.

To support this new task, we construct the EM-QnF datasets from three existing egocentric benchmarks (Ego4D-NLQ, GoalStep, HD-EPIC) using a scalable synthetic feedback generation recipe, removing the need for costly manual annotation.

Our Approach

Feedback Generation Recipe

Since no dataset currently supports EM-QnF, we propose a synthetic feedback generation recipe that turns existing EM-NLQ datasets into EM-QnF datasets. The four-step pipeline samples a reference span, generates captions for each span using an LVLM, generates an explanation for the ground-truth response, and then uses an LLM to produce diverse, realistic feedback. We further extract contains, not-contains, and temporal clauses from each feedback to supervise FALM training.

ReFocus & FALM

ReFocus integrates our novel plug-and-play Feedback ALignment Module (FALM) with existing EM-NLQ models, enabling them to process and respond to user feedback efficiently. Unlike prior approaches that treat queries as fixed, one-shot inputs, ReFocus allows models to refine their predictions based on user corrections or clarifications.

FALM architecture and ReFocus integration diagram

FALM encodes the video, query, and feedback using pretrained encoders, then uses Transformer encoder-decoder blocks to predict a per-clip alignment score P indicating how well each clip matches the user's feedback. These scores reweight the video features in an existing EM-NLQ model via a lightweight EM Adapter (two learned scalars), shifting the model's attention without retraining the base model from scratch. Multi-turn feedback is handled by late fusion of per-feedback alignment scores.

Results

Video Examples

The following videos showcase examples from the EM-QnF dataset alongside predictions from ReFocus (GroundNLQ) and GroundNLQ. Each example first shows the reference span, followed by the prediction from GroundNLQ, then ReFocus (GroundNLQ).

Success Cases

Example 1

Example 2

Example 3

Example 4

Failure Cases

Failure 1

Failure 2

Citation

BibTeX

If you find this work useful in your research, please consider citing:

bibtex

@inproceedings{subedi2026refocus,
  title     = {Interactive Episodic Memory with User Feedback},
  author    = {Subedi, Nikesh and Bazzani, Loris and Al-Halah, Ziad},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer
               Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}