MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations (1)

#ai #machinelearning #llm #datascience

Positional Encodingについてまず説明する

Positional encodingは単語のならびをattentionに覚えさせるためのモジュール
今回の手法ではPositional embeddingが用いられていることに注意しておく

UNSCENE-dataset
This dataset explores the scene-only data to evaluate the robustness of hallucinations in LMMs.
Unusual context video
Scene only video

Steps

Manually collect & label
Generating hallucination labels with an LLM (GPT)
Generating QA pairs in the video

メソッドについては後でかく．重そうなので

DEV Community

MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations (1)

Top comments (0)