I wouldn’t trust an AI to explain how itself works. Also there’s no way it could respond in a reasonable amount of time if it was analyzing every frame of a video in real time.
Unless you gave it something that isn’t a YouTube video and it worked there’s no way it isn’t just using the transcript. It’s not “watching” the video.