Abstract: The goal of dense video captioning is to generate multiple text descriptions related to different timestamps from a video. Traditional methods adopt a two-stage "localize-then-describe" ...