Based on the research data and the concerns that were brought up, it was clear that the best solution would be to have a fully isolated page, where everything serves the purpose of editing the caption. The page had three main parts:
Vertical timeline
Half of the page is a vertical timeline where each entry is a "row" in the caption. Each row is a clickable element that opens the editing on the right side. The rows can be deleted and new rows can be also added.
Editor
A simple text area accompanied by two input fields for the start and end timestamp, on a milisecond level. These fields use auto-save.
Preview
The video itself is always visible, where the changes made in the editor can be previewed in real-time.