Step 1: Feature Extraction - As FE is necessary to even start to construct the databases they used, which were integral to the first encoder. They needed lists of features with sentiment for each image, so there was a lot that went into using MIRQI and a label extractor to get the databases up and running.
Step 2: Attention Mechanism - Before you start building the models that make use of the datasets you learned how to construct in Step 1, you have got to understand what makes the Transformer so different, why it is more efficient at parallel computing, how it is contextual and why that is important.
Step 3: Visual Language Model - After the dataset is constructed, and you understand the mechanics behind the model, you can start understanding the architecture of the model they built in the paper.
Generally, the assets start with a simple explanation of it on Towards Data Science or Medium, and then go to a peer-reviewed work once you have a general understanding of it.