Transformers have taken the world by the storm the past five years. The models that utilize them have become larger and more powerful (and impressive!) but it is difficult to build intuitions about the basic building blocks (e.g. attention) when introspecting a model with a larger number of high-dimensional layers (among other things).
To help build intuitions about the internals of the Transformer I have put together and written up a set of reproducible experiments that train and evaluate simple sequence classifiers that have just one layer. The experiment is called Transformer Sequence Classification and it is hosted under the Workshop repository. The intent is for this repository to host future experiments too, but for now it has just the one.
The experiment report is split across multiple documents which are best read consecutively, and are listed here for convenience too:
- High-Level Walkthrough details what occurs when two example strings "aac" and "baac" are processed by the sequence classifier. This document details how the input strings are tokenized, how the tokens are mapped to embeddings, how the embeddings are transformed by the Transformer Block, and how the transformed embeddings are processed to generate output logits that indicate whether or not the sequence contains "a" and "b".
- self_attention_block Walkthrough details how the embeddings are transformed within the instance of the SelfAttentionBlock. Specific attention (pun intended) is given to how the attention keys, queries, and values are constructed for the example strings.
- Generalization Errors examines example errors—both false positives and false negatives—and details why they occur within the Transformer Block.
- Improving Generalization examines strategies for improving generalization of the model. This document is focuses primarily on dataset construction, initialization strategies, and model dimensionality.