I thought I’d dive back into history and read the original paper that started it all. It’s somewhat technical about encode/decoder layouts and matrix multiplications. None of the components are super exciting for somebody who’s been looking at neural networks for the past decade.
What’s exciting is that such a simplification generates results that are that much better and how they came up with it. Unfortunately, they don’t write how they found this out.
The paper itself is a bit too abstract so I’m going to look for some of those YouTube videos that explain what is actually going on here and why it’s such a big deal. I’ll update this later.