Why are autoencoders so bad at the simple task of text reconstruction? Is it due to a deficiency in the word embedding process, or is there an architectural explanation for why they perform so much worse on NLP tasks than on CV tasks?
I'm trying to construct a model that explains a knowledge graph in English, and an autoencoder seems perfect for the final step, but all of the results I look at for autoencoders are extremely poor. I'm looking into using something like Perceiver instead to bridge the gap between graph and text modalities, but I'd like to better understand why something like attention is necessary and feedforward networks appear insufficient for even the simplest version of the job.
I'm trying to construct a model that explains a knowledge graph in English, and an autoencoder seems perfect for the final step, but all of the results I look at for autoencoders are extremely poor. I'm looking into using something like Perceiver instead to bridge the gap between graph and text modalities, but I'd like to better understand why something like attention is necessary and feedforward networks appear insufficient for even the simplest version of the job.