/sci/ - Science & Math » Thread #13379250

40KiB, 1574x896, 1_44eDEuZBEsmG_TCAKRI3Kw@2x.png

View Same Google iqdb SauceNAO

Anonymous Sun 11 Jul 00:54:11 2021 No.13379250 View Reply Original Report

Quoted By: >>13379310 >>13379336 >>13379478 >>13379590

Why are autoencoders so bad at the simple task of text reconstruction? Is it due to a deficiency in the word embedding process, or is there an architectural explanation for why they perform so much worse on NLP tasks than on CV tasks?

I'm trying to construct a model that explains a knowledge graph in English, and an autoencoder seems perfect for the final step, but all of the results I look at for autoencoders are extremely poor. I'm looking into using something like Perceiver instead to bridge the gap between graph and text modalities, but I'd like to better understand why something like attention is necessary and feedforward networks appear insufficient for even the simplest version of the job.

Anonymous

Anonymous Sun 11 Jul 2021 01:27:26 No.13379310 Report

Quoted By: >>13379336 >>13379367 >>13379404

>>13379250
Text has very low spatial autocorrelation.

Anonymous

Anonymous Sun 11 Jul 2021 01:40:54 No.13379336 Report

Quoted By: >>13379404

>>13379250
I think textual features like this use other methods.

>>13379310
And low volume of data.

Anonymous

Anonymous Sun 11 Jul 2021 01:53:31 No.13379367 Report

Quoted By:

>>13379310
Good answer

Anonymous

Anonymous Sun 11 Jul 2021 02:06:23 No.13379404 Report

Quoted By: >>13379481

>>13379336
I know that attention is used in general, but crossing the gap between graph and text with attention is kind of hard. Perceiver is a new architecture and is the only thing really trying to solve it, as far as I can tell.

>>13379310
Do you think that using something like a mixer to approximate local attention would be enough to fix that? I haven't found much relevant research in this area, and if I want to run my own experiments it costs money.

Anonymous

Anonymous Sun 11 Jul 2021 02:36:59 No.13379478 Report

Quoted By:

>>13379250
I dont know but I tried to do an AE for a text dataset with LSTMs once and it sucked ass. I hate NLP.

Anonymous

Anonymous Sun 11 Jul 2021 02:38:11 No.13379481 Report

Quoted By: >>13379484 >>13379493 >>13379501

>>13379404
It seems you may have a model that is too large (deep) to be trained with text, if you borrowed the idea direclty from CV.

And the reconstruction in images look like the original because there is too much redundant and correlated data, image grabbing a whole text, convering into a small feature vector, trying to recreate it and expecting it to be similar to the original according to usual metrics.
The degree of ambiguity (possible texts you can generate given a specific feature vector and vice-versa) is too high.

Anonymous

Anonymous Sun 11 Jul 2021 02:39:12 No.13379484 Report

Quoted By:

>>13379481
*imagine grabbing a whole

Anonymous

Anonymous Sun 11 Jul 2021 02:41:44 No.13379485 Report

Quoted By: >>13379493 >>13379497 >>13379501

Autoencoders work well with global structure and spatial autocorrelation. Images are the best example of this.
By global structure, think of something like a face, which autoencoders perform extremely well on. A face is a coherent object and all faces share the same basic shape. And there is a hierarchy of features to decompose that global structure down.
Autocorrelation in the case of an image simply means you can predict neightboring pixels of any given pixel. Self explanatory but just explaining for the sake of it.
Text lacks autocorrelation. First of all words don't repeat, but ok we are probably using some kind of embedding that helps here a little bit, but even then language is contextual. Verbs usually follow nouns etc.
More importantly sentences require non-local context. To really "understand" a particular word, you might need to look backwards a few word, or even forwards a few words. And there isn't any kind of universal rule that covers this. A sentence in essence is a kind of graph that is being displayed in a serialized format, and we just pick up those extra semantic links contextually. But this is hard for something like an autoencoder to do.
I'm being a little handwavey here but hopefully that makes sense.

Anonymous

Anonymous Sun 11 Jul 2021 02:49:15 No.13379493 Report

Quoted By: >>13379503

>>13379481
I don't think you know enough about this subject to understand what I'm saying.

>>13379485
Sounds to me like we both think it's just due to a lack of information in embeddings. I guess I'll need to do my experiment with a global mixer if I want to see if it works.

Anonymous

Anonymous Sun 11 Jul 2021 02:51:29 No.13379497 Report

Quoted By: >>13379595

>>13379485
The features they use are much simpler, for example:

https://en.wikipedia.org/wiki/Latent_semantic_analysis

http://cohmetrix.memphis.edu/cohmetrixhome/documentation.html

Also it makes you wonder how do you feed the data to a network (per sentence, per paragraph, whole document, ...) and what kind of input layer you need given how the data behaves (convolutional, full layer, ...).

Anonymous

Anonymous Sun 11 Jul 2021 02:52:39 No.13379501 Report

Quoted By:

>>13379481
I tried to explain this:
>>13379485

Anonymous

Anonymous Sun 11 Jul 2021 02:52:58 No.13379503 Report

Quoted By:

>>13379493
Yea mixers, Perceivers and Transformers seem like the right tool for the job. Not sure exactly what you are working with but it sounds like you are on the right track.

Also another think to take note of. If you are usually a fully connected autoencoder, they are permutation invariant, meaning you could shuffle the data around and it would still work exactly the same, assuming you apply the same shuffle to all of the data consistently. This is a clear clue that you're going to have problems with something like text, where ordering is critical.

Anonymous

Anonymous Sun 11 Jul 2021 03:29:40 No.13379590 Report

Quoted By:

>>13379250
>auto encoders
>NLP
get your head out of your ass we need people developing new spiking neural network architectures

Anonymous

Anonymous Sun 11 Jul 2021 03:31:26 No.13379595 Report

Quoted By:

>>13379497
The intuition here is that features an image-based network come from deep networks creating complex descriptors (a convolution of a convolution of a convultion... and finally some fully connected layers).
In texts, you don't have that much data and that much auto-correlation to be able to do the same. The process of extracting a feature builtin into the network must be very shallow.

And given how much information will be lost during encoding, you will have to tweak how you compare the input and the output (reconstruction) to some absurd logic where semantically similar texts are considered similar.

Capcode	All Only User Posts Only Moderator Posts Only Admin Posts Only Developer Posts
Show Posts	All Only With Images Only Without Images
Deleted Posts	All Only Deleted Posts Only Non-Deleted Posts
Ghost Posts	All Only Ghost Posts Only Non-Ghost Posts
Post Type	All Only Sticky Threads Only Opening Posts Only Reply Posts
Results	All Grouped By Threads
Order	Latest Posts First Oldest Posts First

Your latest searches