‘Theory of Vibe’ Open Questions: Going Beyond Toy Models

4 min readJan 19, 2021

The content of my doctoral dissertation was something like “If you grant me the false premise that a work of art is a set of samples from a bottleneck autoencoder trained on a distribution D to minimize Euclidean reconstruction distance, I can retrodict many of the deepest ideas in aesthetics and poetics.” I see the doctorate as introducing a toy model, and toy models are fine — but to work with a toy model honestly you have to prove that the model can evolve without falling apart. I spent most of the year thinking about different directions in which the toy model can/should evolve:

Building a model where the work of art embodies a representation space the artist (or culture or whatever) acquired through supervised or reinforcement learning processes, but allows the reader to learn this representation space through pure autoencoding of the work of art. Doing this requires saying something about whether, when, and how one can translate a supervised/reinforcement learning task into an unsupervised learning task. An interesting recently-discovered fact that may be relevant is that one can use adversarially robust supervised nets to do pretty decent input reconstructions by using a ‘deep dream’ process that starts with noise and modifies the noise until its representation in the network’s upper feature layer is identical to the representation of the input, and the process converges pretty strongly whatever the initial noise. So it seems that even without a decoder we can talk about something like the ‘canonical’ objects associated with the representations of a trained neural network. I’m considering doing an experiment that trains an autoencoder on ‘deep dreamed’ input-reconstructions from adversarially robust supervised networks, but it’s gonna be resource intensive so I’m still in the theoretical motivation stage.
Building a model where, instead of talking about autoencoders, I’m talking about some broad family of generative networks. (Some autoencoders are not strictly speaking generative networks, but we can mostly speak of autoencoders as a a subcategory of generative networks. Doing this requires proving or demonstrating the following principle for some family of generative networks: If network N training on distribution D converges to some generative distribution M, then training on samples from M instead of on samples from D will converge to M with fewer samples.
Building a model in which I do not limit myself to autoencoders that minimize Euclidean reconstruction distance. Most real successes with autoencoders in contemporary machine-learning involve sequentially structured input objects such as graphs or waveforms. That is, each input is e.g. a graph. To make autoencoding work on inputs like this, instead of minimizing Euclidean reconstruction distance we minimize the negative log likelihood a conditional RNN decoder trained with the encoder assigns the input given the input’s encoding. (A good way to think about the codes you get when training an encoder together with a conditional stochastic RNN decoder is as describing the ‘internal grammar’ of an object, whereupon the decoder generates a random object in accordance with the grammar given by the code.) What sucks about this, from my point of view, is that we can no longer talk about an input space submanifold that expresses the encoding function geometrically, because even the input-space point assigned the highest probability conditional on a given code doesn’t effectively represents the entire complicated probability distribution associated with said code. But there is also something very cool about thinking about the code space as a space of ‘internal grammars’ of objects, so that the vibe that the code space maps is the vibe of a family of internal grammars of objects rather than the vibe of a family of objects.
Building a model where I refine the idea that autoencoding occurs within or in relation to a human or cultural representation space (shared by artist and reader) that already involves sophisticated abstraction. There are interesting conceptual and technical differences between two ways of thinking about this: One way would be to build a model in which there is an autoencoder that takes abstract representations of inputs and compresses them to even more abstract representations. (Recent research shows that in supervised neural nets the upper feature layer activations caused by inputs from the training distribution are restricted to a lower dimensional submanifold of the upper feature layer, so if the input comes from something like a supervised net’s upper feature layer it should be highly amenable to autoencoder compression.) A second way would be to build a model where there is an autoencoder that receives concrete — that is, low-level — input representations, but measures the reconstruction distance by sending both the input and the reconstruction to some abstract representation space and only then taking the distance between them. (This is called a feature-matching autoencoder or ‘perceptual distance’ autoencoder.) Very speculatively, I’m drawn to the idea that variants of both of these models are at play in aesthetics, and that the difference between them is related to the difference between understanding an artistic style and mastering an artistic style.

None of these research directions are particularly formally, experimentally, or philosophically easy to work through, so figuring out the right resource allocation is a serious question.

‘Theory of Vibe’ Open Questions: Going Beyond Toy Models

Written by Peli Grietzer