Peli Grietzer
6 min readSep 19, 2019

Technical Open Questions in My ‘Theory of Vibe’ Work

Question 1: Distilling Core Commitments

A friend asked what the core commitments of my theory are, because he thinks the choice of autoencoders as my ML-theoretic stand-in for vibes might have some undesirable implications. So below I’m gonna try to describe my minimal desiderata for an ML-theoretic stand-in for grasping a ‘vibe.’

I think grasping a ‘vibe’ corresponds to a representation space such that:

1) Given an input x and the space’s representation of x there’s a natural way to measure how well the representation “captures” x.

2) There is strong pressure on the representation space to well-capture all and only inputs from the same distribution as the training set, so when the trained representation space fails to well-capture some input x the failure acts as an anomaly detector.

3) We can talk about something like the set of inputs X’ such that for all x’ the trained representation space captures x’ optimally, and show that a small sample from this set of “idealized” inputs (as opposed to a much larger sample from the actual training set) is sufficient to efficiently reverse-engineer the trained representation space.

I think it’s plausible that not only autoencoder representation spaces, but most representation spaces produced by unsupervised methods, and some representation spaces produced by self-supervised methods and by adversarially robust supervised methods, will have some version of properties 1–3. For many of these representation spaces, the best path to establishing properties 1–3 may be to apply a measure of “capturing” that relies on the lossiness of reconstructions produced via a decoder, but there are some interesting alternatives specific to (respectively) representation spaces produced by self-supervised methods and representation spaces produced by adversarially robust supervised methods.*

[*With representation spaces produce by self-supervised methods like contrastive predictive coding, the best measure of “capturing” an input x might be the degree of success of the self-supervision task on x. With representation spaces produced by adversarially robust supervised methods, the best measure of “capturing” an input x might be the lossiness of reconstructions produced by stochastically optimizing noise until its representation in the network’s upper feature layer matches the representation of x. If I’m right that both of these decoder-free measures of “capturing” above would support properties 1–3 as effectively as decoder-based measures of capturing would, my project can potentially gain a lot of flexibility.]

The core idea of my project is that if we identify grasping a vibe with having a representation space that has properties 1–3, I can derive the following two independently attractive aesthetic-philosophical principles:

A) Sensing a vibe is sensing the applicability of a worldview — or of a ‘mood’ in Heideggerian — to a set of objects. (That is, sensing that a single worldview is sufficient to well-capture all these objects)

B) The optimal training set for learning a worldview v is an artefactual world such that v captures everything in that world *perfectly* — which by ‘A)’ means that it is an artefactual world with an ‘ideal’ or ‘pure’ form of the vibe possessed by natural datasets associated with v.

Question 2: Going Beyond Toy Models

The content of my doctoral dissertation was something like “If you grant me the false premise that a work of art is a set of samples from a bottleneck autoencoder trained on a distribution D to minimize Euclidean reconstruction distance, I can retrodict many of the deepest ideas in aesthetics and poetics.” I see the doctorate as introducing a toy model, and toy models are fine — but to work with a toy model honestly you have to prove that the model can evolve without falling apart. I spent most of the year thinking about different directions in which the toy model can/should evolve:

1) Building a model where the work of art embodies a representation space the artist (or culture or whatever) acquired through supervised or reinforcement learning processes, but allows the reader to learn this representation space through pure autoencoding of the work of art. Doing this requires saying something about whether, when, and how one can translate a supervised/reinforcement learning task into an unsupervised learning task. An interesting recently-discovered fact that may be relevant is that one can use adversarially robust supervised nets to do pretty decent input reconstructions by using a ‘deep dream’ process that starts with noise and modifies the noise until its representation in the network’s upper feature layer is identical to the representation of the input, and the process converges pretty strongly whatever the initial noise. So it seems that even without a decoder we can talk about something like the ‘canonical’ objects associated with the representations of a trained neural network. I’m considering doing an experiment that trains an autoencoder on ‘deep dreamed’ input-reconstructions from adversarially robust supervised networks, but it’s gonna be resource intensive so I’m still in the theoretical motivation stage.

2) Building a model where, instead of talking about autoencoders, I’m talking about generative networks that do density estimation. (There is a lot of overlap between autoencoders and such networks, but it’s not quite the same category. When I say “autoencoder,” I mean specifically a bottlenecked artificial neural network that was trained on some distribution D to minimize some reconstruction cost, perhaps with some additional regularization.) Doing this requires proving or demonstrating the following principle for some family of generative networks: If network N training on distribution D converges to some generative distribution M, then training on samples from M instead of on samples from D will converge to M with fewer samples.

3) Building a model in which I do not limit myself to autoencoders that minimize Euclidean reconstruction distance. Most real successes with autoencoders in contemporary machine-learning involve sequentially structured input objects such as graphs or waveforms. That is, each input is e.g. a graph. To make autoencoding work on inputs like this, instead of minimizing Euclidean reconstruction distance we minimize the negative log likelihood a conditional RNN decoder trained with the encoder assigns the input given the input’s encoding. (A good way to think about the codes you get when training an encoder together with a conditional stochastic RNN decoder is as describing the ‘internal grammar’ of an object, whereupon the decoder generates a random object in accordance with the grammar given by the code.) What sucks about this, from my point of view, is that we can no longer talk about an input space submanifold that expresses the encoding function geometrically, because even the input-space point assigned the highest probability conditional on a given code doesn’t effectively represents the entire complicated probability distribution associated with said code. But there is also something very cool about thinking about the code space as a space of ‘internal grammars’ of objects, so that the vibe that the code space maps is the vibe of a family of internal grammars of objects rather than the vibe of a family of objects.

4) Building a model where I refine the idea that autoencoding occurs within or in relation to a human or cultural representation space (shared by artist and reader) that already involves sophisticated abstraction. There are interesting conceptual and technical differences between two ways of thinking about this: One way would be to build a model in which there is an autoencoder that takes abstract representations of inputs and compresses them to even more abstract representations. (Recent research shows that in supervised neural nets the upper feature layer activations caused by inputs from the training distribution are restricted to a lower dimensional submanifold of the upper feature layer, so if the input comes from something like a supervised net’s upper feature layer it should be highly amenable to autoencoder compression.) A second way would be to build a model where there is an autoencoder that receives concrete — that is, low-level — input representations, but measures the reconstruction distance by sending both the input and the reconstruction to some abstract representation space and only then taking the distance between them. (This is called a feature-matching or in some contexts ‘perceptual distance’ autoencoder.) Very speculatively, I’m drawn to the idea that variants of both of these models are at play in aesthetics, and that the difference between them is related to the difference between understanding an artistic style and mastering an artistic style.

None of these research directions are particularly formally, experimentally, or philosophically easy to work through, so figuring out the right resource allocation is a serious question.

Peli Grietzer

Harvard Comp. Lit, visitor Einstein Institute of Mathematics