‘Theory of Vibe’ Open Questions: Distilling Core Commitments

Peli Grietzer
2 min readJan 19, 2021

--

A friend asked what the core commitments of my theory are, because he thinks the choice of autoencoders as my ML-theoretic stand-in for vibes might have some undesirable implications. So below I’m gonna try to describe my minimal desiderata for an ML-theoretic stand-in for grasping a ‘vibe.’

I think grasping a ‘vibe’ corresponds to a representation space such that:

1) Given an input x and the space’s representation of x there’s a natural way to measure how well the representation “captures” x.

2) There is strong pressure on the representation space to well-capture all and only inputs from the same distribution as the training set, so when the trained representation space fails to well-capture some input x the failure acts as an anomaly detector.

3) We can talk about something like the set of inputs X’ such that for all x’ the trained representation space captures x’ optimally, and show that a small sample from this set of “idealized” inputs (as opposed to a much larger sample from the actual training set) is sufficient to efficiently reverse-engineer the trained representation space.

I think it’s plausible that not only autoencoder representation spaces, but most representation spaces produced by unsupervised methods, and some representation spaces produced by self-supervised methods and by adversarially robust supervised methods, will have some version of properties 1–3. For many of these representation spaces, the best path to establishing properties 1–3 may be to apply a measure of “capturing” that relies on the lossiness of reconstructions produced via a decoder, but there are some interesting alternatives specific to (respectively) representation spaces produced by self-supervised methods and representation spaces produced by adversarially robust supervised methods.*

[*With representation spaces produce by self-supervised methods like contrastive predictive coding, the best measure of “capturing” an input x might be the degree of success of the self-supervision task on x. With representation spaces produced by adversarially robust supervised methods, the best measure of “capturing” an input x might be the lossiness of reconstructions produced by stochastically optimizing noise until its representation in the network’s upper feature layer matches the representation of x. If I’m right that both of these decoder-free measures of “capturing” above would support properties 1–3 as effectively as decoder-based measures of capturing would, my project can potentially gain a lot of flexibility.]

The core idea of my project is that if we identify grasping a vibe with having a representation space that has properties 1–3, I can derive the following two independently attractive aesthetic-philosophical principles:

A) Sensing a vibe is sensing the applicability of a worldview — or of a ‘mood’ in Heideggerian — to a set of objects. (That is, sensing that a single worldview is sufficient to well-capture all these objects)

B) The optimal training set for learning a worldview v is an artefactual world such that v captures everything in that world *perfectly* — which by ‘A)’ means that it is an artefactual world with an ‘ideal’ or ‘pure’ form of the vibe possessed by natural datasets associated with v.

--

--

Peli Grietzer

Harvard Comp. Lit, visitor Einstein Institute of Mathematics