From ‘A Literary Theorist’s Guide to Autoencoding’

Peli Grietzer
37 min readJun 14, 2017

--

(Excerpt from ’17 Harvard Department of Comparative Literature dissertation ‘Ambient Meaning: Mood, Vibe, System’ Ch. 1)

I. The Mechanics of Autoencoders

An autoencoder algorithm is a program tasked with learning, through a kind of trial and error, to make facsimiles of worldly objects. Why make a program learn how to create facsimiles? One way to think about this is as follows. When human artists make facsimiles or imitations of the world — think of sculpture and painting, fiction, acting, costuming, dollmaking — we treat the artist’s facsimile as an interpretation of the objects or phenomena the artist imitates, or even an interpretation of the world in a more general sense. The basic AI-theoretic motivation for autoencoders is, speaking informally, that something of this sort should hold for algorithms too: to imitate is to interpret, so if we can push an algorithm to make reasonable imitations then we got it to interpret the world meaningfully.

Let’s call a hypothetical, exemplary autoencoder ‘Hal.’ Hal is an algorithm with one input channel and two output channels. Hal’s input channel takes sensory data — images, recordings, videos, texts — and in return Hal’s output channel #1 gives short summaries of these data, and Hal’s output channel #2 attempts to reconstruct the data from the information in the summaries. For every object Hal receives as input, Hal’s short summary will consist of a short list of short numbers that records various ‘measurements’ of the input, and Hal’s reconstruction will consist of an object in the same material medium — image, audio, video, text, and so on — as the input. In addition to an input channel and two output channels, Hal is also equipped with a mechanism that we’ll call Hal’s ‘optimizer.’ Hal’s optimizer, in very informal terms, is a mechanism that measures the accuracy of Hal’s reconstruction of an input at a mechanical level — it measures how “close” Hal’s reconstruction is to matching the original detail by detail — and then applies a formula that slightly revises the specifics of Hal’s method of summary and reconstruction to slightly improve Hal’s future accuracy on this input. Hal’s optimizer mechanism can be turned on and off at the AI researcher’s discretion, and we refer to Hal as training if its optimizer mechanism is turned on, and as trained if its optimizer mechanism is turned off. Lastly, we call the set of all inputs that Hal interacts with when Hal’s optimizer is turned on Hal’s training set, and we call any input that Hal only interacts with when Hal’s optimizer is turned off Hal’s test data. An autoencoder’s training set is typically a large set of media files whose contents share a structurally meaningful domain in common — for example, a large set of audio files of English words pronounced out loud (the domain being ‘spoken English words’), or a large set of scanned handwritten numerals (the domain being ‘handwritten numerals’), or a large set of images of human faces (the domain being ‘human faces’). The training set can also be much more diverse than in the above examples, and instead draw on very general domains such as ‘photographs’ or ‘English sentences’: it’s standard practice, for example, to use a training set consisting of every page on Wikipedia or on Google News, or of several million photographs randomly sampled from all the photos posted on the internet. Finally, although test data can mean literally any input to a trained autoencoder that was not included in the training set, one typically tests a newly trained autoencoder with a set of media files that fall within the same domain as the training set but were withheld from the autoencoder during training.

While the final two concepts introduced above — an autoencoder’s training set and its test data — remain marginal in our exposition for the moment, they will become central to our discussion of autoencoding once we’re free to move beyond a survey of the bare mechanics of autoencoder algorithms. Recall that the ‘trial and error’ method by which untrained autoencoder algorithms learn is meant to act as something of a neutral conduit through which structural forces inherent to the task of developing a method of mimesis assert themselves. Similarly, we will soon start thinking about a trained autoencoder algorithm as a kind of conduit for generative structures and schemas of perception that are latent in the autoencoder’s training sets, and thinking of the performance of a trained autoencoder on test data as the expression of these generative structures and schemas of perception. Indeed, we are now just about to venture into the mechanical details of how autoencoders train — what actually happens when you give a training set to an untrained autoencoder — but only to the extent absolutely necessary to provide a platform we can use for thinking about the structural relationships between a trained autoencoder, its training set, and its test data.

To illustrate the way autoencoders train, let us suppose that Hal is an untrained autoencoder paired to a large album of photographs randomly sampled from the internet. Hal’s training starts with Hal going through the entire album, summarizing and reconstructing every photograph. Hal’s optimizer then reviews Hal’s inputs, outputs, and procedure, and employs a simple formula to calculate a very small adjustment to Hal’s summary-and-reconstruction procedure that is guaranteed to make Hal’s next summary-and-reconstruction run on the same album produce slightly more accurate reconstructions. After Hal’s optimizer makes this small adjustment to Hal’s summary-and-reconstruction procedure, Hal once again goes through the same entire album, summarizing and reconstructing every photograph, this time using the adjusted summary-and-reconstruction procedure. Hal’s optimizer then reviews the results and calculates another very small adjustment guaranteed to make Hal’s next summary-and-reconstruction run on the same album produce slightly more accurate reconstructions, and the entire process repeats itself. The process typically loops for up to several million rounds, concluding when Hal arrives at a procedure that can’t be improved by any very small adjustments. This process of iterated small adjustments, technically known as gradient descent, is of limited interest to us, since the logic of the gradient descent method itself does not say much about the nature of what an autoencoder ends up learning by employing gradient descent. One caveat worth mentioning, however, is that any concept, structure, or skill learnable through gradient descent must be ‘soft’ — that is, difficult to describe using explicit rules or formulas, but amenable to intuitions, heuristics and approximations. (E.g. tennis is soft and arithmetic is not, cooking is soft and baking is not, ‘go’ is soft and chess is not, and linguists disagree on whether syntax is soft.) Over the course of our discussion, we will revisit this matter of the ‘softness’ of the structures that autoencoders learn from a more qualitative viewpoint, basing our discussion not on the mechanical details of the gradient descent method of learning but rather on the nature of the learning tasks at hand, and ultimately even on a strong formal analogy between the structures an autoencoder must learn and some canonically hyper-soft structures of literary-theoretic fame such as Martin Heidegger’s ‘moods’ (Stimmung), Sianne Ngai’s ‘tones’, and Raymond Williams’ ‘structures of feeling.’ Nevertheless, it may worth remembering that the humble mechanistic origins of an autoencoder’s knowledge shouldn’t lead us to expect that what it learns would be itself rule-driven, simple, inflexible, reductive, or even rigorous in character — in fact, what they should lead us to expect is just the opposite.

Let us recap, this time allowing ourselves to introduce some basic technical vocabulary when appropriate: Let ‘Hal’ be an autoencoder algorithm. Hal’s input channel is a receptor for some predetermined type of digital media file. The type or types of file Hal can receive as input will depend on the design decision of the AI researcher who built Hal, but the most typical choices are receptors for digital images, receptors for digitized audio, and receptors for word processor documents. Whenever Hal receives an input media file x, Hal’s output channel #1 outputs a short list of short numbers that we call Hal’s feature values for x, and Hal’s output channel #2 outputs a media file we call Hal’s projection of x. We call the computation that determines output #1 Hal’s feature function, and the computation that determines output #2 Hal’s projection function. Throughout this dissertation, we will often think of Hal’s feature function as Hal’s ‘worldview’ or ‘conceptual scheme,’ and of Hal’s projection function as Hal’s ‘imagination’ or ‘mimesis.’ The technical relationship between Hal’s feature function and projection function is as follows: Hal’s projection function is a composition of Hal’s feature function and of a ‘decoder’ function that, for every input x to Hal, receives the output of Hal’s feature function (aka Hal’s feature values for x) as its input and then outputs Hal’s projection of x. In other words, Hal’s Projection Function (x) = Hal’s Decoder Function (Hal’s Feature Function (x)). Finally, Hal has an optimizer mechanism, which can be turned on or turned off. We call the set of all the inputs Hal receives while its optimizer is turned on Hal’s training set, and call any input Hal receives while its optimizer is turned off test data. When Hal is training — that is, when Hal’s optimizer mechanism is turned on — Hal’s optimizer mechanism does the following: For every input x included in Hal’s training set, Hal’s optimizer mechanism compares x to Hal’s projection of x and computes a quantity called Hal’s reconstruction error on x. Using a formula called gradient descent, Hal’s optimizer mechanism then makes a small change to Hal’s projection function that slightly reduces Hal’s mean error on the training set — that is, slightly reduces the average size of Hal’s reconstruction errors. When the optimizer alters Hal’s projection function, it necessarily also (by logical entailment) alters the two functions that compose Hal’s projection function: Hal’s feature function and Hal’s decoder function.

For reasons that directly follow from the artificial neural network architecture of autoencoders, and which our abstract model will replace with stipulation, Hal’s optimizer always alters Hal’s feature function and decoder function in roughly symmetrical ways. A trained autoencoder’s decoder function is therefore roughly, and often exactly, just its feature function running in reverse: Hal’s decoder function translates short lists of short numbers into media files by mirroring the steps Hal’s feature function uses to translate media files into short lists of short numbers. Hal’s projection function, therefore, is a matter of using Hal’s feature function to translate a media file into a short list of short numbers, and then running Hal’s feature function in reverse to get a media file again. Of course, since the variety of possible media files is much wider than the variety of possible short lists of short numbers, something must necessarily get lost in the translation from media file to feature values and back. Many media files translate into the same feature short list of short numbers, and yet each short list of short numbers can only translate back into one media file. This means, in an important sense, that Hal’s projection function always replaces input media file x with ‘stand-in’ media file y: unless x happens to be the exact media file that Hal’s decoder function assigns to the feature values that Hal’s feature function assign to x, Hal’s projection of x will not be x itself but some media file y that acts as the stand-in for all media files that share its feature values. The technical name for the set of all the media files that Hal uses as stand-in — that is, all the possible outputs of Hal’s projection function — is the image of Hal’s projection function. In this dissertation, we will often think about the image of Hal’s projection function as Hal’s canon — as the set of privileged objects that Hal uses as the measure of all other objects. Importantly, because of the symmetry between the optimization of the feature function and the optimization of the decoder, the logic by which Hal determines which of the media files that share the same feature values gets is ‘stand in’ for the group is utterly inseparable from the logic by which Hal determines the assignments of feature values in the first place — or, in conceptual terms, the logic that determines Hal’s canon is inseparable from the logic that determines Hal’s worldview.

II. The Meaning(s) of Autoencoders

In the preceding section, we devoted all of our attention to describing the mechanical procedures that Hal follows, saying nearly nothing of the abstract structures Hal’s procedures are designed to channel. The present section will attempt to step beyond those limits, building up towards a theoretical interpretation of autoencoding sometimes called ‘the manifold perspective.’ The manifold perspective on autoencoding, closely identified with AI luminaries Bengio and LeCunn, proposes that what a trained autoencoder truly learns is a space, called the autoencoder’s manifold, and all relevant facts about a given trained autoencoder follow from the form of this space. This concept of a trained autoencoder’s manifold will help us makes sense of the mathematical relationship between a trained autoencoder’s feature function, its projection function, and the image of its projection function in a way that complements our interpretation of the feature function as a kind of worldview, the projection function as a kind of method of mimesis, and the image of the projection function as kind of canon.

Because the manifold perspective on autoencoding promises to do so much for our purposes, we will not introduce the manifold perspective ‘from above’ but rather try to climb there by reflection about the mechanically defined autoencoder algorithm we already introduced. To this effect, we will start by describing an autoencoder as per our previous definition of Hal, only this time Hal’s algorithm won’t be carried out by a machine but rather by a pair of humans playing an unusual collaborative game: Imagine that two expert art restorers specializing in Classical Greek pottery learn that an upcoming exhibition will unveil a previously unseen collection of Classical Greek pots, and they decide to put their understanding of Classical Greek pottery to an extreme test. The self-imposed rules of their game dictate that they must try to recreate the pots from the new exhibition, but only Expert #1 may see the exhibition, and only Expert #2 may sculpt the recreations. Expert #2 will thus have to rely entirely on Expert #1’s descriptions of the pots, and Expert #1 will have to rely entirely on Expert #2’s ability to divine the original pots from her descriptions. Not yet content with the rules of their test, the two Classical pottery experts add one last constraint: instead of describing the pots in as much detail as she pleases, Expert #1’s communications from the exhibition will be limited to written messages of 100 characters per pot. In the division of labor fixed by the rules of their game, then, Expert #1 will play the role of an autoencoder’s feature function, and Expert #2 will play the role of an autoencoder’s projection function, with each pot in the exhibition acting as an input. (Quick reminder: ‘feature function’ is the formal term for what we have informally called an autoencoder’s method for summarizing inputs, and ‘projection function’ is the formal term for what we have informally called an autoencoder’s method of reconstructing inputs.) While this might seem a strange or pointless game for experts in Classical Greek pottery to play, I would propose that it has special merit as a test of our experts’ grasp of Classical Greek pottery as a full cultural-aesthetic system: Most crucially, because 100 written characters cannot suffice for a naive detailed description of a Classical Greek pot, our experts must invent a shorthand that relies on their grasp of the grammar of Classical Greek pottery — their (largely implicit) grasp of the constraints and logic of the variation between one Classical Greek pot and another, from the correlations and dependencies between various ceramic techniques, thematic motifs, ornamental patterns, and laws of composition, to the interactions between these tangible variables and a pot’s painterly and sculptural gestalt.

For the remainder of this section we will rather scrupulously delve into the meaning and mechanics of this hypothetical test of our hypothetical experts’ gestalt understanding of the cultural-aesthetic system of Classical Greek pottery, with the intention of beginning to conceptualize the epistemic objects of autoencoders — in other words, establishing just what it is you know when you know how to summarize-and-reconstruct. Relatedly, we will be bracketing the issue of autoencoders’ optimizer mechanisms (AKA their training process) for the moment, maintaining our focus in this section on just what autoencoders learn rather than how they learn it. We will start off, then, by positioning our Classical Greek pottery experts to be the analogue of an already trained autoencoder — an autoencoder whose optimizer has been turned off after an epoch of training, locking in place the present version of the algorithm’s summarize-and-reconstruct procedure. We will imagine, therefore, that our Classical Greek pottery experts have completed all their preparations for the test, having devised, coordinated, and extensively practiced their Classical Greek Pottery shorthand, and they are now committed to employing the resulting method in the test. Importantly, we mustn’t assume that because the state of our experts after they devised their method of summary-and-reconstruction is analogous to a trained autoencoder, the state of our experts before they devised their summary-and-reconstruction method is any way analogous to that of an untrained autoencoder. For one thing, among many other disanalogies, the work that our experts must perform in order to devise their method of summary-and-reconstruction is minimal compared to the learning process of an autoencoder, since our experts already possess a gestalt understanding of Classical Greek pottery.

The aforementioned limit on the scope of our analogy is, in my view, all for the better. Instead of speaking about experts who already gained a gestalt grasp of Classical Greek pottery by whatever means, we could have described students of Classical Greek pottery who train their gestalt grasp of Classical Greek pottery as an autoencoder would, by repeatedly testing and revising a summary-and-reconstruction method for Classical Greek pots, testing their method, and revising it, but tying our experts’ knowledge to autoencoding from the get-go in this manner would in fact defeat the purpose. Our goal is to establish that autoencoding demonstrates structural forces whose applicability goes far beyond explaining algorithms that explicitly use the autoencoder learning procedure, and which bear on the general phenomenon of grasping a domain’s systemic grammar. We therefore want to start with agents that are not by definition anything like an autoencoder, and who paradigmatically exemplify having a gestalt grasp of a domain’s systemic grammar, and argue that one rigorous way to discuss the structure of this ‘gestalt grasp’ is to construct the formalism of an epistemically equivalent trained autoencoder. In other words, we want to temporarily dissociate the concept of a trained autoencoder from actual autoencoder machines, and instead pitch it as a general schema for the mathematical representation of an agent’s gestalt grasps of a domain’s systemic grammars. Presently, this means defining our agents simply as experts that posses a gestalt grasp of a domain’s systemic grammar, and then arguing that a summary-and-reconstruction task that requires our experts to act as their epistemically equivalent trained autoencoder would call upon the entirety of our experts’ gestalt grasp of their domain as no other activity would.

If the proposed relationship between summary-and-reconstruction and gestalt understanding holds as promised, then thinking about the structural forces at play in summary-and-reconstruction might be a source of insight into the elusive but frequently crucial idea of gestalt understanding of the systemic grammar of a domain of objects — an idea that comes into play whenever we as literary and cultural scholars want to speak about, say, the cultural logic of late capitalism, the style of German Baroque music, or the Geist of the building and streets of 19th century Paris. Furthermore, if we can make good sense of the idea that the activity of summary-and-reconstruction is a uniquely thorough demonstration of one’s gestalt understanding of the systemic grammar of a domain, we cannot be far off from arguing that some form of ‘summary-and-reconstruction’ is at play when humans produce facsimiles (reconstruction) of the world that seem to powerfully embody their subjective take on the systemic grammar of the world (summary). Guilty as charged, we will devote Chapter 2 of this dissertation to discussing works of literature as something like “reverse autoencoders.”

Putting these promises aside for now, let us return to our Classical Greek pottery experts and their autoencoder vivant. Given that our Classical Greek pottery experts are required to perform the same work as a trained autoencoder, what summary-and-reconstruction method could our experts use to do so? While there may be multiple ways to operationalize a gestalt grasp of Classical Greek pottery, one strategy we know to be effective is for our experts to devise a feature function: Let our Classical Greek pottery experts devise a list of 100 descriptive statements whose truth varies strongly from one Classical Greek pot to another (e.g. ‘this pot depicts war,’ ‘this pot’s ceramic technique is uncommon,’ ‘the pot’s composition is symmetrical,’ ‘this pot’s affect is mournful,’ ‘this pot depicts worship’), and agree that on the day of the exhibit Expert #1 will fill out her message about each pot she examines with 100 numerical grades ranging from 0 to 9, marking ‘strongly disagree’ to ‘strongly agree’ for each statement. If our experts can devise a list that optimally complements their gestalt grasp of Classical Greek pottery then the resulting summary-and-reconstruction process will be logically equivalent to a trained autoencoder of the same abilities.

What we can get from our Classical Greek experts hypothetical, then, is a ‘top-down’ perspective on a trained autoencoder: rather than looking at our trained autoencoder as the outcome of a holistic formal learning process applied to a domain, we’re looking at our trained autoencoder as the formal expression of an agent’s gestalt grasp of a domain. By understanding the relationship that our experts’ list of 100 descriptive statements — the feature function of their trained autoencoder — has to bear to their gestalt of Classical Greek pottery to serve as an effective summary-and-reconstruction method, we have an opportunity to understand something of what a trained autoencoder’s feature function ‘means’ in general. Now, the idea of a list of 100 statements that implements a gestalt grasp of the systemic grammar of Classical Greek pottery could, and perhaps should, strike us as suspicious. Isn’t a gestalt grasp of the grammar of Classical Greek pottery, after all, exactly what no list should ever manage to spell out? Crucially, it turns out that a list that implements our experts’ grasp of the grammar of Classical Greek does not need to spell out a grammar of Classical Greek pottery, but rather to implicitly exploit the inferential powers latent in our experts’ grasp of the grammar of Classical Greek pottery.

Speaking informally, we might think of our experts’ list as an optimal set of questions for a game of ‘20 Questions’ about Classical Greek pots in which the questions have to be submitted in advance. Indeed, our reader can even acquire some first-hand experience with the idea of a list of questions complementing a systemic grammar by devising strategies for an imaginary game of ‘20 Questions Submitted in Advance’ on some personal favorite domain such as ‘20th century novels’ or ‘R&B songs.’ What makes a list of questions for a game of ‘20 Questions Submitted in Advance’ effective? In an ordinary game of ‘20 Questions,’ we are always looking for the most relevant question we can ask in light of all the answers we received so far. In ‘20 Questions Submitted in Advance,’ we are looking for a list of questions that each remain relevant no matter what the answers to the other questions. Quite simply, we don’t want our individually good questions to “step on each other’s toes” and end up giving us redundant answers. Formally speaking, this means that we need our list to consist of questions whose answers are both individually unpredictable and statistically independent from each other — and that, returning to our Classical Greek pottery experts, the list of 100 descriptive statements Expert #1 will grade from 0 (‘strongly disagree’) to 9 (‘strongly agree’) to record her impressions of a pot must consist of statements whose validity varies widely and independently from Classical Greek pot to Classical Greek pot.

Admittedly, all of this talk about a list of questions blessed with statistical independence may seem quite remote from the idea of a list of questions complementing some ‘gestalt grasp of systemic grammar of Classical Greek pottery.’ In fact, it can appear downright inimical: Suppose that we applied our Classical Greek pottery experts’ supposedly optimal list of statements to every Classical Greek pot in the world, recording each pot as a 100 numbers long list of grades from 0 to 9 . If our list indeed consists of statements whose validity varies widely and independently between different Classical Greek pots, it would mean that the set of all the individual accounts of Classical Greek pots we have produced using our grading system — a set which, since it contains an account of every Classical Greek pot in the world, is in some sense our account of Classical Greek pottery in total — is itself bereft of any structure, pattern, grammar or gestalt. (In reality, no list of questions, and no feature function, can reach the degree of optimality where all pattern in the domain in fact vanishes, but the degree to which a list or feature function will approach this optimal condition is a measure of the list’s or feature function’s strength.) The math, for what it’s worth, checks out, but does it make conceptual sense that the list of questions that expresses our grasp of a domain’s systemic grammar has to also be the list of questions that ‘dissolves’ the domain’s systemic grammar? Here, as elsewhere, some habits of thought familiar to us from literary theory might help us understand what’s going on. Recall the common critical-theory warning that systems of thought always ‘naturalize’ the worlds to which they’re best attuned, such that a great deal of literary, philosophical, and political thought misrecognizes worlds that are highly particular and structured — the world of male experience, the world of white middle-class families — as universal, average, neutral, unconstrained or typical. While one might be reluctant to call our experts’ list of 100 descriptive statements to grade 0 to 9 a ‘system of thought,’ it certainly naturalizes the domain of Classical Greek pottery. Indeed, I would suggest that our experts’ descriptive method literally internalizes the domain of Classical Greek pottery, to the extent that it’s successful.

Recall that the set we have formally called the image of an autoencoder’s projection function, and informally called an autoencoder’s canon: the set of all potential outputs of the autoencoder’s projection function, or equivalently, the set comprising the autoencoder’s ‘representative’ object for each potential output of the feature function. In our Classical Greek pottery case, where we treat our two experts and their list of 100 statements as a trained autoencoder, the image of the autoencoder’s projection function is the set we get by taking every possible list of 100 numbers between 0 to 9 and then replacing each numerical list with the object Expert #2 would make if Expert #1 were to send her that numerical list in a message. Seen in relation to the universe of all the objects Expert #2 could construct at the behest of messages, each numerical message from Expert #1 is like a treasure map instructing Expert #2 how many steps to take in each of this universe’s 100 directions in order to arrive at the right spot. Indeed, for Expert #2 the meaning of a given message of 100 numbers between 0 to 9 is of the form: “go zero steps in the direction of depicting worship, then five steps in the direction of symmetrical patterns, then nine steps in the direction of unusual ceramic technique, then three steps in the direction of depicting warfare…”

On this line of interpretation, then, our experts’ list of 100 descriptive statements is a coordinate system for the space of possible Classical Greek pots, or at the very least a coordinate system for our experts’ mental model of the space of possible Classical Greek pots. (Indeed, in mathematical parlance our space of Classical Greek pots is a 100 dimensional space, and each statement-to-grade is a dimension of this space.) If this interpretation holds — which, granted, we have yet to fully demonstrate — it follows, with a little effort, that our experts’ optimal descriptive method internalizes the domain of Classical Greek pottery, in at least the following sense: to the extent that our experts’ gestalt grasp of systemic grammar of Classical Greek pots is accurate, our experts’ optimal descriptive method describes objects only in terms of their relative position in (or orientation relative to) a space of all possible Classical Greek pots arranged in compliance with Classical Greek pottery’s systemic grammar. For the remainder of this chapter, we will try to talk ourselves through this line of reasoning.

III. Internalizing Spaces

In his excellent exposition to Heidegger’s concept of mood (‘Stimmung’), Jonathan Flatly writes that ‘any orientation toward anything specific requires a presumed view of the total picture, a presumption that is usually invisible to us — that is just the way the world is.’ Recall that in our Classical Greek pottery experts’ descriptive system, Expert #1 — the ‘feature function’ of the trained autoencoder — translates whatever object she’s describing into coordinates for a point in Expert #2’s space of possible Classical Greek pots. Thus in relation to Expert #1’s descriptive system, Expert #2's space of possible Classical Greek pots is literally ‘the ‘precondition’ and ‘medium’ (Heidegger) that ‘makes it possible in the first place to orient oneself toward individual things’ (Heidegger): when Expert #1 describes an object, her description simply is the coordinates of a location in Expert #2’s space of Classical Greek pots. And, indeed, Expert #2’s space of Classical Greek pots is very much ‘a presumed view of the total picture’ (Flatly), in the specifically Heideggerian sense in which ‘total’ must refer not merely to everything that is, but to everything that is possible and the relation of all possibilities to one another. Expert #2’s space of Classical Greek pots is a model — a ‘presumed view’ — of the totality of Classical Greek pottery: a model of what Classical Greek pots are possible and how the possibilities of Classical Greek pots relate to one another. And, indeed, the ‘presumption’ is, as Flatly says, ‘invisible’: Suppose that Expert #2’s model of the totality of Classical Greek pottery is an imperfect model, simplified or biased compared to the real systemic grammar of Classical Greek pottery, and therefore that the reconstructions of Classical Greek pots that our experts’ system produces simplify or distort the original Classical Greek pot. Clearly, we could never employ Expert #1’s descriptive system to describe the difference between an original and reconstruction, since Expert #1’s descriptive system assigns the original and reconstruction the same feature values. Much as we would expect, the descriptive system that presumes Expert #2’s model of Classical Greek pottery can’t register the structural impact of this model on our experts’ reconstructions.

It should be fair to say, then, that in intuitive terms Expert #1’s descriptive system is a worldview that apprehends individual things only in relation to the ‘total picture’ given by Expert #2’s space of potential Classical Greek pots. What we are now hoping to show is that by giving an exact meaning to the idea that the ‘total picture’ that grounds a trained autoencoder is a space, we can also express exactly what it means to apprehend an individual thing in relation to a total picture. To begin, let us revisit the interpretation of our experts’ list of 100 descriptive statements as a coordinate system for (a model of) the space of possible Classical Greek pots: Consider the fact that for any two pots x, y in Expert #2’s universe of potential Classical Greek pots there is some series of edits that transforms a message specifying pot x into a message specifying pot y. Because the contents of a message are 100 numerical grades, this series of edits can be expressed as series of addition and subtraction operations on the 100 grades specifying pots x. This means, in turn, that the series of edits that transforms a message specifying pot x into a message specifying pot y can be represented by a 100 numbers-long list of its own, this time with numbers ranging -9 to 9: the list that we get by subtracting the grades specifying x from the corresponding grades specifying y. In formal terms, we’re treating the messages specifying x and y as vectors, and then we subtract the vectors to receive their ‘vector difference.’ (While our discussion in this dissertation generally won’t assume any familiarity with vectors, we will nevertheless take to saying ‘x and y’s vector difference’ instead of ‘the list that we get by subtracting the grades specifying x from the corresponding grades specifying y’ from now on, for obvious reasons.) Now, if our experts really did devise the best possible list of 100 descriptive statements to use for their Classical Greek pottery summary-and-reconstruction — the list that best complements their gestalt grasp of the systemic grammar of Classical Greek pottery — then the vector difference between potential pots x and y provides a basis for expressing the holistic relationship between pots x and y in Expert #2’s gestalt of Classical Greek pottery as a spatial relationship.

We can consider two respects in which the vector difference between x and y acts as a spatial expression of the relationship between pot x and pot y in the systemic gestalt of Classical Greek pottery. Firstly, the vector difference between x and y lets us derive a quantity that, from a purely mathematical perspective, qualifies as the distance between two points respectively given by x’s and y’s messages. If our experts’ list of statements to grade is indeed effective, the distance between the two points respectively given by x’s and y’s messages will necessary have a strong correlation with Expert #2’s intuitive judgment of the overall (dissimilarity) of x and y. Secondly, the vector difference between x and y lets us derive a rule for editing a message specifying pot x that mathematically qualifies as moving on a straight line drawn between the point given by x’s message and the point given by y’s message. In other words, the vector difference indicates the direction of the point given by y’s message relative the point given by x’s message. If our experts’ list of statements to grade is indeed effective, then moving along the straight line between x’s message and y’s message corresponds to gradually evolving pot x into pot y, each movement giving us a pot that is proportionally further evolved towards y-ness. Generally speaking, it is no great matter that we can coherently define Expert #1’s messages as coordinates for points in space comprising all possible outputs. The vector difference between two points, however, isn’t a quantity that we define to track the similarity between two objects or the transformation path between two objects: our only act of definition was the choice to treat the list of numbers composing each potential pot’s message as Euclidean coordinates. The fact that elementary geometrical relationships between points in the space resulting from this act of definition seem to track conceptual relationships between pots in our experts’ model of Classical Greek pottery, then, is an example of why one might want to say the messages “really are” Euclidean coordinates.

While the above should go some way towards explaining why it makes good sense to treat our experts’ list as a coordinate system for (our experts’ mental model of) the space of Classical Greek pots, so far it can only offer us a very partial formal interpretation of the idea that this space determines our experts’ descriptions of individual inputs. We know that our experts’ descriptive system turns every input into coordinates for a point in the space of Classical Greek pots, and that it therefore describes Classical Greek pots in terms of their relative position in the space of Classical Greek pots, but what of inputs outside the domain of Classical Greek pots? Recall that if Expert #1 were to secretly apply her descriptive method to some Roman pot or to some Classical Greek statue, or even to secretly apply it to some Coca-Cola bottle, house, or car, when Expert #2 receives the message of 100 grades she would have no grounds to suspect anything is amiss, and she will simply produce a Classical Greek pot that matches the parameters reported in the message. Indeed, this facsimile Classical Greek pot will be ‘identical’ to the Roman pot, Classical Greek statue, Coca-Cola bottle or car in question as far as the properties that our Experts’ system measure go, though this ‘identity’ will now have very different implications depending on the case. How does the spatial interpretation of our experts’ system apply to the transformation of a Roman pot, a Coca-Cola bottle, a potted plant, a Classical Greek statue, a house, or a car into coordinates for a point in the space of Classical Greek pots? Perhaps surprisingly, it applies just as naive spatial metaphor might lead us to suppose. Intuitively, we expect that as a worldview that internalizes the domain of Classical Greek pots, our experts’ descriptive methods should give meaningful results when applied to an object that are ‘close enough’ to being Classical Greek pots, and give increasingly nonsensical results as we go further outside of its comfort zone: a Roman pot or Classical Greek statue should have quite a lot in common with their assigned Classical Greek pot, while a potted plant or Coca-Cola bottle will have a much cruder relationship to their assigned Classical Greek pot, and a house or car no meaningful relationship at all. Drawing on the same pool of intuitions, we might even imagine that what our experts’ system does given an input that is not a Classical Greek pot is chooses the Classical Greek pot that’s ‘closest’ to the input. In order for these intuitions to apply, the space that constitutes our trained autoencoder’s total picture — the space comprising all the objects in the autoencoder’s canon arranged in accordance with the coordinates assigned by the autoencoder’s feature function — has to be, from an alternative perspective, a section of a larger space.

According to the theoretical approach known as the manifold perspective on autoencoders, the space that an autoencoder learns is just this kind of mathematical object: a manifold in the space of all possible inputs. The space of all possible inputs, formally called input-space, is the set of all possible inputs to an autoencoder, spatially arranged in accordance with their superficial properties as inputs rather than an underlying systemic grammar. The exact details of an autoencoder’s input-space depend on the exact way we define that autoencoder’s input channel, but what matters for our purposes is that conceptually speaking input-space is the space we get by taking all possible inputs and arranging them in accordance with the most naive possible axes of similarity and difference. A good paradigm case of an input-space is the space of all possible digital images, commonly called ‘pixel space:’ in pixel space, every pixel is a dimension, and images are arranged according to the color values of their pixels. The intuitive fact most worth noticing about ‘pixel space’, and which makes pixel space a good paradigmatic example of a naive similarity space, is that we can reliably expect that any two photographs (or two paintings, or two cartoons) that are very close in pixel space will also be close on a mature measure of similarity for photographs (or a mature measure of similarity for paintings or a mature measure of similarity for cartoons), and that the converse is true as well — any two photographs that are very similar in terms of objects, lighting conditions, angle, and so forth will be close in pixel space.

A manifold, in general, refers to any set of points that can be mapped with a coordinate system. (Some manifolds can only be accurately mapped using a series of coordinate systems, rather than one global coordinate system. Nevertheless, even these manifolds can often be approximately mapped by a single coordinate system.) In the context of autoencoders, one traditionally uses ‘manifold’ in a more narrow sense, to mean a lower dimensional submanifold: a shape such that we can determine relative directions on said shape, quite apart from directions relative to the larger space that contains it. The following illustration, for example, depicts a “two dimensional” manifold within a three dimensional space:

Figure #1

From the ‘internal’ point of view — the point of view relative to the manifold — the manifold in the illustration is a two dimensional space, and every point on the manifold can be specified using two coordinates. From the external point of view — the point of view relative to the three dimensional space — the manifold in the illustration is a ribbon-like three dimensional shape, and every point on the manifold can only be specified using three coordinates. Because the space of Classical Greek pots is a manifold within the space of all possible inputs, a point in the space of Classical Greek pots is also a point in the space of all possible inputs. Our Classical Greek pottery experts’ descriptive method is the internal coordinate system of the space of Classical Greek pots, describing each point in the manifold relative only to the manifold. When Expert #2 constructs a Classical Greek pot based on the specifications given in a message from Expert #1, however, she allows us to interpret the point specified by Expert #1’s message from the external point of view, as a point in the space of all possible inputs. Indeed, by taking the set of all the reconstructions our experts can produce — what we have called the image of the trained autoencoder’s projection function, or its ‘canon’ — and marking the location of each reconstruction in the space of all possible inputs, we get our experts’ space of Classical Greek pottery from the external point of view, where this space is a (high dimensional) ribbon-like shape in the space of all possible inputs.

It’s probably worthwhile, at this point, to insist on clarifying some potentially confusing issues about the relationship between the internal viewpoint of a manifold like ‘Classical Greek pottery space’ and the manifold as we see it from higher-dimensional, external points of view. Most important, the 100-dimensional space we’re calling ‘Classical Greek pottery space’ isn’t a space that we get by first parametrizing the full input space (‘all-logically-possible-ceramics-objects space’) with, let’s say, 1000 meaningful questions and then picking from among them the 100 questions that are most pertinent to Classical Greek pots. Instead, the 100 questions that act as the coordinates for Classical Greek pottery space may well only have meaning from the point of view of Classical Greek pottery space: the Classical Greek pottery expert doesn’t necessarily employ a general concept of symmetry when judging ‘is this pot symmetrical,’ or use a general concept of worship when judging ‘does this pot depict worship,’ but rather makes Classical-Greek-pottery-specific versions of these judgments, which may well fail to track the relevant concept when used outside the domain of Classical Greek pottery. While this insistence might initially sound like an artifact forced on us by our analogy between our human experts and autoencoders, the view that judgments, and particularly expert judgments, are essentially embodied/embedded in a context rather than the application of a fully abstract universal judgments is an old favorite in Phenomenology (Heidegger, Merleau-Ponty) and social theory (Bourdieu, Williams). An expert in Greek pottery might well not know how to judge symmetry in a Chinese pot — if she cannot identify certain Chinese decorative patterns as patterns, for example, then she won’t be able to tell whether the same patterns appear on both sides of the pot — or how to judge whether a Chinese pot depicts worship.

With the above in mind, we might want to consider two kinds of possible ‘outside’ points of view on the dimensions of a space like ‘Classical Greek pottery’ space. First, there is the point of view of input-space itself — in this case, a space where all logically possible ceramics objects are arranged according to some ‘naive’ organizing principle akin to pixel-space. Like pixel-space, ‘all-logically-possible-ceramics-objects space’ does not comprise only real, typical, or sensible ceramic objects, but rather contains every logically possible ceramic blob or splatter. Thus, when we look at our experts’ Classical Greek pottery space from the input-space’s point of view, we’re looking at how movement in Classical Greek pottery space translate to movement in the ‘naive’ space of all possible ceramic blobs and splatters. Second, we can also conceive of the point of view of a third expert, an ‘all-cultures -pottery’ expert, and of a corresponding manifold within ‘all-logically-possible-ceramic-objects space’ (input-space) that has more dimensions than Classical Greek pottery space but fewer dimensions than the input-space. From the ‘all-cultures-pottery’ expert’s point of view, the Classical Greek pottery experts’ space is a ‘limited’ perspective on Classical Greek pots, because it cannot express, for example, all of the respects in which a given Classical Greek pot is different and similar to a given Chinese pot.

It could be tempting to think, at this point, that because Classical Greek pottery space is a lower-dimensional slice of the already low dimensional (relative to input-space) ‘all-cultures-pottery space,’ the latter space is simply Classical Greek pottery space plus a few more dimensions, and therefore we can specify Classical Greek pottery space relative to ‘all-cultures-pottery space’ by choosing a certain fixed set of coordinates in the ‘new’ dimensions that restricts us to the Classical Greek pottery slice of ‘all-cultures-pottery-space’ and leaving the ‘original’ dimensions free. In reality, this will practically never be the case: although the space of Classical Greek pottery is a subset of the space of all culture pottery, the questions that are best for navigating the space of Classical Greek pottery are not a subset of the questions that are best for navigating the space of all cultural pottery. To see why, imagine that we play a game of ’20 Question Submitted in Advance’ where it is given that the object is a marine mammal. Later we play a game of ’40 Questions Submitted in Advance’ where it is given that the object is an animal. Although one category is a subset of the other, the questions that you submit in the second game will not be the same 20 questions you submitted in the first game plus 20 new questions, but rather a new set of questions. Each space is a totality (to rudely borrow the term from Western Marxism) unto itself, and every aspect of analysis — that is, every dimension — has structure only relative to this totality.

Still, we can reliably expect some interesting relationship between movement relative to ‘all-cultures pottery space’ and movement relative to Classical Greek pottery space, at least where the two manifolds overlap: it’s probable enough that some of the dimension of ‘all-cultures-pottery space’ will roughly correspond to some of the dimensions of Classical Greek pottery space around points where the two manifolds overlap. It would be plausible enough, for instance, to imagine that the question ‘is the pot symmetrical’ is a key question to ask both when describing a given Classical Greek pot relative to Classical Greek pots and when describing a given cultural pot of whatever kind relative to all cultural pottery. In this case, even though the two questions don’t have strictly the same meaning (because different expertise give the question different embodied/embedded meaning), we would expect that if we take the coordinates of a given Classical Greek pot x in both spaces, then whether we make a small increase in the grade of the ‘pot symmetry’ coordinate of our current location in ‘all-cultures-pottery space’ or make a small increase the ‘pot symmetry’ coordinate of our current location in Classical Greek pottery space, we would arrive at roughly the same slightly-more-symmetrical-than-x Classical Greek pot.

IV. Reading a Manifold

While the above is well and good, the set of all the reconstructions that a given trained autoencoder can produce (the set of input-space points that we’ve called a trained autoencoder’s canon, or the image of a trained autoencoder’s projection function) in fact gives us much more than just the shape of a ribbon-like manifold in input-space. It also gives a metric of ‘internal distance,’ distance between objects from the point of view of the autoencoder’s own internal space, through an implicit quantity we’ll call the input-space manifold’s density. How does a trained autoencoder’s canon give a metric of ‘internal distance’? The reasoning at play here is a tad technically involved, but can be summarized as follows. Internal distance on a manifold is, in principle, defined independently of the manifold’s appearance as a shape in input-space. In Figure #1, for example, the crisscrossing lines on the manifold depict a coordinate system that ‘interprets’ input-space distances between points on the manifold very differently in different regions of the manifold. (Every line-segment between intersection points in Figure #1 marks a distance of length 1 in the coordinate system assigned to the manifold, regardless of the input-space length of the line-segment).

In other words, a given coordinate system can have a dynamically varying degree of sensitivity to input-space distances on the manifold: sometimes a small input-space distance between points on the manifold might translate to a large difference between their coordinates, and sometimes a large input-space distance between points on the manifold might translate to a small difference between their coordinates. Because an autoencoder’s feature function is ultimately discrete — that is, there is a finite number of setting for each feature — the dynamically varying sensitivity of a trained autoencoder’s coordinate system to movements on the manifold is ultimately expressed as a dynamically varying threshold for registering a movement on the manifold as a change in feature values. (Both human brains and artificial neural networks are ‘pseudo-continuous,’ but can’t keep making smaller and smaller distinctions literally ad-infinitum.) This property of the autoencoder’s feature function is, in turn, expressed by what we might call the varying ‘density’ of the autoencoder’s manifold — visually, the varying density of the ribbon-shaped cloud of input-space points formed by the autoencoder’s canon. While the input-space distance between neighboring points in this ribbon-shaped cloud of input-space points will vary, we know that from the point of view of the autoencoder’s coordinate system the distance between each pair of neighboring points in this ribbon-shaped cloud is always ‘a single step.’ This information about the autoencoder’s coordinate system’s dynamically varying sensitivity to input-space distances on the manifold provides us with what a mathematician would call the autoencoder’s Riemannian metric on the manifold — its system for deciding the ‘internal distance’ between points — and effectively determines the autoencoder’s coordinate system. For ease of use, we will henceforth refer to the ribbon-shaped cloud of input-space points formed by a trained autoencoder’s canon, which gives us both the shape and ‘density’ of the autoencoder’s manifold, as the input-space form of the autoencoder’s manifold.

As we can see (or take on faith), the ‘internal’ point of view on points within the manifold associated with an autoencoder is mathematically determined by the input-space form of the autoencoder’s manifold. Given the canon that composes our Classical Greek pottery experts’ space of potential Classical Greek pots, for example, we can mathematically deduce our experts’ coordinate system for their space of potential Classical Greek pots. The real benefit of understanding an autoencoder as a manifold, however, lies in the fact the input-space form of an autoencoder’s manifold also allows us to make spatial sense of the trained autoencoder’s treatment of inputs outside of its canon — that is, make spatial sense of the autoencoder’s treatment of those inputs its projection function actually transforms in some way. The term ‘projection’ is, in fact, a shorthand for ‘orthogonal projection to the manifold,’ which means treating an input x as a point in input-space and taking the nearest input-space point covered by the manifold: if input x is itself on the manifold, the projection function outputs x itself, and if input x is not on the manifold the projection function output the ‘canonical’ point that is nearest x in input-space. In our Classical Greek experts hypothetical, we made the fantastical assumption that our experts’ space of potential Classical Greek pots is a perfect match to the domain of Classical Greek pottery, and so we relegated the question of projection to the autoencoder’s treatment of objects outside the domain of Classical Greek pottery — Roman pots, Classical Greek statues, houses, cars. In reality, however, mental models are always simplifications or at least abstractions of the domain that they model, and therefore nearly all of a trained autoencoder’s reconstructions of inputs from its own domain will involve a substantial projection. Indeed, it is the ubiquity of substantive projection that makes it worthwhile to conceptualize a trained autoencoder as a system of mimesis in the literary-theoretic sense: a system that represents objects using imitations whose relationship to the originals betrays a worldview. While it is obvious that the input-space form of the autoencoder’s manifold determines which point on the manifold is closest (in input-space coordinates) to any given point x, it may not be so clear what relationship this operation has to do with the ‘internal’ structure of the manifold, and by extension to the trained autoencoder’s model of the system-grammatically structured similarities and differences between the objects that compose its canon.

A very helpful way to think about the meaning of the projection operation is as follows: Given an input-space point x that isn’t covered by manifold, let’s take point y, the furthest input-space point from x that is covered by the manifold. Let us ‘travel’ on the manifold, starting from point y, until we are as close as we can be (in input-space distance) to point x — that is, until we are at the projection point of x. From the internal point of view, every step we take along the path from y to the projection point of x corresponds to a system-grammatically meaningful change to y. Once we reach the projection point of x, however, no system-grammatically meaningful change to y will get us any closer (in input-space coordinates) to x. The projection point of x is thus the one point on the manifold whose input-space difference from x is completely inexpressible from the internal point of view. We cannot traverse any of the input-space distance between x’s projection point and x by moving on the manifold — or, speaking from the internal point of view, we can’t traverse any of the distance between x’s projection point and x by making system-grammatically meaningful changes. Consequently, by determining an autoencoder’s projection function, the input-space form of its manifold also determines its feature function’s output on inputs outside the manifold: an autoencoder’s feature values for x are simply the feature values (that is, the internal coordinates) of x’s projection point. As an important corollary, because distance in input-space is an extremely naive measure of similarity that is only sensible across short distances, a trained autoencoder’s “area of competence” is in fact the literal input-space area around its manifold. In other words, in order for an autoencoder’s feature function to record system-grammatically meaningful properties of the input, and for its projection function to simplify or interpret (rather than simply distort) the input, the input has the be close enough to the manifold to render input-space distance meaningful.

--

--

Peli Grietzer

Harvard Comp. Lit, visitor Einstein Institute of Mathematics