Topological Spaces
From Points to Waves: Why the Next Generation of AI Representations Should Think in Signals
Blog 1 of a series on Topology, and the Future of Machine Learning Representation
“A vector tells you where something is. A signal tells you what something does.”
A Personal Starting Point
Over the past year, I have been working on problems that sit at the intersection of signals and cognition — including sleep-stage classification, cognitive performance modelling, and participation in the EEG 2025 NeurIPS challenge. These experiences exposed me to biological signals not just as data, but as structured, dynamic processes: oscillations, rhythms, synchrony, and noise interacting over time.
This naturally led me to think about representation from a more neuroscientific perspective. In the brain, information is not encoded as static points — it is carried through patterns of activity, often oscillatory, where timing and phase relationships play a crucial role. That raised a simple but persistent question:
Why do we represent meaning in machine learning as a static point in space?
Language is not static. Images are not static. The world that machine learning models are trying to understand is fundamentally dynamic — things unfold, interact, interfere with each other, and change meaning depending on context and timing. And yet the dominant paradigm in representation learning is to compress all of that into a fixed vector: a point in \(\mathbb{R}^d\) that just sits there.
I started thinking about what it would look like to represent multimodal data — text, audio, vision — not as vectors but as signals: functions over time or frequency, things that have phase as well as amplitude, things that can constructively amplify each other or destructively cancel. Then I discovered that a whole family of models — State Space Models, including S4 [1] and Mamba [2] — had already begun moving in this direction, using signal-processing machinery from control theory as the foundation of sequence modelling.
This blog series is an attempt to explore an alternative viewpoint: what if representations were not points, but signals? I am writing this series to document my learning from first principles and to make that journey useful to others exploring similar questions. I have also used LLMs for ideation, restructuring, and rephrasing in places, but the core ideas, technical direction, and learning are my own. The mathematical equations explaining each of the spaces are also generated using LLMs.
Let us begin with the spaces themselves.
Part I: The Hierarchy of Mathematical Spaces
Before we can argue that one kind of space is better for representation, we need to understand what a “space” actually is in mathematics, and what properties different spaces add on top of each other. Think of this as a ladder — each rung adds structure. Watch the reference video
Image source: reference image
1. Topological Space — The Most General Setting
A topological space is the most general notion of a geometric space. [3]. You start with a set \(X\) and a collection of subsets called open sets, satisfying three axioms: the empty set and \(X\) itself are open; arbitrary unions of open sets are open; finite intersections of open sets are open.
That is all. There is no notion of distance, no notion of angle, no notion of size. But you gain the concept of continuity — a map between two topological spaces is continuous if the preimage of every open set is open.
This matters for machine learning because continuity is the prerequisite for anything meaningful to happen. If your embedding function is not continuous, nearby inputs can map to wildly different representations, and generalisation becomes impossible. Topological spaces are the setting where we ask: what structure must be preserved when we learn a mapping?
2. Metric Space — Adding Distance
A metric space \((X, d)\) is a topological space equipped with a distance function \(d : X \times X \to \mathbb{R}_{\geq 0}\) satisfying non-negativity, identity of indiscernibles (\(d(x,y) = 0 \iff x = y\)), symmetry, and the triangle inequality \(d(x,z) \leq d(x,y) + d(y,z)\) [4].
Metric spaces are where most practising ML researchers implicitly live. Euclidean distance, cosine distance, edit distance — these are all metrics. The key concept metric spaces give us is convergence: a sequence \(\{x_n\}\) converges to \(x\) if \(d(x_n, x) \to 0\).
But metric spaces say nothing about directions, addition, or scaling. Two points have a distance between them, but you cannot add them together and get something meaningful.
3. Normed Vector Space — Adding Size and Algebra
A normed vector space \((V, \|\cdot\|)\) is a vector space over \(\mathbb{R}\) (or \(\mathbb{C}\)) equipped with a norm \(\|\cdot\| : V \to \mathbb{R}_{\geq 0}\) satisfying positive definiteness, absolute homogeneity (\(\|\alpha v\| = |\alpha| \|v\|\)), and the triangle inequality [4].
Every normed vector space is a metric space under \(d(x,y) = \|x - y\|\). But now we can also add vectors and scale them, which means we can talk about linear combinations, subspaces, and transformations.
This is where the \(L^p\) spaces live. \(L^2(\Omega)\) — the space of square-integrable functions on a domain \(\Omega\) — is a normed vector space with \(\|f\|_2 = \left(\int_\Omega |f(x)|^2 \, dx\right)^{1/2}\). This is deeply relevant: signals with finite energy (finite \(L^2\) norm) form a normed vector space, and this is the foundation of all of Fourier analysis.
4. Inner Product Space — Adding Geometry and Angles
An inner product space adds a bilinear (or sesquilinear, over \(\mathbb{C}\)) form \(\langle \cdot, \cdot \rangle : V \times V \to \mathbb{F}\) satisfying conjugate symmetry, linearity in the first argument, and positive definiteness [3].
The norm is recovered as \(\|v\| = \sqrt{\langle v, v \rangle}\). But crucially, the inner product adds angles:
\[\cos\theta = \frac{\langle u, v \rangle}{\|u\| \|v\|}\]
This is the space that current LLMs inhabit. Dot-product attention computes \(\text{Attention}(Q, K) \propto QK^\top\), which is exactly a matrix of inner products [5]. Cosine similarity, used ubiquitously for semantic search, measures the angle between embedding vectors. The geometry of meaning, in today’s models, is entirely encoded in the angles and magnitudes of vectors in an inner product space.
5. Banach Space — Completeness
A Banach space is a normed vector space that is complete: every Cauchy sequence converges to a limit that is still inside the space [4]. This is a technical but important property. Incompleteness means you can construct sequences of valid representations that “converge” to something outside your representation space — a pathological situation for any learning algorithm.
All finite-dimensional normed vector spaces are automatically Banach spaces. For infinite-dimensional function spaces (which appear in functional analysis and in the theory of neural networks as models of function classes), completeness must be verified explicitly.
6. Hilbert Space — The Meeting Point of Algebra and Analysis
A Hilbert space \(\mathcal{H}\) is a complete inner product space [3,4]. It combines the geometric richness of an inner product space with the analytical solidity of a Banach space. In finite dimensions, this is nothing new — \(\mathbb{R}^d\) with the dot product is already a Hilbert space.
The profound content of Hilbert spaces appears in infinite dimensions. The canonical example is \(L^2(\mathbb{R})\): the space of square-integrable functions on the real line. This space has:
- An inner product: \(\langle f, g \rangle = \int_{-\infty}^\infty f(x)\overline{g(x)} \, dx\)
- A notion of orthogonality: \(\langle f, g \rangle = 0\)
- Orthonormal bases: the complex exponentials \(\{e^{2\pi i n x}\}_{n \in \mathbb{Z}}\) form an orthonormal basis — this is the Fourier basis
- A Plancherel theorem: the Fourier transform is an isometric isomorphism on \(L^2(\mathbb{R})\), preserving inner products exactly
This last point is crucial: the Fourier transform is a rotation in Hilbert space. Moving from the time domain to the frequency domain is not a loss of information — it is a change of basis in an infinite-dimensional inner product space.
7. Other Spaces
Sobolev spaces extend function spaces by incorporating derivatives. [6]. They are the natural setting for partial differential equations and for understanding the regularity of functions — not just whether they are square-integrable, but whether they are smooth.
In machine learning, Sobolev spaces appear in regularisation theory (penalising functions with large higher-order derivatives), in physics-informed neural networks, and in the theoretical analysis of approximation by neural networks with smooth activation functions [7].
Minkowski space introduces an indefinite inner product.
This is the geometric setting of special relativity. In machine learning, it appears in hyperbolic embedding methods [9]: because hyperbolic space has constant negative curvature, it can represent hierarchical structures (trees, taxonomies, knowledge graphs) exponentially more efficiently than Euclidean space. The Poincaré ball model of hyperbolic space can be viewed as a projection of Minkowski space.
Part II: What LLMs Are Actually Doing — and What They Are Missing
Embeddings as Static Vectors
In every major large language model — GPT-4 [10], Llama [11], Mistral — words and subword tokens are mapped to vectors in \(\mathbb{R}^d\) for some large \(d\) (commonly 1024 to 16384). These embeddings are static: the token “bank” always starts as the same point in \(\mathbb{R}^d\), regardless of context. Context-sensitivity comes later, through the attention mechanism, which produces contextualised representations by mixing embeddings.
The geometry of these spaces is straightforward:
- Semantic similarity is measured by cosine similarity (angle in the inner product space)
- Attention scores are inner products, \(q^\top k / \sqrt{d_k}\), softmax-normalised [5]
- Residual connections perform vector addition in the same space
This is a rich geometric structure, and it has proven extraordinarily effective. But it has a fundamental character: it is amplitude-only. Every representation is a real-valued vector — a collection of magnitudes. There is no phase, no timing information, no notion of whether two features are in sync or in opposition.
The Magnitude Heuristic and Its Failure Modes
In real-valued networks, the common heuristic is: a feature is important if its activation is large. Magnitude proxies for salience. This creates several problems:
- Frequency vs importance ambiguity: A feature that fires frequently in training data accumulates large weights, making it hard to distinguish structural relevance from statistical prevalence.
- Superposition: Polysemantic neurons — single neurons that respond to multiple unrelated concepts — have been extensively documented [12]. When you only have magnitude, many concepts must be packed into the same activation by sharing magnitude ranges.
- Limited interaction mechanisms: In real-valued spaces, you cannot have two representations that destructively interfere. The only way to suppress a feature is to add a neuron with the opposite sign — which uses capacity but does not reflect any geometric elegance.
Hardy Spaces and the Frequency Domain View
The Laplace transform maps a continuous-time signal \(f(t)\) to a function of a complex variable \(s = \sigma + i\omega\):
\[\mathcal{L}\{f\}(s) = \int_0^\infty f(t) e^{-st} \, dt\]
The Z-transform does the same for discrete sequences:
\[\mathcal{Z}\{x\}(z) = \sum_{n=0}^\infty x[n] z^{-n}\]
The spaces of functions that these transforms naturally live in are Hardy spaces \(H^2\) — a class of Hilbert spaces consisting of holomorphic (complex-analytic) functions on a half-plane or disk, with \(L^2\) boundary behaviour [13]. Hardy spaces inherit the inner product structure of \(L^2\), but the analyticity constraint (the real and imaginary parts satisfy the Cauchy-Riemann equations) means they encode much more structure than generic \(L^2\) functions.
In these spaces, a “signal” is not a point but a function: a complex-valued object where:
- The real part encodes amplitude information
- The imaginary part encodes phase information
- The inner product measures overlap, synchrony, and coherence between signals
This is the geometric framework that has been standard in electrical engineering and quantum mechanics for decades, but is only beginning to enter machine learning in earnest.
Part III: Signals as Richer Representations
Complex Activations and Phase
In a complex-valued neural network (CVNN) [14], activations are elements of \(\mathbb{C}\) rather than \(\mathbb{R}\). A single complex activation \(z = re^{i\theta}\) encodes two quantities:
- \(r = |z|\): the magnitude (analogous to standard activation strength)
- \(\theta = \arg(z)\): the phase (a second, independent channel of information)
Phase encodes relational information — not “how strong is this feature?” but “how does this feature relate to the system’s current state?” This is the representation language of physical systems: waves in quantum mechanics, oscillators in neuroscience, carrier signals in communications.
Interference: The Mechanism That Real-Valued Networks Lack
The defining operation of wave-based representations is interference. When two complex-valued signals are summed:
\[z_1 + z_2 = r_1 e^{i\theta_1} + r_2 e^{i\theta_2}\]
the result depends critically on the phase relationship \(\theta_1 - \theta_2\):
- If \(\theta_1 = \theta_2\) (in phase): constructive interference — amplitudes add
- If \(\theta_1 = \theta_2 + \pi\) (out of phase): destructive interference — amplitudes cancel
- Intermediate phases: partial interference of intermediate strength
This gives a principled mechanism for:
Noise suppression: Stochastic noise has a uniform phase distribution. When many noisy signals are summed, their phases cancel in expectation — this is exactly how phase-array radar and optical coherence tomography work [15]. Real-valued networks have no analogous mechanism.
Feature synchronisation: Semantically related features can be “phase-locked” — assigned similar phases so they constructively amplify each other. This is reminiscent of the binding hypothesis in neuroscience [16]: objects are perceived as unified wholes because their neural representations oscillate in phase.
Geometric stability: Phase-aware models encode relationships as rotations in the complex plane, not just as signed magnitudes. Rotations are isometries — they preserve distances — which tends to improve generalisation and reduce sensitivity to input perturbations.
Why This Matters for Multimodal Representations
My original motivation was multimodal representation: representing text, image, and audio in a common space. The standard approach is to train modality-specific encoders and project them into the same \(\mathbb{R}^d\) — forcing fundamentally different signal types into the same geometric box.
But audio is intrinsically a signal: it is a pressure wave with frequency, amplitude, and phase. Representing audio as a static point in \(\mathbb{R}^d\) discards its temporal structure before the representation even begins. A Hilbert space representation would allow audio, image (via 2D Fourier structure), and text (via sequence dynamics) to be represented as functions — elements of an \(L^2\) space — and their cross-modal relationships to be encoded as inner products and phase relationships in that space.
Part IV: State Space Models — The Existing Bridge
When I was developing these ideas, I came across a body of work that had already built something close to what I was imagining, coming from a different direction: State Space Models (SSMs).
The S4 model [1] parameterises sequence processing using a continuous-time state space:
\[\dot{x}(t) = Ax(t) + Bu(t), \quad y(t) = Cx(t) + Du(t)\]
where \(A\) is a structured matrix (specifically, a HiPPO matrix [17] designed to optimally memorise history), and the model is discretised for practical training. The key insight: in the frequency domain (via Laplace transform), this system is a rational function of \(s\) — exactly the kind of object that lives naturally in a Hardy space.
Mamba [2] extends this with selective state spaces — input-dependent dynamics that allow the model to choose, at each step, what to remember and what to forget. This is closer to a filter in signal processing than to attention: rather than computing pairwise similarities across the entire sequence, it propagates information through a dynamical system with state.
The computational advantage is stark: standard Transformers scale as \(O(n^2)\) in sequence length due to the attention matrix, while SSMs scale as \(O(n)\) — because they process sequences as signals through a filter, not as sets of points comparing each other.
This is not a coincidence. When you represent sequences as signals rather than as sets, linear-time processing becomes natural. The Fourier transform of a signal is computed in \(O(n \log n)\) rather than \(O(n^2)\); convolution in the time domain is multiplication in the frequency domain. Signal representations come with efficient algorithms as a structural gift.
Part V: A Taxonomy of Spaces for Representation
To ground the above discussion, here is a summary of which spaces are relevant to which ML contexts:
| Space | Key Structure | Natural ML Application |
|---|---|---|
| Metric space | Distance only | k-NN, clustering, contrastive learning |
| Normed vector space | Distance + linear algebra | \(L^p\) regularisation, weight decay |
| Inner product space | Angles + projections | Dot-product attention, cosine similarity |
| Hilbert space (finite-dim) | Complete inner product | Standard neural network layers |
| Hilbert space (infinite-dim, \(L^2\)) | Function-valued representations | SSMs, functional neural processes |
| Hardy space \(H^2\) | Holomorphic + \(L^2\) boundary | Laplace/Z-transform signal representations |
| Sobolev space \(W^{k,p}\) | Function + derivative regularity | Physics-informed NNs, smoothness regularisation |
| Hyperbolic space / Minkowski | Negative curvature | Hierarchical embedding, knowledge graphs |
| Riemannian manifold | Local Euclidean + curvature | Geometric deep learning, manifold learning |
The progression from left to right is a progression from less structure to more structure. Current LLMs sit solidly in the finite-dimensional inner product space column. State space models begin to occupy the \(L^2\) and Hardy space columns. The full realisation of signal-based multimodal representation would require working fluently across several of these spaces simultaneously.
Conclusion: Rethinking Representation from the Ground Up
The dominant paradigm treats representation as placement: a token is a point, a meaning is a location, similarity is proximity. This is a powerful and productive view, and it has driven remarkable progress. But it is, ultimately, a static view.
Signals offer a dynamic alternative. A representation that carries phase as well as amplitude can encode how a feature relates to the system’s current state, not just that it is present. Representations as functions in a Hilbert space can encode temporal and spectral structure natively, without discarding it at the tokenisation stage. Interference gives a principled mechanism for noise suppression and feature binding that has no real-valued analogue.
State space models are an existence proof that this direction is practically viable. But I think the full potential — particularly for multimodal systems where audio, vision, and language need to be represented in a common framework that respects the intrinsic nature of each modality — remains largely unexplored.
The next blog in this series will move from spaces to geometry: we will look at the Manifold Hypothesis and ask what it means for data to live on a low-dimensional manifold inside a high-dimensional space, explore whether high-dimensional spaces are really “more linear” in any meaningful sense, and introduce Vector Symbolic Architectures as a way to perform structured reasoning inside geometric spaces.
References
[1] Gu, A., Goel, K., & Ré, C. (2022). Efficiently modeling long sequences with structured state spaces. ICLR 2022. arXiv
[2] Gu, A., & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752. arXiv
[3] Reed, M., & Simon, B. (1980). Methods of Modern Mathematical Physics, Vol. 1: Functional Analysis. Academic Press.
[4] Kreyszig, E. (1978). Introductory Functional Analysis with Applications. Wiley.
[5] Vaswani, A., et al. (2017). Attention is all you need. NeurIPS 2017. arXiv
[6] Evans, L. C. (2010). Partial Differential Equations (2nd ed.). American Mathematical Society.
[7] Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945.
[8] Minkowski, H. (1908). Raum und Zeit. Physikalische Zeitschrift, 10, 75–88.
[9] Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. NeurIPS 2017. arXiv
[10] OpenAI. (2023). GPT-4 Technical Report. arXiv:2303.08774. arXiv
[11] Touvron, H., et al. (2023). LLaMA: Open and efficient foundation language models. arXiv:2302.13971. arXiv
[12] Elhage, N., et al. (2022). Toy models of superposition. Transformer Circuits Thread. Article
[13] Garnett, J. B. (2007). Bounded Analytic Functions. Springer.
[14] Trabelsi, C., et al. (2018). Deep complex networks. ICLR 2018. arXiv
[15] Trees, H. L. V. (2002). Optimum Array Processing. Wiley-Interscience.
[16] Singer, W. (1999). Neuronal synchrony: a versatile code for the definition of relations? Neuron, 24(1), 49–65.
[17] Gu, A., et al. (2020). HiPPO: Recurrent memory with optimal polynomial projections. NeurIPS 2020. arXiv
This is Blog 1 of a planned series. Blog 2 will cover the Manifold Hypothesis, the question of linearity in high-dimensional spaces, and Vector Symbolic Architectures as a framework for compositional reasoning inside geometric representations.
Tags: representation-learning hilbert-spaces state-space-models signal-processing machine-learning mathematical-foundations embeddings

