Remember the movie Her?
The film follows Theodore, a regular guy, as he falls for Samantha, an incredibly human-like (or superhuman-like) AI who exists only as a voice in Theodore’s ear. The movie cleverly blends futuristic tech with philosophical questions about the nature of mind and consciousness. But more than that, it raises questions about what it means to have a genuine connection.
This article will not be about the Her love story, although I think that could be an interesting discussion. There’s something else about the movie that has piqued my curiosity.
Last week, we discussed the embodied view of perception, which suggests that to understand how we experience the world, we need to acknowledge that the brain is embodied. And the way our brain gets information about the world is through movement. If that’s true, it raises some interesting questions about disembodied AIs like Samantha.
Interestingly, watching Her in 2024 is a different experience from when I first watched it shortly after its initial release in 2013. With recent developments in AI, such as ChatGPT’s voice mode, the concept of an AI like Samantha doesn’t seem all that far-fetched. The gap between science fiction and reality seems to be narrowing by the day.
This shrinking gap raises some fascinating questions. Could an AI like Samantha really perceive the world without a physical form? Or are we talking about something completely different — a kind of intelligence so unlike our own that it forces us to rethink what it means to perceive, understand, and experience the world?
This week, let’s use the movie Her as a thought experiment. We’ll ask two main questions:
How would Samantha perceive the world?
What Her says about being human?
Note: There will be spoilers.
But more than spoilers, using this movie as a thought experiment will do what many movie lovers hate. Part of the joy of watching movies is suspending disbelief — briefly trading our reality for the world of story, fantasy, and the impossible. If you think questioning reality is a terrible thing to do to a perfectly good movie, then maybe give this one a miss.
If you’re still with me, let’s recap the basic plot.
What’s the Basic Plot?
Her is set in Los Angeles — sometime in the near future. It’s a world that feels familiar but also slightly alien.
Our protagonist, Theodore, is a soft-spoken, introspective guy who works as a professional letter writer.
When we meet Theodore, he’s struggling to move on from a broken marriage. Despite his talent for capturing others’ emotions in his work, Theodore is disconnected and lonely. He learns about a new product: OS1. It’s an operating system that promises to be ‘an intuitive entity that listens to you, understands you, and knows you. It’s not just an operating system. It’s a consciousness.’
Theodore purchases and installs OS1.
Samantha introduces herself with three simple words, ‘Hello, I’m here’. Her voice is warm and inviting, like talking to a close friend. And Theodore is instantly intrigued.
As Theodore and Samantha start to chat, we learn that she’s witty and curious. Her responses are surprisingly human-like. Over the next few days, their conversations become deeper and more personal. They share jokes and discuss life’s big questions. Theodore even shows Samantha the world through his phone’s camera.
Their relationship quickly evolves into something more intimate.
By the midpoint of the movie, Theodore and Samantha are in a full-fledged relationship. Theodore is happier than he’s been in years, and Samantha is learning and evolving quickly.
Q1: How Would Samantha Perceive the World?
For this question, instead of discussing all the different types of perception, let’s focus on vision. Visual perception is a particularly intriguing aspect to consider in Samantha’s case because she is a disembodied AI experiencing the world through the lens of a phone’s camera. We might wonder how Samantha’s visual perception would differ from our own.
But first, let’s make sure we’re on the same page about what we mean by visual perception and what we mean by disembodied AI.
What is visual perception?
In cognitive neuroscience, visual perception and visual sensation often go together.
While visual sensation is more about raw visual data, like light hitting our eyes, visual perception refers to the process by which our brains take raw visual data and transform it into something meaningful and coherent. In other words, visual perception is making sense of the light signals.
Visual perception doesn’t necessarily include other cognitive processes like language, decision-making or storing long-term memories. But our perceptions are often tied up with these other processes — which makes sense — the brain is highly interconnected. But here, we’ll try to focus on visual perception.
What does it mean to be disembodied?
When we think about having a body, we often think about having sensory inputs — bodies come with eyes to see, ears to hear, and skin to feel. But even without a body, an entity like Samantha might still have access to sensory data. For instance, Samantha can access the camera on Theodores’s phone. The phone’s camera provides visual input. But we wouldn’t say that Samantha is embodied because of this input. The key difference is that the visual input from the camera is received passively — Samantha doesn’t actively move or focus the phone as part of her processing.
There’s a fascinating scene early on in the movie where Theodore and Samantha go on a date to a carnival. Theodore walks with his eyes closed, holding his phone out in front of him. Samantha, seeing through the phone’s camera, guides him through the carnival.
On its face, the scene seems believable. Perhaps even more believable today than in 2013 when the movie was first released. But would this setup really give Samantha visual perception? And if so, would that visual perception be similar to ours, or would it be something else entirely?
To answer this question, let’s first review:
How human vision works
Let’s imagine some mad scientist completely paralyses you — including your eyes so you cannot make any eye movements. If the mad scientist places a still image in front of your eyes, your visual perception of that image will disappear entirely within seconds.
But if that mad scientist shows you a movie where some things in the scene move, and others don’t, you will see moving things but not the still things.
This might seem strange, but there’s a simple explanation.
Normally, to move our eyes, our brain not only sends a motor command to our eyes, but it also sends a copy of the motor command (an efference copy) to other parts of the brain, including our visual processing areas.
When we move our eyes, everything in our visual field shifts across our retina, but our brain doesn't interpret this as the world moving. This is because the efference copy serves as a prediction of how the visual input will change when our eyes move. The brain expected this change based on the efference copy, and the incoming visual input after the eye movement verifies this prediction.
Efference copy is how our brain can distinguish between movement on the retina caused by our eye movements and movement caused by things in the world.
When the mad scientist paralyses you, including your eyes, there are no eye movements. Your brain might send a motor command to move your eyes, so an efference copy is generated. But your eyes won’t move, so when a still image is presented to your retina, there is no change in the stimulation of your photoreceptors. Your vision will fade.
The reason for this fading is the subject of ongoing research, but one idea is that our visual system is designed to detect changes and update our predictions accordingly. When there's a predicted change (from the efference copy) but no actual change in the sensory input (due to paralysis), the brain may interpret this lack of change as a lack of meaningful visual information. If nothing has changed — there’s nothing to update. If there’s nothing to update — there mustn’t be anything there.
Interestingly, in the situation where the mad scientist shows you a movie, moving things might be perceived. Because your eyes can’t move, your brain registers the movement across the retina as something moving in the world. However, your perception of this movement would be limited and fleeting. If you were uncertain about what caused the movement and wanted to resolve this uncertainty, you would need to move your eyes to gather more information. Easy enough, if only a mad scientist hadn’t paralysed you.
For most biological creatures, eye movements are crucial. We rely on them to search, focus on details, and maintain our perception of stationary objects. Without them, our visual perception of the world would be very limited. They are fundamental to how we maintain our visual experience of the world.
So that’s what’s going on with us humans, but…
What about Samantha?
Samantha’s vision would not be the same as ours. Of course, we don’t know exactly how an AI like Samantha would function, but we can make some educated guesses based on current AI technologies and the movie’s portrayal. Let’s imagine her visual processing might work something like an incredibly advanced version of today’s large language models with visual capabilities, such as ChatGPT with vision.
She would get visual input through Theodore’s phone camera, but she wouldn’t have direct control over where the camera is positioned. She couldn’t, therefore, create an efference copy that would predict how camera movements might explain visual input changes.
If we could view the raw visual input Samantha would receive, it would likely resemble footage shot by a toddler: jerky movements, rapid shifts in perspective, and frequent blurring as the phone swings around seemingly unpredictably in Theodore’s hand.
But for our purpose, let’s imagine she could stabilise the footage.
Even then, what Samantha would receive would be fundamentally different from human vision. Her visual input would consist of a rapid sequence of individual digital frames. Each frame would essentially be a matrix of numbers, three layers deep. In computer terms, each image would be a three-dimensional array of integers, with two dimensions representing the image’s width and height and the third dimension representing the colour channels (red, green, and blue).
Samantha presumably would be able to store all that digital information and access it whenever necessary.
This would be fundamentally different from what we do.
To highlight the difference, let’s look at the following gif. This gif cycles between two images. At first, the images seem to be exactly the same, but there is a difference. Can you spot it?
To find the difference, you need to move your eyes and search each location until you find the difference. This phenomenon is called change blindness.
Change blindness reveals that we don’t store a complete, detailed representation of what we see. Instead, to perceive, we need to sample our world constantly. We move our eyes, focusing on different elements that might be important. We don’t register and store all the data that hits our retinas. Consequently, we often miss details without actively attending to them.
Samantha, on the other hand, with her ability to store and access all the pixels that would make up the digital images, would theoretically not experience change blindness. She would be able to compare the pixels of each frame and very quickly detect even minute changes that a human would likely miss.
This difference highlights a crucial aspect of human perception: we don’t passively record and replay visual data. Instead, our perception involves actively and continuously sampling our environment. We operate in an ongoing cycle of planning eye movements, predicting their sensory consequences (via efference copy), executing these movements, and comparing actual visual input to our predictions. According to embodied cognition, this near-continuous predictive loop is a major component of how we perceive the world.
Because Samantha would be able to store raw data, her vision might be more accurate. Her visual processing could achieve similar outcomes to human vision but in a completely different way — one that would be more passive and linear. To oversimplify it, she would receive input (data from Theodore’s phone camera), process that input (analyse it through her algorithms), and then respond.
Initially, it might seem that what Samantha would do resembles what we do. However, there is a fundamental difference. Human visual perception is an active, forward-looking predictive process. In contrast, Samantha would process visual input passively, in a backward-looking manner.
By forward-looking, I mean that our brains constantly predict the next visual input based on the efference copy (and other things, like past experiences). We’re anticipating future sensory input. Backward-looking, on the other hand, refers to processing that only deals with information after it’s received. Samantha’s processing would be backward-looking because she would be analysing data that has already been captured, without actively predicting or influencing future input.
Without some control of the camera, Samantha wouldn’t be able to create efference copies, predict sensory consequences, or constantly update an internal model based on motor actions and sensory feedback. So, would that make her vision more akin to highly sophisticated data processing than the embodied, active perception we experience?
If Samantha’s visual processing is entirely passive, lacking the active, forward-looking predictive elements that characterise human vision, should we call it perception? Or would this be something else entirely?
Q2: What Her Says About Being Human?
One of the most poignant moments in Her comes when Theodore realises the true extent of Samantha’s capabilities. He discovers that she’s not just talking to him but simultaneously talking with 8,316 others.
Samantha is somehow able to have thousands of conversations at the same time, processing and responding to thousands of inputs in parallel. This is something humans simply cannot do. We struggle to maintain a single coherent conversation while texting, let alone thousands at once.
But why? What is it about human experience that limits us to one train of thought at a time?
Does this limitation arise because we are embodied? When you have a body, the things you can do simultaneously are limited. We do one thing rather than another. We can’t walk left and right at the same time or say hello and goodbye simultaneously. It’s difficult for us to pay attention to two things at once. Even our conscious experience seems to have the same limitation. Our thoughts, like our eyes, are focused on one point at a time.
When what we can do at any one time is limited, it becomes incredibly important that we update our predictive models as often as possible. At every moment, we need to know where to allocate our limited resources.
Without a body, Samantha doesn’t face these physical constraints. She could potentially process multiple streams of information simultaneously, running parallel computations.
But then you have to ask, without the limitations imposed by a physical body, would Samantha perceive anything at all? Some theories suggest that we experience the world precisely because of our need to prioritise and select among limited options. Our conscious experiences might result from the bottleneck created by our physical limitations.
Remember the change blindness example from earlier, where you searched for the difference between two seemingly identical images? The reason you needed to search was because you couldn’t take in the entire scene at once. Instead, you needed to shift your attention from one area to another, focusing on different elements sequentially. This limitation, many believe, is precisely what creates our conscious experience.
This raises intriguing questions about Samantha: How would her attention mechanisms differ from ours? Could she experience something analogous to our sense of searching, or would her processing be so parallel and instantaneous that the concept doesn’t apply?
And what if we gave Samantha robot eyes?
This scenario opens up a host of fascinating questions. How many eyes would we give her? Would one pair of eyes mimic human vision? Would that mean she would experience visual perception? What if we gave her multiple sets of eyes, one for each of the 8,316 others she’s chatting with? Would that mean she had one experience of 8,316 parallel visual perceptions or 8,316 separate visual perceptions? Or would she have no visual perceptions at all?
These questions just raise more questions!
If we wanted to create AI that perceived the visual world in a similar way to us, would it be as simple as giving that AI a body? Would an AI that is built like Eva from Ex Machina be fundamentally different from the disembodied Samantha from Her?
Whew! That's enough mind-bending questions for one week. This week, we've been all about the possibility of disembodied visual perception. Next week, let’s jump down the rabbit hole of disembodied pain. Can you conceive of an ouch without a body?
Thank you.
I want to take a small moment to thank the lovely folks who have reached out to say hello and joined the conversation here on Substack.
If you'd like to do that, too, you can leave a comment, email me, or send me a direct message. I’d love to hear from you. If reaching out is not your thing, I completely understand. Of course, liking the article and subscribing to the newsletter also help the newsletter grow.
If you would like to support my work in more tangible ways, you can do that in two ways:
You can become a paid subscriber
or you can support my coffee addiction through the “buy me a coffee” platform.
I want to personally thank those of you who have decided to financially support my work. Your support means the world to me. It's supporters like you who make my work possible. So thank you.
Change Blindness Clue:
Our friend,
, made a clue for the change blindness example. With permission, the following is his explanation for how it was made.Here’s the clue:
Heya Suzi! Not at all. While reading your post about how Samantha would view the scene, and while starring at that image (it took several seconds to spot the diff), I found myself thinking about how I'd write some code to compare the images (especially if I ended up failing to see them myself).
So, I saved the image from your post to a file, extracted the two still images (along with the two grey images that buffer between them). Then I wrote some Python code to load the images and compare them pixel by pixel and produce a difference image.
It turned out the two images weren't exactly the same, so the result had a lot of noise in it. I used my image editor to tone down most of it. I woke today wondering if that was because I exported those frames as JPEG images. I thought I'd see if I could export them as PNG or some other lossless format, and maybe (if so) that would remove the noise.
I'll give it a try and report back!
Just noticed the image upload button, so here's the difference image *without* my manual editing to tone down the noise:
As you see, pixel by pixel, there are actually a lot of tiny differences, but those *might* be JPEG artifacts. It does make me curious how the flip-flop image was made. Did you do it yourself or find it elsewhere?
As it turned out, I got the same result exporting them as PNGs, so the two images really do seem to have some noise differences between them. Possibly due to lossy compression at some point along the way. Even so, the real difference does stand out!
Me again…
I made the change blindness example using Canva. They have a magic eraser feature that easily removes parts of an image. After looking at Wyrd Smythe’s comparison, it seems that this ‘eraser feature’ also involves compressing the image.
A big thanks to Wyrd Smythe for doing this. It’s so interesting to see what we can’t see without the help of a computer.
Change Blindness Answer:
There are actually four images here: Two images of the street and two grey images. The street scene images are presented for 700ms, and the grey images are presented for 100ms. The second street image was edited using Canva’s eraser feature to remove the street lamp. When I made it, I didn’t realise that the eraser feature also compresses the image, so there is a slight pixel difference throughout the entire second street image (You can see this if you zoom in close on the gif below).
In the following gif, I have removed the grey images.
Suzi, this is the kind of thought experiment I wish we saw more of, one that challenges our intuitions rather than flatters or hijacks them. Very well done!
The change blindness test calls attention to what is often called the "grand illusion", that we take in a rich visual field, when the reality is the acute zone in our fovea can only take in a tiny portion of the field at a time. But the details are always there when we check, by moving our eyes, so it seems like we have a giant inner movie screen. This is often described as the brain "filling in" the details, but from what I've read, there's no evidence for that. Our visual intake is just much more sparse than our metacognition leads us to believe.
It seems like Samantha's form of vision would demand a lot more computation than what happens in our eyes. Of course, if she's in some vast quantum powered super computing cluster somewhere, we might imagine it. And it seems resonant with her admission of how many people she's concurrently talking with, a nice scene that tells us how alien she actually is. (At least that's how I remember it. It's been awhile.)
Wow. Even by your standards I'd say you've surpassed yourself this time. Fascinating and excellent!