What is Information? The Ins and the Outs [Part 1]
Information Theory versus Integrated Information Theory
Information is one of those words we feel we understand, but when we stop to ponder its meaning, it turns out to be a tricky concept to pin down. We say we are surrounded by it, inundated with it, and desperately seeking more of it. But what is information, really?
These days, the word information is so commonly used that it easily slips into conversation without much thought about its meaning. The common-sense understanding of information is closely tied to understanding or meaning. So, we might talk about information gaps, inside information, or the exchange of information. But information is also commonly used interchangeably with the word data; we talk about information storage, information management, and information retrieval.
Information is not limited to common-sense interpretations. Various disciplines have developed their own distinct notions of what information means. Mathematics provides a quantitative measure of information. Biological sciences examine the role of genetic information in living organisms, while neuroscience explores the neural mechanisms underlying information processing. Computer science focuses on digital information representation, storage, and algorithms. Even quantum mechanics has its own interpretation of information related to principles like quantum entanglement. And philosophy challenges all these ideas by questioning whether information might be reality itself. Are all these disciplines referring to the same thing, or is information simply a word used to describe many things?
Answering the question of what information ‘is’ is a monumental task that won’t be fully addressed in a single article. However, given its relevance to the topics we discuss here — artificial intelligence, neuroscience, and consciousness — developing a good understanding of what information means is crucial.
So far, I’ve let the word slip into the conversation without addressing its complexities. But given I want to start discussing theories like Giulio Tononi’s Integrated Information Theory — a popular yet controversial theory of consciousness — the time has come to start chipping away at the question — what is information?
In this article, I want to start a two-part series comparing information according to two theories: Information Theory, which is heavily employed in the sciences and artificial intelligence, and Integrated Information Theory (IIT). Both theories use similar words when describing what they mean by information, but their conceptions of information are radically different.
This week, let’s explore information according to Information Theory, and we’ll leave information according to IIT for next week.
Information Theory
In fields like mathematics, engineering and science, information is often formally defined through the lens of Information Theory. Claude Shannon — the father of Information Theory — formalised the idea in the late 1940s while working at Bell Labs.
Originally, Information Theory was developed to solve problems in digital communications. The core question was: How can we accurately and efficiently send information, ensuring the message we send is the same as the message received, despite noise or other interferences in the system? But nowadays, Information Theory is heavily used in many fields, especially those that employ Bayesian statistics, including neuroscience and artificial intelligence.
To understand Information Theory, we need to understand information entropy.
Information entropy is similar to entropy in physics, but it’s not exactly the same.
In physics, entropy is often thought of as the disorder or randomness in a system. The higher the disorder, the higher the entropy. (More accurately, in thermodynamics, entropy represents the amount of a system's thermal energy that is unavailable for conversion into mechanical work. However, it is more commonly described in terms of disorder.)
In Information Theory, information entropy measures uncertainty. The higher the uncertainty, the higher the entropy.
For example, let's imagine you have a new, unopened deck of Bicycle Playing Cards. The order of the unopened deck is —two Jokers, followed by Ace through King of Spades, Ace through King of Diamonds, King through Ace of Clubs, and King through Ace of Hearts (looking at the faces of the cards). (It turns out the order of playing cards can vary depending on the brand, country, and whether you are playing in a casino, but let’s just imagine we know the order of our deck is as above).
When you first open the deck, we are certain of the order of the cards. If we were to place the cards face down and flip over the top card, there would be no uncertainty about which card would appear first — it would be the Ace of Hearts (assuming we have removed any extra ad cards). The order of the cards has very low information entropy.
Now imagine we shuffle the cards so the order is random; we know almost nothing about which card is where. Each card turned over provides new information. If we do our job well and shuffle the cards so we have completely removed the original order, the order of the cards would have high information entropy — we would be very uncertain about the identity of the top card.
The core idea in Information Theory is quantifying how much an observation reduces the uncertainty about an event.
In our card example, we can think of the event as turning over one card in the deck. Because there is no uncertainty about the card order of an unshuffled deck, there is no reduction in uncertainty when we turn over a new card. We have learned nothing. We gained no information from turning over a new card.
But for a well-shuffled deck, we are uncertain about the order of the deck, so we do learn something when we turn over a new card. Our uncertainty has decreased — we have gained some information about the order of the deck.
According to Information Theory, this decrease in uncertainty is the definition of information.
Information: the reduction in uncertainty when we learn the outcome of an event.
The remarkable thing about Information Theory is it defines information in a way that makes it quantifiable. We can calculate the amount of information we gain from turning over a card in a deck of cards (or any other system). A bit is the standard unit for measuring information.
To calculate the bits of information, we simply need to know the probabilities and then plug those probabilities into the entropy equation. Given there are 52 cards in a deck (excluding the ad cards and the two jokers), the probability of turning over any card is 1/52.
Below is for those who like to see the math (if that’s not you, skip the mathy-looking section between the wavey lines):
To calculate the bits of information we gain in turning over a card in a well-shuffled deck, we plug 1/52 into the following formula:
Applying the formula to a deck of 52 cards:
This simplifies to:
Using the property of logarithms:
The information content of an event, such as turning over a card, is quantified as the negative logarithm (base 2) of the probability of that event occurring.
This means if we are uncertain about which card will be revealed first from a well-shuffled deck, turning over the top card yields approximately 5.7 bits of information. Given our prior uncertainty, this number represents the information gained from the specific action of revealing the top card.
For an unshuffled deck, we know the order of the cards. The first card will be the Ace of Hearts. In other words, the probability of the first card being the Ace of Hearts is 1. Turning over the top card gives 0 bits of information.
The bits of information actually indicate the number of binary (yes/no) questions required to reduce the uncertainty from all possible outcomes to just one.
For example, for a well-shuffled deck, we need to ask, on average, 5.7 binary questions. Imagine the top card is the Seven of Hearts. We could ask the following 6 binary questions to get the answer:
Is the card red? — Yes
Is the card a heart? — Yes
Is the value of the card 7 or lower? — Yes
Is the value of the card 4 or lower? — No
Is the value of the card 6 or higher? —Yes
Is the card the Seven of Hearts? — Yes
The Sum Up
Understanding the math, we can note three things.
1.
Events with lower probabilities will result in more bits of information than events with higher probabilities.
2.
But those probabilities will depend on what we are uncertain about.
For example, consider the same shuffled deck mentioned earlier. If instead of wondering about the specific card on top, we're only uncertain whether the top card is red, the scenario changes. With half the cards being red and the other half black, the probability that the top card is red is 1/2. Turning over the top card in this context provides exactly 1 bit of information (not 5.7 bits).
Information Theory applied in our example means information is observer-dependent. The number of bits depends on what the observer already knows and what the observer is uncertain about.
3.
The last thing to note is that information is entirely about the statistical properties of the event (flipping the card) and the outcome distribution (which card it is) rather than any meaning or significance the card’s identity might have.
For example, knowing whether the flipped card is the Ace of Hearts or the Two of Spades doesn’t change the calculation; it’s purely about how this outcome reduces uncertainty. The observer, in this case, might be the person calculating the entropy, the player in a game, or even a theoretical observer analysing the scenario. They assess the situation from the outside, focusing on the information flow (from unknown to known card) rather than on what the card means contextually or intrinsically (e.g. like they would need to do in a poker game).
We’ll revisit these points next week when we explore information according to IIT.
Thanks so much for reading this article.
I want to take a small moment to thank the lovely folks who have reached out to say hello and joined the conversation here on Substack.
If you'd like to do that, too, you can leave a comment, email me, or send me a direct message. I’d love to hear from you. If reaching out is not your thing, I completely understand. Of course, liking the article and subscribing to the newsletter also help the newsletter grow.
If you would like to support my work in more tangible ways, you do that in two ways:
You can become a paid subscriber
or you can support my coffee addiction through the “buy me a coffee” platform.
I want to personally thank those of you who have decided to financially support my work. Your support means the world to me. It's supporters like you who make my work possible. So thank you.
You did it again.
I propose a new tagline for your Substack: "Everything you thought was simple is actually much more complex when you start to think about it"
(On second thought, that tagline sucks. Forget I ever said anything.)
But thanks to this post, I can now sound smart when I tell my poker buddies that Texas Hold 'Em is really all about minimizing information entropy while maximizing your bets.
At which point I will no longer have any poker buddies.
Brilliant presentation, as usual!
Just want to add there's an equivalent interpretation for information being quantified as the neg log of the probability of the event: it gives you the minimum size (in bits) of a message encoding the outcome of that event, in a hypothetical language if which receiver and transmitter have the same knowledge except for that outcome. This had of course very practical implications in the design of Internet communication protocols: it gives us, for example, the optimal compression rate we can expect for any given piece of data, the size the redundancy payload needs to be to be able to recover from different types of errors.
But it also has a rather profound meaning in computational complexity, it gives a way to talk about the optimal program for any given problem, e.g., the smallest Turing machine that can decide if an arbitrary string is a solution to the problem or not. And this, in turn, has profound implications in machine learning and AI, as it speaks about inherent limitations of learning algorithms.