The new mathematical method presented here provides a more accurate estimate of the (internal) information content of any discrete pattern based on Shannon's original function. The method is tested on different data sets and the results are compared with the results of other methods like compression algorithms.
Abstract: In this paper, we present a new multi-scale information content calculation method based on Shannon information (and Shannon entropy). The original method described by Claude E. Shannon and based on the logarithm of the probability of elements gives an upper limit to the information content of discrete patterns, but in many cases (for example, in the case of repeating patterns) it is inaccurate and does not approximate the true information content of the pattern well enough. The new mathematical method presented here provides a more accurate estimate of the (internal) information content of any discrete pattern based on Shannon's original function. The method is tested on different data sets and the results are compared with the results of other methods like compression algorithms.
1 Introduction
Traditionally, Shannon's information theory [
13] has been used to measure the information content of samples. Shannon information, as defined by Claude E. Shannon, is the degree of uncertainty or surprise associated with a given outcome in a set of possible outcomes. Shannon entropy, which is the expected value of Shannon information, is used to quantify the average information content of a discrete sample or message. It serves as a basic concept in information theory and is widely used in communication systems and data compression.
In some situations, such as repeated patterns, Shannon's original information measurement method does not give accurate results enough because it does not take into account the structure of the patterns, it only looks at certain statistical characteristics of them. To solve this problem, this paper presents a new multiscale information content calculation method based on Shannon's original principles. By refining the computational approach, our method offers a more accurate estimate of the internal information content of discrete samples, regardless of their nature.
There are several other methods for measuring the information content of patterns, such as Kolmogorov complexity [
8], randomness [
9], and compression complexity. The common property of these methods that they are all suitable for determining and understanding the information content of patterns with some accuracy, and therefore provide a suitable comparison basis for checking newer methods.
To verify the effectiveness of our new method, it is applied to various data sets and compared with compression algorithms. The results show that our proposed method based on Shannon information closely approximates the results measured by other methods while taking a completely different approach.
2 Patterns
In this study, we deal with the calculation of the internal quantitative information content of discrete patterns. From the point of view of the calculation of the information content, the nature of the object of the measurement is irrelevant. The information content of events, signals, system states, or data sequences can be calculated since their models (with finite precision) can all be represented as discrete patterns. By moving along a spatial pattern, we get a temporal pattern and vice versa. Therefore, we do not distinguish between spatial and temporal patterns. The basic markings should be as follows.
Denote
the set of finite sequences that can be generated from the set
:
Let us call the finite sequence
a pattern:
Denote the length of the series
:
Denote the set of possible values of the series
:
Let
denite the number of occurrences of
in the series of
:
Let the relative frequency of any
element of the pattern
:
Denote the concatenation of
patterns as:
3 Information content
The information content can be interpreted intuitively when only the interpretable information content is examined [
1]. In this study we examine the amount of the entire internal information content without interpreting it or considering the context.
The information content of a pattern can be characterized by the improbability of individual elements of the pattern (Shannon information [
13]), the length of the most concise description of the pattern (Kolmogorov complexity [
8]), or the degree of randomness of the pattern [
9].
A fundamental difference between Shannon's and Kolmogorov's viewpoints is that Shannon considered only the probabilistic characteristics of the random source of information that created the pattern, ignoring the pattern itself. In contrast, Kolmogorov only focused on the pattern itself [
5]. In their definition, Kolmogorov and Chaitin called (inaccurately) random the pattern with the maximum information content [
10].
Information, complexity and randomness have such similar properties that we can reasonably assume that they are essentially approaching the same thing with different methods. It is sufficient to consider that the Shannon information, Kolmogorov complexity and randomness of a pattern consisting of identical elements are all minimal, while in the case of a true random pattern all three values are maximal, and they all assign the highest information value to data sets with maximum entropy [
1].
The concepts of entropy and information are often confused [
3], so it is important to mention that entropy can also be understood as such the average information content per element.
Approached intuitively, the amount of information is a function for which the following conditions are met:
- The information content of a pattern with zero length or consisting of identical elements is zero.
- The information content of the pattern consisting of repeating sections is (almost) identical to the information content of the repeating section.
- A pattern and its reflection have the same information content.
- The sum of the information content of patterns with disjoint value sets is smaller than the information content of the concatenated pattern.
- The information content of true random patterns is almost maximal.
Let the information content be the function
that assigns a non-negative real number to any arbitrary pattern
:
In addition, the following conditions are met:
- , where is a real random pattern.
Since any pattern can be described in non-decomposable binary form, the unit of information content should be the bit.
It can be seen that for any pattern
, if
and
, then the maximum information content of
is:
That is,
for any pattern
. In the case of a binary pattern,
, the length of the pattern, which means that a maximum of
bits of information (decision) is required to describe the pattern.
If the maximum information content is known, the relative information content can be calculated:
4 Shannon information
In theory, Kolmogorov complexity would provide a better approximation of the information content of patterns, but it has been proven that it cannot be calculated [
5], in contrast to Shannon information [
13], which can be calculated efficiently, but approximates the actual information content less well. Shannon information calculates the information content of the pattern based on the expected probability of occurrence (relative frequency) of the elements of the pattern.
The Shannon information of an arbitrary pattern
:
Since the relative frequency (expected occurrence) of the elements of the pattern is only one statistical characteristic of the pattern and does not take into account the order of the elements. That's why the Shannon information often gives a very inaccurate estimate of the information content.
The value of the Shannon information is the same for all patterns of the same length whose elements have the same relative frequency. If
,
and
then it holds that:
Shannon information ignores the structure of the patterns at different scales, the laws encoded in them, and therefore overestimates the information content of patterns consisting of repeating sections.
The problem can be illustrated with a simple example. Let's calculate the Shannon entropy of the following three patterns:
In all three cases, the set of values is
, the probability of each element is
and
, and the Shannon entropy is
, although it is obvious that the information content of the data series differs significantly. Due to its randomness, the information content of data line
is almost 16 bits, while the information content of the other two data lines is much smaller, as they contain repeated sections. In the
data line, for example, the 2-bit section
is repeated, which means that its information content is closer to 2 bits.
The problem is that in the example above, we are examining the datasets at an elementary level, and our Shannon entropy function does not take into account the larger-scale structure of the dataset, such as the presence of repeating segments longer than 1 signal. Therefore, it is obvious to develop methods that are based on the Shannon entropy, but the data series are analyzed in the entire spectrum of the resolution, in the entire frequency range, and thus provide a more accurate approximation of the information content of the data series. Countless such solutions have already been published, which can be read, for example, in the articles [
2] and [
6]. This article presents an additional method.
5 SSM information
5.1 Shannon information spectrum
Let the pattern
be partitioned by sections of length
if
:
Let the following series denoted as Shannon information spectrum (SP) of pattern
:
From the sequences
, we omit (truncated) partitions shorter than
, those that are shorter than
. In the cases
,
, so these are also omitted from the spectrum.
Figure 1. Diagram shows the Shannon information spectrum of the random pattern , and diagram shows the repeating pattern (Appendix I). It can be seen that in case , a lower value appears at certain frequencies.
5.2 Maximal Shannon information spectrum
The Shannon information spectrum will be maximum in the case of random data sets. Let the following formula denoted as maximum Shannon information spectrum (SMS):
is a supremum for all information spectrum having the same value set and pattern length. If
, then in the case of random patterns, the value set of the partitioning contains most likely all possible partitions, so the information content is approximately
. If
, then the partitioning cannot contain all possible partitions, each partition will most likely be unique, so the information content will be
. If
is small enough, then the series
most likely contains all possible partitions, therefore by random data sets the measured amount of information will approximately equal the maximum possible information content of the pattern, i.e. if
is small, then
.
Figure 2. Comparison of maximum Shannon information spectrum (ISMS) and Shannon information spectrum (ISP) of the repeating pattern .
5.3 Shannon normalized information spectrum
If we are interested in how much the information content seems to be relative to the maximum value in each resolution, we can normalize the spectrum with the maximum spectrum to the range
. Let the following sequence denoted as Shannon normalized information spectrum (SNS):
If the value set of the partitioning has only one element, i.e.
, the normalized value would be
. In this case the information content should be the information content of the repeating partition, and the average Shannon entropy of an element of the elementary resolution is multiplied by the length of the partition:
.
Figure 3. Comparison of Shannon normalized information spectrum (S of patterns from very different sources. The vertical axis represents the amounts of bits of information measured at the given resolution. You can see how different the spectrum of the different patterns is, but in most cases there is a resolution where the information content shows a definite minimum. The minima are marked with an arrow. A: random binary pattern, B: binary pattern with repeating sections, C: DNA section, D: English text, E: ECG signal, F: audio recording containing speech, G: evolution of the number of sunspots between 1700-2021, H: seismogram , I: Lena's photo.
The figures show that different types of patterns have very different and characteristic spectra. This suggests that the type or source of the pattern may be inferred from the nature of the spectrum, but we do not deal with this in this study.
5.4 SSM information
We know that the Shannon information gives an upper estimate in all cases, so we get the most accurate approximation of the information content from the normalized spectrum if we take the minimum. Let the information content calculated from the normalized spectrum denoted as Shannon spectrum minimum information (SSM information):
Shannon information, SSM information and compression complexity of different patterns (Appendix I) in bits:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Repeating binary pattern.
|
|
|
|
|
|
|
|
Repeating binary pattern.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Duplicate text with one character error.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
DNA segment of COVID virus.
|
|
|
|
|
|
|
|
Random string (0-9, a-z, A-Z).
|
|
|
|
|
|
|
|
English text (James Herriot's Cat Stories).
|
|
|
|
|
|
|
|
Solar activity between 1700-2021 (A-Z).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 1. Comparison of SSM information and compression complexity of different patterns.
Relative Shannon information, SSM information, and compression complexity of different patterns (Appendix I) compared to maximum information:
Table 2. Comparison of relative SSM information and relative compression complexity of different patterns.
It can be seen from the table that the SSM information gives similar results as the compression algorithms. In general, it is true that the more computationally demanding a compression or information measurement procedure is, the closer it is to Kolmogorov complexity. In the examined examples, the results of SSM information are usually located between the results of ZIP and 7Z, so the computational complexity of SSM information must be similar to the computational complexity of ZIP and 7Z.
Figure 4. Comparison of the results of different information measurement methods.
Figure 5. Comparison of the average results of different information measurement methods.
5.5 Comparison with computational complexity
If we do not know the signal set of the signal sequence, the first step is to determine the number of signals occurring in the signal sequence, which has an asymptotic complexity of
.
Determining the Shannon information consists of two steps. In the first step, we determine the frequency of signals, which has a complexity of
, and in the second step, we sum up the entropy of each signal, so the total complexity of the Shannon information is
.
For the ZIP, 7Z and ZPAQ algorithms used to calculate the compression complexity, the complexity is usually between
and
, but for ZPAQ may be greater [
7] [
12] [
11].
In the case of SSM information, the first step is also to determine the frequency of signals, which has a complexity of
. In the second step, the Shannon information spectrum is calculated
complexity, finally the minimum of the spectrum can be determined
with complexity. The complexity of calculating the SSM information in the worst case is
, which is identical to compression algorithms.
5.6 Known issues
All methods of calculating the amount of information have inaccuracies. One of the problems with SSM information is that if the repetition in a repeating pattern is not perfect, the value of the SSM information is larger than expected, as shown in the example below.
|
|
123456789 123456789 123456789
|
|
223456789 123456789 123456789
|
|
Table 3. One element change can cause a notable difference in SSM information.
6 Conclusion
It can be shown SSM information can determine the information content of the patterns with an accuracy comparable to the compression algorithms, but at the same time it is simple. In addition information spectrum presented here provides a useful visual tool for studying the information structure of patterns in the frequency domain.
References
- Scoville, John, "Fast Autocorrelated Context Models for Data Compression", (2013).
- Laszlo Lovasz, Complexity of Algorithms (Boston University, 2020).
- Ben-Naim, Arieh, "Entropy and Information Theory: Uses and Misuses", Entropy (2019).
- Pieter Adriaans, "Facticity as the amount of self-descriptive information in a data set", (2012).
- Juha Karkkainen, "Fast BWT in small space by blockwise suffix sorting", Theoretical Computer Science (2007).
- A. N. Kolmogorov, "On tables of random numbers", Mathematical Reviews (1963).
- Laszlo Lovasz, "Information and Complexity (How To Measure Them?)", The Emergence of Complexity in Mathematics, Physics, Chemistry and Biology, Pontifical Academy of Sciences (1996).
- Anne Humeau-Heurtier, "The Multiscale Entropy Algorithm and Its Variants: A Review", Entropy (2015).
- Allen, Benjamin and Stacey, Blake and Bar-Yam, Yaneer, "Multiscale Information Theory and the Marginal Utility of Information", Entropy (2017).
- Goldberger, A. and Amaral, L. and Glass, L. and Hausdorff, J. and Ivanov, P. C. and Mark, R. and Stanley, H. E., "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals.", Circulation (2000).
- Markus Mauer, Timo Beller, Enno Ohlebush, "A Lempel-Ziv-style Compression Method for Repetitive Texts", (2017).
- Grunwald, Peter and Vitanyi, Paul, "Shannon Information and Kolmogorov Complexity", CoRR (2004).
- Claude E. Shannon, "A Mathematical Theory of Communication", Bell System Technical Journal (1948).
- Ervin Laszlo, Introduction to Systems Philosophy (Routledge, 1972).
- Olimpia Lombardi and Federico Holik and Leonardo Vanni, "What is Shannon information?", Synthese (2015).
|
A pattern or a detail of the pattern
|
|
|
|
001101101010111001110010001001000100001000010000
|
|
|
|
101010101010101010101010101010101010101010101010
|
|
Repeating binary pattern.
|
|
111111110000000011111111000000001111111100000000
|
|
Repeating binary pattern.
|
|
The sky is blue. The sky is blue. The sky is blue.
|
|
|
The sky is blue. The sky is blue. The sky is blue.
|
|
The sky is blue. The sky is blue. The sky is blue.
|
|
Duplicate text with one character error.
|
The sky is blue. The sky is glue. The sky is blue.
|
|
cagtttctagctatattagcgggcacgactccactgcgcctatgcggaag
|
|
|
cttgatcaaattttgaccagatcttaggtaacctgaacaagtcagttcgt
|
aggcgtcgattggccgacgggtgcgaagaaaaaagtgatcgttgtccaac
|
atctctagtacccaccgttgtgatgtacgttatacggacacgagcatatt
|
|
cggcagtgaggacaatcagacaactactattcaaacaattgttgaggttc
|
|
DNA segment of COVID virus.
|
aacctcaattagagatggaacttacaccagttgttcagactattgaagtg
|
aatagttttagtggttatttaaaacttactgacaatgtatacattaaaaa
|
tgcagacattgtggaagaagctaaaaaggtaaaaccaacagtggttgtta
|
|
EK8Pi5sv2npTfzoaMNp87QtT5kbIUQkTJzHwICCstSmg4aksHT
|
|
Random string (0-9, a-z, A-Z).
|
MwztgHFg3j8AoIobN3FycCLidGeyROiNyG5itB9kxyez1LZjFF
|
HIBjipE7hidZyiJmilXM0mwnxzlzWSfQ0xP1OuFpWosMwS1cjY
|
t4nyv4ONx1FceWkAf8SdvDGZVzeVzq2EmOqRF6Im2iudcYRswj
|
|
I think it was the beginning of Mrs. Bond's
|
|
English text (James Herriot's Cat Stories)
|
unquestioning faith in me when she saw me
|
quickly enveloping the cat till all you could
|
see of him was a small black and white head
|
protruding from an immovable cocoon of cloth.
|
|
ABCDFIEDBBAAAABEHJJGEEDBDGMSPLHFBACFKMRPLGDCA[...]
|
|
Solar activity between 1700-2021 (A-Z).
|
|
My name is Joe. That is what my colleague,
|
|
|
Milton Davidson, calls me. He is a programmer and
|
I am a computer program. [...]
|
|
1011000100110011101110111011001100110011[...]
|
|
|
|
110000101000000011000010100000001100001010000[...]
|
|
|
|
0101001001001001010001100100011011100100[...]
|
|
|
|
1010001010100001101000001010001010100011[...]
|
|
Lena (256x256 pixel, grayscale).
|
Written by:
Zsolt Pocze
Volgyerdo Nonprofit Kft., Nagybakonak, HUN, 2023