Having a habit of compulsively wondering approximately every 34.765th day about how zip compression (bzip2 in this case) might be used to measure information contained in data – this time the question popped up in my head of whether or not and if then how permutation of a text’s words would affect its compression ratio. The answer is – it does – and it does so in a very weird and fascinating way.

Lo and behold James Joyce’s “A Portrait of the Artist as a Young Man” and its peculiar distribution of compression ratios …

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
library(stringr) library(ggplot2) compress <- function(str) { length(memCompress(str, type="bzip2")) / nchar(str) } calc_ratios <- function(title, N, sep=" ") { f <- paste("/media/Volume/temp/struc/",title,".txt",sep="") t <- scan(f, what=character()) y <- unlist(sapply(t, function(l) str_split(l,"[^a-zA-Z]"))) names(y) <- NULL y <- y[nchar(y)>1] ratios <- rep(NA, N) for(i in 1:N) { ratios[i] <- compress(paste(y[sample(length(y))],collapse=sep)) } results <- list("perms" = ratios, "original" = compress(paste(y,collapse=sep))) return(results) } res <- calc_ratios("portraitartistyoungman", N=10000) ggplot(data = data.frame(x = res[["perms"]])) + geom_histogram(aes(x = x),binwidth=.00001,stat="bin") + geom_vline(xintercept=res[["original"]], color="red") + labs(title="compression ratio of not permuted text in red", x="", y="") + theme(plot.title = element_text(vjust=1, size=12)) hist(res[["perms"]], breaks=100, main = "A Portrait of the Artist as a Young Man", xlab="distribution of compression ratios for word permutations with median in red", ylab="") abline(v=median(res[["perms"]]), col = "red") |

So obviously the compression ratio is indeed affected by word order and even comes along with a very nice looking distribution. But the best part is that the compression ratio of the original word order is effectively statistically mega-super-über-hyper-significant! The same is true for all the book’s texts I checked out and, I guess, it is true for every natural language text in the world. Intuitively I thought that maybe there is a distribution which is typical for text but from what I see, they are pretty different from content to content. It even seems to me that there is a resemblance of shape per author (“authorship attribution”). Though, Ulysses – the literary troublemaker of last century – is of course not playing along nicely. Nonetheless – Twain features a plateau, the other two Joyces a double hill and Dickens a bell-shape (see six-pack below).

# Compression Ratio as a Measure for Structure

The difference between a normal text and a permutation of it is of course that we kill its structure by doing so. This makes me wonder if it would be possible to measure a texts degree of structure by relating its original compression ratio to its distribution of compression ratios. F.x.:

A *low* DOS would indicate a *high* degree of structure because of the relatively high difference to the lower boundary of the distribution of compression ratios for randomly permuted word order. Looking at the DOS column of the table below this interpretation seems reasonable when we compare the degree f.x. for Dickens vs. Joyce. Assuming that Dickens narrates rather straightforward as opposed to Joyce who is notorious for his demanding literary style.

1 2 3 4 5 6 7 8 9 10 11 |
wc = word count original = compression ratio of original text (word order) dec_nth / median = nth decile / median of c.r. dist. for permuted word orderings title wc original dec_1st median dec_9th DOS Dubliners 65609 0.28288 0.31397 0.31446 0.31511 0.901 Ulysses 253927 0.32091 0.34607 0.34632 0.34663 0.927 Great Expectations 176096 0.26178 0.29518 0.29534 0.29556 0.887 David Copperfield 337291 0.25696 0.29184 0.292 0.29217 0.88 Huckleberry Finn 105441 0.26252 0.2999 0.30017 0.30056 0.875 Life on Mississippi 138867 0.27769 0.30453 0.30474 0.30534 0.912 |

# Technicalities

All texts are taken from Project Gutenberg – relieved of sections not belonging to the book itself – reduced to letters a to z and A to Z by replacing anything else with a space – the result is split at white spaces and all “words” of length 1 are discarded. This vector of words or string atoms is then permuted using sample().

The compression applied is of type bzip2 and performed using memCompress() with calculation of compression ratio being implemented as follows:

1 2 3 |
compress <- function(str) { length(memCompress(str, type="bzip2")) / nchar(str) } |

(original article published on www.joyofdata.de)

Twist? Maybe you meant Twain?

Must have twisted that … ;)

Interesting results, but as expected.

I’ve actually done something very similar (comparing random sorts of lines of my website) in http://www.gwern.net/Archiving%20URLs#sort—key-compression-trick but I didn’t use Bzip2, I used xz because that seems to be a smarter algorithm on text and I think that yields more meaningful results. What happens if you try xz? Or better yet, zpaq on maximal settings?

(I’ve also tried considering the length/difficulty of different fictional works by comparing not their wordcounts but their compressed sizes: http://www.gwern.net/Book%20reviews#textual-lengthcomplexity But I’m not sure that’s relevant to your points here.)

Hi Gwern,

thanks for your input!

Would you elaborate on why the original word ordering is

thatsignificant (consistently at level 10?¹? – to give a rule of thumb estimate) and what the shape of the distribution might depend on? Or is “expected” referring just to the result that word ordering affects compression?What makes xz and zpaq “smarter”? (not in detail, just an intuitive take on it)

As you suggest, the compression ratio itself is also for the given specimen related to the content’s difficulty. Because as you can take from the table the

originalcompression ratio is higher (less compression efficiency) for Dubliners and Ulysses than for Huckleberry Finn and David Copperfield. This observationIfound quite expectable which is why I didn’t mention it.BTW your site seems quite … interesting … almost mesmerizing … reminds me somehow of isomorphisms. Will definitely have a closer look!

Also interesting that you run mirrors for Silk Road … what’s on sale right now? ;)

> Would you elaborate on why the original word ordering is that significant (consistently at level 10?¹? – to give a rule of thumb estimate)

This doesn’t surprise me because from my own experiment I know the loss from randomizing order is substantial, and so running on a whole book of text could be expected to yield high p-values (how many lines is a book, 10k?).

> and what the shape of the distribution might depend on?

Now that I have no idea about.

> Or is “expected” referring just to the result that word ordering affects compression?

Yes.

> What makes xz and zpaq “smarter”? (not in detail, just an intuitive take on it)

It’s been a long time since I read Mahoney’s data compression book (_ Data Compression Explained_ http://mattmahoney.net/dc/dce.html) so offhand I don’t have any rigorous definition of how xz (LZMA https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) improves on bzip2, and how zpaq improves on xz. I think the short answer for the latter is ‘zpaq is crazy and allows for things like neural networks as part of the compression process’.

> Because as you can take from the table the original compression ratio is higher (less compression efficiency) for Dubliners and Ulysses than for Huckleberry Finn and David Copperfield. This observation I found quite expectable which is why I didn’t mention it.

I actually didn’t notice that, but that is inline with my expectations from comparing _Umineko_ to _Wheel of Time_ and Shakespeare – it certainly would be surprising if _Ulysses_ compressed better than the others! I’m surprised _Dubliners_ was the highest of the normal books since I don’t remember the language being complex.

> Also interesting that you run mirrors for Silk Road … what’s on sale right now? ;)

There’s a 2-for-1 sale on early-finalization scams which has been going gangbusters for a month or two now…

Well, that merely offers the mathematical potential but is not at all implying that you going to see that significance in the wild and consistently at that. When I look at those distributions and how far off the original CR is – I am totally fascinated by it, because it means there is a pattern in language so simple that bzip2 can pick up on it. That is actually also a reason why I think bzip2 is a good choice. Of course a compression algorithm applying NLP technology will show this performance – it would not be surprising at all. But bzip2 is – as you say – comparatively simple. That offers the potential to think about what is going on. An ANN will place a black box between you and insight.

Additionally the shapes of the distribution cue me into believing that there is going on something quite interesting underneath.