how far can data compression go

**JosAH** · Feb 24 '09, 04:59 PM

Originally posted by dynamo

I'm just wondering,what factor(s) makes recursive compression algorithms stop.I mean if you can compress compressed data,why cant you keep compressing till the data reaches it's smallest possible size.

You can't keep on compressing. A perfect compression algorithm leaves no redundancy in the data it has to compress so it ends up with its minimal size after one compression round. The minimal size of some data d (its 'information') is defined as -2log(P(d)) where P(d) is the probability that an event d occurs.

A sequence of random bits can not be compressed (e.g. white noise). Another problem is what we consider 'information'. Normally we use eight bits as information units but for a certain bit stream the information unit could be nine bits long or a prime number of bits long.

Most compression algorithms are near optimal so they can't be applied ad infinitum. For a nice introduction read this and this.

kind regards,

Jos

**dynamo** · Feb 24 '09, 07:43 PM

reply

Thanks for replying.When you say one can't compress random data.What exactly do you mean by random data and does this mean you cant compress data if your looking at it at binary level?

**JosAH** · Feb 25 '09, 03:56 PM

Originally posted by dynamo

Thanks for replying.When you say one can't compress random data.What exactly do you mean by random data and does this mean you cant compress data if your looking at it at binary level?

For white noise random data every bit has a probability P(0) == P(1) == 0.5. If all those bits are truely random you can't compress that bit stream. In jargon: the bit stream has maximal entropy.

Compressing data increases the entropy of the data. If the entropy is maximal already you can't compress that data.

kind regards,

Jos

**bilibytes** · Mar 5 '09, 11:10 PM

Originally posted by JosAH

For white noise random data every bit has a probability P(0) == P(1) == 0.5. If all those bits are truely random you can't compress that bit stream. In jargon: the bit stream has maximal entropy.

Compressing data increases the entropy of the data. If the entropy is maximal already you can't compress that data.

kind regards,

Jos

Hey you look very smart!
Did you get your knowledge from a University or by googleing?

congrats for your math knowledge!

**JosAH** · Mar 6 '09, 07:17 AM

Originally posted by bilibytes

Did you get your knowledge from a University or by googleing?

University; when I studied there (somewhere in the dark ages) there was no Google yet, nor was there any internet available to the public ...

kind regards,

Jos

**RedSon** · Mar 16 '09, 05:12 PM

Originally posted by bilibytes

Hey you look very smart!
Did you get your knowledge from a University or by googleing?

congrats for your math knowledge!

Actually if you saw him he would look very ugly! No just kidding Jos actually looks like a fluffy dog. But that is not to take away from the fact that he is very smart or at least very good at thumbing through text books.

**Predictor** · Mar 23 '09, 12:50 AM

Originally posted by dynamo

I'm just wondering,what factor(s) makes recursive compression algorithms stop.I mean if you can compress compressed data,why cant you keep compressing till the data reaches it's smallest possible size.

All data compression schemes involve the detection of patterns of some kind, and reducing their stored size. Since real data typically contains substantial numbers of patterns, programs like ZIP can usually squeeze them at least a little.

However, there is a limit to this ability. Think about it this way: 8 bits of data can store any of 256 possible configurations. No compression scheme could transform all of these 256 configurations to, say, 4 bits, since there are only 16 possible 4-bit configurations. In practice, compression works well on the usual case where there are some sort of patterns to detect. High-entropy random data has no such patterns can cannot, generally, be reduced in size.

-Will Dwinnell
Data Mining in MATLAB

**Banfa** · Mar 27 '09, 10:00 PM

It should be pointed out that so far everyone has been talking about lossless data compression. That is compression that when reversed produces exactly what was originally compressed.

With lossy compression it gets a bit more abstract because, as the name implies, you are actually discarding some of the data either purposefully or through the algorithm. In this way you get the situation you have with jpegs where you get to choose your compression level but the more you compress the worst the picture looks. Lossy compression is used in a fair number of places, jpegs mp3 MPEG2 and MPEG4.

It is my belief that the most efficient lossy compression is fractal compression which basically replaces a picture with a set of rules about how the different bits of the picture relate to each other. However this is not in very common use because I believe it takes a very long time to perform the compression although the decompression is very quick.

MPEG2 based standards (like MPEG2 and mp3) discard some data based on what most human beings can perceive, that is humans do not perceive blue-green (I think) colour differences very well so MPEG2 only uses a small amount of data to encode that information since small changes are imperceptible to the human eye. Similarly with audio it tends to drop very high and very low frequencies that a large number of humans can't hear.

**JosAH** · Mar 28 '09, 10:27 AM

I once found the 'ultimate loss-less compression method': suppose I want to compress a sequence of characters (eg a book). Define the vertices v_1, v_2, ... v_m where each vertex represents a character present in the book. Next draw a directed arc from v_s to v_n where v_s is the first character from the book and v_n is the next character; do this for all following characters in the book.

This way I created a multigraph; let v_s be the first character in the book and v_e be the last character in the book. The sequence of characters in the book is repesented by an Euler path in the multi-graph (all edges used exactly once, starting at v_s and ending at v_e). All Euler paths can be lexicographical ly sorted and a number assigned, so all I need is a matrix with m rows and colums and each cell has a number v_ij representing the number of edges starting at v_i and ending at v_j; the number of the Euler path plus this matrix is enough to represent the entire book, no matter the size of the book.

Here's the catch: that 'Euler number' takes as many bits proportional to the size of the original book; the other parts are just the offspring of my hyper-intelligent brains ;-)

kind regards,

Jos

how far can data compression go

how far can data compression go

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment