Huffman Encoding / Decoding

**JosAH** · Aug 5 '07, 06:29 AM

Originally posted by KWSW

The other thread got too long so decided to start a new one.
Got my huffman tree working and am now moving onto encoding and decoding my given text file.

My idea now is to print out all the leaf nodes into a string and split them up using tokens to get my ascii value and the huffman code and store into a collection for reference when encoding.

Just got some questions on this:

1. Is there anyway to look through the huffman tree for the ascii value since they are all now stuck in one node?

2. I was thinking of putting the codes into a map with the huffman code as the key and the ascii value as the map value since it will be sorted by the huffman code which is also determined by the frequency of the ascii value appearing. Thus less time is need to traverse the whole map. Would this be efficient or is there a better way?

Thanks :)

I don't know why you're talking about ASCII values. The leaf nodes represent a
character (*any* Unicode character) which is associated with a Huffman code.

A Huffman code for a character simply is a sequence of bits. Instead of printing
the character to the output, you print that bit sequence. That's the essence of
Huffman encoding; it's all about data compression.

You need a mapping from those Unicode characters to the bit sequences for
the compression or encoding part.

When you decompress such a file you should read bits from a bit sequence and
check whether or not you still have that bit sequence associated with a Unicode
character.

For a zero bit read, you take the route of a left child; go right when a 1 bit is
read. The moment you reach a leaf node you can emit the associated character.

kind regards,

Jos

**KWSW** · Aug 5 '07, 07:27 AM

About the ascii part, the assignment just requires the encoding of a text file thats why I use ascii value. Also the requirement is to print the huffman code as string instead of bits(not too sure why also).

Still not too sure about what you mean by

For a zero bit read, you take the route of a left child; go right when a 1 bit is
read. The moment you reach a leaf node you can emit the associated character.

For now I just created a huffmancode object with the ascii value and its huffman code representation and put them all into an array sorrting with the shortest code first. Thus for compression I just search the ascii value and get the huffman code for it.

Was thinking of doing it the same way but I guess its gonna be very inefficient...

**JosAH** · Aug 5 '07, 08:10 AM

Originally posted by KWSW

About the ascii part, the assignment just requires the encoding of a text file thats why I use ascii value. Also the requirement is to print the huffman code as string instead of bits(not too sure why also).

Still not too sure about what you mean by

Originally posted by JosAH

For a zero bit read, you take the route of a left child; go right when a 1 bit is
read. The moment you reach a leaf node you can emit the associated character.
For now I just created a huffmancode object with the ascii value and its huffman code representation and put them all into an array sorrting with the shortest code first. Thus for compression I just search the ascii value and get the huffman code for it.

Was thinking of doing it the same way but I guess its gonna be very inefficient...

Suppose you have the following little Huffman tree:

Code:

      .
     / \
    /\  C
   A  B

So the encoding is:

A: 00
B: 01
C: 1

Suppose you're reading the folling bits 010011, start at the root of the tree:

0: go left; 1: go right; you're at a leaf, a B was read. Start at the root again;
0: go left; 0: go right; you're at a leaf, an A was read. Start at the root again;
1: go right; you're at a leaf, a C was read. Start at the root again;
1: go right; you're at a lead, a C was read. Start at the root again;
the end of the bit stream was read so you have read BACC

You have to keep that tree somewhere if you want to decode efficiently.
The encoding part can be done with a Map<Character, String> in your case.

kind regards,

Jos

**KWSW** · Aug 5 '07, 10:04 AM

Hi josAH,

I happen to find ur set of huffman and would like to know it means by your add bit function.

[CODE=Java]protected void add(int bit)
{
code&= (1<<len)-1;
code<<= 1;
code|= bit;
len++;
}[/CODE]

I understand len++ is to add one to the code length but do not understand the rest. It seems my method of just adding "0" to the left and "1" to the right doesn't give me prefix codes....

edit:

ok i found out these out

&= "bitwise AND and assign"
<<= "shift bits left and assign"
|= "bitwise OR and assign"
<< "shift bits left"

**JosAH** · Aug 5 '07, 10:18 AM

Originally posted by KWSW

Hi josAH,

I happen to find ur set of huffman and would like to know it means by your add bit function.

[CODE=Java]protected void add(int bit)
{
code&= (1<<len)-1;
code<<= 1;
code|= bit;
len++;
}[/CODE]

Where did you find that code? Not to worry though, I don't claim any copyright.

kind regards,

Jos

**KWSW** · Aug 5 '07, 10:20 AM

Originally posted by JosAH

Where did you find that code? Not to worry though, I don't claim any copyright.

kind regards,

Jos

Found it over at the java forums and the user name was the same as urs...

anyway I decided to use the method that prints the bit into string for me to compare and this is what i got

Code:

c 5 00000 0
b 6 010000 10000
w 6 110000 110000
r 4 1000 1000
e 3 100 100
u 5 00010 10
l 5 10010 10010
i 4 1010 1010
s 4 0110 110
S 9 000001110 1110
T 10 0100001110 100001110
I 10 1100001110 1100001110

 8 10001110 10001110
k 9 001001110 1001110
j 9 101001110 101001110
G 10 0011001110 11001110
W 10 1011001110 1011001110
: 10 0111001110 111001110
F 10 1111001110 1111001110
, 6 101110 101110
d 5 11110 11110
a 4 0001 1
n 4 1001 1001
o 4 0101 101
g 6 001101 1101
p 6 101101 101101
m 6 011101 11101
M 11 00000111101 111101
z 11 10000111101 10000111101
V 12 001000111101 1000111101
O 12 101000111101 101000111101
( 13 0011000111101 11000111101
) 13 1011000111101 1011000111101
' 13 0111000111101 111000111101
- 13 1111000111101 1111000111101
N 10 0100111101 100111101
; 10 1100111101 1100111101
x 10 0010111101 10111101
K 13 0001010111101 1010111101
[ 13 1001010111101 1001010111101
] 13 0101010111101 101010111101
E 13 1101010111101 1101010111101
D 11 11010111101 11010111101
H 9 110111101 110111101
v 7 1111101 1111101
t 4 0011 11
. 8 00001011 1011
A 11 00010001011 10001011
P 11 10010001011 10010001011
q 11 01010001011 1010001011
B 11 11010001011 11010001011
C 10 0110001011 110001011
U 12 001110001011 1110001011
R 12 101110001011 101110001011
J 12 011110001011 11110001011
L 13 0111110001011 111110001011
Y 13 1111110001011 1111110001011
y 7 1001011 1001011
f 6 101011 101011
h 5 11011 11011
  3 111 111

where the first is the character, the 2nd is my huffmancode in string and the 3rd part is the bits after using integer.toBinar yString(int).

I could just modify my code by storing the huffman code in bits and using the integer.toBinar yString(int) to write to file for my assignment requirements but than I will not be learning anything...

**JosAH** · Aug 5 '07, 10:32 AM

Originally posted by KWSW

I could just modify my code by storing the huffman code in bits and using the integer.toBinar yString(int) to write to file for my assignment requirements but than I will not be learning anything...

It seems to look alright to me: the most frequently used letters are associated
with the shortest Huffman codes. As you can see, if the leftmost bit(s) of the
Huffman codes are 0, you can't tell the length of the code when you'd store it
as a simple int. You have to keep track of the bitlength of the code as I did in
my version.

Reading and writing bits from/to a stream is an exercise by itself.

kind regards,

Jos

**KWSW** · Aug 5 '07, 10:41 AM

Originally posted by JosAH

It seems to look alright to me: the most frequently used letters are associated
with the shortest Huffman codes. As you can see, if the leftmost bit(s) of the
Huffman codes are 0, you can't tell the length of the code when you'd store it
as a simple int. You have to keep track of the bitlength of the code as I did in
my version.

Reading and writing bits from/to a stream is an exercise by itself.

kind regards,

Jos

So this part has something to do with writing bits to the stream?

[CODE=java]protected void add(int bit)
{
code &= (1<<len)-1;
code <<= 1;
code |= bit;
len++;
}[/CODE]

**JosAH** · Aug 5 '07, 10:47 AM

Originally posted by KWSW

Just still clueless on this part:

[CODE=java]protected void add(int bit)
{
code &= (1<<len)-1;
code <<= 1;
code |= bit;
len++;
}[/CODE]

It's just bit fiddling (I'm good at that ;-) 'code' is the bit sequence so far and 'len'
is the length of the code so far. The parameter 'bit' is either one or zero. This
little method shifts in that bit on the right side of the code and the length is
incremented. The first line 'code&= (1<<len)-1' isn't actually needed but it makes
sure that only 'len' bits in the code are used before the new bit is shifted in.
Check your Java manuals for those bit manipulating operators.

I could've written 'code= code*2+bit' but I think my mind was in a bit fiddling mood
at that time ;-)

kind regards,

Jos

**KWSW** · Aug 5 '07, 11:38 AM

Originally posted by JosAH

It's just bit fiddling (I'm good at that ;-) 'code' is the bit sequence so far and 'len'
is the length of the code so far. The parameter 'bit' is either one or zero. This
little method shifts in that bit on the right side of the code and the length is
incremented. The first line 'code&= (1<<len)-1' isn't actually needed but it makes
sure that only 'len' bits in the code are used before the new bit is shifted in.
Check your Java manuals for those bit manipulating operators.

I could've written 'code= code*2+bit' but I think my mind was in a bit fiddling mood
at that time ;-)

kind regards,

Jos

thanks for explaining it more... makes sense now... managed to find the website from java.sun.com for bitwise operation. Will go through it once I get this assignment done... and I still have not started on hamming yet...

But still kinda stuck with the decoding... understood what you mean by "0" to the left and "1" to the right, but how would I do that since the end result of the treeset is the root node if I understand it correctly...

the logic i was thinking of:

if its "0" go to the left node else go to the right node.
At new node check if its the end of the branch.
If it is then get the symbol.
Else get the next bit and check for left or right node.

**JosAH** · Aug 5 '07, 11:58 AM

Originally posted by KWSW

But still kinda stuck with the decoding... understood what you mean by "0" to the left and "1" to the right, but how would I do that since the end result of the treeset is the root node if I understand it correctly...

the logic i was thinking of:

if its "0" go to the left node else go to the right node.
At new node check if its the end of the branch.
If it is then get the symbol.
Else get the next bit and check for left or right node.

Yep, that's exactly it. After you are ready with the construction phase, you have
that Huffman tree and the decoder must use that tree in the way your described
above.

Huffman compression is an 'off line' compression technique, i.e. first you have
to read the entire tex and build the tree before you can perform any compression
on the text.

Another disadvantage is that not only the compressor needs that tree, the de-
compressor needs it as well. How is the decompressor supposed to find that
tree? The basic answer is that the compressor must include the tree in the
compressed data; as the first data to be read actually because otherwise the
decompressor can't decompress anything.

If you're interested in 'on line' compression techniques read all about the LZW
algorithm.

kind regards,

Jos

Huffman Encoding / Decoding

Huffman Encoding / Decoding

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment