Huffman Encoding / Decoding

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • KWSW
    New Member
    • May 2007
    • 72

    Huffman Encoding / Decoding

    The other thread got too long so decided to start a new one.
    Got my huffman tree working and am now moving onto encoding and decoding my given text file.

    My idea now is to print out all the leaf nodes into a string and split them up using tokens to get my ascii value and the huffman code and store into a collection for reference when encoding.

    Just got some questions on this:

    1. Is there anyway to look through the huffman tree for the ascii value since they are all now stuck in one node?

    2. I was thinking of putting the codes into a map with the huffman code as the key and the ascii value as the map value since it will be sorted by the huffman code which is also determined by the frequency of the ascii value appearing. Thus less time is need to traverse the whole map. Would this be efficient or is there a better way?

    Thanks :)
  • JosAH
    Recognized Expert MVP
    • Mar 2007
    • 11453

    #2
    Originally posted by KWSW
    The other thread got too long so decided to start a new one.
    Got my huffman tree working and am now moving onto encoding and decoding my given text file.

    My idea now is to print out all the leaf nodes into a string and split them up using tokens to get my ascii value and the huffman code and store into a collection for reference when encoding.

    Just got some questions on this:

    1. Is there anyway to look through the huffman tree for the ascii value since they are all now stuck in one node?

    2. I was thinking of putting the codes into a map with the huffman code as the key and the ascii value as the map value since it will be sorted by the huffman code which is also determined by the frequency of the ascii value appearing. Thus less time is need to traverse the whole map. Would this be efficient or is there a better way?

    Thanks :)
    I don't know why you're talking about ASCII values. The leaf nodes represent a
    character (*any* Unicode character) which is associated with a Huffman code.

    A Huffman code for a character simply is a sequence of bits. Instead of printing
    the character to the output, you print that bit sequence. That's the essence of
    Huffman encoding; it's all about data compression.

    You need a mapping from those Unicode characters to the bit sequences for
    the compression or encoding part.

    When you decompress such a file you should read bits from a bit sequence and
    check whether or not you still have that bit sequence associated with a Unicode
    character.

    For a zero bit read, you take the route of a left child; go right when a 1 bit is
    read. The moment you reach a leaf node you can emit the associated character.

    kind regards,

    Jos

    Comment

    • KWSW
      New Member
      • May 2007
      • 72

      #3
      About the ascii part, the assignment just requires the encoding of a text file thats why I use ascii value. Also the requirement is to print the huffman code as string instead of bits(not too sure why also).

      Still not too sure about what you mean by
      For a zero bit read, you take the route of a left child; go right when a 1 bit is
      read. The moment you reach a leaf node you can emit the associated character.
      For now I just created a huffmancode object with the ascii value and its huffman code representation and put them all into an array sorrting with the shortest code first. Thus for compression I just search the ascii value and get the huffman code for it.

      Was thinking of doing it the same way but I guess its gonna be very inefficient...

      Comment

      • JosAH
        Recognized Expert MVP
        • Mar 2007
        • 11453

        #4
        Originally posted by KWSW
        About the ascii part, the assignment just requires the encoding of a text file thats why I use ascii value. Also the requirement is to print the huffman code as string instead of bits(not too sure why also).

        Still not too sure about what you mean by

        Originally posted by JosAH
        For a zero bit read, you take the route of a left child; go right when a 1 bit is
        read. The moment you reach a leaf node you can emit the associated character.
        For now I just created a huffmancode object with the ascii value and its huffman code representation and put them all into an array sorrting with the shortest code first. Thus for compression I just search the ascii value and get the huffman code for it.
        Was thinking of doing it the same way but I guess its gonna be very inefficient...
        Suppose you have the following little Huffman tree:
        Code:
              .
             / \
            /\  C
           A  B
        So the encoding is:

        A: 00
        B: 01
        C: 1

        Suppose you're reading the folling bits 010011, start at the root of the tree:

        0: go left; 1: go right; you're at a leaf, a B was read. Start at the root again;
        0: go left; 0: go right; you're at a leaf, an A was read. Start at the root again;
        1: go right; you're at a leaf, a C was read. Start at the root again;
        1: go right; you're at a lead, a C was read. Start at the root again;
        the end of the bit stream was read so you have read BACC

        You have to keep that tree somewhere if you want to decode efficiently.
        The encoding part can be done with a Map<Character, String> in your case.

        kind regards,

        Jos

        Comment

        • KWSW
          New Member
          • May 2007
          • 72

          #5
          Hi josAH,

          I happen to find ur set of huffman and would like to know it means by your add bit function.

          [CODE=Java]protected void add(int bit)
          {
          code&= (1<<len)-1;
          code<<= 1;
          code|= bit;
          len++;
          }[/CODE]

          I understand len++ is to add one to the code length but do not understand the rest. It seems my method of just adding "0" to the left and "1" to the right doesn't give me prefix codes....

          edit:

          ok i found out these out

          &= "bitwise AND and assign"
          <<= "shift bits left and assign"
          |= "bitwise OR and assign"
          << "shift bits left"

          Comment

          • JosAH
            Recognized Expert MVP
            • Mar 2007
            • 11453

            #6
            Originally posted by KWSW
            Hi josAH,

            I happen to find ur set of huffman and would like to know it means by your add bit function.

            [CODE=Java]protected void add(int bit)
            {
            code&= (1<<len)-1;
            code<<= 1;
            code|= bit;
            len++;
            }[/CODE]
            Where did you find that code? Not to worry though, I don't claim any copyright.

            kind regards,

            Jos

            Comment

            • KWSW
              New Member
              • May 2007
              • 72

              #7
              Originally posted by JosAH
              Where did you find that code? Not to worry though, I don't claim any copyright.

              kind regards,

              Jos
              Found it over at the java forums and the user name was the same as urs...

              anyway I decided to use the method that prints the bit into string for me to compare and this is what i got


              Code:
              c 5 00000 0
              b 6 010000 10000
              w 6 110000 110000
              r 4 1000 1000
              e 3 100 100
              u 5 00010 10
              l 5 10010 10010
              i 4 1010 1010
              s 4 0110 110
              S 9 000001110 1110
              T 10 0100001110 100001110
              I 10 1100001110 1100001110
              
               8 10001110 10001110
              k 9 001001110 1001110
              j 9 101001110 101001110
              G 10 0011001110 11001110
              W 10 1011001110 1011001110
              : 10 0111001110 111001110
              F 10 1111001110 1111001110
              , 6 101110 101110
              d 5 11110 11110
              a 4 0001 1
              n 4 1001 1001
              o 4 0101 101
              g 6 001101 1101
              p 6 101101 101101
              m 6 011101 11101
              M 11 00000111101 111101
              z 11 10000111101 10000111101
              V 12 001000111101 1000111101
              O 12 101000111101 101000111101
              ( 13 0011000111101 11000111101
              ) 13 1011000111101 1011000111101
              ' 13 0111000111101 111000111101
              - 13 1111000111101 1111000111101
              N 10 0100111101 100111101
              ; 10 1100111101 1100111101
              x 10 0010111101 10111101
              K 13 0001010111101 1010111101
              [ 13 1001010111101 1001010111101
              ] 13 0101010111101 101010111101
              E 13 1101010111101 1101010111101
              D 11 11010111101 11010111101
              H 9 110111101 110111101
              v 7 1111101 1111101
              t 4 0011 11
              . 8 00001011 1011
              A 11 00010001011 10001011
              P 11 10010001011 10010001011
              q 11 01010001011 1010001011
              B 11 11010001011 11010001011
              C 10 0110001011 110001011
              U 12 001110001011 1110001011
              R 12 101110001011 101110001011
              J 12 011110001011 11110001011
              L 13 0111110001011 111110001011
              Y 13 1111110001011 1111110001011
              y 7 1001011 1001011
              f 6 101011 101011
              h 5 11011 11011
                3 111 111
              where the first is the character, the 2nd is my huffmancode in string and the 3rd part is the bits after using integer.toBinar yString(int).

              I could just modify my code by storing the huffman code in bits and using the integer.toBinar yString(int) to write to file for my assignment requirements but than I will not be learning anything...

              Comment

              • JosAH
                Recognized Expert MVP
                • Mar 2007
                • 11453

                #8
                Originally posted by KWSW
                I could just modify my code by storing the huffman code in bits and using the integer.toBinar yString(int) to write to file for my assignment requirements but than I will not be learning anything...
                It seems to look alright to me: the most frequently used letters are associated
                with the shortest Huffman codes. As you can see, if the leftmost bit(s) of the
                Huffman codes are 0, you can't tell the length of the code when you'd store it
                as a simple int. You have to keep track of the bitlength of the code as I did in
                my version.

                Reading and writing bits from/to a stream is an exercise by itself.

                kind regards,

                Jos

                Comment

                • KWSW
                  New Member
                  • May 2007
                  • 72

                  #9
                  Originally posted by JosAH
                  It seems to look alright to me: the most frequently used letters are associated
                  with the shortest Huffman codes. As you can see, if the leftmost bit(s) of the
                  Huffman codes are 0, you can't tell the length of the code when you'd store it
                  as a simple int. You have to keep track of the bitlength of the code as I did in
                  my version.

                  Reading and writing bits from/to a stream is an exercise by itself.

                  kind regards,

                  Jos

                  So this part has something to do with writing bits to the stream?

                  [CODE=java]protected void add(int bit)
                  {
                  code &= (1<<len)-1;
                  code <<= 1;
                  code |= bit;
                  len++;
                  }[/CODE]

                  Comment

                  • JosAH
                    Recognized Expert MVP
                    • Mar 2007
                    • 11453

                    #10
                    Originally posted by KWSW
                    Just still clueless on this part:

                    [CODE=java]protected void add(int bit)
                    {
                    code &= (1<<len)-1;
                    code <<= 1;
                    code |= bit;
                    len++;
                    }[/CODE]
                    It's just bit fiddling (I'm good at that ;-) 'code' is the bit sequence so far and 'len'
                    is the length of the code so far. The parameter 'bit' is either one or zero. This
                    little method shifts in that bit on the right side of the code and the length is
                    incremented. The first line 'code&= (1<<len)-1' isn't actually needed but it makes
                    sure that only 'len' bits in the code are used before the new bit is shifted in.
                    Check your Java manuals for those bit manipulating operators.

                    I could've written 'code= code*2+bit' but I think my mind was in a bit fiddling mood
                    at that time ;-)

                    kind regards,

                    Jos

                    Comment

                    • KWSW
                      New Member
                      • May 2007
                      • 72

                      #11
                      Originally posted by JosAH
                      It's just bit fiddling (I'm good at that ;-) 'code' is the bit sequence so far and 'len'
                      is the length of the code so far. The parameter 'bit' is either one or zero. This
                      little method shifts in that bit on the right side of the code and the length is
                      incremented. The first line 'code&= (1<<len)-1' isn't actually needed but it makes
                      sure that only 'len' bits in the code are used before the new bit is shifted in.
                      Check your Java manuals for those bit manipulating operators.

                      I could've written 'code= code*2+bit' but I think my mind was in a bit fiddling mood
                      at that time ;-)

                      kind regards,

                      Jos
                      thanks for explaining it more... makes sense now... managed to find the website from java.sun.com for bitwise operation. Will go through it once I get this assignment done... and I still have not started on hamming yet...

                      But still kinda stuck with the decoding... understood what you mean by "0" to the left and "1" to the right, but how would I do that since the end result of the treeset is the root node if I understand it correctly...

                      the logic i was thinking of:

                      if its "0" go to the left node else go to the right node.
                      At new node check if its the end of the branch.
                      If it is then get the symbol.
                      Else get the next bit and check for left or right node.

                      Comment

                      • JosAH
                        Recognized Expert MVP
                        • Mar 2007
                        • 11453

                        #12
                        Originally posted by KWSW
                        But still kinda stuck with the decoding... understood what you mean by "0" to the left and "1" to the right, but how would I do that since the end result of the treeset is the root node if I understand it correctly...

                        the logic i was thinking of:

                        if its "0" go to the left node else go to the right node.
                        At new node check if its the end of the branch.
                        If it is then get the symbol.
                        Else get the next bit and check for left or right node.
                        Yep, that's exactly it. After you are ready with the construction phase, you have
                        that Huffman tree and the decoder must use that tree in the way your described
                        above.

                        Huffman compression is an 'off line' compression technique, i.e. first you have
                        to read the entire tex and build the tree before you can perform any compression
                        on the text.

                        Another disadvantage is that not only the compressor needs that tree, the de-
                        compressor needs it as well. How is the decompressor supposed to find that
                        tree? The basic answer is that the compressor must include the tree in the
                        compressed data; as the first data to be read actually because otherwise the
                        decompressor can't decompress anything.

                        If you're interested in 'on line' compression techniques read all about the LZW
                        algorithm.

                        kind regards,

                        Jos

                        Comment

                        Working...