simple string compression?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Chris LaJoie

    #16
    Re: simple string compression?

    > Well, using my "space saver" technique got the 519,942 byte input file[color=blue]
    > down to 217,875 bytes - again, 42%. I suspect that getting it down
    > significantly from there would take a fair amount more work -
    > basically, writing a Huffman encoder, which is reasonably
    > straightforward (I've done it before) but not trivial, and not
    > something I fancy doing at half four in the morning :)[/color]

    Sounds good Jon. I'm anxious to see what you can do with that. If you
    don't mind, I plan on submitting this code (with proper credit of course) to
    various sites so compression code isn't so hard to come by.

    Chris


    Comment

    • Jon Skeet [C# MVP]

      #17
      Re: simple string compression?

      Chris LaJoie <chris@etriptra der.com> wrote:[color=blue][color=green]
      > > Well, using my "space saver" technique got the 519,942 byte input file
      > > down to 217,875 bytes - again, 42%. I suspect that getting it down
      > > significantly from there would take a fair amount more work -
      > > basically, writing a Huffman encoder, which is reasonably
      > > straightforward (I've done it before) but not trivial, and not
      > > something I fancy doing at half four in the morning :)[/color]
      >
      > Sounds good Jon. I'm anxious to see what you can do with that. If you
      > don't mind, I plan on submitting this code (with proper credit of course) to
      > various sites so compression code isn't so hard to come by.[/color]

      Sure. This is really only "early morning for fun" code though - no
      comments, no error checking, hideously inefficient etc - I just wanted
      to see how it would do.

      Also, due to the fact that it *can* compress so well, the
      GetMaxCharCount may well mean that some other classes will seriously
      overallocate space.

      So, with all the caveats over with, here's the code.

      using System;
      using System.Text;
      using System.IO;

      public class SpaceSaver : Encoding
      {
      public override int GetByteCount (char[] text, int start,
      int length)
      {
      return ToBinary(text, start, length).Length;
      }

      public override int GetBytes (char[] text, int textStart,
      int length, byte[] buffer,
      int bufferStart)
      {
      byte[] binary = ToBinary(text, textStart, length);
      Array.Copy (binary, 0, buffer, bufferStart, binary.Length);
      return binary.Length;
      }

      public override int GetCharCount (byte[] binary, int start,
      int length)
      {
      return ToString(binary , start, length).Length;
      }

      public override int GetChars (byte[] binary, int binaryStart,
      int length, char[] buffer,
      int bufferStart)
      {
      char[] chars = ToString (binary, binaryStart,
      length).ToCharA rray();
      Array.Copy (chars, 0, buffer, bufferStart, chars.Length);
      return chars.Length;
      }

      public override int GetMaxByteCount (int length)
      {
      return length;
      }

      public override int GetMaxCharCount (int length)
      {
      return length*129;
      }

      String ToString (byte[] data, int start, int length)
      {
      StringBuilder builder = new StringBuilder() ;
      foreach (byte b in data)
      {
      if (b < 128)
      builder.Append ((char)b);
      else
      builder.Append (new string (' ', b-126));
      }
      return builder.ToStrin g();
      }

      byte[] ToBinary (char[] text, int start, int length)
      {
      using (MemoryStream ms = new MemoryStream())
      {
      int spaces=0;
      foreach (char t in text)
      {
      if (t==' ')
      {
      spaces++;
      if (spaces==130)
      {
      ms.WriteByte(25 5);
      spaces=1;
      }
      }
      else
      {
      switch (spaces)
      {
      case 0:
      break;
      case 1:
      ms.WriteByte(32 );
      break;
      default:
      ms.WriteByte((b yte)(spaces+126 ));
      break;
      }
      spaces=0;
      ms.WriteByte ((byte)t);
      }
      }
      switch (spaces)
      {
      case 0:
      break;
      case 1:
      ms.WriteByte(32 );
      break;
      default:
      ms.WriteByte((b yte)(spaces+126 ));
      break;
      }
      ms.Flush();
      return ms.ToArray();
      }
      }

      static void Main(string[] args)
      {
      string inputFile = args[0];

      Encoding encoding = new SpaceSaver();

      try
      {
      // Create the reader and writer with appropriate encodings.
      using (StreamReader inputReader =
      new StreamReader (inputFile, Encoding.ASCII) )
      {
      string allText = inputReader.Rea dToEnd();

      byte[] data = encoding.GetByt es(allText);

      Console.WriteLi ne ("data length="+data.L ength);

      string recovered = encoding.GetStr ing(data);
      if (recovered != allText)
      {
      Console.WriteLi ne ("Oops!");
      Console.WriteLi ne (recovered.Subs tring (0, 100));
      }
      }
      }
      catch (IOException e)
      {
      Console.WriteLi ne ("Exception during processing: {0}",
      e.Message);
      }
      }
      }


      --
      Jon Skeet - <skeet@pobox.co m>
      Pobox has been discontinued as a separate service, and all existing customers moved to the Fastmail platform.

      If replying to the group, please do not mail me too

      Comment

      • Chris LaJoie

        #18
        Re: simple string compression?

        > Also, due to the fact that it *can* compress so well, the[color=blue]
        > GetMaxCharCount may well mean that some other classes will seriously
        > overallocate space.[/color]

        Thanks a lot for the code. A quick test proved it works, and gets a
        compression ratio of about 41% of the original file size. Good job :)
        I'll go into the code more tommorow, but I may have to mail you if I can't
        figure out what's going on.

        Chris


        Comment

        • Fergus Cooney

          #19
          Re: simple string compression?

          Hi Jon,

          Very simple, very effective - nice routine. :-)

          I don't know if you looked at the reference that Chris gave - the HACC
          op-code method. It's pleasing that yours achieves a comparable result without
          the complexity (not that it's that much more). It does, of course, depend on
          what the data is.

          I had a play with it and, noticing repeated sequences of '---', turned it
          from a space-saver to a true RLE encoder. There weren't enough dashes to
          overcome the loss of using the high-bit, however, and it added an extra 10K. I
          think that some more could be squeezed out by allowing it to be a two-char
          saver and using 0x40 as the switch. Reduced run lengths, but it would allow
          for a second favourite character, ie the '-'. Not worth the effort though.

          Like you say, Jon, the next (decent) level of compression is going to come
          with a lot more complexity.


          GetMaxCharCount (). Oh boy. Lol. 'Nuff said !! :-))
          So too GetBytes and GetChars - oh, a copying we will go....
          But, no worries, I know this was just an experiment of implementing an
          Encoding.


          Chris, when you submit Jon's code, can I recommend that you make some
          small changes. A key point in this method is the use of the top bit, so I
          think this should be emphasised.

          Instead of
          builder.Append (new string (' ', b-126));
          have
          builder.Append (new string (' ', (b+2)-128));

          Instead of
          ms.WriteByte((b yte)(spaces+126 ));
          have
          ms.WriteByte((b yte)(128 + (spaces-2)));

          And to show that this is in the same scheme:
          if (t==' ')
          {
          spaces++;
          if (spaces==130)
          {
          ms.WriteByte(25 5);
          spaces=1;
          }
          Have
          if (t==' ')
          {
          if (spaces==129)
          {
          ms.WriteByte(12 8 + (spaces -2));
          spaces=0;
          spaces++;
          }

          Thus there is a consistency in the use of +/-2 and +/-128. Maybe you'd
          even go as far as having a constant for the compressed-spaces indicator, ie
          the top bit.

          Regards,
          Fergus

          ps. And now I know how an Encoding looks from the inside. :-)


          Comment

          • Jon Skeet [C# MVP]

            #20
            Re: simple string compression?

            Fergus Cooney <filter-1@tesco.net> wrote:[color=blue]
            > Very simple, very effective - nice routine. :-)[/color]

            Cheers :)
            [color=blue]
            > I don't know if you looked at the reference that Chris gave - the HACC
            > op-code method. It's pleasing that yours achieves a comparable result without
            > the complexity (not that it's that much more). It does, of course, depend on
            > what the data is.[/color]

            I looked at it briefly, but the description wasn't terribly well
            written unfortunately.. .
            [color=blue]
            > I had a play with it and, noticing repeated sequences of '---', turned it
            > from a space-saver to a true RLE encoder. There weren't enough dashes to
            > overcome the loss of using the high-bit, however, and it added an extra 10K. I
            > think that some more could be squeezed out by allowing it to be a two-char
            > saver and using 0x40 as the switch. Reduced run lengths, but it would allow
            > for a second favourite character, ie the '-'. Not worth the effort though.[/color]

            Right. Possibly not - depending on just how many of them there were.
            [color=blue]
            > GetMaxCharCount (). Oh boy. Lol. 'Nuff said !! :-))[/color]

            Indeed. It's quite possible that by overriding more of the Encoding
            methods, that wouldn't be a problem - I suspect it's used by the
            default implementations though.
            [color=blue]
            > So too GetBytes and GetChars - oh, a copying we will go....
            > But, no worries, I know this was just an experiment of implementing an
            > Encoding.[/color]

            :)
            [color=blue]
            > Chris, when you submit Jon's code, can I recommend that you make some
            > small changes. A key point in this method is the use of the top bit, so I
            > think this should be emphasised.
            >
            > Instead of
            > builder.Append (new string (' ', b-126));
            > have
            > builder.Append (new string (' ', (b+2)-128));
            >
            > Instead of
            > ms.WriteByte((b yte)(spaces+126 ));
            > have
            > ms.WriteByte((b yte)(128 + (spaces-2)));[/color]

            Those make sense, yes.
            [color=blue]
            > And to show that this is in the same scheme:
            > if (t==' ')
            > {
            > spaces++;
            > if (spaces==130)
            > {
            > ms.WriteByte(25 5);
            > spaces=1;
            > }
            > Have
            > if (t==' ')
            > {
            > if (spaces==129)
            > {
            > ms.WriteByte(12 8 + (spaces -2));
            > spaces=0;
            > spaces++;
            > }[/color]

            I presume you meant the spaces++; to come after the closing brace - and
            note that you then need a cast to byte. One alternative would be:

            if (t==' ')
            {
            spaces++;
            if (spaces==129) // The most we can write in one byte
            {
            ms.WriteByte ((byte) (128+(spaces-2)));
            spaces=0;
            }
            }

            In other words, writing the spaces a bit more eagerly, and starting
            with a clean slate, as it were.
            [color=blue]
            > Thus there is a consistency in the use of +/-2 and +/-128. Maybe you'd
            > even go as far as having a constant for the compressed-spaces indicator, ie
            > the top bit.[/color]

            Possibly - although that would imply that it could be changed without
            too much problem, whereas it can't - if it's anything *other* than the
            top bit, you end up with it conflicting with ASCII. I'd prefer to just
            document it carefully.
            [color=blue]
            > ps. And now I know how an Encoding looks from the inside. :-)[/color]

            If you want a slightly more polished example, have a look at
            Pobox has been discontinued as a separate service, and all existing customers moved to the Fastmail platform.


            --
            Jon Skeet - <skeet@pobox.co m>
            Pobox has been discontinued as a separate service, and all existing customers moved to the Fastmail platform.

            If replying to the group, please do not mail me too

            Comment

            • Fergus Cooney

              #21
              Re: simple string compression?

              Hi Jon,

              Effort for saving dashes? Using your code for '-' instead of ' ' reduced
              it by c28700.

              The corrections to the [if (t==' ') etc]. Yes, rather than copy from the
              project (I'd only just got up and VS was still asleep) I copied the code from
              the email and hacked away (too hurriedly, lol, I had an appointment to get
              to). What you presumed was what I was actually using in my version of the live
              code. But keeping the ++ early and writing as-soon-as ('eagerly') - yes, that
              is cleaner.

              With the constant I said 'maybe' because it's not hugely important. Anyone
              working with bytes should recognise a 128 for what it is. But if I did have
              the constant, I'd have a comment - top bit only because.. - or some such.

              I've looked at the page for your other encoder but I'll leave the
              examination for when the time comes. Shame it's for ebcdic - I tend to run
              away from anything I can't pronounce, lol.! Lord of the Rings and Dune escaped
              being devoured because of that!

              Lader, Dude,

              Best regards,
              Fergus


              Comment

              Working...