Unicode values

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • billsahiker@yahoo.com

    Unicode values

    Where do I find the unicode values for math operators like equal,
    minus and plus sign and how to I check if the value of a byte array is
    one of these operators? I populate the byte array from a filestream
    object using the Read method. So far Ihave been working with utf8
    files and I just use

    if(byte[i] == 61) //0x3D works also

    it returns true if it is the equal sign. But how do I do this if I
    work with a unicode/utf16 encoded file? I figure I need to compare two
    bytes for unicode, right? but where do I get those values? I googled
    for unicode code chart and the like but after a couple hours cannot
    find this.

    BTW, the files I am reading are all text. my test files are created
    with streamwriter using the desired encoding object.

    Bill

  • Peter Duniho

    #2
    Re: Unicode values

    On Mon, 12 May 2008 19:55:40 -0700, <billsahiker@ya hoo.comwrote:
    Where do I find the unicode values for math operators like equal,
    minus and plus sign and how to I check if the value of a byte array is
    one of these operators?
    There are a variety of sources. Windows has the Character Map utility
    that allows you to browse characters on a per-font basis, and will tell
    you the Unicode value for a character.

    However, you may be going about whatever you're trying to do the wrong
    way. You should read your text in using an Encoder class appropriate to
    the format, converting to the char type in C#. Then you can just use the
    literal '=' (for example) to compare for the equals character, without
    ever needing to know the actual Unicode value.

    Pete

    Comment

    • billsahiker@yahoo.com

      #3
      Re: Unicode values

      On May 12, 9:14 pm, "Peter Duniho" <NpOeStPe...@nn owslpianmk.com>
      wrote:
      On Mon, 12 May 2008 19:55:40 -0700, <billsahi...@ya hoo.comwrote:
      Where do I find the unicode values for math operators like equal,
      minus and plus sign and how to I check if the value of a byte array is
      one of these operators?
      >
      There are a variety of sources.  Windows has the Character Map utility  
      that allows you to browse characters on a per-font basis, and will tell  
      you the Unicode value for a character.
      >
      However, you may be going about whatever you're trying to do the wrong  
      way.  You should read your text in using an Encoder class appropriate to 
      the format, converting to the char type in C#.  Then you can just use the  
      literal '=' (for example) to compare for the equals character, without  
      ever needing to know the actual Unicode value.
      >
      Pete
      I am looking for maximum performance. I originally read the file with
      streamreader and did the parsing with strings, but it was way too
      slow. I am thinking there should be two byte values for a specific
      character in a given language -do the bytes vary by font as well?

      Comment

      • Peter Duniho

        #4
        Re: Unicode values

        On Mon, 12 May 2008 20:30:10 -0700, <billsahiker@ya hoo.comwrote:
        I am looking for maximum performance. I originally read the file with
        streamreader and did the parsing with strings, but it was way too
        slow. I am thinking there should be two byte values for a specific
        character in a given language -do the bytes vary by font as well?
        No, but fonts vary in which Unicode characters they actually include.
        Character Map isn't really a Unicode character browser; it just has the
        side-effect of the consequence of how it operates that it's a convenient
        way to look up character codes. If you're running on Windows, anyway.
        (The Mac has a similar utility, and my guess is there's something like it
        on Unix, Linux, etc. too but I don't know for a fact).

        Anyway, I'm suspicious that you're having a performance issue that is
        directly caused by converting from UTF-8 to UTF-16. However, if you
        insist that you are, your strategy should still not to be to hard-code
        byte-value character constants. Simply use the built-in character
        encoding support to encode C# char values specified as literals to UTF-8
        and use that. Obviously, you would do that encoding once, either at the
        start of your program's execution, or even using a standalone tool to
        create an appropriate input data file with UTF-8 constants.

        There's really no good reason to hard-code character constants in your
        code, or for you to even know or care what those character constants are.

        Pete

        Comment

        • billsahiker@yahoo.com

          #5
          Re: Unicode values

          Pete,

          The performance issue was parsing strings vs. a byte array. since I
          already have a working routine
          that searches a byte array for utf8 files, I wanted to modify it for
          unicode. turns out I can still search for the same
          byte values, e.g., 0x3D for the equal sign, because the first byte is
          the same in unicode and ut8 for the math symbols
          I need. Once I discovered that, all I needed to do was increment the
          pointer variable in the buffer by two instead of one.
          With that minor adjustment the routine now works for both utf8 and
          unicode.

          Thanks for your help.

          Bill


          On May 12, 10:34 pm, "Peter Duniho" <NpOeStPe...@nn owslpianmk.com>
          wrote:
          Anyway, I'm suspicious that you're having a performance issue that is  
          directly caused by converting from UTF-8 to UTF-16.  However, if you  
          insist that you are, your strategy should still not to be to hard-code  
          byte-value character constants.  Simply use the built-in character  
          encoding support to encode C# char values specified as literals to UTF-8  
          and use that.  Obviously, you would do that encoding once, either at the 
          start of your program's execution, or even using a standalone tool to  
          create an appropriate input data file with UTF-8 constants.
          >
          There's really no good reason to hard-code character constants in your  
          code, or for you to even know or care what those character constants are.
          >
          Pete

          Comment

          Working...