Extracting Unicode characters from RTF

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • geegeegeegee
    New Member
    • Mar 2008
    • 3

    Extracting Unicode characters from RTF

    Hi All,
    I have come across a difficult problem to do with extracting UniCode characters from RTF strings.
    A detailed description of my problem is below, if anyone could help, it would be much appreciated. I've tried to make the problem as clear as possible, but if any clarification is needed please let me know.

    Task
    -Convert RTF2 formatted text containing foreign characters (UniCode) to PlainText.

    Background

    -We are using Stephan Lebans RTF2 control to display and edit text.
    -RTF2 fields cannot be displayed appropriately on reports, so unformatted text must be stored in database.
    -The RTF2 parser cannot handle Unicode (our overseas clients, specifically Romania, use Unicode characters), so often the rtf2.PlainText method returns strings containing ???
    -I have built a simple parser to convert Hex values in rtf2.RTFText to characters
    -Given a character table, I can add functionality to generate characters appropriately depending on RTF Character Set defined in .RTFText.

    Question
    -Where can I find a character table for the Character Sets specified in .RTFText (specifically fcharset238)?

    Technical/Testing info:
    Fonts
    These are the 2 relevant fonts:
    F1: {\f1\fnil\fchar set0 MS Sans Serif;}
    F2: {\f2\fswiss\fch arset238{\*\fna me Arial;}Arial CE;}

    *Testing in MSWord showed that the actual font (Sans Serif, Arial etc made no difference to presented character, so fcharset is most likely the issue).

    Keys
    -Pressing ";" usually generates "ş" (hereby referred to as "s")
    -However, when in VB6 code window it generates "º" (this probably isn't important).
    -Copy/pasting from/into VB6 code window alternates between the characters.

    RTF
    -In RTF format, abnormal characters are partly referenced by “\’XX” with XX being their hex values. Eg the RTF string “xxx\’BAxxx” corresponds to “xxxşxxx”.
    -In RTF format, abnormal characters are partly referenced by the specified font.

    -So, the actual character displayed is dependent on the hex value, as well as the font (character set) specified in RTF.

    Characters
    Below is a table indicating my observations for a character. Hex Value and Font are the inputs.

    Hex Value || Font ||Character Displayed || Unicode for Character Displayed
    BA || F1 || ş || 00BA
    BA || F2 || º || 015F
  • geegeegeegee
    New Member
    • Mar 2008
    • 3

    #2
    I should have mentioned, testing was carried out with Input Language set to Romanian.
    Greg

    Comment

    • NeoPa
      Recognized Expert Moderator MVP
      • Oct 2006
      • 32661

      #3
      Greg,

      I commend you on the care taken to specify the question as well as the trouble you've obviously already gone to to find a solution yourself.

      I'm afraid I can't help you directly with this issue, but I will flag it for some of the other Access experts to come and have a look-see in case any of them can help. It is more of a problem come across using Access than an Access problem per-se though, so if we can find no joy in here it may be worth throwing up a link to this thread in the Windows forum too.

      Let's see what flagging to the other Access experts can do for us first though.

      Comment

      • Scott Price
        Recognized Expert Top Contributor
        • Jul 2007
        • 1384

        #4
        The ChrW() function will return/display the character associated with the hex value of any Unicode character.

        Syntax is ChrW(&H15F) this displays correctly the s with cedilla below in a simple text box that I set up in my test database. Using ChrW(&HBA) displays the degree symbol that you mention. You mention them being the other way 'round, which makes me wonder if that isn't a typo?

        I'm not personally familiar with Lebans' RTF2, but after doing a little research into the character sets and code pages involved, it looks to me that you actually have a code page problem, not a fcharset problem. For example, the codepage for Latin 2 is 1250 (see here), and maps the code page character BA to the Unicode character 015F. However, the code page 1252 (see here) kindly takes the same code page character BA and maps it to the Unicode character BA which corresponds to the masculine ordinal indicator (so it says... Just means the degree character more or less).

        My suggestion is that you are receiving the text encoded with code page 1250 and interpreting it based on the 1252 encoding.

        Again, I'm not familiar with Lebans' RTF2, but somehow you will need to find the coding to change this encoding/decoding discrepancy. Sorry to not be able to give you any specific help on doing that :-(

        Regards,
        Scott

        Comment

        • Scott Price
          Recognized Expert Top Contributor
          • Jul 2007
          • 1384

          #5
          A few links that contain helpful and not so helpful information that I came across in my research:

          MS developer discussion

          Character sets and Code pages

          Wikipedia Character Encoding

          Wikipedia Code Pages

          Wikipedia Romanian Alphabet

          Kind regards,
          Scott

          Comment

          • geegeegeegee
            New Member
            • Mar 2008
            • 3

            #6
            Thanks for your suggestion Scott. I think the 1252 code page will point us in the right direction. Will let you know how we go.

            Comment

            • Scott Price
              Recognized Expert Top Contributor
              • Jul 2007
              • 1384

              #7
              Let me know how it goes! Good luck.

              Regards,
              Scott

              Comment

              Working...