character to HTML ampersand escape sequence converter

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • SwordAngel

    character to HTML ampersand escape sequence converter

    Hello,
    I'm looking for a program that converts characters of different
    encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
    escape sequences. Anybody knows where I can find one?

    thx.

  • David Dorward

    #2
    Re: character to HTML ampersand escape sequence converter

    SwordAngel wrote:[color=blue]
    > I'm looking for a program that converts characters of different
    > encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
    > escape sequences. Anybody knows where I can find one?[/color]

    IIRC Tidy will do that.



    --
    David Dorward <http://blog.dorward.me .uk/> <http://dorward.me.uk/>
    Home is where the ~/.bashrc is

    Comment

    • Bjoern Hoehrmann

      #3
      Re: character to HTML ampersand escape sequence converter

      * David Dorward wrote in comp.infosystem s.www.authoring.html:[color=blue]
      >SwordAngel wrote:[color=green]
      >> I'm looking for a program that converts characters of different
      >> encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
      >> escape sequences. Anybody knows where I can find one?[/color]
      >
      >IIRC Tidy will do that.[/color]

      Well, yes, but only for character encodings it supports (and it does not
      support any of the encodings SwordAngel listed to that extend). Windows
      users can compile Tidy with an experimental feature that enables support
      for all character encodings Windows / Internet Explorer support via the
      TIDY_WIN32_MLAN G_SUPPORT #define, but it is generally better to use ex-
      ternal tools such as iconv, piconv, uconv, recode, ... to convert the
      document to UTF-8 and let Tidy process the document accordingly.
      --
      Björn Höhrmann · mailto:bjoern@h oehrmann.de · http://bjoern.hoehrmann.de
      Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
      68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

      Comment

      • Alan J. Flavell

        #4
        Re: character to HTML ampersand escape sequence converter

        On Fri, 17 Dec 2004, SwordAngel wrote:
        [color=blue]
        > I'm looking for a program that converts characters of different
        > encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
        > escape sequences. Anybody knows where I can find one?[/color]

        "free recode" ? http://recode.progiciels-bpi.ca/

        Call it with something like:
        recode -d euc-jp..h4 < input.html > output.html

        That won't do anything to tidy up the HTML, though, unlike Tidy ;-)

        And don't forget that when you've translated language-specific
        encodings into Han-unified Unicode characters, you should mark-up
        the source with the correct language attribute in order to get
        the right rendering of the unified characters. At least that's my
        understanding (I can't actually read them myself).

        Comment

        • Nick Kew

          #5
          Re: character to HTML ampersand escape sequence converter

          In article <41c728c0.39247 1656@news.bjoer n.hoehrmann.de> ,
          Bjoern Hoehrmann <derhoermi@gmx. net> writes:
          [color=blue][color=green]
          >>IIRC Tidy will do that.[/color][/color]

          Indeed. I was on the point of suggesting AN XML processor until I saw
          that (libxml2 accepts HTML as well as XML input).
          [color=blue]
          > Well, yes, but only for character encodings it supports (and it does not
          > support any of the encodings SwordAngel listed to that extend).[/color]

          Indeed, libxml2 (last time I checked) supports some but not all of
          those encodings, so the same limitation applies.

          Have you considered tying in iconv to Tidy to improve i18n support?
          [color=blue]
          > but it is generally better to use ex-
          > ternal tools such as iconv, piconv, uconv, recode, ... to convert the
          > document to UTF-8 and let Tidy process the document accordingly.[/color]

          I believe OpenSP supports all the encodings named, though I'm
          not entirely sure OTTOMH. So there may still be a one-stop
          program for the conversion. But as Björn says, a transcoder
          such as iconv is a more general solution.


          --
          Nick Kew

          Nick's manifesto: http://www.htmlhelp.com/~nick/

          Comment

          • Bjoern Hoehrmann

            #6
            Re: character to HTML ampersand escape sequence converter

            * Nick Kew wrote in comp.infosystem s.www.authoring.html:[color=blue]
            >Have you considered tying in iconv to Tidy to improve i18n support?[/color]

            I wrote an experimental iconv wrapper which is included in the source
            distribution, but it is not plugged into the code, i.e., you need to
            change a few things in order to use it. Development of these features
            was put on hold until a better interface for pluggable transcoders for
            Tidy has been developed (which has not happend yet).
            --
            Björn Höhrmann · mailto:bjoern@h oehrmann.de · http://bjoern.hoehrmann.de
            Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
            68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

            Comment

            • Henri Sivonen

              #7
              Re: character to HTML ampersand escape sequence converter

              In article <fg7c92-i11.ln1@hugin.w ebthing.com>,
              nick@hugin.webt hing.com (Nick Kew) wrote:
              [color=blue]
              > Indeed. I was on the point of suggesting AN XML processor until I saw
              > that (libxml2 accepts HTML as well as XML input).[/color]

              A quick glance at the API docs suggested that the HTML API is similar
              but separate from the XML API. Is it so? Is there an equivalent of SAX
              filter or somesuch that would make HTML appear to the app as XHTML?

              TagSoup on the Java side appears to the app as an XML parser parsing
              XHTML.

              Has anyone compared the tag slurping features of TagSoup and libxml2? I
              Wonder which one is a better idea when writing in Python: using libxml2
              with CPython or using TagSoup with Jython?

              --
              Henri Sivonen
              hsivonen@iki.fi

              Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

              Comment

              • Alan J. Flavell

                #8
                Re: character to HTML ampersand escape sequence converter

                On Sat, 18 Dec 2004, Henri Sivonen wrote:
                [color=blue]
                > In article <fg7c92-i11.ln1@hugin.w ebthing.com>,
                > nick@hugin.webt hing.com (Nick Kew) wrote:
                >[color=green]
                > > Indeed. I was on the point of suggesting AN XML processor until I
                > > saw that (libxml2 accepts HTML as well as XML input).[/color]
                >
                > A quick glance at the API docs suggested that the HTML API is similar
                > but separate from the XML API. Is it so?[/color]

                But does this matter, in the context of the original question?

                Surely, given any WWW-compatible HTML or XHTML data stream, one can
                choose to convert any non-ascii coded character (or any selection of
                non-ascii characters) to a unicode code point and thence into
                &#bignumber; notation, purely at the character stream layer, without
                parsing the rest of the material at all?

                Comment

                • Bjoern Hoehrmann

                  #9
                  Re: character to HTML ampersand escape sequence converter

                  * Alan J. Flavell wrote in comp.infosystem s.www.authoring.html:[color=blue]
                  >Surely, given any WWW-compatible HTML or XHTML data stream, one can
                  >choose to convert any non-ascii coded character (or any selection of
                  >non-ascii characters) to a unicode code point and thence into
                  >&#bignumber; notation, purely at the character stream layer, without
                  >parsing the rest of the material at all?[/color]

                  That does not work very well for comments, CDATA elements, processing
                  instructions, etc.
                  --
                  Björn Höhrmann · mailto:bjoern@h oehrmann.de · http://bjoern.hoehrmann.de
                  Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
                  68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

                  Comment

                  • Alan J. Flavell

                    #10
                    Re: character to HTML ampersand escape sequence converter

                    On Sat, 18 Dec 2004, Bjoern Hoehrmann wrote:
                    [color=blue]
                    > * Alan J. Flavell wrote in comp.infosystem s.www.authoring.html:[color=green]
                    > >Surely, given any WWW-compatible HTML or XHTML data stream, one can
                    > >choose to convert any non-ascii coded character (or any selection of
                    > >non-ascii characters) to a unicode code point and thence into
                    > >&#bignumber; notation, purely at the character stream layer, without
                    > >parsing the rest of the material at all?[/color]
                    >
                    > That does not work very well for comments,[/color]

                    Fortunately, HTML rendering agents don't need to interpret the content
                    of comments...
                    [color=blue]
                    > CDATA elements, processing instructions, etc.[/color]

                    Theoretically, of course, you are right; which is why I slipped-in
                    that qualification re. documents that are compatible with the WWW as
                    it exists.

                    I don't dispute that in theory you can produce counter-examples where
                    the simple method described above gives the wrong result, for the
                    reasons you gave; but I'm interested if a real-life example can be
                    produced where this would matter.

                    all the best

                    Comment

                    • Henri Sivonen

                      #11
                      Re: character to HTML ampersand escape sequence converter

                      In article <Pine.LNX.4.61. 0412181103510.2 0967@ppepc56.ph .gla.ac.uk>,
                      "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
                      [color=blue]
                      > On Sat, 18 Dec 2004, Henri Sivonen wrote:
                      >[color=green]
                      > > In article <fg7c92-i11.ln1@hugin.w ebthing.com>,
                      > > nick@hugin.webt hing.com (Nick Kew) wrote:
                      > >[color=darkred]
                      > > > Indeed. I was on the point of suggesting AN XML processor until I
                      > > > saw that (libxml2 accepts HTML as well as XML input).[/color]
                      > >
                      > > A quick glance at the API docs suggested that the HTML API is similar
                      > > but separate from the XML API. Is it so?[/color]
                      >
                      > But does this matter, in the context of the original question?[/color]

                      Perhaps not. It was a new question in the spirit of "discussion
                      forum--not help desk". :-)
                      [color=blue]
                      > Surely, given any WWW-compatible HTML or XHTML data stream, one can
                      > choose to convert any non-ascii coded character (or any selection of
                      > non-ascii characters) to a unicode code point and thence into
                      > &#bignumber; notation, purely at the character stream layer, without
                      > parsing the rest of the material at all?[/color]

                      Yes, except comments change if they exist and contain non-ASCII.

                      --
                      Henri Sivonen
                      hsivonen@iki.fi

                      Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

                      Comment

                      • Bjoern Hoehrmann

                        #12
                        Re: character to HTML ampersand escape sequence converter

                        * Alan J. Flavell wrote in comp.infosystem s.www.authoring.html:[color=blue]
                        >I don't dispute that in theory you can produce counter-examples where
                        >the simple method described above gives the wrong result, for the
                        >reasons you gave; but I'm interested if a real-life example can be
                        >produced where this would matter.[/color]

                        Consider a HTML document with

                        <style type="text/css">
                        q:lang(no) { quotes: "«" "»" '"' '"' }
                        </style>

                        or consider HTML documents with scripts such as those in

                        ファイルの入出力 ファイルの読み込みや書き込みをするには、まず、ファイルを開いて『ファイルハンドル』に関連付け

                        --
                        Björn Höhrmann · mailto:bjoern@h oehrmann.de · http://bjoern.hoehrmann.de
                        Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
                        68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

                        Comment

                        • Alan J. Flavell

                          #13
                          Re: character to HTML ampersand escape sequence converter

                          On Sat, 18 Dec 2004, Bjoern Hoehrmann wrote:
                          [color=blue]
                          > Consider a HTML document with
                          >
                          > <style type="text/css">
                          > q:lang(no) { quotes: "«" "»" '"' '"' }
                          > </style>
                          >
                          > or consider HTML documents with scripts such as those in
                          >
                          > http://www.rfs.jp/sitebuilder/javascript/01/08.html[/color]

                          OK, I concede.

                          Of course, if the target encoding was meant to be us-ascii with
                          &#bignumber; representations of non-ascii characters (which might have
                          been what the questioner had in mind, since I undestood the request to
                          be for &#bignumber; representation rather than actual utf-8-encoded
                          characters in the HTML part), then you'd need CSS-aware and
                          Javascript-aware converters to know how to represent those non-ascii
                          characters in their respective languages.

                          Indeed the W3C were wise in their XHTML documentation to recommend
                          moving those enclosures out into separate files rather than trying to
                          in-line them as CDATA ;-)

                          Comment

                          • Nick Kew

                            #14
                            Re: character to HTML ampersand escape sequence converter

                            In article <hsivonen-5BCFB2.12592918 122004@news.dna internet.net>,
                            Henri Sivonen <hsivonen@iki.f i> writes:
                            [color=blue][color=green]
                            >> Indeed. I was on the point of suggesting AN XML processor until I saw
                            >> that (libxml2 accepts HTML as well as XML input).[/color]
                            >
                            > A quick glance at the API docs suggested that the HTML API is similar
                            > but separate from the XML API. Is it so?[/color]

                            Yes, that's a reasonably fair summary. The HTML parser is the XML
                            parser with tolerance of non-XML and knowledge of HTML4.
                            [color=blue]
                            > Is there an equivalent of SAX
                            > filter or somesuch that would make HTML appear to the app as XHTML?[/color]

                            The HTML parser gives you either SAX or DOM, and will process either
                            HTML or XHTML input without distinction. HTML mode is also tolerant
                            of tag-soup, though not quite as forgiving as a typical browser.
                            There are a few bugs wrt the spec: most obviously, it only recognises
                            XML comment syntax (but then, so do the browsers).

                            As a corollary, you can use it to apply XML processing to HTML.
                            [color=blue]
                            > TagSoup on the Java side appears to the app as an XML parser parsing
                            > XHTML.[/color]

                            I'm not familiar with that, but it's not uncommon.
                            [color=blue]
                            > Has anyone compared the tag slurping features of TagSoup and libxml2? I
                            > Wonder which one is a better idea when writing in Python: using libxml2
                            > with CPython or using TagSoup with Jython?[/color]

                            Couldn't tell you. But I'd venture a strong guess that libxml2 will be
                            not only a great deal faster than anything-java, but also no harder
                            and possibly easier to work with.


                            --
                            Nick Kew

                            Comment

                            • Henri Sivonen

                              #15
                              Re: character to HTML ampersand escape sequence converter

                              In article <fu3e92-a01.ln1@hugin.w ebthing.com>,
                              nick@hugin.webt hing.com (Nick Kew) wrote:
                              [color=blue]
                              > In article <hsivonen-5BCFB2.12592918 122004@news.dna internet.net>,
                              > Henri Sivonen <hsivonen@iki.f i> writes:
                              >[color=green][color=darkred]
                              > >> Indeed. I was on the point of suggesting AN XML processor until I saw
                              > >> that (libxml2 accepts HTML as well as XML input).[/color][/color][/color]
                              [color=blue]
                              > The HTML parser gives you either SAX or DOM, and will process either
                              > HTML or XHTML input without distinction.[/color]

                              Are the elements in the XHTML namespace or in no namespace? The good
                              thing about TagSoup is that it allows the app internals to be written
                              for XHTML, so the same app internals work for HTML, XHTML *and*
                              XHTML+FooML (using an XML parser). That is, the HTML/XHTML difference is
                              left on the parsing level and not carried over to higher levels as in
                              browsers.
                              [color=blue]
                              > But I'd venture a strong guess that libxml2 will be
                              > not only a great deal faster than anything-java, but also no harder
                              > and possibly easier to work with.[/color]

                              I think I read somewhere that the libxml2 wrapper gives the Python side
                              UTF-8 byte strings instead of Python Unicode strings.

                              --
                              Henri Sivonen
                              hsivonen@iki.fi

                              Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

                              Comment

                              Working...