Convert HTML to XML or Paser HTML

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Q.Z.

    Convert HTML to XML or Paser HTML

    Hello,

    Does anybody know is there a .NET or COM based library to
    parse HTML or convert html to xml so I can use xpath to
    parse it?

    Thanks
    Qin Zhou
  • Ken Cox [Microsoft MVP]

    #2
    Re: Convert HTML to XML or Paser HTML

    It looks like you can use the COM wrapper around Tidy to get there...





    "Q.Z." <zhou@netquote. com> wrote in message
    news:051301c3d6 3c$eb7103e0$a30 1280a@phx.gbl.. .[color=blue]
    > Hello,
    >
    > Does anybody know is there a .NET or COM based library to
    > parse HTML or convert html to xml so I can use xpath to
    > parse it?
    >
    > Thanks
    > Qin Zhou[/color]

    Comment

    • Steven Cheng[MSFT]

      #3
      RE: Convert HTML to XML or Paser HTML

      Hi Q.Z,


      Thank you for using Microsoft Newsgroup Service. Based on your description,
      you are looking for some COM or dotnet components which can convert the
      html document into XML (XHTML) style document. Is my understanding correct?

      If so, I think Ken Cox've provided some good sites on this topic, they
      shows two components of COM. You may have a try on them to see whether they
      help.

      Steven Cheng
      Microsoft Online Support

      Get Secure! www.microsoft.com/security
      (This posting is provided "AS IS", with no warranties, and confers no
      rights.)

      Comment

      • Joerg Jooss

        #4
        Re: Convert HTML to XML or Paser HTML

        "Ken Cox [Microsoft MVP]" wrote:
        [color=blue]
        > It looks like you can use the COM wrapper around Tidy to get there...
        >
        > http://perso.wanadoo.fr/ablavier/TidyCOM/
        >
        > http://www.15seconds.com/Issue/010601.htm[/color]

        Uh -- why COM when there's Chris Lovett's SgmlReader at
        www.gotdotnet.com?

        Also note that TidyLib can be easily used through P/Invoke.

        Cheers,
        --
        Joerg Jooss
        joerg.jooss@gmx .net

        Comment

        • Q.Z

          #5
          Re: Convert HTML to XML or Paser HTML

          Ken and Steven,

          Thanks a a lot! Looks like it will do the trick.

          Qin ZHou
          [color=blue]
          >-----Original Message-----
          >It looks like you can use the COM wrapper around Tidy to[/color]
          get there...[color=blue]
          >
          >http://perso.wanadoo.fr/ablavier/TidyCOM/
          >
          >http://www.15seconds.com/Issue/010601.htm
          >
          >"Q.Z." <zhou@netquote. com> wrote in message
          >news:051301c3d 63c$eb7103e0$a3 01280a@phx.gbl. ..[color=green]
          >> Hello,
          >>
          >> Does anybody know is there a .NET or COM based library[/color][/color]
          to[color=blue][color=green]
          >> parse HTML or convert html to xml so I can use xpath to
          >> parse it?
          >>
          >> Thanks
          >> Qin Zhou[/color]
          >
          >.
          >[/color]

          Comment

          • David Elliott

            #6
            Re: Convert HTML to XML or Paser HTML

            I have tried the SgmlReader but am having difficultly with some sites, such as www.msn.com

            If I could find a way to do parsing on HTML using C/C++/C# I would be happy. All I really
            need is a way to have an array of <tag> and <data>. Finer grainularity is not necessary. Just
            the raw information. I do need the entire page though from opening <html> to the closing </html>.

            I would prefer an HTML to XML conversion, but as time is limited, any solution would be
            appreciated.

            Thanks,
            Dave



            On Fri, 09 Jan 2004 03:23:29 GMT, v-schang@online.m icrosoft.com (Steven Cheng[MSFT]) wrote:
            [color=blue]
            >Hi Q.Z,
            >
            >
            >Thank you for using Microsoft Newsgroup Service. Based on your description,
            >you are looking for some COM or dotnet components which can convert the
            >html document into XML (XHTML) style document. Is my understanding correct?
            >
            >If so, I think Ken Cox've provided some good sites on this topic, they
            >shows two components of COM. You may have a try on them to see whether they
            >help.
            >
            >Steven Cheng
            >Microsoft Online Support
            >
            >Get Secure! www.microsoft.com/security
            >(This posting is provided "AS IS", with no warranties, and confers no
            >rights.)[/color]

            Comment

            • Maxim Kazitov

              #7
              Re: Convert HTML to XML or Paser HTML

              If you load you page to WebBrowser control you can parse you page using DOM,
              this is work slow, but works.


              "David Elliott" <DavidElliott@B ellSouth.net.no spam> wrote in message
              news:1ijk20t1n7 0a334i6npipv74c 0si3d9lem@4ax.c om...[color=blue]
              > I have tried the SgmlReader but am having difficultly with some sites,[/color]
              such as www.msn.com[color=blue]
              >
              > If I could find a way to do parsing on HTML using C/C++/C# I would be[/color]
              happy. All I really[color=blue]
              > need is a way to have an array of <tag> and <data>. Finer grainularity is[/color]
              not necessary. Just[color=blue]
              > the raw information. I do need the entire page though from opening <html>[/color]
              to the closing </html>.[color=blue]
              >
              > I would prefer an HTML to XML conversion, but as time is limited, any[/color]
              solution would be[color=blue]
              > appreciated.
              >
              > Thanks,
              > Dave
              >
              >
              >
              > On Fri, 09 Jan 2004 03:23:29 GMT, v-schang@online.m icrosoft.com (Steven[/color]
              Cheng[MSFT]) wrote:[color=blue]
              >[color=green]
              > >Hi Q.Z,
              > >
              > >
              > >Thank you for using Microsoft Newsgroup Service. Based on your[/color][/color]
              description,[color=blue][color=green]
              > >you are looking for some COM or dotnet components which can convert the
              > >html document into XML (XHTML) style document. Is my understanding[/color][/color]
              correct?[color=blue][color=green]
              > >
              > >If so, I think Ken Cox've provided some good sites on this topic, they
              > >shows two components of COM. You may have a try on them to see whether[/color][/color]
              they[color=blue][color=green]
              > >help.
              > >
              > >Steven Cheng
              > >Microsoft Online Support
              > >
              > >Get Secure! www.microsoft.com/security
              > >(This posting is provided "AS IS", with no warranties, and confers no
              > >rights.)[/color]
              >[/color]


              Comment

              • George Ter-Saakov

                #8
                Re: Convert HTML to XML or Paser HTML

                Take a look


                George.

                "David Elliott" <DavidElliott@B ellSouth.net.no spam> wrote in message
                news:1ijk20t1n7 0a334i6npipv74c 0si3d9lem@4ax.c om...[color=blue]
                > I have tried the SgmlReader but am having difficultly with some sites,[/color]
                such as www.msn.com[color=blue]
                >
                > If I could find a way to do parsing on HTML using C/C++/C# I would be[/color]
                happy. All I really[color=blue]
                > need is a way to have an array of <tag> and <data>. Finer grainularity is[/color]
                not necessary. Just[color=blue]
                > the raw information. I do need the entire page though from opening <html>[/color]
                to the closing </html>.[color=blue]
                >
                > I would prefer an HTML to XML conversion, but as time is limited, any[/color]
                solution would be[color=blue]
                > appreciated.
                >
                > Thanks,
                > Dave
                >
                >
                >
                > On Fri, 09 Jan 2004 03:23:29 GMT, v-schang@online.m icrosoft.com (Steven[/color]
                Cheng[MSFT]) wrote:[color=blue]
                >[color=green]
                > >Hi Q.Z,
                > >
                > >
                > >Thank you for using Microsoft Newsgroup Service. Based on your[/color][/color]
                description,[color=blue][color=green]
                > >you are looking for some COM or dotnet components which can convert the
                > >html document into XML (XHTML) style document. Is my understanding[/color][/color]
                correct?[color=blue][color=green]
                > >
                > >If so, I think Ken Cox've provided some good sites on this topic, they
                > >shows two components of COM. You may have a try on them to see whether[/color][/color]
                they[color=blue][color=green]
                > >help.
                > >
                > >Steven Cheng
                > >Microsoft Online Support
                > >
                > >Get Secure! www.microsoft.com/security
                > >(This posting is provided "AS IS", with no warranties, and confers no
                > >rights.)[/color]
                >[/color]


                Comment

                Working...