Extract plain text out of HTML page

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Guogang

    Extract plain text out of HTML page

    Hi,

    I need to extract plain text from HTML page (i.e. do not show images, html
    formatting, ...)

    Is there some C# class/function that can help me on this?

    Thanks,
    Guogang


  • Dmitriy Lapshin [C# / .NET MVP]

    #2
    Re: Extract plain text out of HTML page

    Hi,

    The dumbest solution would probably be to employ a regular expression to cut
    out any construct of a form
    "<...>" from the HTML file. You could use the following RegExp:

    <[^>]+>

    to replace all tags with empty strings.

    --
    Dmitriy Lapshin [C# / .NET MVP]
    X-Unity Test Studio

    Bring the power of unit testing to VS .NET IDE

    "Guogang" <nospam@no_such _domain.com> wrote in message
    news:%23SnWKYdm DHA.684@TK2MSFT NGP09.phx.gbl.. .[color=blue]
    > Hi,
    >
    > I need to extract plain text from HTML page (i.e. do not show images, html
    > formatting, ...)
    >
    > Is there some C# class/function that can help me on this?
    >
    > Thanks,
    > Guogang
    >
    >[/color]

    Comment

    • Chris R. Timmons

      #3
      Re: Extract plain text out of HTML page

      "Guogang" <nospam@no_such _domain.com> wrote in
      news:#SnWKYdmDH A.684@TK2MSFTNG P09.phx.gbl:
      [color=blue]
      > Hi,
      >
      > I need to extract plain text from HTML page (i.e. do not show
      > images, html formatting, ...)
      >
      > Is there some C# class/function that can help me on this?[/color]

      Guogang,

      A regular expression can be used to strip out all HTML tags:

      using System.Text.Reg ularExpressions ;
      ...
      string plainText = Regex.Replace(h tmlText, "<[^>]+?>", "");

      Hope this helps.

      Chris.
      -------------
      C.R. Timmons Consulting, Inc.

      Comment

      • Morten Wennevik

        #4
        Re: Extract plain text out of HTML page

        Not familiar with the Regex class which may be more suitable, but you
        could strip html tags with

        int index = 0;

        while((i = htmlPage.IndexO f("<")) != -1)
        {
        i = strip(i);
        }

        private int strip(int i)
        {
        int a = htmlPage.IndexO f("<", i);
        int b = htmlPage.IndexO f(">", i);
        if(a < b) // nested tags, so do a recursive loop
        strip(a);
        ... // then you would add some code to strip away everything from i to b
        }

        --
        Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

        Comment

        • Girish Bharadwaj

          #5
          Re: Extract plain text out of HTML page

          Guogang wrote:
          [color=blue]
          > Hi,
          >
          > I need to extract plain text from HTML page (i.e. do not show images, html
          > formatting, ...)
          >
          > Is there some C# class/function that can help me on this?
          >
          > Thanks,
          > Guogang
          >
          >[/color]
          You can try writing a simple XSL which transforms HTML to text. Of
          course, for this to work , you need to make sure that the HTML is
          well-formed. otherwise, use the other suggestions.

          --
          Girish Bharadwaj

          Comment

          Working...