Extract plain text out of HTML page

**Dmitriy Lapshin [C# / .NET MVP]** · Nov 15 '05, 01:59 PM

Re: Extract plain text out of HTML page

Hi,

The dumbest solution would probably be to employ a regular expression to cut
out any construct of a form
"<...>" from the HTML file. You could use the following RegExp:

<[^>]+>

to replace all tags with empty strings.

--
Dmitriy Lapshin [C# / .NET MVP]
X-Unity Test Studio

miik.com.ua

http://x-unity.miik.com.ua/teststudio.aspx

This domain may be for sale!

Bring the power of unit testing to VS .NET IDE

"Guogang" <nospam@no_such _domain.com> wrote in message
news:%23SnWKYdm DHA.684@TK2MSFT NGP09.phx.gbl.. .[color=blue]
> Hi,
>
> I need to extract plain text from HTML page (i.e. do not show images, html
> formatting, ...)
>
> Is there some C# class/function that can help me on this?
>
> Thanks,
> Guogang
>
>[/color]

**Chris R. Timmons** · Nov 15 '05, 01:59 PM

Re: Extract plain text out of HTML page

"Guogang" <nospam@no_such _domain.com> wrote in
news:#SnWKYdmDH A.684@TK2MSFTNG P09.phx.gbl:
[color=blue]
> Hi,
>
> I need to extract plain text from HTML page (i.e. do not show
> images, html formatting, ...)
>
> Is there some C# class/function that can help me on this?[/color]

Guogang,

A regular expression can be used to strip out all HTML tags:

using System.Text.Reg ularExpressions ;
...
string plainText = Regex.Replace(h tmlText, "<[^>]+?>", "");

Hope this helps.

Chris.
-------------
C.R. Timmons Consulting, Inc.

C.R. Timmons Consulting, Inc.

http://www.crtimmonsinc.com/

**Morten Wennevik** · Nov 15 '05, 01:59 PM

Re: Extract plain text out of HTML page

Not familiar with the Regex class which may be more suitable, but you
could strip html tags with

int index = 0;

while((i = htmlPage.IndexO f("<")) != -1)
{
i = strip(i);
}

private int strip(int i)
{
int a = htmlPage.IndexO f("<", i);
int b = htmlPage.IndexO f(">", i);
if(a < b) // nested tags, so do a recursive loop
strip(a);
... // then you would add some code to strip away everything from i to b
}

--
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

**Girish Bharadwaj** · Nov 15 '05, 02:00 PM

Re: Extract plain text out of HTML page

Guogang wrote:
[color=blue]
> Hi,
>
> I need to extract plain text from HTML page (i.e. do not show images, html
> formatting, ...)
>
> Is there some C# class/function that can help me on this?
>
> Thanks,
> Guogang
>
>[/color]
You can try writing a simple XSL which transforms HTML to text. Of
course, for this to work , you need to make sure that the HTML is
well-formed. otherwise, use the other suggestions.

--
Girish Bharadwaj

Extract plain text out of HTML page

Extract plain text out of HTML page

Comment

Comment

Comment

Comment