Extract Content from HTML ?

**Toby Inkster** · Jul 23 '05, 11:39 PM

Re: Extract Content from HTML ?

mark4 wrote:
[color=blue]
> Are there any utilities to help me extract Content from HTML ?
> I'd like to store this data in a database.[/color]

Looks to me like you'd have to write your own customised program to
extract the data.

To do that, I recommend using Perl. Perl has a module called HTML::Parser
which is apparently pretty good at extracting information from malformed
HTML files. Whatsmore, it is generally very good at text handling and has
decent database modules too.
[color=blue]
> Nor can I contact the person who 'owns' it. If I did contact them, they
> would be unlikely to release the data.
>
> Despite this, there are no copyright issues here. Every single post made
> to the forum was made using an alias and no forum poster wants to be
> identified, nor do any posters wish to claim "ownership" of their
> contributions.[/color]

Sounds to me like there are *major* copyright issues!

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

**mark4** · Jul 23 '05, 11:39 PM

Re: Extract Content from HTML ?

On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster
<usenet200502@t obyinkster.co.u k> wrote:
[color=blue]
>mark4 wrote:
>[color=green]
>> Are there any utilities to help me extract Content from HTML ?
>> I'd like to store this data in a database.[/color]
>
>Looks to me like you'd have to write your own customised program to
>extract the data.[/color]

I expected as much.
[color=blue]
>To do that, I recommend using Perl. Perl has a module called HTML::Parser
>which is apparently pretty good at extracting information from malformed
>HTML files. Whatsmore, it is generally very good at text handling and has
>decent database modules too.[/color]

Thanks. Being a microserf, I don't normally code in Perl but I
may look into this. It's either that or WSH Javascript with
it's regular expressions. Fortunately I already have a top
level design and it looks pretty simple. I may look into this
Perl module but it will probably be easier to use microserf
technology with which I'm intimate with. I shall probably store
it in MSSQL.
[color=blue][color=green]
>> Nor can I contact the person who 'owns' it. If I did contact them, they
>> would be unlikely to release the data.
>>
>> Despite this, there are no copyright issues here. Every single post made
>> to the forum was made using an alias and no forum poster wants to be
>> identified, nor do any posters wish to claim "ownership" of their
>> contributions.[/color]
>
>Sounds to me like there are *major* copyright issues![/color]

I can't see what those issues are. Who owns the data? Not the
original forum provider. The data posted to a forum is copyright
of the original author - no matter what ToS my be specified in
the forum. All those original authors have an alias and don't
actually want to be identified. What I'm doing is no more a
violation of copyright than someone keeping newspaper clippings.

So long as I don't republish it.

**Sherm Pendley** · Jul 23 '05, 11:39 PM

Re: Extract Content from HTML ?

mark4 wrote:
[color=blue]
> On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster
> <usenet200502@t obyinkster.co.u k> wrote:
>[color=green]
>>To do that, I recommend using Perl. Perl has a module called HTML::Parser
>>which is apparently pretty good at extracting information from malformed
>>HTML files. Whatsmore, it is generally very good at text handling and has
>>decent database modules too.[/color][/color]

Mark's right. I don't do the whole "language cheerleader" thing - but for
this particular problem, Perl's an ideal fit.
[color=blue]
> Thanks. Being a microserf, I don't normally code in Perl but I
> may look into this. It's either that or WSH Javascript with
> it's regular expressions.[/color]

There's Perl for Windows, you know. It integrates nicely with WSH too.

<http://www.activestate .com>

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org

**Philip Herlihy** · Jul 23 '05, 11:39 PM

Re: Extract Content from HTML ?

Access can link to HTML (direct from the web) and will recognise tables.
You might be lucky! It would make a very quick solution. File > Get
External Data > Link... and then choose HTML. I was surprised how well it
worked when I tried it on a table I'd created in FrontPage.

--
############### #####
## PH, London
############### #####
"mark4" <mark4asp@#nott his#ntlworld.co m> wrote in message
news:8eb521h18o 9m8s8l4dgcfvl61 riho36r65@4ax.c om...[color=blue]
> Hello,
>
> Are there any utilities to help me extract Content from HTML ?
>
> I'd like to store this data in a database.
>
> The HTML consists of about 10,000 files with a total size of
> about 160 Mb. Each file is a thread from a message forum. Each
> thread has several contributions. The threads are in linear
> order of date posted with filenames such as 000125633.html. The
> HTML is marked up with <table>, etc tags. This HTML is very
> badly formed with crucial tags missing (such as <TR>, <BODY>,
> etc.). There is no coherence to this; no system - sometimes tags
> are missing and sometimes they are present. Despite this, the
> threads seem to render correctly; such is the forgiving nature
> of modern browsers.
>
> Fields for each post are usually identified by an attribute tag.
> (usually an attribute of a <TD> or <SPAN>.
>
> Sometimes I need to actually store HTML with the content (for
> instance when a post includes a link, colored writing or text
> formatted with <PRE> tags.
>
> My purpose in storing this in a database is to make the content
> (a) easier to search and (b) use a more efficient storage
> medium.
>
> The original database from which these web-forum posts were
> taken is no longer available on the web nor does it look like it
> ever will be again. Nor can I contact the person who 'owns' it.
> If I did contact them, they would be unlikely to release the
> data.
>
> Despite this, there are no copyright issues here. Every single
> post made to the forum was made using an alias and no forum
> poster wants to be identified, nor do any posters wish to claim
> "ownership" of their contributions.
>[/color]

**Jim Royal** · Jul 23 '05, 11:39 PM

Re: Extract Content from HTML ?

In article <8eb521h18o9m8s 8l4dgcfvl61riho 36r65@4ax.com>, mark4
<mark4asp@#nott his#ntlworld.co m> wrote:
[color=blue]
> Are there any utilities to help me extract Content from HTML ?[/color]

BBEdit has a simple menu command to remove markup from an HTML page,
leaving only the content. You should then perform any kind of regex
operation to massage the data before saving it.

To process all those files, it should be a pretty simple matter to
write an AppleScript to automate this procesure.

However, this solution is Macintosh-only.

--
Jim Royal
"Understand ing is a three-edged sword"

Jim Royal – Photographer · Videographer · Technical Communicator

http://JimRoyal.com

**Chrissy Cruiser** · Jul 23 '05, 11:39 PM

Re: Extract Content from HTML ?

On Mon, 28 Feb 2005 08:32:19 GMT, mark4 wrote:
[color=blue][color=green][color=darkred]
>>> Nor can I contact the person who 'owns' it. If I did contact them, they
>>> would be unlikely to release the data.
>>>
>>> Despite this, there are no copyright issues here. Every single post made
>>> to the forum was made using an alias and no forum poster wants to be
>>> identified, nor do any posters wish to claim "ownership" of their
>>> contributions.[/color]
>>
>>Sounds to me like there are *major* copyright issues![/color]
>
> I can't see what those issues are.[/color]

By law, those posts are copyrighted and owned by the posters.

**John Fitzsimons** · Jul 23 '05, 11:39 PM

Re: Extract Content from HTML ?

On Mon, 28 Feb 2005 06:06:36 GMT, mark4
<mark4asp@#nott his#ntlworld.co m> wrote:
[color=blue]
>Hello,[/color]
[color=blue]
>Are there any utilities to help me extract Content from HTML ?[/color]

< snip >

Notetab ? Modify - Strip HTML tags ?

NoteTab – A Prize-Winning Text Editor and HTML Editor

http://www.notetab.com/

It’s a versatile text editor, a popular Notepad replacement, and a blazingly fast HTML editor. NoteTab gets more done in less time. Try it!

Not sure whether that is in the freeware version or not.

Regards, John.

**Toby Inkster** · Jul 23 '05, 11:39 PM

Re: Extract Content from HTML ?

mark4 wrote:
[color=blue]
> Thanks. Being a microserf, I don't normally code in Perl but I
> may look into this.[/color]

I am told ActiveState's Windows port of Perl is pretty good. Alternatively
there is also a Cygwin version of Perl.
[color=blue]
> I can't see what those issues are. Who owns the data?[/color]

Its original authors, unless they explicitly signed away the copyright.
[color=blue]
> All those original authors have an alias and don't actually want to be
> identified.[/color]

Publishing anonymously or under a pseudonym does not mean you forgo
copyright.
[color=blue]
> So long as I don't republish it.[/color]

If you are keeping the database for private use, then you can probably
"get away with it", but the natural assumption on alt.html is that posters
are wanting to publish their efforts to the web, unless it's explicitly
stated otherwise.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

**ggrothendieck@volcanomail.com** · Jul 23 '05, 11:39 PM

Re: Extract Content from HTML ?

> >To do that, I recommend using Perl. Perl has a module called
HTML::Parser[color=blue][color=green]
> >which is apparently pretty good at extracting information from[/color][/color]
malformed[color=blue][color=green]
> >HTML files. Whatsmore, it is generally very good at text handling[/color][/color]
and has[color=blue][color=green]
> >decent database modules too.[/color]
>
>
> Thanks. Being a microserf, I don't normally code in Perl but I
> may look into this. It's either that or WSH Javascript with
> it's regular expressions. Fortunately I already have a top
> level design and it looks pretty simple. I may look into this
> Perl module but it will probably be easier to use microserf
> technology with which I'm intimate with. I shall probably store
> it in MSSQL.[/color]

You could use the InternetExplore r.Application COM object.
That would give you the facilities for performing HTML
parsing without regexps. It would therefore be
more robust and readily doable in your favorite language.
Try google for examples.

**mbstevens** · Jul 23 '05, 11:39 PM

Re: Extract Content from HTML ?

mark4 wrote:
[color=blue]
> Hello,
>
> Are there any utilities to help me extract Content from HTML ?[/color]

lynx -dump http://whateverTheHeck.com > temp.txt

.... is the shortest program I know of for this kind of thing.
The '>' redirection to temp.txt may vary somewhat between operating systems.
--
mbstevens http://www.mbstevens.com

Extract Content from HTML ?

Extract Content from HTML ?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment