Extract Content from HTML ?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • mark4

    Extract Content from HTML ?

    Hello,

    Are there any utilities to help me extract Content from HTML ?

    I'd like to store this data in a database.

    The HTML consists of about 10,000 files with a total size of
    about 160 Mb. Each file is a thread from a message forum. Each
    thread has several contributions. The threads are in linear
    order of date posted with filenames such as 000125633.html. The
    HTML is marked up with <table>, etc tags. This HTML is very
    badly formed with crucial tags missing (such as <TR>, <BODY>,
    etc.). There is no coherence to this; no system - sometimes tags
    are missing and sometimes they are present. Despite this, the
    threads seem to render correctly; such is the forgiving nature
    of modern browsers.

    Fields for each post are usually identified by an attribute tag.
    (usually an attribute of a <TD> or <SPAN>.

    Sometimes I need to actually store HTML with the content (for
    instance when a post includes a link, colored writing or text
    formatted with <PRE> tags.

    My purpose in storing this in a database is to make the content
    (a) easier to search and (b) use a more efficient storage
    medium.

    The original database from which these web-forum posts were
    taken is no longer available on the web nor does it look like it
    ever will be again. Nor can I contact the person who 'owns' it.
    If I did contact them, they would be unlikely to release the
    data.

    Despite this, there are no copyright issues here. Every single
    post made to the forum was made using an alias and no forum
    poster wants to be identified, nor do any posters wish to claim
    "ownership" of their contributions.

  • Toby Inkster

    #2
    Re: Extract Content from HTML ?

    mark4 wrote:
    [color=blue]
    > Are there any utilities to help me extract Content from HTML ?
    > I'd like to store this data in a database.[/color]

    Looks to me like you'd have to write your own customised program to
    extract the data.

    To do that, I recommend using Perl. Perl has a module called HTML::Parser
    which is apparently pretty good at extracting information from malformed
    HTML files. Whatsmore, it is generally very good at text handling and has
    decent database modules too.
    [color=blue]
    > Nor can I contact the person who 'owns' it. If I did contact them, they
    > would be unlikely to release the data.
    >
    > Despite this, there are no copyright issues here. Every single post made
    > to the forum was made using an alias and no forum poster wants to be
    > identified, nor do any posters wish to claim "ownership" of their
    > contributions.[/color]

    Sounds to me like there are *major* copyright issues!

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact

    Comment

    • mark4

      #3
      Re: Extract Content from HTML ?

      On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster
      <usenet200502@t obyinkster.co.u k> wrote:
      [color=blue]
      >mark4 wrote:
      >[color=green]
      >> Are there any utilities to help me extract Content from HTML ?
      >> I'd like to store this data in a database.[/color]
      >
      >Looks to me like you'd have to write your own customised program to
      >extract the data.[/color]

      I expected as much.
      [color=blue]
      >To do that, I recommend using Perl. Perl has a module called HTML::Parser
      >which is apparently pretty good at extracting information from malformed
      >HTML files. Whatsmore, it is generally very good at text handling and has
      >decent database modules too.[/color]

      Thanks. Being a microserf, I don't normally code in Perl but I
      may look into this. It's either that or WSH Javascript with
      it's regular expressions. Fortunately I already have a top
      level design and it looks pretty simple. I may look into this
      Perl module but it will probably be easier to use microserf
      technology with which I'm intimate with. I shall probably store
      it in MSSQL.
      [color=blue][color=green]
      >> Nor can I contact the person who 'owns' it. If I did contact them, they
      >> would be unlikely to release the data.
      >>
      >> Despite this, there are no copyright issues here. Every single post made
      >> to the forum was made using an alias and no forum poster wants to be
      >> identified, nor do any posters wish to claim "ownership" of their
      >> contributions.[/color]
      >
      >Sounds to me like there are *major* copyright issues![/color]

      I can't see what those issues are. Who owns the data? Not the
      original forum provider. The data posted to a forum is copyright
      of the original author - no matter what ToS my be specified in
      the forum. All those original authors have an alias and don't
      actually want to be identified. What I'm doing is no more a
      violation of copyright than someone keeping newspaper clippings.

      So long as I don't republish it.

      Comment

      • Sherm Pendley

        #4
        Re: Extract Content from HTML ?

        mark4 wrote:
        [color=blue]
        > On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster
        > <usenet200502@t obyinkster.co.u k> wrote:
        >[color=green]
        >>To do that, I recommend using Perl. Perl has a module called HTML::Parser
        >>which is apparently pretty good at extracting information from malformed
        >>HTML files. Whatsmore, it is generally very good at text handling and has
        >>decent database modules too.[/color][/color]

        Mark's right. I don't do the whole "language cheerleader" thing - but for
        this particular problem, Perl's an ideal fit.
        [color=blue]
        > Thanks. Being a microserf, I don't normally code in Perl but I
        > may look into this. It's either that or WSH Javascript with
        > it's regular expressions.[/color]

        There's Perl for Windows, you know. It integrates nicely with WSH too.

        <http://www.activestate .com>

        sherm--

        --
        Cocoa programming in Perl: http://camelbones.sourceforge.net
        Hire me! My resume: http://www.dot-app.org

        Comment

        • Philip Herlihy

          #5
          Re: Extract Content from HTML ?

          Access can link to HTML (direct from the web) and will recognise tables.
          You might be lucky! It would make a very quick solution. File > Get
          External Data > Link... and then choose HTML. I was surprised how well it
          worked when I tried it on a table I'd created in FrontPage.

          --
          ############### #####
          ## PH, London
          ############### #####
          "mark4" <mark4asp@#nott his#ntlworld.co m> wrote in message
          news:8eb521h18o 9m8s8l4dgcfvl61 riho36r65@4ax.c om...[color=blue]
          > Hello,
          >
          > Are there any utilities to help me extract Content from HTML ?
          >
          > I'd like to store this data in a database.
          >
          > The HTML consists of about 10,000 files with a total size of
          > about 160 Mb. Each file is a thread from a message forum. Each
          > thread has several contributions. The threads are in linear
          > order of date posted with filenames such as 000125633.html. The
          > HTML is marked up with <table>, etc tags. This HTML is very
          > badly formed with crucial tags missing (such as <TR>, <BODY>,
          > etc.). There is no coherence to this; no system - sometimes tags
          > are missing and sometimes they are present. Despite this, the
          > threads seem to render correctly; such is the forgiving nature
          > of modern browsers.
          >
          > Fields for each post are usually identified by an attribute tag.
          > (usually an attribute of a <TD> or <SPAN>.
          >
          > Sometimes I need to actually store HTML with the content (for
          > instance when a post includes a link, colored writing or text
          > formatted with <PRE> tags.
          >
          > My purpose in storing this in a database is to make the content
          > (a) easier to search and (b) use a more efficient storage
          > medium.
          >
          > The original database from which these web-forum posts were
          > taken is no longer available on the web nor does it look like it
          > ever will be again. Nor can I contact the person who 'owns' it.
          > If I did contact them, they would be unlikely to release the
          > data.
          >
          > Despite this, there are no copyright issues here. Every single
          > post made to the forum was made using an alias and no forum
          > poster wants to be identified, nor do any posters wish to claim
          > "ownership" of their contributions.
          >[/color]


          Comment

          • Jim Royal

            #6
            Re: Extract Content from HTML ?

            In article <8eb521h18o9m8s 8l4dgcfvl61riho 36r65@4ax.com>, mark4
            <mark4asp@#nott his#ntlworld.co m> wrote:
            [color=blue]
            > Are there any utilities to help me extract Content from HTML ?[/color]

            BBEdit has a simple menu command to remove markup from an HTML page,
            leaving only the content. You should then perform any kind of regex
            operation to massage the data before saving it.

            To process all those files, it should be a pretty simple matter to
            write an AppleScript to automate this procesure.

            However, this solution is Macintosh-only.

            --
            Jim Royal
            "Understand ing is a three-edged sword"

            Comment

            • Chrissy Cruiser

              #7
              Re: Extract Content from HTML ?

              On Mon, 28 Feb 2005 08:32:19 GMT, mark4 wrote:
              [color=blue][color=green][color=darkred]
              >>> Nor can I contact the person who 'owns' it. If I did contact them, they
              >>> would be unlikely to release the data.
              >>>
              >>> Despite this, there are no copyright issues here. Every single post made
              >>> to the forum was made using an alias and no forum poster wants to be
              >>> identified, nor do any posters wish to claim "ownership" of their
              >>> contributions.[/color]
              >>
              >>Sounds to me like there are *major* copyright issues![/color]
              >
              > I can't see what those issues are.[/color]

              By law, those posts are copyrighted and owned by the posters.

              Comment

              • John Fitzsimons

                #8
                Re: Extract Content from HTML ?

                On Mon, 28 Feb 2005 06:06:36 GMT, mark4
                <mark4asp@#nott his#ntlworld.co m> wrote:
                [color=blue]
                >Hello,[/color]
                [color=blue]
                >Are there any utilities to help me extract Content from HTML ?[/color]

                < snip >

                Notetab ? Modify - Strip HTML tags ?

                It’s a versatile text editor, a popular Notepad replacement, and a blazingly fast HTML editor. NoteTab gets more done in less time. Try it!


                Not sure whether that is in the freeware version or not.

                Regards, John.

                Comment

                • Toby Inkster

                  #9
                  Re: Extract Content from HTML ?

                  mark4 wrote:
                  [color=blue]
                  > Thanks. Being a microserf, I don't normally code in Perl but I
                  > may look into this.[/color]

                  I am told ActiveState's Windows port of Perl is pretty good. Alternatively
                  there is also a Cygwin version of Perl.
                  [color=blue]
                  > I can't see what those issues are. Who owns the data?[/color]

                  Its original authors, unless they explicitly signed away the copyright.
                  [color=blue]
                  > All those original authors have an alias and don't actually want to be
                  > identified.[/color]

                  Publishing anonymously or under a pseudonym does not mean you forgo
                  copyright.
                  [color=blue]
                  > So long as I don't republish it.[/color]

                  If you are keeping the database for private use, then you can probably
                  "get away with it", but the natural assumption on alt.html is that posters
                  are wanting to publish their efforts to the web, unless it's explicitly
                  stated otherwise.

                  --
                  Toby A Inkster BSc (Hons) ARCS
                  Contact Me ~ http://tobyinkster.co.uk/contact

                  Comment

                  • ggrothendieck@volcanomail.com

                    #10
                    Re: Extract Content from HTML ?

                    > >To do that, I recommend using Perl. Perl has a module called
                    HTML::Parser[color=blue][color=green]
                    > >which is apparently pretty good at extracting information from[/color][/color]
                    malformed[color=blue][color=green]
                    > >HTML files. Whatsmore, it is generally very good at text handling[/color][/color]
                    and has[color=blue][color=green]
                    > >decent database modules too.[/color]
                    >
                    >
                    > Thanks. Being a microserf, I don't normally code in Perl but I
                    > may look into this. It's either that or WSH Javascript with
                    > it's regular expressions. Fortunately I already have a top
                    > level design and it looks pretty simple. I may look into this
                    > Perl module but it will probably be easier to use microserf
                    > technology with which I'm intimate with. I shall probably store
                    > it in MSSQL.[/color]

                    You could use the InternetExplore r.Application COM object.
                    That would give you the facilities for performing HTML
                    parsing without regexps. It would therefore be
                    more robust and readily doable in your favorite language.
                    Try google for examples.

                    Comment

                    • mbstevens

                      #11
                      Re: Extract Content from HTML ?

                      mark4 wrote:
                      [color=blue]
                      > Hello,
                      >
                      > Are there any utilities to help me extract Content from HTML ?[/color]

                      lynx -dump http://whateverTheHeck.com > temp.txt

                      .... is the shortest program I know of for this kind of thing.
                      The '>' redirection to temp.txt may vary somewhat between operating systems.
                      --
                      mbstevens http://www.mbstevens.com

                      Comment

                      Working...