Tool needed: to strip some HTML

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Voetleuce en f?nsievry

    Tool needed: to strip some HTML

    G'day

    I have some pages written by a bot and much of the code does not
    concern the visible content on the site. I'd like to strip all the
    codes that do not affect or influence the visible stuff (although I'd
    like to keep the nested tables, if possible). Some of this can be
    stripped using Search/Replace, but some of it contains codes which
    differ from page to page.

    How many pages? About 750, totalling 80 megabytes of data, which I'm
    hoping to reduce when I "clean" the code.

    Do you know of any tool that can do this? A tool that can be set to
    strip all codes except HTML 2.0 would, for example, also be useful
    except I'll lose the nested tables (which is not a *gigantic*
    loss...).

    I tried converting everything to TXT but most HTML2TXT programs
    deliver very poor results. I did find some code strippers that
    attempt to maintain the tables layout (but that is even less
    preferred). If the stuff is gonna be in plaintext, then there should
    be an intelligent way of dealing with nested tables.

    Any advice, people? What tool can you recommend? Preferably for W95x
    (but Linux would be fine too as long as it is newbie-friendly),
    preferably freeware (or shareware, but I don't intend buying).
  • Vigil

    #2
    Re: Tool needed: to strip some HTML

    On Wed, 07 Apr 2004 02:34:39 -0700, Voetleuce en f?nsievry wrote:
    [color=blue]
    > Any advice, people? What tool can you recommend?[/color]

    If you only knew Perl...

    --

    ..

    Comment

    • Rijk van Geijtenbeek

      #3
      Re: Tool needed: to strip some HTML

      On 7 Apr 2004 02:34:39 -0700, Voetleuce en f?nsievry
      <carlitan@websu rfer.co.za> wrote:
      [color=blue]
      > G'day
      >
      > I have some pages written by a bot and much of the code does not
      > concern the visible content on the site. I'd like to strip all the
      > codes that do not affect or influence the visible stuff (although I'd
      > like to keep the nested tables, if possible). Some of this can be
      > stripped using Search/Replace, but some of it contains codes which
      > differ from page to page.
      >
      > How many pages? About 750, totalling 80 megabytes of data, which I'm
      > hoping to reduce when I "clean" the code.
      >
      > Do you know of any tool that can do this? A tool that can be set to
      > strip all codes except HTML 2.0 would, for example, also be useful
      > except I'll lose the nested tables (which is not a *gigantic*
      > loss...).[/color]

      HTML Tidy might help a lot. It can be set to 'clean' the pages, it will
      then drop all presentational markup.



      --
      Rijk van Geijtenbeek

      The Web is a procrastination apparatus:
      It can absorb as much time as is required to ensure that you
      won't get any real work done. - J.Nielsen

      Comment

      • Pierre Goiffon

        #4
        Re: Tool needed: to strip some HTML

        "Voetleuce en f?nsievry" <carlitan@websu rfer.co.za> a écrit dans le
        message de news:f0401042.0 404070134.4818b 817@posting.goo gle.com[color=blue]
        > I have some pages written by a bot and much of the code does not
        > concern the visible content on the site. I'd like to strip all the
        > codes that do not affect or influence the visible stuff (although I'd
        > like to keep the nested tables, if possible). Some of this can be
        > stripped using Search/Replace, but some of it contains codes which
        > differ from page to page.[/color]

        Can you give an URL for a sample ? The answer will depend on where the text
        to delete is located in your HTML pages...

        Comment

        • William Park

          #5
          Re: Tool needed: to strip some HTML

          Voetleuce en f?nsievry <carlitan@websu rfer.co.za> wrote:[color=blue]
          > Any advice, people? What tool can you recommend? Preferably for W95x
          > (but Linux would be fine too as long as it is newbie-friendly),
          > preferably freeware (or shareware, but I don't intend buying).[/color]

          Any example?

          --
          William Park, Open Geometry Consulting, <opengeometry@y ahoo.ca>
          Linux solution/training/migration, Thin-client

          Comment

          • Voetleuce en f?nsievry

            #6
            Re: Tool needed: to strip some HTML

            "Pierre Goiffon" <pgoiffon@nowhe re.invalid> wrote in message news:<4074256e$ 0$21152$626a14c e@news.free.fr> ...
            [color=blue]
            > "Voetleuce en f?nsievry" <carlitan@websu rfer.co.za> a écrit dans le
            > message de news:f0401042.0 404070134.4818b 817@posting.goo gle.com[/color]
            [color=blue][color=green]
            > > I have some pages written by a bot and much of the code does not
            > > concern the visible content on the site. I'd like to strip all the
            > > codes that do not affect or influence the visible stuff (although I'd
            > > like to keep the nested tables, if possible). Some of this can be
            > > stripped using Search/Replace, but some of it contains codes which
            > > differ from page to page.[/color][/color]
            [color=blue]
            > Can you give an URL for a sample ? The answer will depend on where the text
            > to delete is located in your HTML pages...[/color]

            True. Here goes:
            http://leuce.com/translate/tempfile/22265.html (100 kb)
            http://leuce.com/translate/tempfile/22265.zip (30 kb)

            These files are Yahoo group message files from a mailing list group we
            have, but you see, Yahoo's message search feature is rubbish and we'd
            like to make the archive of old messages available for new members so
            that they can *search* the old messages and not ask the same questions
            over and over again.

            The file for download mentioned above is from a guest login, but the
            files I have are logged in which means the e-mail addresses show up
            (we'll remove these manually later).

            Any thing to reduce the fluff would be nice. We're considering
            putting the messages on a web site for Google to index (which would be
            *excellent*) but our bandwidth bill will kill us at present.

            TIA.

            Comment

            • Voetleuce en f?nsievry

              #7
              Re: Tool needed: to strip some HTML

              "Pierre Goiffon" <pgoiffon@nowhe re.invalid> wrote in message news:<4074256e$ 0$21152$626a14c e@news.free.fr> ...
              [color=blue]
              > "Voetleuce en f?nsievry" <carlitan@websu rfer.co.za> a écrit dans le
              > message de news:f0401042.0 404070134.4818b 817@posting.goo gle.com[/color]
              [color=blue][color=green]
              > > I have some pages written by a bot and much of the code does not
              > > concern the visible content on the site. I'd like to strip all the
              > > codes that do not affect or influence the visible stuff (although I'd
              > > like to keep the nested tables, if possible). Some of this can be
              > > stripped using Search/Replace, but some of it contains codes which
              > > differ from page to page.[/color][/color]
              [color=blue]
              > Can you give an URL for a sample ? The answer will depend on where the text
              > to delete is located in your HTML pages...[/color]

              True. Here goes:
              http://leuce.com/translate/tempfile/22265.html (100 kb)
              http://leuce.com/translate/tempfile/22265.zip (30 kb)

              These files are Yahoo group message files from a mailing list group we
              have, but you see, Yahoo's message search feature is rubbish and we'd
              like to make the archive of old messages available for new members so
              that they can *search* the old messages and not ask the same questions
              over and over again.

              The file for download mentioned above is from a guest login, but the
              files I have are logged in which means the e-mail addresses show up
              (we'll remove these manually later).

              Any thing to reduce the fluff would be nice. We're considering
              putting the messages on a web site for Google to index (which would be
              *excellent*) but our bandwidth bill will kill us at present.

              TIA.

              Comment

              • Voetleuce en f?nsievry

                #8
                Re: Tool needed: to strip some HTML

                Vigil <me@privacy.net > wrote in message news:<pan.2004. 04.07.14.04.45. 9556@privacy.ne t>...
                [color=blue]
                > On Wed, 07 Apr 2004 02:34:39 -0700, Voetleuce en f?nsievry wrote:[/color]
                [color=blue][color=green]
                > > Any advice, people? What tool can you recommend?[/color][/color]
                [color=blue]
                > If you only knew Perl...[/color]

                I have a Perl interpreter installed here... :-)

                Comment

                • Voetleuce en f?nsievry

                  #9
                  Re: Tool needed: to strip some HTML

                  Vigil <me@privacy.net > wrote in message news:<pan.2004. 04.07.14.04.45. 9556@privacy.ne t>...
                  [color=blue]
                  > On Wed, 07 Apr 2004 02:34:39 -0700, Voetleuce en f?nsievry wrote:[/color]
                  [color=blue][color=green]
                  > > Any advice, people? What tool can you recommend?[/color][/color]
                  [color=blue]
                  > If you only knew Perl...[/color]

                  I have a Perl interpreter installed here... :-)

                  Comment

                  • Voetleuce en f?nsievry

                    #10
                    SORRY! NEW URL: Re: Tool needed: to strip some HTML

                    "Pierre Goiffon" <pgoiffon@nowhe re.invalid> wrote in message news:<4074256e$ 0$21152$626a14c e@news.free.fr> ...
                    [color=blue]
                    > Can you give an URL for a sample ? The answer will depend on where the text
                    > to delete is located in your HTML pages...[/color]

                    SORRY! WRONG URL! Here's the correct ones:

                    http://leuce.com/tempfile/22265.html (100 kb)
                    http://leuce.com/tempfile/22265.zip (30 kb)

                    Comment

                    • Voetleuce en f?nsievry

                      #11
                      SORRY! NEW URL: Re: Tool needed: to strip some HTML

                      "Pierre Goiffon" <pgoiffon@nowhe re.invalid> wrote in message news:<4074256e$ 0$21152$626a14c e@news.free.fr> ...
                      [color=blue]
                      > Can you give an URL for a sample ? The answer will depend on where the text
                      > to delete is located in your HTML pages...[/color]

                      SORRY! WRONG URL! Here's the correct ones:

                      http://leuce.com/tempfile/22265.html (100 kb)
                      http://leuce.com/tempfile/22265.zip (30 kb)

                      Comment

                      • Andy Dingley

                        #12
                        Re: Tool needed: to strip some HTML

                        On 7 Apr 2004 02:34:39 -0700, carlitan@websur fer.co.za (Voetleuce en
                        f?nsievry) wrote:
                        [color=blue]
                        >Do you know of any tool that can do this?[/color]

                        Use HTMLTidy to turn it into XHTML, then run XSLT on that.


                        Comment

                        • Andy Dingley

                          #13
                          Re: Tool needed: to strip some HTML

                          On 7 Apr 2004 02:34:39 -0700, carlitan@websur fer.co.za (Voetleuce en
                          f?nsievry) wrote:
                          [color=blue]
                          >Do you know of any tool that can do this?[/color]

                          Use HTMLTidy to turn it into XHTML, then run XSLT on that.


                          Comment

                          • Pierre Goiffon

                            #14
                            Re: Tool needed: to strip some HTML

                            "Voetleuce en f?nsievry" <carlitan@websu rfer.co.za> a écrit dans le
                            message de news:f0401042.0 404080020.61fc6 220@posting.goo gle.com[color=blue]
                            > These files are Yahoo group message files from a mailing list group we
                            > have, but you see, Yahoo's message search feature is rubbish and we'd
                            > like to make the archive of old messages available for new members so
                            > that they can *search* the old messages and not ask the same questions
                            > over and over again.[/color]

                            OK, here are some ideas :
                            - a batch that catches the messages on a POP or IMAP account registered onto
                            your list. It'll just have to insert the messages into a database (really
                            easy to do it with a php script)
                            - make a script that can extract the datas from the Yahoo Groups pages

                            For that type of needs, I'll surely choose the first solution ! Actually the
                            yahoo groups html is just not really standard compliant, the inside
                            structure seems to vary a lot from one page to another, and so it would be
                            almost impossible to use dom : you must all do it by yourself, using string
                            manipulation functions like regexp. It could works, but why not choose
                            simply to catch a structured data, ie the messages sent vie email ?

                            Comment

                            • Pierre Goiffon

                              #15
                              Re: Tool needed: to strip some HTML

                              "Voetleuce en f?nsievry" <carlitan@websu rfer.co.za> a écrit dans le
                              message de news:f0401042.0 404080020.61fc6 220@posting.goo gle.com[color=blue]
                              > These files are Yahoo group message files from a mailing list group we
                              > have, but you see, Yahoo's message search feature is rubbish and we'd
                              > like to make the archive of old messages available for new members so
                              > that they can *search* the old messages and not ask the same questions
                              > over and over again.[/color]

                              OK, here are some ideas :
                              - a batch that catches the messages on a POP or IMAP account registered onto
                              your list. It'll just have to insert the messages into a database (really
                              easy to do it with a php script)
                              - make a script that can extract the datas from the Yahoo Groups pages

                              For that type of needs, I'll surely choose the first solution ! Actually the
                              yahoo groups html is just not really standard compliant, the inside
                              structure seems to vary a lot from one page to another, and so it would be
                              almost impossible to use dom : you must all do it by yourself, using string
                              manipulation functions like regexp. It could works, but why not choose
                              simply to catch a structured data, ie the messages sent vie email ?

                              Comment

                              Working...