Remove spaces and line wraps from html?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • RiGGa

    Remove spaces and line wraps from html?

    Hi,

    I have a html file that I need to process and it contains text in this
    format:

    <TD><SPAN class=xf id=EmployeeNo
    title="Employee Number">0123456 </SPAN></TD></TR>

    (Note split over two lines is as it appears in the source file.)

    I would like to use Python (or anything else really) to have it all on one
    line i.e.

    <TD><SPAN class=xf id=EmployeeNo title="Employee
    Number">0123456 </SPAN></TD></TR>

    (Note this has wrapped to the 2nd line)

    Reason I would like to do this is so it is easier to pull back the
    information from the file, I am interested in the contents of the title=
    field and the data immediately after the > (in this case 0123456). I have
    a basic Python program I have written to handle this however with the
    script in its current format it goes wrong when its split over a line like
    my first example.

    Hope this all makes sense.

    Any help appreciated.


  • Paramjit Oberoi

    #2
    Re: Remove spaces and line wraps from html?

    > I have a html file that I need to process and it contains text in this[color=blue]
    > format:[/color]

    Try:



    (or search c.l.p for "HTMLPrinte r")

    Comment

    • RiGGa

      #3
      Re: Remove spaces and line wraps from html?

      Paramjit Oberoi wrote:
      [color=blue][color=green]
      >> I have a html file that I need to process and it contains text in this
      >> format:[/color]
      >
      > Try:
      >
      >[/color]

      40hotmail.com&r num=1[color=blue]
      >
      > (or search c.l.p for "HTMLPrinte r")[/color]
      Thanks, I forgot to mention I am new to Python so I dont yet know how to use
      that example :(


      Comment

      • Paramjit Oberoi

        #4
        Re: Remove spaces and line wraps from html?

        >> http://groups.google.com/groups?q=HT...ail.com&rnum=1[color=blue][color=green]
        >>
        >> (or search c.l.p for "HTMLPrinte r")[/color]
        >
        > Thanks, I forgot to mention I am new to Python so I dont yet know how to
        > use that example :([/color]

        Python has a HTMLParser module in the standard library:




        It looks complicated if you are new to all this, but it's fairly simple
        really. Using it is much better than dealing with HTML syntax yourself.

        A small example:

        --------------------------------------------------
        from HTMLParser import HTMLParser

        class MyHTMLParser(HT MLParser):
        def handle_starttag (self, tag, attrs):
        print "Encountere d the beginning of a %s tag" % tag
        def handle_endtag(s elf, tag):
        print "Encountere d the end of a %s tag" % tag

        my_parser=MyHTM LParser()

        html_data = """
        <html>
        <head>
        <title>hi</title>
        </head>
        <body> hi </body>
        </html>
        """

        my_parser.feed( html_data)
        --------------------------------------------------

        will produce the result:
        Encountered the beginning of a html tag
        Encountered the beginning of a head tag
        Encountered the beginning of a title tag
        Encountered the end of a title tag
        Encountered the end of a head tag
        Encountered the beginning of a body tag
        Encountered the end of a body tag
        Encountered the end of a html tag

        You'll be able to figure out the rest using the
        documentation and some experimentation .

        HTH,
        -param

        Comment

        • RiGGa

          #5
          Re: Remove spaces and line wraps from html?

          Paramjit Oberoi wrote:
          [color=blue][color=green][color=darkred]
          >>>[/color][/color][/color]
          http://groups.google.com/groups?q=HT...ail.com&rnum=1[color=blue][color=green][color=darkred]
          >>>
          >>> (or search c.l.p for "HTMLPrinte r")[/color]
          >>
          >> Thanks, I forgot to mention I am new to Python so I dont yet know how to
          >> use that example :([/color]
          >
          > Python has a HTMLParser module in the standard library:
          >
          > http://www.python.org/doc/lib/module-HTMLParser.html
          > http://www.python.org/doc/lib/htmlparser-example.html
          >
          > It looks complicated if you are new to all this, but it's fairly simple
          > really. Using it is much better than dealing with HTML syntax yourself.
          >
          > A small example:
          >
          > --------------------------------------------------
          > from HTMLParser import HTMLParser
          >
          > class MyHTMLParser(HT MLParser):
          > def handle_starttag (self, tag, attrs):
          > print "Encountere d the beginning of a %s tag" % tag
          > def handle_endtag(s elf, tag):
          > print "Encountere d the end of a %s tag" % tag
          >
          > my_parser=MyHTM LParser()
          >
          > html_data = """
          > <html>
          > <head>
          > <title>hi</title>
          > </head>
          > <body> hi </body>
          > </html>
          > """
          >
          > my_parser.feed( html_data)
          > --------------------------------------------------
          >
          > will produce the result:
          > Encountered the beginning of a html tag
          > Encountered the beginning of a head tag
          > Encountered the beginning of a title tag
          > Encountered the end of a title tag
          > Encountered the end of a head tag
          > Encountered the beginning of a body tag
          > Encountered the end of a body tag
          > Encountered the end of a html tag
          >
          > You'll be able to figure out the rest using the
          > documentation and some experimentation .
          >
          > HTH,
          > -param[/color]
          Thank you!! that was just the kind of help I was
          looking for.

          Best regards

          Rigga

          Comment

          • RiGGa

            #6
            Re: Remove spaces and line wraps from html?

            RiGGa wrote:
            [color=blue]
            > Paramjit Oberoi wrote:
            >[color=green][color=darkred]
            >>>>[/color][/color]
            >[/color]
            http://groups.google.com/groups?q=HT...ail.com&rnum=1[color=blue][color=green][color=darkred]
            >>>>
            >>>> (or search c.l.p for "HTMLPrinte r")
            >>>
            >>> Thanks, I forgot to mention I am new to Python so I dont yet know how to
            >>> use that example :([/color]
            >>
            >> Python has a HTMLParser module in the standard library:
            >>
            >> http://www.python.org/doc/lib/module-HTMLParser.html
            >> http://www.python.org/doc/lib/htmlparser-example.html
            >>
            >> It looks complicated if you are new to all this, but it's fairly simple
            >> really. Using it is much better than dealing with HTML syntax yourself.
            >>
            >> A small example:
            >>
            >> --------------------------------------------------
            >> from HTMLParser import HTMLParser
            >>
            >> class MyHTMLParser(HT MLParser):
            >>
            >> print "Encountere d the beginning of a %s tag" % tag
            >> def handle_endtag(s elf, tag):
            >> print "Encountere d the end of a %s tag" % tag
            >>
            >> my_parser=MyHTM LParser()
            >>
            >> html_data = """
            >> <html>
            >> <head>
            >> <title>hi</title>
            >> </head>
            >> <body> hi </body>
            >> </html>
            >> """
            >>
            >> my_parser.feed( html_data)
            >> --------------------------------------------------
            >>
            >> will produce the result:
            >> Encountered the beginning of a html tag
            >> Encountered the beginning of a head tag
            >> Encountered the beginning of a title tag
            >> Encountered the end of a title tag
            >> Encountered the end of a head tag
            >> Encountered the beginning of a body tag
            >> Encountered the end of a body tag
            >> Encountered the end of a html tag
            >>
            >> You'll be able to figure out the rest using the
            >> documentation and some experimentation .
            >>
            >> HTH,
            >> -param[/color]
            > Thank you!! that was just the kind of help I was
            > looking for.
            >
            > Best regards
            >
            > Rigga[/color]
            I have just tried your example exacly as you typed
            it (copy and paste) and I get a syntax error everytime
            I run it, it always fails at the line starting:

            def handle_starttag (self, tag, attrs):

            And the error message shown in the command line is:

            DeprecationWarn ing: Non-ASCII character '\xa0'

            What does this mean?

            Many thanks

            R


            Comment

            • RiGGa

              #7
              Re: Remove spaces and line wraps from html?

              RiGGa wrote:
              [color=blue]
              > RiGGa wrote:
              >[color=green]
              >> Paramjit Oberoi wrote:
              >>[color=darkred]
              >>>>>[/color]
              >>[/color]
              >[/color]
              http://groups.google.com/groups?q=HT...ail.com&rnum=1[color=blue][color=green][color=darkred]
              >>>>>
              >>>>> (or search c.l.p for "HTMLPrinte r")
              >>>>
              >>>> Thanks, I forgot to mention I am new to Python so I dont yet know how
              >>>> to use that example :(
              >>>
              >>> Python has a HTMLParser module in the standard library:
              >>>
              >>> http://www.python.org/doc/lib/module-HTMLParser.html
              >>> http://www.python.org/doc/lib/htmlparser-example.html
              >>>
              >>> It looks complicated if you are new to all this, but it's fairly simple
              >>> really. Using it is much better than dealing with HTML syntax yourself.
              >>>
              >>> A small example:
              >>>
              >>> --------------------------------------------------
              >>> from HTMLParser import HTMLParser
              >>>
              >>> class MyHTMLParser(HT MLParser):
              >>>
              >>> print "Encountere d the beginning of a %s tag" % tag
              >>> def handle_endtag(s elf, tag):
              >>> print "Encountere d the end of a %s tag" % tag
              >>>
              >>> my_parser=MyHTM LParser()
              >>>
              >>> html_data = """
              >>> <html>
              >>> <head>
              >>> <title>hi</title>
              >>> </head>
              >>> <body> hi </body>
              >>> </html>
              >>> """
              >>>
              >>> my_parser.feed( html_data)
              >>> --------------------------------------------------
              >>>
              >>> will produce the result:
              >>> Encountered the beginning of a html tag
              >>> Encountered the beginning of a head tag
              >>> Encountered the beginning of a title tag
              >>> Encountered the end of a title tag
              >>> Encountered the end of a head tag
              >>> Encountered the beginning of a body tag
              >>> Encountered the end of a body tag
              >>> Encountered the end of a html tag
              >>>
              >>> You'll be able to figure out the rest using the
              >>> documentation and some experimentation .
              >>>
              >>> HTH,
              >>> -param[/color]
              >> Thank you!! that was just the kind of help I was
              >> looking for.
              >>
              >> Best regards
              >>
              >> Rigga[/color]
              > I have just tried your example exacly as you typed
              > it (copy and paste) and I get a syntax error everytime
              > I run it, it always fails at the line starting:
              >
              > def handle_starttag (self, tag, attrs):
              >
              > And the error message shown in the command line is:
              >
              > DeprecationWarn ing: Non-ASCII character '\xa0'
              >
              > What does this mean?
              >
              > Many thanks
              >
              > R[/color]
              Ignore that, I retyped it manually and it now works, must have been a hidden
              chatracter that my IDE didnt like.

              Thanks again for your help, no doubt I will post back later with more
              questions :)

              Thanks
              R

              Comment

              • Peter Otten

                #8
                Re: Remove spaces and line wraps from html?

                RiGGa wrote:
                [color=blue]
                > I have just tried your example exacly as you typed
                > it (copy and paste) and I get a syntax error everytime
                > I run it, it always fails at the line starting:
                >
                > def handle_starttag (self, tag, attrs):
                >
                > And the error message shown in the command line is:
                >
                > DeprecationWarn ing: Non-ASCII character '\xa0'
                >
                > What does this mean?[/color]

                You get a deprecation warning when your source code contains non-ascii
                characters and you have no encoding declared (read the PEP for details).
                Those characters have a different meaning depending on the encoding, which
                makes the code ambiguous.

                However, what's really going on in your case is that (some) space characters
                in the source code were replaced by chr(160), which happens sometimes with
                newsgroup postings for reasons unknown to me. What makes that nasty is that
                chr(160) looks just like the normal space character.

                If you run the following from the command line with a space after python
                (replace xxx.py with the source file and yyy.py with the name of the new
                cleaned-up file), Paramjit's code should work as expected.

                python-c'file("yyy.py" ,"w").write(fil e("xxx.py").rea d().replace(chr (160),chr(32))) '

                Peter

                Comment

                Working...