Remove spaces and line wraps from html?

**Paramjit Oberoi** · Jul 18 '05, 11:59 AM

Re: Remove spaces and line wraps from html?

> I have a html file that I need to process and it contains text in this[color=blue]
> format:[/color]

Try:

http://groups.google.com/groups?q=HTMLPrinter&hl=en&lr=&ie=UTF-8&c2coff=1&selm=pan.2004.03.27.22.05.55.384482%40hotmail.com&rnum=1

(or search c.l.p for "HTMLPrinte r")

**RiGGa** · Jul 18 '05, 11:59 AM

Re: Remove spaces and line wraps from html?

Paramjit Oberoi wrote:
[color=blue][color=green]
>> I have a html file that I need to process and it contains text in this
>> format:[/color]
>
> Try:
>
>[/color]

http://groups.google.com/groups?q=HTMLPrinter&hl=en&lr=&ie=UTF-8&c2coff=1&selm=pan.2004.03.27.22.05.55.384482

40hotmail.com&r num=1[color=blue]
>
> (or search c.l.p for "HTMLPrinte r")[/color]
Thanks, I forgot to mention I am new to Python so I dont yet know how to use
that example :(

**Paramjit Oberoi** · Jul 18 '05, 11:59 AM

Re: Remove spaces and line wraps from html?

>> http://groups.google.com/groups?q=HT...ail.com&rnum=1[color=blue][color=green]
>>
>> (or search c.l.p for "HTMLPrinte r")[/color]
>
> Thanks, I forgot to mention I am new to Python so I dont yet know how to
> use that example :([/color]

Python has a HTMLParser module in the standard library:

Welcome to Python.org

http://www.python.org/doc/lib/module-HTMLParser.html

The official home of the Python Programming Language

Welcome to Python.org

http://www.python.org/doc/lib/htmlparser-example.html

The official home of the Python Programming Language

It looks complicated if you are new to all this, but it's fairly simple
really. Using it is much better than dealing with HTML syntax yourself.

A small example:

--------------------------------------------------
from HTMLParser import HTMLParser

class MyHTMLParser(HT MLParser):
def handle_starttag (self, tag, attrs):
print "Encountere d the beginning of a %s tag" % tag
def handle_endtag(s elf, tag):
print "Encountere d the end of a %s tag" % tag

my_parser=MyHTM LParser()

html_data = """
<html>
<head>
<title>hi</title>
</head>
<body> hi </body>
</html>
"""

my_parser.feed( html_data)
--------------------------------------------------

will produce the result:
Encountered the beginning of a html tag
Encountered the beginning of a head tag
Encountered the beginning of a title tag
Encountered the end of a title tag
Encountered the end of a head tag
Encountered the beginning of a body tag
Encountered the end of a body tag
Encountered the end of a html tag

You'll be able to figure out the rest using the
documentation and some experimentation .

HTH,
-param

**RiGGa** · Jul 18 '05, 11:59 AM

Re: Remove spaces and line wraps from html?

Paramjit Oberoi wrote:
[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]
http://groups.google.com/groups?q=HT...ail.com&rnum=1[color=blue][color=green][color=darkred]
>>>
>>> (or search c.l.p for "HTMLPrinte r")[/color]
>>
>> Thanks, I forgot to mention I am new to Python so I dont yet know how to
>> use that example :([/color]
>
> Python has a HTMLParser module in the standard library:
>
> http://www.python.org/doc/lib/module-HTMLParser.html
> http://www.python.org/doc/lib/htmlparser-example.html
>
> It looks complicated if you are new to all this, but it's fairly simple
> really. Using it is much better than dealing with HTML syntax yourself.
>
> A small example:
>
> --------------------------------------------------
> from HTMLParser import HTMLParser
>
> class MyHTMLParser(HT MLParser):
> def handle_starttag (self, tag, attrs):
> print "Encountere d the beginning of a %s tag" % tag
> def handle_endtag(s elf, tag):
> print "Encountere d the end of a %s tag" % tag
>
> my_parser=MyHTM LParser()
>
> html_data = """
> <html>
> <head>
> <title>hi</title>
> </head>
> <body> hi </body>
> </html>
> """
>
> my_parser.feed( html_data)
> --------------------------------------------------
>
> will produce the result:
> Encountered the beginning of a html tag
> Encountered the beginning of a head tag
> Encountered the beginning of a title tag
> Encountered the end of a title tag
> Encountered the end of a head tag
> Encountered the beginning of a body tag
> Encountered the end of a body tag
> Encountered the end of a html tag
>
> You'll be able to figure out the rest using the
> documentation and some experimentation .
>
> HTH,
> -param[/color]
Thank you!! that was just the kind of help I was
looking for.

Best regards

Rigga

**RiGGa** · Jul 18 '05, 11:59 AM

Re: Remove spaces and line wraps from html?

RiGGa wrote:
[color=blue]
> Paramjit Oberoi wrote:
>[color=green][color=darkred]
>>>>[/color][/color]
>[/color]
http://groups.google.com/groups?q=HT...ail.com&rnum=1[color=blue][color=green][color=darkred]
>>>>
>>>> (or search c.l.p for "HTMLPrinte r")
>>>
>>> Thanks, I forgot to mention I am new to Python so I dont yet know how to
>>> use that example :([/color]
>>
>> Python has a HTMLParser module in the standard library:
>>
>> http://www.python.org/doc/lib/module-HTMLParser.html
>> http://www.python.org/doc/lib/htmlparser-example.html
>>
>> It looks complicated if you are new to all this, but it's fairly simple
>> really. Using it is much better than dealing with HTML syntax yourself.
>>
>> A small example:
>>
>> --------------------------------------------------
>> from HTMLParser import HTMLParser
>>
>> class MyHTMLParser(HT MLParser):
>>
>> print "Encountere d the beginning of a %s tag" % tag
>> def handle_endtag(s elf, tag):
>> print "Encountere d the end of a %s tag" % tag
>>
>> my_parser=MyHTM LParser()
>>
>> html_data = """
>> <html>
>> <head>
>> <title>hi</title>
>> </head>
>> <body> hi </body>
>> </html>
>> """
>>
>> my_parser.feed( html_data)
>> --------------------------------------------------
>>
>> will produce the result:
>> Encountered the beginning of a html tag
>> Encountered the beginning of a head tag
>> Encountered the beginning of a title tag
>> Encountered the end of a title tag
>> Encountered the end of a head tag
>> Encountered the beginning of a body tag
>> Encountered the end of a body tag
>> Encountered the end of a html tag
>>
>> You'll be able to figure out the rest using the
>> documentation and some experimentation .
>>
>> HTH,
>> -param[/color]
> Thank you!! that was just the kind of help I was
> looking for.
>
> Best regards
>
> Rigga[/color]
I have just tried your example exacly as you typed
it (copy and paste) and I get a syntax error everytime
I run it, it always fails at the line starting:

def handle_starttag (self, tag, attrs):

And the error message shown in the command line is:

DeprecationWarn ing: Non-ASCII character '\xa0'

What does this mean?

Many thanks

R

**RiGGa** · Jul 18 '05, 11:59 AM

Re: Remove spaces and line wraps from html?

RiGGa wrote:
[color=blue]
> RiGGa wrote:
>[color=green]
>> Paramjit Oberoi wrote:
>>[color=darkred]
>>>>>[/color]
>>[/color]
>[/color]
http://groups.google.com/groups?q=HT...ail.com&rnum=1[color=blue][color=green][color=darkred]
>>>>>
>>>>> (or search c.l.p for "HTMLPrinte r")
>>>>
>>>> Thanks, I forgot to mention I am new to Python so I dont yet know how
>>>> to use that example :(
>>>
>>> Python has a HTMLParser module in the standard library:
>>>
>>> http://www.python.org/doc/lib/module-HTMLParser.html
>>> http://www.python.org/doc/lib/htmlparser-example.html
>>>
>>> It looks complicated if you are new to all this, but it's fairly simple
>>> really. Using it is much better than dealing with HTML syntax yourself.
>>>
>>> A small example:
>>>
>>> --------------------------------------------------
>>> from HTMLParser import HTMLParser
>>>
>>> class MyHTMLParser(HT MLParser):
>>>
>>> print "Encountere d the beginning of a %s tag" % tag
>>> def handle_endtag(s elf, tag):
>>> print "Encountere d the end of a %s tag" % tag
>>>
>>> my_parser=MyHTM LParser()
>>>
>>> html_data = """
>>> <html>
>>> <head>
>>> <title>hi</title>
>>> </head>
>>> <body> hi </body>
>>> </html>
>>> """
>>>
>>> my_parser.feed( html_data)
>>> --------------------------------------------------
>>>
>>> will produce the result:
>>> Encountered the beginning of a html tag
>>> Encountered the beginning of a head tag
>>> Encountered the beginning of a title tag
>>> Encountered the end of a title tag
>>> Encountered the end of a head tag
>>> Encountered the beginning of a body tag
>>> Encountered the end of a body tag
>>> Encountered the end of a html tag
>>>
>>> You'll be able to figure out the rest using the
>>> documentation and some experimentation .
>>>
>>> HTH,
>>> -param[/color]
>> Thank you!! that was just the kind of help I was
>> looking for.
>>
>> Best regards
>>
>> Rigga[/color]
> I have just tried your example exacly as you typed
> it (copy and paste) and I get a syntax error everytime
> I run it, it always fails at the line starting:
>
> def handle_starttag (self, tag, attrs):
>
> And the error message shown in the command line is:
>
> DeprecationWarn ing: Non-ASCII character '\xa0'
>
> What does this mean?
>
> Many thanks
>
> R[/color]
Ignore that, I retyped it manually and it now works, must have been a hidden
chatracter that my IDE didnt like.

Thanks again for your help, no doubt I will post back later with more
questions :)

Thanks
R

**Peter Otten** · Jul 18 '05, 11:59 AM

Re: Remove spaces and line wraps from html?

RiGGa wrote:
[color=blue]
> I have just tried your example exacly as you typed
> it (copy and paste) and I get a syntax error everytime
> I run it, it always fails at the line starting:
>
> def handle_starttag (self, tag, attrs):
>
> And the error message shown in the command line is:
>
> DeprecationWarn ing: Non-ASCII character '\xa0'
>
> What does this mean?[/color]

You get a deprecation warning when your source code contains non-ascii
characters and you have no encoding declared (read the PEP for details).
Those characters have a different meaning depending on the encoding, which
makes the code ambiguous.

However, what's really going on in your case is that (some) space characters
in the source code were replaced by chr(160), which happens sometimes with
newsgroup postings for reasons unknown to me. What makes that nasty is that
chr(160) looks just like the normal space character.

If you run the following from the command line with a space after python
(replace xxx.py with the source file and yyy.py with the name of the new
cleaned-up file), Paramjit's code should work as expected.

python-c'file("yyy.py" ,"w").write(fil e("xxx.py").rea d().replace(chr (160),chr(32))) '

Peter

Remove spaces and line wraps from html?

Remove spaces and line wraps from html?

Comment

Comment

Comment

Comment

Comment

Comment

Comment