ElementTree.fromstring(unicode_html)

**John Machin** · Jan 26 '08, 03:15 AM

Re: ElementTree.fro mstring(unicode _html)

On Jan 26, 1:11 pm, globophobe <globoph...@gma il.comwrote:

This is likely an easy problem; however, I couldn't think of
appropriate keywords for google:
>
Basically, I have some raw data that needs to be preprocessed before
it is saved to the database e.g.
>
In [1]: unicode_html = u'\u3055\u3080\ u3044\uff0f\r\n \u3064\u3081\u3 05f
\u3044\r\n'
>
I need to turn this into an elementtree, but some of the data is
japanese whereas the rest is html. This string contains a <br />.

>>import unicodedata as ucd
>>s = u'\u3055\u3080\ u3044\uff0f\r\n \u3064\u3081\u3 05f\u3044\r\n'
>>[ucd.name(c) if ord(c) >= 128 else c for c in s]

['HIRAGANA LETTER SA', 'HIRAGANA LETTER MU', 'HIRAGANA LETTER I',
'FULLWIDTH SOLIDUS', u'\r', u'\n', 'HIRAGANA LETTER TU', 'HIRAGANA
LETTER ME', 'HIRAGANA LETTER TA', 'HIRAGANA LETTER I', u'\r', u'\n']

>>>

Where in there is the <br /??

**Fredrik Lundh** · Jan 27 '08, 06:45 PM

Re: ElementTree.fro mstring(unicode _html)

globophobe wrote:

In [1]: unicode_html = u'\u3055\u3080\ u3044\uff0f\r\n \u3064\u3081\u3 05f
\u3044\r\n'
>
I need to turn this into an elementtree, but some of the data is
japanese whereas the rest is html. This string contains a <br />.

where? <br /is an element, not a character. "\r" and "\n" are
characters, not elements.

If you want to build a tree where "\r\n" is replaced with a <br />
element, you can encode the string as UTF-8, use the replace method to
insert the element, and then call fromstring.

Alternatively, you can build the tree yourself:

import xml.etree.Eleme ntTree as ET

unicode_html =
u'\u3055\u3080\ u3044\uff0f\r\n \u3064\u3081\u3 05f\u3044\r\n'

parts = unicode_html.sp litlines()

elem = ET.Element("dat a")
elem.text = parts[0]
for part in parts[1:]:
ET.SubElement(e lem, "br").tail = part

print ET.tostring(ele m)

</F>

ElementTree.fromstring(unicode_html)

ElementTree.fromstring(unicode_html)

Comment

Comment