ElementTree.fromstring(unicode_html)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • globophobe

    ElementTree.fromstring(unicode_html)

    This is likely an easy problem; however, I couldn't think of
    appropriate keywords for google:

    Basically, I have some raw data that needs to be preprocessed before
    it is saved to the database e.g.

    In [1]: unicode_html = u'\u3055\u3080\ u3044\uff0f\r\n \u3064\u3081\u3 05f
    \u3044\r\n'

    I need to turn this into an elementtree, but some of the data is
    japanese whereas the rest is html. This string contains a <br />.

    In [2]: e = ET.fromstring(' <data>%s</data>' % unicode_html)
    In [2]: e.text
    Out[3]: u'\u3055\u3080\ u3044\uff0f\n\u 3064\u3081\u305 f\u3044\n'
    In [4]: len(e)
    Out[4]: 0

    How can I decode the unicode html <br /into a string that
    ElementTree can understand?

  • John Machin

    #2
    Re: ElementTree.fro mstring(unicode _html)

    On Jan 26, 1:11 pm, globophobe <globoph...@gma il.comwrote:
    This is likely an easy problem; however, I couldn't think of
    appropriate keywords for google:
    >
    Basically, I have some raw data that needs to be preprocessed before
    it is saved to the database e.g.
    >
    In [1]: unicode_html = u'\u3055\u3080\ u3044\uff0f\r\n \u3064\u3081\u3 05f
    \u3044\r\n'
    >
    I need to turn this into an elementtree, but some of the data is
    japanese whereas the rest is html. This string contains a <br />.
    >>import unicodedata as ucd
    >>s = u'\u3055\u3080\ u3044\uff0f\r\n \u3064\u3081\u3 05f\u3044\r\n'
    >>[ucd.name(c) if ord(c) >= 128 else c for c in s]
    ['HIRAGANA LETTER SA', 'HIRAGANA LETTER MU', 'HIRAGANA LETTER I',
    'FULLWIDTH SOLIDUS', u'\r', u'\n', 'HIRAGANA LETTER TU', 'HIRAGANA
    LETTER ME', 'HIRAGANA LETTER TA', 'HIRAGANA LETTER I', u'\r', u'\n']
    >>>
    Where in there is the <br /??

    Comment

    • Fredrik Lundh

      #3
      Re: ElementTree.fro mstring(unicode _html)

      globophobe wrote:
      In [1]: unicode_html = u'\u3055\u3080\ u3044\uff0f\r\n \u3064\u3081\u3 05f
      \u3044\r\n'
      >
      I need to turn this into an elementtree, but some of the data is
      japanese whereas the rest is html. This string contains a <br />.
      where? <br /is an element, not a character. "\r" and "\n" are
      characters, not elements.

      If you want to build a tree where "\r\n" is replaced with a <br />
      element, you can encode the string as UTF-8, use the replace method to
      insert the element, and then call fromstring.

      Alternatively, you can build the tree yourself:

      import xml.etree.Eleme ntTree as ET

      unicode_html =
      u'\u3055\u3080\ u3044\uff0f\r\n \u3064\u3081\u3 05f\u3044\r\n'

      parts = unicode_html.sp litlines()

      elem = ET.Element("dat a")
      elem.text = parts[0]
      for part in parts[1:]:
      ET.SubElement(e lem, "br").tail = part

      print ET.tostring(ele m)

      </F>

      Comment

      Working...