Html character entity conversion

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • pak.andrei@gmail.com

    Html character entity conversion

    Here is my script:

    from mechanize import *
    from BeautifulSoup import *
    import StringIO
    b = Browser()
    f = b.open("http://www.translate.r u/text.asp?lang=r u")
    b.select_form(n r=0)
    b["source"] = "hello python"
    html = b.submit().get_ data()
    soup = BeautifulSoup(h tml)
    print soup.find("span ", id = "r_text").strin g

    OUTPUT:
    привет
    питон
    ----------
    In russian it looks like:
    "приве т питон"

    How can I translate this using standard Python libraries??

    --
    Pak Andrei, http://paxoblog.blogspot.com, icq://97449800

  • Claudio Grondi

    #2
    Re: Html character entity conversion

    pak.andrei@gmai l.com wrote:
    Here is my script:
    >
    from mechanize import *
    from BeautifulSoup import *
    import StringIO
    b = Browser()
    f = b.open("http://www.translate.r u/text.asp?lang=r u")
    b.select_form(n r=0)
    b["source"] = "hello python"
    html = b.submit().get_ data()
    soup = BeautifulSoup(h tml)
    print soup.find("span ", id = "r_text").strin g
    >
    OUTPUT:
    привет
    питон
    ----------
    In russian it looks like:
    "приве т питон"
    >
    How can I translate this using standard Python libraries??
    >
    --
    Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
    >
    Translate to what and with what purpose?

    Assuming your intention is to get a Python Unicode string, what about:

    strHTML = 'привет
    питон'
    strUnicodeHexCo de = strHTML.replace ('&#','\u').rep lace(';','')
    strUnicode = eval("u'%s'"%st rUnicodeHexCode )

    ?

    I am sure, there is a more elegant and direct solution, but just wanted
    to provide here some quick response.

    Claudio Grondi

    Comment

    • danielx

      #3
      Re: Html character entity conversion

      pak.andrei@gmai l.com wrote:
      Here is my script:
      >
      from mechanize import *
      from BeautifulSoup import *
      import StringIO
      b = Browser()
      f = b.open("http://www.translate.r u/text.asp?lang=r u")
      b.select_form(n r=0)
      b["source"] = "hello python"
      html = b.submit().get_ data()
      soup = BeautifulSoup(h tml)
      print soup.find("span ", id = "r_text").strin g
      >
      OUTPUT:
      привет
      питон
      ----------
      In russian it looks like:
      "приве т питон"
      >
      How can I translate this using standard Python libraries??
      >
      --
      Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
      I'm having trouble understanding how your script works (what would a
      "BeautifulS oup" function do?), but assuming your intent is to find
      character reference objects in an html document, you might try using
      the HTMLParser class in the HTMLParser module. This class delegates
      several methods. One of them is handle_charref. It will be called with
      one argument, the name of the reference, which includes only the number
      part. HTMLParser is alot more powerful than that though. There may be
      something more light-weight out there that will accomplish what you
      want. Then again, you might be able to find a use for all that power :P.

      Comment

      • pak.andrei@gmail.com

        #4
        Re: Html character entity conversion


        Claudio Grondi wrote:
        pak.andrei@gmai l.com wrote:
        Here is my script:

        from mechanize import *
        from BeautifulSoup import *
        import StringIO
        b = Browser()
        f = b.open("http://www.translate.r u/text.asp?lang=r u")
        b.select_form(n r=0)
        b["source"] = "hello python"
        html = b.submit().get_ data()
        soup = BeautifulSoup(h tml)
        print soup.find("span ", id = "r_text").strin g

        OUTPUT:
        привет
        питон
        ----------
        In russian it looks like:
        "приве т питон"

        How can I translate this using standard Python libraries??

        --
        Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
        Translate to what and with what purpose?
        >
        Assuming your intention is to get a Python Unicode string, what about:
        >
        strHTML = 'привет
        питон'
        strUnicodeHexCo de = strHTML.replace ('&#','\u').rep lace(';','')
        strUnicode = eval("u'%s'"%st rUnicodeHexCode )
        >
        ?
        >
        I am sure, there is a more elegant and direct solution, but just wanted
        to provide here some quick response.
        >
        Claudio Grondi
        Thank you, Claudio.
        Really interest solution, but it doesn't work...

        In [19]: strHTML = 'привет
        питон'

        In [20]: strUnicodeHexCo de = strHTML.replace ('&#','\u').rep lace(';','')

        In [21]: strUnicode = eval("u'%s'"%st rUnicodeHexCode )

        In [22]: print strUnicode
        ---------------------------------------------------------------------------
        exceptions.Unic odeEncodeError Traceback (most
        recent call last)

        C:\Documents and Settings\dron\< ipython console>

        C:\usr\lib\enco dings\cp866.py in encode(self, input, errors)
        16 def encode(self,inp ut,errors='stri ct'):
        17
        ---18 return codecs.charmap_ encode(input,er rors,encoding_m ap)
        19
        20 def decode(self,inp ut,errors='stri ct'):

        UnicodeEncodeEr ror: 'charmap' codec can't encode characters in position
        0-5: character maps to <undefined>

        In [23]: print strUnicode.enco de("utf-8")
        сВЗсВИсР’АсБ┤сБ⠕–сВР сВЗсВАсР’РсВЖсВЕ
        <-- it's not my string "приве т питон"

        In [24]: strUnicode.enco de("utf-8")
        Out[24]:
        '\xe1\x82\x87\x e1\x82\x88\xe1\ x82\x80\xe1\x81 \xb4\xe1\x81\xb 7\xe1\x82\x90
        \xe1\x82\x87\xe 1\x82\x80\xe1\x 82\x90\xe1\x82\ x86\xe1\x82\
        x85' <-- and too many chars

        Comment

        • pak.andrei@gmail.com

          #5
          Re: Html character entity conversion

          danielx wrote:
          pak.andrei@gmai l.com wrote:
          Here is my script:

          from mechanize import *
          from BeautifulSoup import *
          import StringIO
          b = Browser()
          f = b.open("http://www.translate.r u/text.asp?lang=r u")
          b.select_form(n r=0)
          b["source"] = "hello python"
          html = b.submit().get_ data()
          soup = BeautifulSoup(h tml)
          print soup.find("span ", id = "r_text").strin g

          OUTPUT:
          привет
          питон
          ----------
          In russian it looks like:
          "приве т питон"

          How can I translate this using standard Python libraries??

          --
          Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
          >
          I'm having trouble understanding how your script works (what would a
          "BeautifulS oup" function do?), but assuming your intent is to find
          character reference objects in an html document, you might try using
          the HTMLParser class in the HTMLParser module. This class delegates
          several methods. One of them is handle_charref. It will be called with
          one argument, the name of the reference, which includes only the number
          part. HTMLParser is alot more powerful than that though. There may be
          something more light-weight out there that will accomplish what you
          want. Then again, you might be able to find a use for all that power :P.
          Thank you for response.
          It doesn't matter what is 'BeautifulSoup' ...
          General question is:

          How can I convert encoded string

          sEncodedHtmlTex t = 'привет
          питон'

          into human readable:

          sDecodedHtmlTex t == 'привет питон'

          Comment

          • Marc 'BlackJack' Rintsch

            #6
            Re: Html character entity conversion

            In <1154266972.154 519.175040@m73g 2000cwd.googleg roups.com>,
            pak.andrei@gmai l.com wrote:
            Here is my script:
            >
            from mechanize import *
            from BeautifulSoup import *
            import StringIO
            b = Browser()
            f = b.open("http://www.translate.r u/text.asp?lang=r u")
            b.select_form(n r=0)
            b["source"] = "hello python"
            html = b.submit().get_ data()
            soup = BeautifulSoup(h tml)
            print soup.find("span ", id = "r_text").strin g
            >
            OUTPUT:
            привет
            питон
            ----------
            In russian it looks like:
            "приве т питон"
            >
            How can I translate this using standard Python libraries??
            Have you tried a more recent version of BeautifulSoup? IIRC current
            versions always decode text to unicode objects before returning them.

            Ciao,
            Marc

            Comment

            • Claudio Grondi

              #7
              Re: Html character entity conversion

              pak.andrei@gmai l.com wrote:
              Claudio Grondi wrote:
              >
              >>pak.andrei@gm ail.com wrote:
              >>
              >>>Here is my script:
              >>>
              >>>from mechanize import *
              >>>from BeautifulSoup import *
              >>>import StringIO
              >>>b = Browser()
              >>>f = b.open("http://www.translate.r u/text.asp?lang=r u")
              >>>b.select_for m(nr=0)
              >>>b["source"] = "hello python"
              >>>html = b.submit().get_ data()
              >>>soup = BeautifulSoup(h tml)
              >>>print soup.find("span ", id = "r_text").strin g
              >>>
              >>>OUTPUT:
              >>>привет
              >>>питон
              >>>----------
              >>>In russian it looks like:
              >>>"Ð¿Ñ€Ð¸Ð²ÐµÑ ‚ питон"
              >>>
              >>>How can I translate this using standard Python libraries??
              >>>
              >>>--
              >>>Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
              >>>
              >>
              >>Translate to what and with what purpose?
              >>
              >>Assuming your intention is to get a Python Unicode string, what about:
              >>
              >>strHTML = 'привет
              >>питон'
              >>strUnicodeHex Code = strHTML.replace ('&#','\u').rep lace(';','')
              >>strUnicode = eval("u'%s'"%st rUnicodeHexCode )
              >>
              >>?
              >>
              >>I am sure, there is a more elegant and direct solution, but just wanted
              >>to provide here some quick response.
              >>
              >>Claudio Grondi
              >
              >
              Thank you, Claudio.
              Really interest solution, but it doesn't work...
              >
              In [19]: strHTML = 'привет
              питон'
              >
              In [20]: strUnicodeHexCo de = strHTML.replace ('&#','\u').rep lace(';','')
              >
              In [21]: strUnicode = eval("u'%s'"%st rUnicodeHexCode )
              >
              In [22]: print strUnicode
              ---------------------------------------------------------------------------
              exceptions.Unic odeEncodeError Traceback (most
              recent call last)
              >
              C:\Documents and Settings\dron\< ipython console>
              >
              C:\usr\lib\enco dings\cp866.py in encode(self, input, errors)
              16 def encode(self,inp ut,errors='stri ct'):
              17
              ---18 return codecs.charmap_ encode(input,er rors,encoding_m ap)
              19
              20 def decode(self,inp ut,errors='stri ct'):
              >
              UnicodeEncodeEr ror: 'charmap' codec can't encode characters in position
              0-5: character maps to <undefined>
              >
              In [23]: print strUnicode.enco de("utf-8")
              сВЗсВИсР’АсБ┤сБ⠕–сВР сВЗсВАсР’РсВЖсВЕ
              <-- it's not my string "приве т питон"
              >
              In [24]: strUnicode.enco de("utf-8")
              Out[24]:
              '\xe1\x82\x87\x e1\x82\x88\xe1\ x82\x80\xe1\x81 \xb4\xe1\x81\xb 7\xe1\x82\x90
              \xe1\x82\x87\xe 1\x82\x80\xe1\x 82\x90\xe1\x82\ x86\xe1\x82\
              x85' <-- and too many chars
              >
              Have you considered, that the HTML page specifies charset=windows-1251
              in its
              <meta http-equiv="Content-Type" content="text/html;
              charset=windows-1251"tag ?
              You are apparently on Linux or so, so I can't track this problem down
              having only a Windows box here, but inbetween I know that there is
              another problem with it:
              I have erronously assumed, that the numbers in п are hexadecimal,
              but they are decimal, so it is necessary to do hex(int('1087') ) on them
              to get at the right code to put into eval().
              As you know now the idea I hope you will succeed as I did with:
              >>lstIntUnicode DecimalCode = strHTML.replace ('&#','').split (';')
              >>lstIntUnicode DecimalCode
              ['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080',
              '1090', '1086', '1085', '']
              >>lstIntUnicode DecimalCode = lstIntUnicodeDe cimalCode[:-1]
              >>lstHexUnico de = [ hex(int(item)) for item in lstIntUnicodeDe cimalCode]
              >>lstHexUnico de
              ['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438',
              '0x442', '0x43e', '0x43d']
              >>eval( 'u"%s"'%''.join (lstHexUnicode) .replace('0x',' \u0' ) )
              u'\u043f\u0440\ u0438\u0432\u04 35\u0442\u043f\ u0438\u0442\u04 3e\u043d'
              >>strUnicode = eval(
              'u"%s"'%''.join (lstHexUnicode) .replace('0x',' \u0' ) )
              >>print strUnicode
              приветпР¸Ñ‚он

              Sorry for that mess not taking the space into consideration, but I think
              you can get the idea anyway.

              Claudio Grondi

              Comment

              • John Machin

                #8
                Re: Html character entity conversion

                Claudio Grondi wrote:
                pak.andrei@gmai l.com wrote:
                Claudio Grondi wrote:
                >pak.andrei@gma il.com wrote:
                >
                >>Here is my script:
                >>
                >>from mechanize import *
                >>from BeautifulSoup import *
                >>import StringIO
                >>b = Browser()
                >>f = b.open("http://www.translate.r u/text.asp?lang=r u")
                >>b.select_form (nr=0)
                >>b["source"] = "hello python"
                >>html = b.submit().get_ data()
                >>soup = BeautifulSoup(h tml)
                >>print soup.find("span ", id = "r_text").strin g
                >>
                >>OUTPUT:
                >>привет
                >>питон
                >>----------
                >>In russian it looks like:
                >>"приве т питон"
                >>
                >>How can I translate this using standard Python libraries??
                >>
                >>--
                >>Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
                >>
                >
                >Translate to what and with what purpose?
                >
                >Assuming your intention is to get a Python Unicode string, what about:
                >
                >strHTML = 'привет
                >питон'
                >strUnicodeHexC ode = strHTML.replace ('&#','\u').rep lace(';','')
                >strUnicode = eval("u'%s'"%st rUnicodeHexCode )
                >
                >?
                >
                >I am sure, there is a more elegant and direct solution, but just wanted
                >to provide here some quick response.
                >
                >Claudio Grondi

                Thank you, Claudio.
                Really interest solution, but it doesn't work...

                In [19]: strHTML = 'привет
                питон'

                In [20]: strUnicodeHexCo de = strHTML.replace ('&#','\u').rep lace(';','')

                In [21]: strUnicode = eval("u'%s'"%st rUnicodeHexCode )

                In [22]: print strUnicode
                ---------------------------------------------------------------------------
                exceptions.Unic odeEncodeError Traceback (most
                recent call last)

                C:\Documents and Settings\dron\< ipython console>

                C:\usr\lib\enco dings\cp866.py in encode(self, input, errors)
                16 def encode(self,inp ut,errors='stri ct'):
                17
                ---18 return codecs.charmap_ encode(input,er rors,encoding_m ap)
                19
                20 def decode(self,inp ut,errors='stri ct'):

                UnicodeEncodeEr ror: 'charmap' codec can't encode characters in position
                0-5: character maps to <undefined>

                In [23]: print strUnicode.enco de("utf-8")
                сВЗсВИсР’АсБ┤сБ⠕–сВР сВЗсВАсР’РсВЖсВЕ
                <-- it's not my string "приве т питон"

                In [24]: strUnicode.enco de("utf-8")
                Out[24]:
                '\xe1\x82\x87\x e1\x82\x88\xe1\ x82\x80\xe1\x81 \xb4\xe1\x81\xb 7\xe1\x82\x90
                \xe1\x82\x87\xe 1\x82\x80\xe1\x 82\x90\xe1\x82\ x86\xe1\x82\
                x85' <-- and too many chars
                Have you considered, that the HTML page specifies charset=windows-1251
                in its
                <meta http-equiv="Content-Type" content="text/html;
                charset=windows-1251"tag ?
                You are apparently on Linux or so, so I can't track this problem down
                having only a Windows box here, but inbetween I know that there is
                another problem with it:
                I have erronously assumed, that the numbers in п are hexadecimal,
                but they are decimal, so it is necessary to do hex(int('1087') ) on them
                to get at the right code to put into eval().
                As you know now the idea I hope you will succeed as I did with:
                >
                >>lstIntUnicode DecimalCode = strHTML.replace ('&#','').split (';')
                >>lstIntUnicode DecimalCode
                ['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080',
                '1090', '1086', '1085', '']
                >>lstIntUnicode DecimalCode = lstIntUnicodeDe cimalCode[:-1]
                >>lstHexUnico de = [ hex(int(item)) for item in lstIntUnicodeDe cimalCode]
                >>lstHexUnico de
                ['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438',
                '0x442', '0x43e', '0x43d']
                >>eval( 'u"%s"'%''.join (lstHexUnicode) .replace('0x',' \u0' ) )
                u'\u043f\u0440\ u0438\u0432\u04 35\u0442\u043f\ u0438\u0442\u04 3e\u043d'
                >>strUnicode = eval(
                'u"%s"'%''.join (lstHexUnicode) .replace('0x',' \u0' ) )
                >>print strUnicode
                приветпР¸Ñ‚он
                >
                Sorry for that mess not taking the space into consideration, but I think
                you can get the idea anyway.
                I hope he *doesn't* get that "idea".

                #>>strHTML =
                'приветпит& #
                1086;н'
                #>>strUnicode = [unichr(int(x)) for x in
                strHTML.replace ('&#','').split (';') if
                x]
                #>>strUnicode
                [u'\u043f', u'\u0440', u'\u0438', u'\u0432', u'\u0435', u'\u0442',
                u'\u043f', u'
                \u0438', u'\u0442', u'\u043e', u'\u043d']
                #>>>

                Comment

                • Claudio Grondi

                  #9
                  Re: Html character entity conversion

                  John Machin wrote:
                  Claudio Grondi wrote:
                  >
                  >>pak.andrei@gm ail.com wrote:
                  >>
                  >>>Claudio Grondi wrote:
                  >>>
                  >>>
                  >>>>pak.andrei@ gmail.com wrote:
                  >>>>
                  >>>>
                  >>>>>Here is my script:
                  >>>>>
                  >>>>
                  >>>>>from mechanize import *
                  >>>>>from BeautifulSoup import *
                  >>>>
                  >>>>>import StringIO
                  >>>>>b = Browser()
                  >>>>>f = b.open("http://www.translate.r u/text.asp?lang=r u")
                  >>>>>b.select_f orm(nr=0)
                  >>>>>b["source"] = "hello python"
                  >>>>>html = b.submit().get_ data()
                  >>>>>soup = BeautifulSoup(h tml)
                  >>>>>print soup.find("span ", id = "r_text").strin g
                  >>>>>
                  >>>>>OUTPUT:
                  >>>>>привет
                  >>>>>питон
                  >>>>>----------
                  >>>>>In russian it looks like:
                  >>>>>"привРµÑ‚ питон"
                  >>>>>
                  >>>>>How can I translate this using standard Python libraries??
                  >>>>>
                  >>>>>--
                  >>>>>Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
                  >>>>>
                  >>>>
                  >>>>Translate to what and with what purpose?
                  >>>>
                  >>>>Assuming your intention is to get a Python Unicode string, what about:
                  >>>>
                  >>>>strHTML = 'привет
                  >>>>питон'
                  >>>>strUnicodeH exCode = strHTML.replace ('&#','\u').rep lace(';','')
                  >>>>strUnicod e = eval("u'%s'"%st rUnicodeHexCode )
                  >>>>
                  >>>>?
                  >>>>
                  >>>>I am sure, there is a more elegant and direct solution, but just wanted
                  >>>>to provide here some quick response.
                  >>>>
                  >>>>Claudio Grondi
                  >>>
                  >>>
                  >>>Thank you, Claudio.
                  >>>Really interest solution, but it doesn't work...
                  >>>
                  >>>In [19]: strHTML = 'привет
                  >>>питон'
                  >>>
                  >>>In [20]: strUnicodeHexCo de = strHTML.replace ('&#','\u').rep lace(';','')
                  >>>
                  >>>In [21]: strUnicode = eval("u'%s'"%st rUnicodeHexCode )
                  >>>
                  >>>In [22]: print strUnicode
                  >>>---------------------------------------------------------------------------
                  >>>exceptions.U nicodeEncodeErr or Traceback (most
                  >>>recent call last)
                  >>>
                  >>>C:\Documen ts and Settings\dron\< ipython console>
                  >>>
                  >>>C:\usr\lib\e ncodings\cp866. py in encode(self, input, errors)
                  >> 16 def encode(self,inp ut,errors='stri ct'):
                  >> 17
                  >>>---18 return codecs.charmap_ encode(input,er rors,encoding_m ap)
                  >> 19
                  >> 20 def decode(self,inp ut,errors='stri ct'):
                  >>>
                  >>>UnicodeEncod eError: 'charmap' codec can't encode characters in position
                  >>>0-5: character maps to <undefined>
                  >>>
                  >>>In [23]: print strUnicode.enco de("utf-8")
                  >>>сВЗсВИ сВАсБ┤с Б╖сВР сВЗсВАсР’РсВЖсВЕ
                  >>><-- it's not my string "приве т питон"
                  >>>
                  >>>In [24]: strUnicode.enco de("utf-8")
                  >>>Out[24]:
                  >>>'\xe1\x82\x8 7\xe1\x82\x88\x e1\x82\x80\xe1\ x81\xb4\xe1\x81 \xb7\xe1\x82\x9 0
                  >>>\xe1\x82\x87 \xe1\x82\x80\xe 1\x82\x90\xe1\x 82\x86\xe1\x82\
                  >>>x85' <-- and too many chars
                  >>>
                  >>
                  >>Have you considered, that the HTML page specifies charset=windows-1251
                  >>in its
                  >><meta http-equiv="Content-Type" content="text/html;
                  >>charset=windo ws-1251"tag ?
                  >>You are apparently on Linux or so, so I can't track this problem down
                  >>having only a Windows box here, but inbetween I know that there is
                  >>another problem with it:
                  >>I have erronously assumed, that the numbers in п are hexadecimal,
                  >>but they are decimal, so it is necessary to do hex(int('1087') ) on them
                  >>to get at the right code to put into eval().
                  >>As you know now the idea I hope you will succeed as I did with:
                  >>
                  >>lstIntUnicode DecimalCode = strHTML.replace ('&#','').split (';')
                  >>lstIntUnicode DecimalCode
                  >>['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080',
                  >>'1090', '1086', '1085', '']
                  >>lstIntUnicode DecimalCode = lstIntUnicodeDe cimalCode[:-1]
                  >>lstHexUnico de = [ hex(int(item)) for item in lstIntUnicodeDe cimalCode]
                  >>lstHexUnico de
                  >>['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438',
                  >>'0x442', '0x43e', '0x43d']
                  >>eval( 'u"%s"'%''.join (lstHexUnicode) .replace('0x',' \u0' ) )
                  >>u'\u043f\u044 0\u0438\u0432\u 0435\u0442\u043 f\u0438\u0442\u 043e\u043d'
                  >>strUnicode = eval(
                  >>'u"%s"'%''.jo in(lstHexUnicod e).replace('0x' ,'\u0' ) )
                  >>print strUnicode
                  >>приветР¿Ð¸Ñ‚он
                  >>
                  >>Sorry for that mess not taking the space into consideration, but I think
                  > you can get the idea anyway.
                  >
                  >
                  I hope he *doesn't* get that "idea".
                  >
                  #>>strHTML =
                  'приветпит& #
                  1086;н'
                  #>>strUnicode = [unichr(int(x)) for x in
                  strHTML.replace ('&#','').split (';') if
                  x]
                  #>>strUnicode
                  [u'\u043f', u'\u0440', u'\u0438', u'\u0432', u'\u0435', u'\u0442',
                  u'\u043f', u'
                  \u0438', u'\u0442', u'\u043e', u'\u043d']
                  #>>>
                  Knowing about the built-in function unichr() is a good thing, but ...
                  there are still drawbacks, because (not tested!) e.g. :
                  '100x hallo Python' translates to
                  '100x привет
                  Питон'
                  and can't be handled by improving the core idea by usage of unichr()
                  instead of the eval() stuff because of the wrong approach with using
                  ..replace() and .split() which work only on the given example but not in
                  general case.
                  I am just too lazy to sit down and work on code extracting from the HTML
                  the &#....; sequences to convert only them letting the other content of
                  the string unchanged in order to arrive at a solution that works in
                  general case (it should be not hard and I suppose the OP has it already
                  :-) if he is at a Python skill level of playing around with the
                  mechanize module).
                  I am still convinced, that there must be a more elegant and direct
                  solution, so the subject is still fully open for improvements towards
                  the actual final goal.
                  I suppose, that one can use in addition to unichr() also unicode() as
                  replacement for usage of eval().

                  To Andrei: can you please post here what you have finally arrived at?

                  Claudio Grondi

                  Comment

                  • Duncan Booth

                    #10
                    Re: Html character entity conversion

                    pak.andrei@gmai l.com wrote:
                    How can I convert encoded string
                    >
                    sEncodedHtmlTex t = 'привет
                    питон'
                    >
                    into human readable:
                    >
                    sDecodedHtmlTex t == 'привет питон'
                    How about:
                    >>sEncodedHtmlT ext = 'text:
                    приветпито &#108
                    5;'
                    >>def unescape(m):
                    return unichr(int(m.gr oup(0)[2:-1]))
                    >>print re.sub('&#[0-9]+;', unescape, sEncodedHtmlTex t)
                    text: ???????????

                    I'm afraid my newsreader couldn't cope with either your original text or my
                    output, but I think this gives the string you wanted. You probably also
                    ought to decode sEncodedHtmlTex t to unicode first otherwise anything which
                    isn't an entity escape will be converted to unicode using the default ascii
                    encoding.

                    Comment

                    • yichun

                      #11
                      Re: Html character entity conversion

                      pak.andrei@gmai l.com wrote:
                      danielx wrote:
                      >pak.andrei@gmai l.com wrote:
                      >>Here is my script:
                      >>>
                      >>from mechanize import *
                      >>from BeautifulSoup import *
                      >>import StringIO
                      >>b = Browser()
                      >>f = b.open("http://www.translate.r u/text.asp?lang=r u")
                      >>b.select_form (nr=0)
                      >>b["source"] = "hello python"
                      >>html = b.submit().get_ data()
                      >>soup = BeautifulSoup(h tml)
                      >>print soup.find("span ", id = "r_text").strin g
                      >>>
                      >>OUTPUT:
                      >>привет
                      >>питон
                      >>----------
                      >>In russian it looks like:
                      >>"приве т питон"
                      >>>
                      >>How can I translate this using standard Python libraries??
                      >>>
                      >>--
                      >
                      Thank you for response.
                      It doesn't matter what is 'BeautifulSoup' ...
                      However, the best solution is to ask BeautifulSoup to do that for you.
                      if you do

                      soup = BeautifulSoup(y our_html_page, convertEntities ="html")

                      you should not be worrying about the problem you had. this converts all
                      the html entities (the five you see as soup.entitydefs ) and all the
                      "&#xxx;" stuff to their python unicode string.

                      yichun

                      General question is:
                      >
                      How can I convert encoded string
                      >
                      sEncodedHtmlTex t = 'привет
                      питон'
                      >
                      into human readable:
                      >
                      sDecodedHtmlTex t == 'привет питон'
                      >

                      Comment

                      Working...