python 27 re is not able to find characters öÖäÄåÅ

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gintare
    New Member
    • Mar 2007
    • 103

    python 27 re is not able to find characters öÖäÄåÅ

    Os: Windows7, 64bit
    Python27

    Code:
    text="""AU  - Huang, Zhipeng
    AU  - Geyer, Nadine
    AU  - Werner, Peter
    AU  - de Boor, Johannes
    AU  - Gösele, Ulrich
    TI  - Metal-Assisted Chemical Etching of Silicon: A Review"""
    
    auths=re.findall('AU  \- [öÖäÄåÅa-zA-Z.,\s]+', text)
    print(auths)
    I am getting the result: ['AU - Huang, Zhipeng', 'AU - Geyer, Nadine', 'AU - Werner, Peter', 'AU - de
    Boor, Johannes', 'AU - G']
    Python is not able to find " Gösele"
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    I am not having that issue in Python 2.7.2. Try this:
    Code:
    # coding=utf-8
    import re
    
    text="""AU  - Huang, Zhipeng
    AU  - Geyer, Nadine
    AU  - Werner, Peter
    AU  - de Boor, Johannes
    AU  - Gösele, Ulrich
    TI  - Metal-Assisted Chemical Etching of Silicon: A Review"""
    
    auths=re.findall('AU  \- [öÖäÄåÅa-zA-Z, ]+', text)
    for item in auths:
        print item
    The results:
    Code:
    >>> AU  - Huang, Zhipeng
    AU  - Geyer, Nadine
    AU  - Werner, Peter
    AU  - de Boor, Johannes
    AU  - Gösele, Ulrich
    >>> 
    >>> print auths
    ['AU  - Huang, Zhipeng', 'AU  - Geyer, Nadine', 'AU  - Werner, Peter', 'AU  - de Boor, Johannes', 'AU  - G\xc3\xb6sele, Ulrich']
    >>>

    Comment

    • gintare
      New Member
      • Mar 2007
      • 103

      #3
      I am sorry for misinformation. Actually the text is not a string, but text from the file. The error appears if text is from file:
      Code:
      fcit=codecs.open('C:/Users/Gintare/Downloads/Citations.txt','r',encoding='utf-8')
      text=fcit.readlines()
      Thanks for the notice, i am just copying file context to python script and now everything is working. But if you know how to read correctly the file, could you please write.

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #4
        You are calling readlines() which reads in a list of the lines. You should use read() or iterate on the list returned by readlines(). If the file is saved with utf-8 encoding, this should work:
        Code:
        # coding=utf-8
        import re
        import codecs
        
        fcit=codecs.open("data.txt", encoding="utf-8")
        text=fcit.read()
        auths=re.findall(codecs.decode('AU  \- [öÖäÄåÅa-zA-Z, ]+', "utf-8"), text)
        
        for item in auths:
            print item

        Comment

        • gintare
          New Member
          • Mar 2007
          • 103

          #5
          Thanks, it works with file.read()

          Comment

          Working...