Parsing strings -> numbers

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Tuang

    Parsing strings -> numbers

    I've been looking all over in the docs, but I can't figure out how
    you're *supposed* to parse formatted strings into numbers (and other
    data types, for that matter) in Python.

    In C#, you can say

    int.Parse(myStr ing)

    and it will turn a string like "-12,345" into a proper int. It works
    for all sorts of data types with all sorts of formats, and you can
    pass it locale parameters to tell it, for example, to parse a German
    "12.345,67" into 12345.67. Java does this, too.
    (Integer.parseI nt(myStr), IIRC).

    What's the equivalent in Python?

    And if the only problem is comma thousand-separators (e.g.,
    "12,345.67" ), is there a higher-performance way to convert that into
    the number 12345.67 than using Python's formal parsers?

    Thanks.
  • Skip Montanaro

    #2
    Re: Parsing strings -> numbers


    tuanglen> I've been looking all over in the docs, but I can't figure out
    tuanglen> how you're *supposed* to parse formatted strings into numbers
    tuanglen> (and other data types, for that matter) in Python.

    Check out the locale module. From "pydoc locale":

    Help on module locale:

    NAME
    locale - Locale support.

    FILE
    /Users/skip/local/lib/python2.4/locale.py

    MODULE DOCS


    DESCRIPTION
    The module provides low-level access to the C lib's locale APIs
    and adds high level number formatting APIs as well as a locale
    aliasing engine to complement these.

    ...

    FUNCTIONS
    atof(str, func=<type 'float'>)
    Parses a string as a float according to the locale settings.

    atoi(str)
    Converts a string to an integer according to the locale settings.

    ...

    Skip

    Comment

    • Miki Tebeka

      #3
      Re: Parsing strings -&gt; numbers

      Hello Tuang,
      [color=blue]
      > In C#, you can say
      >
      > int.Parse(myStr ing)
      >
      > and it will turn a string like "-12,345" into a proper int. It works
      > for all sorts of data types with all sorts of formats, and you can
      > pass it locale parameters to tell it, for example, to parse a German
      > "12.345,67" into 12345.67. Java does this, too.
      > (Integer.parseI nt(myStr), IIRC).
      >
      > What's the equivalent in Python?[/color]
      Python has a build in "int", "long" and "float" functions. However
      they are more limited than what you want.
      [color=blue]
      > And if the only problem is comma thousand-separators (e.g.,
      > "12,345.67" ), is there a higher-performance way to convert that into
      > the number 12345.67 than using Python's formal parsers?[/color]
      i = int("12,345.67" .replace(",", ""))

      HTH.
      Miki

      Comment

      • Tuang

        #4
        Re: Parsing strings -&gt; numbers

        Skip Montanaro <skip@pobox.com > wrote in message news:<mailman.1 040.1069710514. 702.python-list@python.org >...[color=blue]
        > tuanglen> I've been looking all over in the docs, but I can't figure out
        > tuanglen> how you're *supposed* to parse formatted strings into numbers
        > tuanglen> (and other data types, for that matter) in Python.
        >
        > Check out the locale module. From "pydoc locale":
        >
        > Help on module locale:
        >
        > NAME
        > locale - Locale support.
        >
        > FILE
        > /Users/skip/local/lib/python2.4/locale.py
        >
        > MODULE DOCS
        > http://www.python.org/doc/current/li...le-locale.html
        >
        > DESCRIPTION
        > The module provides low-level access to the C lib's locale APIs
        > and adds high level number formatting APIs as well as a locale
        > aliasing engine to complement these.
        >
        > ...
        >
        > FUNCTIONS
        > atof(str, func=<type 'float'>)
        > Parses a string as a float according to the locale settings.
        >
        > atoi(str)
        > Converts a string to an integer according to the locale settings.
        >
        > ...
        >[/color]

        Thanks for taking a shot at it, but it doesn't appear to work:
        [color=blue][color=green][color=darkred]
        >>> import locale
        >>> locale.atoi("-12,345")[/color][/color][/color]
        Traceback (most recent call last):
        File "<interacti ve input>", line 1, in ?
        File "C:\Python2321\ lib\locale.py", line 179, in atoi
        return atof(str, int)
        File "C:\Python2321\ lib\locale.py", line 175, in atof
        return func(str)
        ValueError: invalid literal for int(): -12,345[color=blue][color=green][color=darkred]
        >>> locale.getdefau ltlocale()[/color][/color][/color]
        ('en_US', 'cp1252')[color=blue][color=green][color=darkred]
        >>> locale.atoi("-12345")[/color][/color][/color]
        -12345

        Given the locale it thinks I have, it should be able to parse
        "-12,345" if it can handle formats containing thousands separators,
        but apparently it can't.

        If Python doesn't actually have its own parsing of formatted numbers,
        what's the preferred Python approach for taking taking data, perhaps
        formatted currencies such as "-$12,345.00" scraped off a Web page, and
        turning it into numerical data?

        Thanks.

        Comment

        • Duncan Booth

          #5
          Re: Parsing strings -&gt; numbers

          tuanglen@hotmai l.com (Tuang) wrote in
          news:df045d93.0 311250127.67395 ae@posting.goog le.com:
          [color=blue][color=green][color=darkred]
          >>>> locale.getdefau ltlocale()[/color][/color]
          > ('en_US', 'cp1252')[color=green][color=darkred]
          >>>> locale.atoi("-12345")[/color][/color]
          > -12345
          >
          > Given the locale it thinks I have, it should be able to parse
          > "-12,345" if it can handle formats containing thousands separators,
          > but apparently it can't.
          >
          > If Python doesn't actually have its own parsing of formatted numbers,
          > what's the preferred Python approach for taking taking data, perhaps
          > formatted currencies such as "-$12,345.00" scraped off a Web page, and
          > turning it into numerical data?
          >[/color]

          The problem is that by default the numeric locale is not set up to parse
          those numbers. You have to set that up separately:
          [color=blue][color=green][color=darkred]
          >>> import locale
          >>> locale.getlocal e(locale.LC_NUM ERIC)[/color][/color][/color]
          (None, None)[color=blue][color=green][color=darkred]
          >>> locale.getlocal e()[/color][/color][/color]
          ['English_United Kingdom', '1252'][color=blue][color=green][color=darkred]
          >>> locale.setlocal e(locale.LC_NUM ERIC, "English")[/color][/color][/color]
          'English_United States.1252'[color=blue][color=green][color=darkred]
          >>> locale.atof('1, 234')[/color][/color][/color]
          1234.0[color=blue][color=green][color=darkred]
          >>> locale.setlocal e(locale.LC_NUM ERIC, "French")[/color][/color][/color]
          'French_France. 1252'[color=blue][color=green][color=darkred]
          >>> locale.atof('1, 234')[/color][/color][/color]
          1.234

          Unless I've missed something, it doesn't support ignoring currency symbols
          when parsing numbers, so you still can't handle "-$12,345.00" even if you
          do set the numeric and monetary locales.

          --
          Duncan Booth duncan@rcp.co.u k
          int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
          "\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

          Comment

          • Skip Montanaro

            #6
            Re: Parsing strings -&gt; numbers

            tuang> Thanks for taking a shot at it, but it doesn't appear to work:
            [color=blue][color=green][color=darkred]
            >>> import locale
            >>> locale.atoi("-12,345")[/color][/color][/color]
            Traceback (most recent call last):
            File "<interacti ve input>", line 1, in ?
            File "C:\Python2321\ lib\locale.py", line 179, in atoi
            return atof(str, int)
            File "C:\Python2321\ lib\locale.py", line 175, in atof
            return func(str)
            ValueError: invalid literal for int(): -12,345[color=blue][color=green][color=darkred]
            >>> locale.getdefau ltlocale()[/color][/color][/color]
            ('en_US', 'cp1252')[color=blue][color=green][color=darkred]
            >>> locale.atoi("-12345")[/color][/color][/color]
            -12345

            Take a look at the output of locale.localeco nv() with various locales set.
            I think you'll find that locale.localeco nv()['tousands_sep'] is '', not ','.
            Failing that, you might want to simply replace the commas and dollar signs
            with empty strings before passing to int() or float(), as someone else
            suggested.

            Be careful if you're scraping web pages which might not use the same charset
            as you do. You may find something like:

            $123.456,78

            as a quote price on a European website. I don't know how to tell what the
            remote site used as its locale when formatting numeric data. Perhaps
            knowing the charset of the page is sufficient to make an educated guess.

            Skip

            Comment

            • Tuang

              #7
              Re: Parsing strings -&gt; numbers

              Skip Montanaro <skip@pobox.com > wrote[color=blue]
              >
              > Be careful if you're scraping web pages which might not use the same charset
              > as you do. You may find something like:
              >
              > $123.456,78
              >
              > as a quote price on a European website. I don't know how to tell what the
              > remote site used as its locale when formatting numeric data. Perhaps
              > knowing the charset of the page is sufficient to make an educated guess.[/color]

              Thanks, Skip. I'm not planning some sort of shady screen scraping
              operation or anything of that sort. This is more of a generic question
              about how to use Python as a convenient utility language.

              Sometimes I'll find a table of interesting data somewhere as I'm just
              surfing around the Web, and I'll want to grab the data and play with
              it a bit. At that scale of operation, I can just look at the page
              source and figure out the encoding, what the currency is, etc. I know
              how to turn a formatted string into a usable number in other languages
              that I use (though I might have to check the docs in some cases to
              remind myself of the details), and since the docs didn't really make
              it obvious what the "one clear and obvious way to do it" was in
              Python, I thought I'd ask.

              It appears as though Python doesn't (yet) have the same formal support
              for format parsing and internationaliz ation that languages like C# and
              Java have, but that's okay for now. I just wanted to make sure I
              didn't start creating my own naive, homemade equivalents of functions
              that are already part of the standard API.

              Comment

              Working...