any other best way of reading the file

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • psbasha
    Contributor
    • Feb 2007
    • 440

    #31
    [Thanks BV..You are really great..you are too good in regular expressions and file parsing

    If the file contains the Point data as shown below
    Code:
    Sample
    Point  *3280505         0               1.28286145-03  1.28286145E+03
    *       -2.01004501+02
    The output should be
    [3280505, 0, 0.00128286145, 1282.8614500000 001, -201.00450099999 998]

    But we are getting the output as
    [3280505, 0, 1.28286145, 3, 1282.8614500000 001, -2.0100450099999 998, 2]

    How to fix the above exponent data?.

    -PSB
    Last edited by psbasha; Dec 28 '07, 07:08 AM. Reason: Content was not visible

    Comment

    • bvdet
      Recognized Expert Specialist
      • Oct 2006
      • 2851

      #32
      Originally posted by psbasha
      [Thanks BV..You are really great..you are too good in regular expressions and file parsing

      If the file contains the Point data as shown below
      Code:
      Sample
      Point  *3280505         0               1.28286145-03  1.28286145E+03
      *       -2.01004501+02
      The output should be
      [3280505, 0, 0.00128286145, 1282.8614500000 001, -201.00450099999 998]

      But we are getting the output as
      [3280505, 0, 1.28286145, 3, 1282.8614500000 001, -2.0100450099999 998, 2]

      How to fix the above exponent data?.

      -PSB
      You are welcome. :)

      Your data is invalid, because there is no 'E' indicating exponential notation. You will need to correct the data before processing it so it can be converted to a floating point number. This pattern matches the invalid data:[code=Python]pattinvalid = re.compile(r'''
      \d+\.\d+\+\d+| # invalid eng notation +
      \d+\.\d+-\d+ # invalid eng notation -
      ''', re.X
      ) [/code]
      This code corrects the data:[code=Python]........if pattinvalid.sea rch(line):
      for item in pattinvalid.fin dall(line):
      line = line.replace(it em, item.replace('-', 'E-').replace('+', 'E+'))[/code]

      Comment

      • psbasha
        Contributor
        • Feb 2007
        • 440

        #33
        Hi BV,

        Could you please help me in understanding the below piece of code in simpler way.

        fileList = [item.strip() for item in open(fn).readli nes()\
        if not item.startswith ('$')]

        I mean after 'for' loop and 'if' condition we are not using the ':' for the block begin.
        How is it different from ordinary 'for' and 'if' with ':' ussage?.Whether both are same or to reduce the lines of code and better readability of code we will follow the above approach.What is the above approach of writing is called in Phython?

        Can you provide the links for learning the above concepts.

        I was trying to implement the invalid data format code in the main code provided by you,but I am not able to succeeded in it.If I understand the above concept I hope I can implement the invalid logic very easily
        Thanks
        PSB
        Last edited by psbasha; Dec 29 '07, 07:31 AM. Reason: Not able to see my posting data

        Comment

        • alijannaty52
          New Member
          • Dec 2007
          • 17

          #34
          The best way i could find out for you.You go the thru the link .Hope this will be helpful .

          BestFileReading Method

          Comment

          • bvdet
            Recognized Expert Specialist
            • Oct 2006
            • 2851

            #35
            Originally posted by psbasha
            Hi BV,

            Could you please help me in understanding the below piece of code in simpler way.

            fileList = [item.strip() for item in open(fn).readli nes()\
            if not item.startswith ('$')]

            I mean after 'for' loop and 'if' condition we are not using the ':' for the block begin.
            How is it different from ordinary 'for' and 'if' with ':' ussage?.Whether both are same or to reduce the lines of code and better readability of code we will follow the above approach.What is the above approach of writing is called in Python?

            Can you provide the links for learning the above concepts.

            I was trying to implement the invalid data format code in the main code provided by you,but I am not able to succeeded in it.If I understand the above concept I hope I can implement the invalid logic very easily
            Thanks
            PSB
            The code assigned to fileList creates a list as the variable name implies and is called a list comprehension. This list comprehension is equivalent to:[code=Python]f = open(fn)
            fileList = []
            for line in f:
            if not line.startswith ('$'):
            fileList.append (line.strip())
            f.close()[/code]To read more about list comprehensions - LINK
            For more links, do a web search on 'list comprehension python'.

            The full source code for parsing your sample data file:[code=Python]import re

            def convert_data(s) :
            for func in (int, float):
            try:
            n = func(s)
            return n
            except:
            pass
            return s

            pattnum = re.compile(r'''
            -\d+\.\d+E\+\d+| # engineering notation -+
            \d+\.\d+E\+\d+| # engineering notation ++
            -\d+\.\d+E-\d+| # engineering notation --
            \d+\.\d+E-\d+| # engineering notation +-
            -\d+\.\d+| # negative float format
            \d+\.\d+| # positive float format
            -\d+\.| # negative float format
            \d+\.| # positive float format
            -\.\d+| # negative float format
            \.\d+| # positive float format
            \d+ # positive integer
            ''', re.X
            )

            pattinvalid = re.compile(r'''
            \d+\.\d+\+\d+| # invalid eng notation +
            \d+\.\d+-\d+ # invalid eng notation -
            ''', re.X
            )

            def parseData(fn, *kargs):
            fileList = [item.strip() for item in open(fn).readli nes()\
            if not item.startswith ('$')]

            pattkey = re.compile('|'. join([r'\b(%s)' % item for item in kargs]))

            # create dictionary with keys from kargs
            masterDict = dict(zip(kargs, [[] for _ in kargs]))
            inData = False
            for line in fileList:

            # check for invalid data
            if pattinvalid.sea rch(line):
            for item in pattinvalid.fin dall(line):
            line = line.replace(it em, item.replace('-', 'E-').replace('+', 'E+'))

            if inData and line.startswith ('*'):
            data.extend(re. findall(pattnum , line))
            elif inData and not line.startswith ('*'):
            masterDict[m.group(0)].append([convert_data(it em)\
            for item in data])
            inData = False
            m = pattkey.match(l ine)
            if m:
            # m.group(0) is the current keyword
            if '*' in line:
            inData = True
            data = re.findall(patt num, line)
            else:
            data = re.findall(patt num, line)
            masterDict[m.group(0)].append([convert_data(it em)\
            for item in data])
            else:
            m = pattkey.match(l ine)
            if m:
            # m.group(0) is the current keyword
            if '*' in line:
            inData = True
            data = re.findall(patt num, line)
            else:
            data = re.findall(patt num, line)
            masterDict[m.group(0)].append([convert_data(it em)\
            for item in data])
            return masterDict

            if __name__ == '__main__':
            fn = 'sample_points. txt'
            keywords = ['Point', 'Othr', 'Rect', 'PRect', 'PLine', 'Line', 'Tria']
            dd = parseData(fn, *keywords)
            for key in dd:
            print key
            for item in dd[key]:
            print ' %s' % item

            ''' Output
            >>> Point
            [1, 0.0, 0.0, 0.0]
            [2, 1.0, 0.0, 0.0]
            [3, 2.0, 0.0, 0.0]
            [4, -3.0, 0.0, 0.0]
            [5, 0.0, 1.0, 0.0]
            [6, 1.0, 1.0, 0.0]
            [7, 2.0, 1.0, 0.0]
            [8, 4.0, 1.0, 0.0]
            [9, 0.0, -2.0, 0.0]
            [3280504, 0, 1282.8614500000 001, 1282.8614500000 001, -201.004501]
            [3280606, 0, 0.0069264000650 000003, -1282.8614500000 001, -10100.4501, -0.0143857673599 99999]
            PLine
            [1, 6, 1.5, 9.375, 0.001, -0.001]
            Tria
            [5, 1, 7, 2, 11]
            PRect
            [4, 11, 15, 16, 10, 11, 0.2999999999999 9999]
            Line
            [1, 1, 1, 2, 0.0029520000000 000002, 0.9925469999999 9996, 0.121827]
            Rect
            [2, 1, 2, 3, 7, 6]
            [3, 1, 3, 4, 8, 7]
            [4, 1, 5, 6, 10, 11, 0.0]
            Othr
            [1, 1, 5, 6, 10, 11, 0.0, 0.0, 10, 11, 0.0, 1.0]
            >>>
            '''

            ''' Data File Contents
            $$$$$
            START
            COLOR RED
            LINETYPE SOLID
            END
            $$$$$$$
            PLine 1 6 1.5 9.375 .001 -.001
            $ Line Details
            Line* 1 1 1 2
            * .002952 .992547 .121827
            $
            Rect 2 1 2 3 7 6
            Rect 3 1 3 4 8 7
            PRect* 4 11 15 16
            * 10 11 0.3
            Rect* 4 1 5 6
            * 10 11 0.
            Othr* 1 1 5 6
            * 10 11 0. 0.
            * 10 11 0. 1.0
            $$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$ $$$$$$
            Tria 5 1 7 2 11
            $$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$ $$$$$
            Point 1 0.0 0.0 0.0
            Point 2 1.0 0.0 0.0
            Point 3 2.0 0.0 0.0
            Point 4 -3.0 0.0 0.0
            Point 5 0.0 1.0 0.0
            Point 6 1.0 1.0 0.0
            Point 7 2.0 1.0 0.0
            Point 8 4.0 1.0 0.0
            Point* 9 0.0 -2.0
            * 0.0
            Point *3280504 0 1.28286145E+03 1.28286145+03
            * -2.01004501E+02
            #
            Point *3280606 0 6.926400065-03 -1.28286145+03
            * -1.01004501+04 -1.438576736-02
            $
            END
            '''[/code]You should test this on real data for valid results. I cannot guarantee that this is a final solution for you.

            Comment

            • psbasha
              Contributor
              • Feb 2007
              • 440

              #36
              Code:
              SampleFile
              $$$$$
              START
              COLOR RED
              LINETYPE SOLID
              END
              $$$$$$$
              PLine   1        6      1.5     9.375   .001    .001
              $ Line Details
              Line*   1               1                1              2
              *       .002952         .992547         .121827
              $
              
              Rect    2        1       2       3       7       6
              Rect    3        1       3       4       8       7
              PRect*  4               11              15              16
              *       10              11              0.3
              Rect*   4               1               5               6
              *       10              11              0.
              Othr*   1               1               5               6
              *       10              11              0.              0.
              *       10              11              0.              1.0
              Oth1*   1               1               5               6
              *       10              11              0.              0.
              *       10              11              0.              1.0
              *       10              11              0.              1.0
              *       10              11              0.              1.0
              Rect*   5               1               5               6
              *       10              11              0.
              Rect    1000000010000000200000007000000060000000
              Rect    2000000020000000300000008000000070000000
              Rect    3000000030000000400000009000000080000000
              Tria    40000000400000005000000090000000
              $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$  $
              Tria     5        1       7       2       11
              $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$  
              Point   1               0.0     0.0     0.0
              Point   2               1.0     0.0     0.0
              Point   3               2.0     0.0     0.0
              Point   4               3.0     0.0     0.0
              Point   5               0.0     1.0     0.0
              Point   6               1.0     1.0     0.0
              Point   7               2.0     1.0     0.0
              Point   8               4.0     1.0     0.0
              Point*  9                              0.0             2.0
              *       0.0
              Point  *3280504         0               1.28286145E+03  1.28286145E+03
              *       -2.01004501E+02
              Point  *3280505         0               1.28286145-03  1.28286145+03
              *       -2.01004501+02
              Point   100000000.      0.      0.
              Point   200000005.      0.      0.
              Point   3000000010.     0.      0.
              Point   4000000015.     0.      0.
              Point   5000000020.     0.      0.
              Point   600000000.      5.      0.
              Point   700000005.      5.      0.
              Point   8000000010.     5.      0.
              Point   9000000015.     5.      0.
              $
              END
              Last edited by psbasha; Dec 29 '07, 11:07 PM. Reason: Added one field

              Comment

              • psbasha
                Contributor
                • Feb 2007
                • 440

                #37
                In the above format ( i.e 8 Digit and 16 Digit) ,if we have complete '8' digits format in the column ,the output is shown incorrect.

                Find the output below:

                Code:
                Output
                >>> Point
                    [1, 0.0, 0.0, 0.0]
                    [2, 1.0, 0.0, 0.0]
                    [3, 2.0, 0.0, 0.0]
                    [4, 3.0, 0.0, 0.0]
                    [5, 0.0, 1.0, 0.0]
                    [6, 1.0, 1.0, 0.0]
                    [7, 2.0, 1.0, 0.0]
                    [8, 4.0, 1.0, 0.0]
                    [9, 0.0, 2.0, 0.0]
                    [3280504, 0, 1282.8614500000001, 1282.8614500000001, -201.004501]
                    [3280505, 0, 0.0012828614500000001, 1282.8614500000001, -201.004501]
                    [100000000.0, 0.0, 0.0]
                    [200000005.0, 0.0, 0.0]
                    [3000000010.0, 0.0, 0.0]
                    [4000000015.0, 0.0, 0.0]
                    [5000000020.0, 0.0, 0.0]
                    [600000000.0, 5.0, 0.0]
                    [700000005.0, 5.0, 0.0]
                    [8000000010.0, 5.0, 0.0]
                    [9000000015.0, 5.0, 0.0]
                PLine
                    [1, 6, 1.5, 9.375, 0.001, 0.001]
                Tria
                    [40000000400000005000000090000000L]
                    [5, 1, 7, 2, 11]
                PRect
                    [4, 11, 15, 16, 10, 11, 0.29999999999999999]
                Line
                    [1, 1, 1, 2, 0.0029520000000000002, 0.99254699999999996, 0.121827]
                Rect
                    [2, 1, 2, 3, 7, 6]
                    [3, 1, 3, 4, 8, 7]
                    [4, 1, 5, 6, 10, 11, 0.0]
                    [5, 1, 5, 6, 10, 11, 0.0]
                    [1000000010000000200000007000000060000000L]
                    [2000000020000000300000008000000070000000L]
                    [3000000030000000400000009000000080000000L]
                Othr
                    [1, 1, 5, 6, 10, 11, 0.0, 0.0, 10, 11, 0.0, 1.0]
                Incorrect output data are
                Code:
                Incorrect
                
                    [100000000.0, 0.0, 0.0]
                    [200000005.0, 0.0, 0.0]
                    [3000000010.0, 0.0, 0.0]
                    [4000000015.0, 0.0, 0.0]
                    [5000000020.0, 0.0, 0.0]
                    [600000000.0, 5.0, 0.0]
                    [700000005.0, 5.0, 0.0]
                    [8000000010.0, 5.0, 0.0]
                    [9000000015.0, 5.0, 0.0]
                Tria
                    [40000000400000005000000090000000L]
                
                Rect
                    [1000000010000000200000007000000060000000L]
                    [2000000020000000300000008000000070000000L]
                    [3000000030000000400000009000000080000000L]
                The above incorrect output data has to be seperated by commas.How to fix the above scenario when we have complete 8 or 16 digit format field?

                Thanks
                PSB

                Comment

                • psbasha
                  Contributor
                  • Feb 2007
                  • 440

                  #38
                  Shown below is the Corrected the Inputdata file into correct format

                  Thanks
                  PSB
                  Last edited by psbasha; Dec 30 '07, 12:31 AM. Reason: Corrected the Input file

                  Comment

                  • psbasha
                    Contributor
                    • Feb 2007
                    • 440

                    #39
                    Code:
                    Correct Formated Input file
                    $$$$$
                    START
                    COLOR RED
                    LINETYPE SOLID
                    END
                    $$$$$$$
                    PLine   1        6      1.5     9.375   .001    .001
                    $ Line Details
                    Line*   1               1                1              2
                    *       .002952         .992547         .121827
                    $
                    
                    Rect    2        1      2       3       7       6
                    Rect    3        1      3       4       8       7
                    PRect*  4               11              15              16
                    *       10              11              0.3
                    Rect*   4               1               5               6
                    *       10              11              0.
                    Othr*   1               1               5               6
                    *       10              11              0.              0.
                    *       10              11              0.              1.0
                    Oth1*   1               1               5               6
                    *       10              11              0.              0.
                    *       10              11              0.              1.0
                    *       10              11              0.              1.0
                    *       10              11              0.              1.0
                    Rect*   5               1               5               6
                    *       10              11              0.
                    Rect    10000000	10000000200000007000000060000000
                    Rect    20000000	20000000300000008000000070000000
                    Rect    30000000	30000000400000009000000080000000
                    Tria    40000000	400000005000000090000000
                    $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$  $
                    Tria    5        1      7       2       11
                    $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$  
                    Point   1               0.0     0.0     0.0
                    Point   2               1.0     0.0     0.0
                    Point   3               2.0     0.0     0.0
                    Point   4               3.0     0.0     0.0
                    Point   5               0.0     1.0     0.0
                    Point   6               1.0     1.0     0.0
                    Point   7               2.0     1.0     0.0
                    Point   8               4.0     1.0     0.0
                    Point*  9                               0.0             2.0
                    *       0.0
                    Point  *3280504         0               1.28286145E+03  1.28286145E+03
                    *       -2.01004501E+02
                    Point  *3280505         0               1.28286145-03  1.28286145+03
                    *       -2.01004501+02
                    Point   10000000	0.      0.      0.
                    Point   20000000	5.      0.      0.
                    Point   30000000	10.     0.      0.
                    Point   40000000	15.     0.      0.
                    Point   50000000	20.     0.      0.
                    Point   60000000	0.      5.      0.
                    Point   70000000	5.      5.      0.
                    Point   80000000	10.     5.      0.
                    Point   90000000	15.     5.      0.
                    $
                    END
                    We can use this above input data for testing
                    Thanks
                    PSB

                    Comment

                    • bvdet
                      Recognized Expert Specialist
                      • Oct 2006
                      • 2851

                      #40
                      You will need to make a small change to regex pattern pattnum:[code=Python]
                      pattnum = re.compile(r'''
                      -\d+\.\d+E\+\d+| # engineering notation -+
                      \d+\.\d+E\+\d+| # engineering notation ++
                      -\d+\.\d+E-\d+| # engineering notation --
                      \d+\.\d+E-\d+| # engineering notation +-
                      -\d+\.\d+| # negative float format
                      \d+\.\d+| # positive float format
                      -\d+\.| # negative float format
                      \d+\.| # positive float format
                      -\.\d+| # negative float format
                      \.\d+| # positive float format
                      \d{1,8} # positive integer
                      ''', re.X
                      )[/code]This will prevent the matching of more than 8 digits at a time. Further adjustments may be required.

                      Comment

                      • psbasha
                        Contributor
                        • Feb 2007
                        • 440

                        #41
                        Thanks BV for your suggestion.

                        I tried to play around the Pattern you have suggested .Still I am getting the Incorrect data.

                        Tria
                        [40000000, 400000005000000 090000000L]

                        Rect
                        [10000000, 100000002000000 070000000600000 00L]
                        [20000000, 200000003000000 080000000700000 00L]
                        [30000000, 300000004000000 090000000800000 00L]


                        -PSB

                        Comment

                        • bvdet
                          Recognized Expert Specialist
                          • Oct 2006
                          • 2851

                          #42
                          Look carefully at the suggested pattern. That pattern produces the following output from your corrected sample data:
                          [code=Python]>>> Point
                          [1, 0.0, 0.0, 0.0]
                          [2, 1.0, 0.0, 0.0]
                          [3, 2.0, 0.0, 0.0]
                          [4, 3.0, 0.0, 0.0]
                          [5, 0.0, 1.0, 0.0]
                          [6, 1.0, 1.0, 0.0]
                          [7, 2.0, 1.0, 0.0]
                          [8, 4.0, 1.0, 0.0]
                          [9, 0.0, 2.0, 0.0]
                          [3280504, 0, 1282.8614500000 001, 1282.8614500000 001, -201.004501]
                          [3280505, 0, 0.0012828614500 000001, 1282.8614500000 001, -201.004501]
                          [10000000, 0.0, 0.0, 0.0]
                          [20000000, 5.0, 0.0, 0.0]
                          [30000000, 10.0, 0.0, 0.0]
                          [40000000, 15.0, 0.0, 0.0]
                          [50000000, 20.0, 0.0, 0.0]
                          [60000000, 0.0, 5.0, 0.0]
                          [70000000, 5.0, 5.0, 0.0]
                          [80000000, 10.0, 5.0, 0.0]
                          [90000000, 15.0, 5.0, 0.0]
                          PLine
                          [1, 6, 1.5, 9.375, 0.001, 0.001]
                          Tria
                          [40000000, 40000000, 50000000, 90000000]
                          [5, 1, 7, 2, 11]
                          PRect
                          [4, 11, 15, 16, 10, 11, 0.2999999999999 9999]
                          Line
                          [1, 1, 1, 2, 0.0029520000000 000002, 0.9925469999999 9996, 0.121827]
                          Oth1
                          [1, 1, 1, 5, 6, 10, 11, 0.0, 0.0, 10, 11, 0.0, 1.0, 10, 11, 0.0, 1.0, 10, 11, 0.0, 1.0]
                          Rect
                          [2, 1, 2, 3, 7, 6]
                          [3, 1, 3, 4, 8, 7]
                          [4, 1, 5, 6, 10, 11, 0.0]
                          [5, 1, 5, 6, 10, 11, 0.0]
                          [10000000, 10000000, 20000000, 70000000, 60000000]
                          [20000000, 20000000, 30000000, 80000000, 70000000]
                          [30000000, 30000000, 40000000, 90000000, 80000000]
                          Othr
                          [1, 1, 5, 6, 10, 11, 0.0, 0.0, 10, 11, 0.0, 1.0]
                          >>> [/code]

                          Comment

                          • psbasha
                            Contributor
                            • Feb 2007
                            • 440

                            #43
                            You are right BV.Sorry, I have not copied the entire pattern you have suggested.I have copied the last statement of the pattern in my code.So I have missed one statement of the pattern.

                            Thanks for your suggestion and help BV.

                            -PSB

                            Comment

                            • psbasha
                              Contributor
                              • Feb 2007
                              • 440

                              #44
                              BV,

                              suggest me books and links for the regular expression to start with Basics and later for advance concepts

                              Thanks
                              PSB

                              Comment

                              • bvdet
                                Recognized Expert Specialist
                                • Oct 2006
                                • 2851

                                #45
                                Originally posted by psbasha
                                BV,

                                suggest me books and links for the regular expression to start with Basics and later for advance concepts

                                Thanks
                                PSB
                                This link has some good introductory and intermediate information on regular expressions - LINK

                                I have been using Kodos for experimenting and testing regular expressions and mostly learned by practicing with and incorporating into my scripts when needed. I do not consider myself an expert on re. Trial and error may be the hard way, but that's the way I learned what I know about Python.

                                Comment

                                Working...