any other best way of reading the file

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • psbasha
    Contributor
    • Feb 2007
    • 440

    #46
    Originally posted by bvdet
    This link has some good introductory and intermediate information on regular expressions - LINK

    I have been using Kodos for experimenting and testing regular expressions and mostly learned by practicing with and incorporating into my scripts when needed. I do not consider myself an expert on re. Trial and error may be the hard way, but that's the way I learned what I know about Python.

    Code:
    Re
    pattnum = re.compile(r'''
                          -\d+\.\d+E\+\d+|          # engineering notation -+
                          \d+\.\d+E\+\d+|           # engineering notation ++
                          -\d+\.\d+E-\d+|           # engineering notation --
                          \d+\.\d+E-\d+|            # engineering notation +-
                          -\d+\.\d+|                # negative float format
                          \d+\.\d+|                 # positive float format
                          -\d+\.|                   # negative float format
                          \d+\.|                    # positive float format
                          -\.\d+|                   # negative float format
                          \.\d+|                    # positive float format
                          \d{1,8}                   # positive integer
                          ''', re.X
    
    
    key_patt = re.compile(r'/([A-Za-z_-]+)/')
    data_patt = re.compile(r'\d+\.\d+|\d+|\w+')
    Hi BV,

    Can you please elaborate the explanation for the above pattern ,with simple examples.What each pattern line stands for?. How are we deciding to go for this types of pattern.

    Thanks
    PSB

    Comment

    • bvdet
      Recognized Expert Specialist
      • Oct 2006
      • 2851

      #47
      Originally posted by psbasha
      Code:
      Re
      pattnum = re.compile(r'''
                            -\d+\.\d+E\+\d+|          # engineering notation -+
                            \d+\.\d+E\+\d+|           # engineering notation ++
                            -\d+\.\d+E-\d+|           # engineering notation --
                            \d+\.\d+E-\d+|            # engineering notation +-
                            -\d+\.\d+|                # negative float format
                            \d+\.\d+|                 # positive float format
                            -\d+\.|                   # negative float format
                            \d+\.|                    # positive float format
                            -\.\d+|                   # negative float format
                            \.\d+|                    # positive float format
                            \d{1,8}                   # positive integer
                            ''', re.X
      
      
      key_patt = re.compile(r'/([A-Za-z_-]+)/')
      data_patt = re.compile(r'\d+\.\d+|\d+|\w+')
      Hi BV,

      Can you please elaborate the explanation for the above pattern ,with simple examples.What each pattern line stands for?. How are we deciding to go for this types of pattern.

      Thanks
      PSB
      Each line in pattnum matches a slightly different format of number as noted in the comments. The last line (''', re.X) contins the VERBOSE flag, which tells the compiler to ignore unecsaped whitespace and comments. The next to last line (\d{1,8}) greedily matches between 1 and eight digits at a time. That is what we fixed earlier to work with your formatted data.

      key_patt matches words like this:
      /ABC_abc-def/
      The brackets '[......]' tell the compiler to match the set of characters enclosed. Since the slash characters are outside the brackets, they must enclose the word in a given string to match. That's how we matched your keywords.

      data_patt matches a floating point number, integer or alphanumeric character. The character '|' tells the compiler to match the patttern to the left OR the pattern to the right in a given string.

      Comment

      • psbasha
        Contributor
        • Feb 2007
        • 440

        #48
        Originally posted by bvdet
        You will need to make a small change to regex pattern pattnum:[code=Python]
        pattnum = re.compile(r'''
        -\d+\.\d+E\+\d+| # engineering notation -+
        \d+\.\d+E\+\d+| # engineering notation ++
        -\d+\.\d+E-\d+| # engineering notation --
        \d+\.\d+E-\d+| # engineering notation +-
        -\d+\.\d+| # negative float format
        \d+\.\d+| # positive float format
        -\d+\.| # negative float format
        \d+\.| # positive float format
        -\.\d+| # negative float format
        \.\d+| # positive float format
        \d{1,8} # positive integer
        ''', re.X
        )[/code]This will prevent the matching of more than 8 digits at a time. Further adjustments may be required.
        Code:
        SampleData
        Line1*  1               1                1              2
        *       .002952         .992547         .121827
        $
        
        Rect2   2        1      2       3       7       6
        Rect    3        1      3       4       8       7
        PRect2* 4               11              15              16
        *       10              11              0.3
        Rect2*   4               1               5               6
        *       10              11              0.
        Othr*   1               1               5               6
        *       10              11              0.              0.
        *       10              11              0.              1.0
        Oth1*   1               1               5               6
        *       10              11              0.              0.
        *       10              11              0.              1.0
        *       10              11              0.              1.0
        *       10              11              0.              1.0
        Rect*   5               1               5               6
        *       10              11              0.
        Rect    10000000	10000000200000007000000060000000
        Rect    20000000	20000000300000008000000070000000
        Rect    30000000	30000000400000009000000080000000
        Tria3   40000000	400000005000000090000000
        $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$  $
        Tria    6        1      7       2       11
        $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$  
        Point   1               0.0     0.0     0.0
        Point   2               1.0     0.0     0.0
        Point   3               2.0     0.0     0.0
        Point   4               3.0     0.0     0.0
        Point   5               0.0     1.0     0.0
        Point   6               1.0     1.0     0.0
        Point   7               2.0     1.0     0.0
        Point   8               4.0     1.0     0.0
        Point*  9                               0.0             2.0
        *       0.0
        Point  *3280504         0               1.28286145E+03  1.28286145E+03
        *       -2.01004501E+02
        Point  *3280505         0               1.28286145-03  1.28286145+03
        *       -2.01004501+02
        Point   10000000	0.      0.      0.
        Point   20000000	5.      0.      0.
        Point   30000000	10.     0.      0.
        Point   40000000	15.     0.      0.
        Point   50000000	20.     0.      0.
        Point   60000000	0.      5.      0.
        Point   70000000	5.      5.      0.
        Point   80000000	10.     5.      0.
        Point   90000000	15.     5.      0.
        $
        END
        if the keywords are defined as below
        keywords = ['Point', 'Othr', 'Rect2', 'Rect','PRect', 'PLine', 'Line1', 'Tria'3,'Oth1']
        The output we are getting is Incorrect.
        Code:
        Output
        $Incorrect output
        Line1
            [1, 1, 1, 1, 2, 0.0029520000000000002, 0.99254699999999996, 0.121827]
        Tria3
            [3,40000000, 40000000, 50000000, 90000000]
        
        Oth1
            [1, 1, 1, 5, 6, 10, 11, 0.0, 0.0, 10, 11, 0.0, 1.0, 10, 11, 0.0, 1.0, 10, 11, 0.0, 1.0]
        
        Rect2
            [2, 2, 1, 2, 3, 7, 6]
            [2, 4, 1, 5, 6, 10, 11, 0.0]
        $Correct output is 
        Line1
            [1, 1, 1, 2, 0.0029520000000000002, 0.99254699999999996, 0.121827]
        Tria3
            [40000000, 40000000, 50000000, 90000000]
        
        Oth1
            [ 1, 1, 5, 6, 10, 11, 0.0, 0.0, 10, 11, 0.0, 1.0, 10, 11, 0.0, 1.0, 10, 11, 0.0, 1.0]
        
        Rect2
            [ 2, 1, 2, 3, 7, 6]
            [ 4, 1, 5, 6, 10, 11, 0.0]
        The source code is taking the 'Rect2' keyword number '2' also.sinmilarly for Line1,Tria3

        Comment

        • bvdet
          Recognized Expert Specialist
          • Oct 2006
          • 2851

          #49
          Try making adjustments to pattnum and pattkey:[code=Python]
          # last line in pattnum
          # matches integers of length between 1 and 8 digits,
          # if not preceded by an alpha character
          # matching as many repetitions possible
          ............... .(?<![a-zA-Z])\d{1,8} # positive integer

          # matches keywords listed in kargs
          # may or may not have a trailing asterisk
          # there must be a word boundary both ends
          ....pattkey = re.compile('|'. join([r'\b(%s)\*?\b' % item for item in kargs]))[/code]

          Comment

          • psbasha
            Contributor
            • Feb 2007
            • 440

            #50
            Originally posted by bvdet
            Try making adjustments to pattnum and pattkey:[code=Python]
            # last line in pattnum
            # matches integers of length between 1 and 8 digits,
            # if not preceded by an alpha character
            # matching as many repetitions possible
            ............... .(?<![a-zA-Z])\d{1,8} # positive integer

            # matches keywords listed in kargs
            # may or may not have a trailing asterisk
            # there must be a word boundary both ends
            ....pattkey = re.compile('|'. join([r'\b(%s)\*?\b' % item for item in kargs]))[/code]
            I try to add the pattern as suggested at the last

            Code:
            Pat
            pattnum = re.compile(r'''
                                  -\d+\.\d+E\+\d+|          # engineering notation -+
                                  \d+\.\d+E\+\d+|           # engineering notation ++
                                  -\d+\.\d+E-\d+|           # engineering notation --
                                  \d+\.\d+E-\d+|            # engineering notation +-
                                  -\d+\.\d+|                # negative float format
                                  \d+\.\d+|                 # positive float format
                                  -\d+\.|                   # negative float format
                                  \d+\.|                    # positive float format
                                  -\.\d+|                   # negative float format
                                  \.\d+|                    # positive float format
                                  \d{1,8}|
                                  \(?<![a-zA-Z])\d{1,8}  # positive integer
                                  ''', re.X
                                 )
            and the line

            pattkey = re.compile('|'. join([r'\b(%s)\*?\b' % item for item in kargs]))

            I hope there is a syntac error in the pattern

            \(?<![a-zA-Z])\d{1,8} # positive integer.

            I am getting the following error
            Code:
            Error
              File "C:\\Sample.py", line 12, in ?
                pattnum = re.compile(r'''
              File "C:\Python24\lib\sre.py", line 180, in compile
                return _compile(pattern, flags)
              File "C:\Python24\lib\sre.py", line 227, in _compile
                raise error, v # invalid expression
            error: unbalanced parenthesis

            Comment

            • bvdet
              Recognized Expert Specialist
              • Oct 2006
              • 2851

              #51
              You copied the pattern incorrectly. The characters '\(' are interpreted as the literal character '(' hence the unbalanced parentheses. The pattern I suggested should replace the last line in pattnum, not appended to the end of pattnum.

              Comment

              • psbasha
                Contributor
                • Feb 2007
                • 440

                #52
                Originally posted by bvdet
                You copied the pattern incorrectly. The characters '\(' are interpreted as the literal character '(' hence the unbalanced parentheses. The pattern I suggested should replace the last line in pattnum, not appended to the end of pattnum.
                I tried replacing the last line as shown below.But still iam not getting the results as expected

                Code:
                Pat
                -\.\d+|                   # negative float format
                \.\d+|                    # positive float format                                            
                \\(?<![a-zA-Z])\d{1,8}  # positive integer
                Code:
                Output
                >>> Point
                    [0.0, 0.0, 0.0]
                    [1.0, 0.0, 0.0]
                    [2.0, 0.0, 0.0]
                    [3.0, 0.0, 0.0]
                    [0.0, 1.0, 0.0]
                    [1.0, 1.0, 0.0]
                    [2.0, 1.0, 0.0]
                    [4.0, 1.0, 0.0]
                    [0.0, 2.0, 0.0]
                    [1282.8614500000001, 1282.8614500000001, -201.004501]
                    [0.0012828614500000001, 1282.8614500000001, -201.004501]
                    [0.0, 0.0, 0.0]
                    [5.0, 0.0, 0.0]
                    [10.0, 0.0, 0.0]
                    [15.0, 0.0, 0.0]
                    [20.0, 0.0, 0.0]
                    [0.0, 5.0, 0.0]
                    [5.0, 5.0, 0.0]
                    [10.0, 5.0, 0.0]
                    [15.0, 5.0, 0.0]
                PLine
                Tria
                    []
                Line1
                    [0.0029520000000000002, 0.99254699999999996, 0.121827]
                PRect
                Rect2
                    []
                Oth1
                    [0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0]
                Rect
                    []
                    [0.0]
                    []
                    []
                    []
                Othr
                    [0.0, 0.0, 0.0, 1.0]
                I am getting the ouput without IDs for some cases and some list are empty.

                Can you help me with the pattern ,how to replace and add to the last lines.

                Thanks
                PSB

                Comment

                • bvdet
                  Recognized Expert Specialist
                  • Oct 2006
                  • 2851

                  #53
                  [code=Python]pattnum = re.compile(r'''
                  -\d+\.\d+E\+\d+ # engineering notation -+
                  | # or
                  \d+\.\d+E\+\d+ # engineering notation ++
                  | # or
                  -\d+\.\d+E-\d+ # engineering notation --
                  | # or
                  \d+\.\d+E-\d+ # engineering notation +-
                  | # or
                  -\d+\.\d+ # negative float format
                  | # or
                  \d+\.\d+ # positive float format
                  | # or
                  -\d+\. # negative float format
                  | # or
                  \d+\. # positive float format
                  | # or
                  -\.\d+ # negative float format
                  | # or
                  \.\d+ # positive float format
                  | # or
                  (?<![a-zA-Z])\d{1,8} # positive integer
                  ''', re.X
                  )[/code][code=Python]pattkey = re.compile('|'. join([r'\b(%s)\*?\b' % item for item in kargs]))[/code]Output:[code=Python]>>> Point
                  [86010101, 0.0, 0.0, 0.0]
                  [86010102, 1.0, 0.0, 0.0]
                  [86010103, 2.0, 0.0, 0.0]
                  [86010104, 3.0, 0.0, 0.0]
                  [86010106, 0.0, 1.0, 0.0]
                  [96010104, 1.0, 1.0, 0.0]
                  [86010110, 2.0, 1.0, 0.0]
                  [86010201, 4.0, 1.0, 0.0]
                  [86010115, 0.0, 2.0, 0.0]
                  [96010403, 0, 1282.8614500000 001, 1282.8614500000 001, -201.004501]
                  [3280505, 0, 0.0012828614500 000001, 1282.8614500000 001, -201.004501]
                  [10000000, 0.0, 0.0, 0.0]
                  [20000000, 5.0, 0.0, 0.0]
                  [30000000, 10.0, 0.0, 0.0]
                  [40000000, 15.0, 0.0, 0.0]
                  [50000000, 20.0, 0.0, 0.0]
                  [60000000, 0.0, 5.0, 0.0]
                  [70000000, 5.0, 5.0, 0.0]
                  [80000000, 10.0, 5.0, 0.0]
                  [90000000, 15.0, 5.0, 0.0]
                  PLine
                  [1, 6, 1.5, 9.375, 0.001, 0.001]
                  Tria
                  [40000000, 40000000, 50000000, 90000000]
                  [5, 1, 7, 2, 11]
                  Tria3
                  [60000000, 80000000, 90000000, 10000000]
                  PRect
                  [4, 11, 15, 16, 10, 11, 0.2999999999999 9999]
                  Line
                  [10001101, 1, 1, 2, 0.0029520000000 000002, 0.9925469999999 9996, 0.121827]
                  Oth1
                  [1, 1, 5, 6, 10, 11, 0.0, 0.0, 10, 11, 0.0, 1.0, 10, 11, 0.0, 1.0, 10, 11, 0.0, 1.0]
                  Rect1
                  [6, 2, 7, 8, 11, 12, 1.7612000000000 001]
                  Rect
                  [10001102, 1, 2, 3, 7, 6]
                  [10001103, 1, 3, 4, 8, 7]
                  [10001104, 1, 5, 6, 10, 11, 0.0]
                  [5, 1, 5, 6, 10, 11, 0.0]
                  [10000000, 10000000, 20000000, 70000000, 60000000]
                  [20000000, 20000000, 30000000, 80000000, 70000000]
                  [30000000, 30000000, 40000000, 90000000, 80000000]
                  Othr
                  [1, 1, 5, 6, 10, 11, 0.0, 0.0, 10, 11, 0.0, 1.0]
                  >>> [/code]I don't know where you got the leading backslashes. I made up some of the data for testing the patterns.

                  Comment

                  • psbasha
                    Contributor
                    • Feb 2007
                    • 440

                    #54
                    Thanks BV for your help.

                    I have to work more on RegEx.I have gone thru the link link you have shared with me.Still I have to work on that in detail.

                    How to verify whether our defined pattern is correct? .Whether we have to use KODOS to verify it or anyother links available

                    -PSB

                    Comment

                    • bvdet
                      Recognized Expert Specialist
                      • Oct 2006
                      • 2851

                      #55
                      Originally posted by psbasha
                      Thanks BV for your help.

                      I have to work more on RegEx.I have gone thru the link link you have shared with me.Still I have to work on that in detail.

                      How to verify whether our defined pattern is correct? .Whether we have to use KODOS to verify it or anyother links available

                      -PSB
                      You are welcome. The best way to test the patterns is thorough testing on real data. Remember that the keyword pattern requires the keyword to be a distinct word (there must be a space between the word and the first data field).

                      Comment

                      • psbasha
                        Contributor
                        • Feb 2007
                        • 440

                        #56
                        Originally posted by bvdet
                        [code=Python]pattnum = re.compile(r'''
                        -\d+\.\d+E\+\d+ # engineering notation -+
                        | # or
                        \d+\.\d+E\+\d+ # engineering notation ++
                        | # or
                        -\d+\.\d+E-\d+ # engineering notation --
                        | # or
                        \d+\.\d+E-\d+ # engineering notation +-
                        | # or
                        -\d+\.\d+ # negative float format
                        | # or
                        \d+\.\d+ # positive float format
                        | # or
                        -\d+\. # negative float format
                        | # or
                        \d+\. # positive float format
                        | # or
                        -\.\d+ # negative float format
                        | # or
                        \.\d+ # positive float format
                        | # or
                        (?<![a-zA-Z])\d{1,8} # positive integer
                        ''', re.X
                        )[/code][code=Python]pattkey = re.compile('|'. join([r'\b(%s)\*?\b' % item for item in kargs]))[/code]Output:[code=Python]>>> Point
                        >>> [/code]I don't know where you got the leading backslashes. I made up some of the data for testing the patterns.
                        Code:
                        InputData
                        Point   5               0.0     1.0     0.0
                        Point   6               1.0     1.0     0.0
                        Point   7               2.0     1.0     0.0
                        Point   86010206        1471.51 -165.842139.
                        Point   86090104        1403.56 -148.237126.7
                        Point   86090129        1708.72 -722.4  232.274
                        Point   86120127        1664.09 -687.   225.852
                        $
                        END
                        BV,

                        Not able to fix the pattern for the above float format which has to be split at every '8' field.

                        The output is incorrect as shown
                        [86010206, 1471.51, -165.842139]
                        [86090104, 1403.5599999999 999, -148.23712599999 999, 0.6999999999999 9996]

                        It should be
                        [86010206, 1471.51, -165.842,139.0]
                        [86090104, 1403.56 -148.237, 126.7]

                        Need to get the X,Y,Z coordianets seperately.Your help is required here.

                        Thanks
                        PSB
                        Last edited by psbasha; Jan 9 '08, 10:54 PM. Reason: Edited

                        Comment

                        • bvdet
                          Recognized Expert Specialist
                          • Oct 2006
                          • 2851

                          #57
                          Originally posted by psbasha
                          Code:
                          InputData
                          Point   5               0.0     1.0     0.0
                          Point   6               1.0     1.0     0.0
                          Point   7               2.0     1.0     0.0
                          Point   86010206        1471.51 -165.842139.
                          Point   86090104        1403.56 -148.237126.7
                          Point   86090129        1708.72 -722.4  232.274
                          Point   86120127        1664.09 -687.   225.852
                          $
                          END
                          BV,

                          Not able to fix the pattern for the above float format which has to be split at every '8' field.

                          The output is incorrect as shown
                          [86010206, 1471.51, -165.842139]
                          [86090104, 1403.5599999999 999, -148.23712599999 999, 0.6999999999999 9996]

                          It should be
                          [86010206, 1471.51, -165.842,139.0]
                          [86090104, 1403.56 -148.237, 126.7]

                          Need to get the X,Y,Z coordianets seperately.Your help is required here.

                          Thanks
                          PSB
                          The regex sequence pattnum depends on word boundaries except when matching integers. If your data consistently uses 8 character fields, the parsing would be much simpler.[code=Python]# Read data file with 8 character fields

                          import re

                          def convert_data(s) :
                          for func in (int, float):
                          try:
                          n = func(s)
                          return n
                          except:
                          pass
                          return s

                          def split_fields(s, fw=8):
                          outList = []
                          while s:
                          outList.append( s[:8])
                          s = s[8:]
                          return outList

                          def parse_data(fn, *kargs):
                          fileList = [item.strip() for item in open(fn).readli nes()\
                          if not item.startswith ('$')]

                          pattkey = re.compile('|'. join([r'\b(%s)\*?\b' % \
                          item for item in kargs]))

                          # create dictionary with keys from kargs
                          mDict = dict(zip(kargs, [[] for _ in kargs]))
                          for line in fileList:
                          m = pattkey.match(l ine)
                          if m:
                          lineList = [s.strip() for s in split_fields(li ne)]
                          mDict[m.group(0)].append([convert_data(it em) for item \
                          in lineList if item][1:])

                          return mDict

                          if __name__ == '__main__':
                          fn = r'H:\TEMP\temsy s\sample_points 100.txt'
                          dd = parse_data(fn, *['Point',])
                          for key in dd:
                          print key
                          for item in dd[key]:
                          print ' %s' % item

                          [/code]Output:[code=Python]>>> Point
                          [5, 0.0, 1.0, 0.0]
                          [6, 1.0, 1.0, 0.0]
                          [7, 2.0, 1.0, 0.0]
                          [86010206, 1471.51, -165.84200000000 001, 139.0]
                          [86090104, 1403.5599999999 999, -148.23699999999 999, 126.7]
                          [86090129, 1708.72, -722.39999999999 998, 232.274]
                          [86120127, 1664.0899999999 999, -687.0, 225.852]
                          >>> [/code]

                          Comment

                          • psbasha
                            Contributor
                            • Feb 2007
                            • 440

                            #58
                            >>> [/code][/QUOTE]
                            Sorry BV,

                            You have understood wrongly.My file data has 8 or 16 digits or combination of both filed formats.I have given a sample data that is not able to read correctly by the Pattern definition.
                            To support that we need to define the pattern for X,Y and Z coordinates properly for '8' and '16' digit.

                            Could you please help me in fixing this problem.

                            Thanks
                            PSB
                            Last edited by psbasha; Jan 10 '08, 12:02 AM. Reason: Edited

                            Comment

                            • psbasha
                              Contributor
                              • Feb 2007
                              • 440

                              #59
                              Originally posted by psbasha
                              >>> [/code]
                              Sorry BV,

                              You have understood wrongly.My file data has 8 or 16 digits or combination of both filed formats.I have given a sample data that is not able to read correctly by the Pattern definition.
                              To support that we need to define the pattern for X,Y and Z coordinates properly for '8' and '16' digit.

                              Could you please help me in fixing this problem.

                              Thanks
                              PSB[/QUOTE]


                              Please find the Sample input data.

                              Code:
                              InputFileData
                              SampleData
                              Line1*  1               1                1              2
                              *       .002952         .992547         .121827
                              $
                               
                              Rect2   2        1      2       3       7       6
                              Rect    3        1      3       4       8       7
                              PRect2* 4               11              15              16
                              *       10              11              0.3
                              Rect2*   4               1               5               6
                              *       10              11              0.
                              Othr*   1               1               5               6
                              *       10              11              0.              0.
                              *       10              11              0.              1.0
                              Oth1*   1               1               5               6
                              *       10              11              0.              0.
                              *       10              11              0.              1.0
                              *       10              11              0.              1.0
                              *       10              11              0.              1.0
                              Rect*   5               1               5               6
                              *       10              11              0.
                              Rect    10000000    10000000200000007000000060000000
                              Rect    20000000    20000000300000008000000070000000
                              Rect    30000000    30000000400000009000000080000000
                              Tria3   40000000    400000005000000090000000
                              $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$    $
                              Tria    6        1      7       2       11
                              $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$    
                              Point   1               0.0     0.0     0.0
                              Point   2               1.0     0.0     0.0
                              Point   3               2.0     0.0     0.0
                              Point   4               3.0     0.0     0.0
                              Point   5               0.0     1.0     0.0
                              Point   6               1.0     1.0     0.0
                              Point   7               2.0     1.0     0.0
                              Point   8               4.0     1.0     0.0
                              Point*  9                               0.0             2.0
                              *       0.0
                              Point  *3280504         0               1.28286145E+03  1.28286145E+03
                              *       -2.01004501E+02
                              Point  *3280505         0               1.28286145-03  1.28286145+03
                              *       -2.01004501+02
                              Point   10000000        0.      0.      0.
                              Point   20000000        5.      0.      0.
                              Point   30000000        10.     0.      0.
                              Point   40000000        15.     0.      0.
                              Point   50000000        20.     0.      0.
                              Point   60000000        0.      5.      0.
                              Point   70000000        5.      5.      0.
                              Point   80000000        10.     5.      0.
                              Point   90000000        15.     5.      0.
                              Point   86010206        1471.51 -165.842139.
                              Point   86090104        1403.56 -148.237126.7
                              Point   86090129        1708.72 -722.4  232.274
                              Point   86120127        1664.09 -687.   225.852
                              $
                              END

                              Comment

                              • bvdet
                                Recognized Expert Specialist
                                • Oct 2006
                                • 2851

                                #60
                                Originally posted by psbasha
                                Sorry BV,

                                You have understood wrongly.My file data has 8 or 16 digits or combination of both filed formats.I have given a sample data that is not able to read correctly by the Pattern definition.
                                To support that we need to define the pattern for X,Y and Z coordinates properly for '8' and '16' digit.

                                Could you please help me in fixing this problem.

                                Thanks
                                PSB
                                The only thing I can think of is to correct the data on the fly. Before extracting all the numbers, split the string and test each list element to see if it has two or more decimal points. If True, add a word boundary (space character or other delimiter), then extract the data.[code=Python]
                                .......snip.... ....
                                '''
                                Before extracting all the numbers with pattnum, split the string and test
                                each list element to see if it has two or more decimal points. If True,
                                add a word boundary (space character or other delimiter), then extract
                                the data.
                                '''
                                def chk_str(s, fw=8, delim=' '):
                                sList = s.split()
                                for i, item in enumerate(sList ):
                                if item.count('.') > 1:
                                sList[i] = fix_fields(item , fw, delim)
                                return delim.join(sLis t)

                                def fix_fields(item , fw, delim):
                                offset = 0
                                for i in range(item.coun t('.')):
                                item = '%s%s%s' % (item[:8+offset], delim, item[8+offset:])
                                offset += 9
                                return item

                                def parseData(fn, *kargs):
                                fileList = [item.strip() for item in open(fn).readli nes()\
                                if not item.startswith ('$')]

                                pattkey = re.compile('|'. join([r'\b(%s)\*?\b' % item for item in kargs]))

                                # create dictionary with keys from kargs
                                masterDict = dict(zip(kargs, [[] for _ in kargs]))
                                inData = False
                                for line in fileList:

                                # test data string for missing word boundary between floating point numbers
                                # field width = 8
                                line = chk_str(line, 8)

                                # check for invalid data
                                if pattinvalid.sea rch(line):
                                for item in pattinvalid.fin dall(line):
                                line = line.replace(it em, item.replace('-', 'E-').replace('+', 'E+'))
                                .......snip.... ....[/code]

                                Comment

                                Working...