Regex for replacing missing value period

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • akselo
    New Member
    • Sep 2007
    • 21

    Regex for replacing missing value period

    I am parsing a csv data file with comma as the delimiter. The file mostly contains integers but also some decimal numbers/floats.

    It has some missing values, which are marked by a period ("."). In order to insert into a database, I want to replace the missing value periods, but NOT the decimal periods.

    With python's regex flavor and the re.sub construct applied to each line read as a string, can someone help what the pattern might be for this?

    Best,
    Aksel
    Last edited by akselo; Dec 20 '10, 06:55 PM. Reason: correction
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    There is no need to use regex for this.
    Code:
    >>> line = "12,15,45.1,.,16"
    >>> lineList = line.split(',')
    >>> output = []
    >>> for item in lineList:
    ... 	if item == ".":
    ... 		output.append(0.0)
    ... 	else:
    ... 		output.append(float(item))
    ... 		
    >>> output
    [12.0, 15.0, 45.100000000000001, 0.0, 16.0]
    If you must:
    Code:
    >>> import re
    >>> patt = re.compile(r', ?. ?,')
    >>> re.sub(patt,',0.0,',"12,15,45.1,.,16")
    '12,15,45.1,0.0,16'
    >>> re.sub(patt,',0.0,',"12,15,45.1, . ,16")
    '12,15,45.1,0.0,16'
    >>>

    Comment

    • Thekid
      New Member
      • Feb 2007
      • 145

      #3
      Wouldn't something like this work where you could replace the 'comma period comma' with just a 'comma':

      Code:
      >>> line='2.5,3,.,100,.,4.10,.,8'
      >>> line.replace(',.,',',')
      '2.5,3,100,4.10,8'

      Comment

      • akselo
        New Member
        • Sep 2007
        • 21

        #4
        That's what I initially hoped, but it fails to replace all periods if several periods are adjacent because the last quote is seemingly used by the first match and skipped as the first quote in the next match. It accordingly skips every second period.
        Code:
        line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
        line.replace(',.,',',')
        '2.5,3,100,4.10,8,.,8.9,.'

        Comment

        • akselo
          New Member
          • Sep 2007
          • 21

          #5
          I like this one, but it doesn't like repeated period values, like so:
          Code:
          patt = re.compile(r', ?. ?,')
          line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
          re.sub(patt,',0.0,',line)
          '2.5,0.0,.,100,0.0,4.10,0.0,8,0.0,.,0.0,8.9,.'

          Comment

          • bvdet
            Recognized Expert Specialist
            • Oct 2006
            • 2851

            #6
            I was a bit hasty with my previous post. See if this works:
            Code:
            >>> patt = re.compile(r', ?[.] ?')
            >>> line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
            >>> re.sub(patt,',0.0',line)
            '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'
            >>>

            Comment

            • Endkill Emanon
              New Member
              • Dec 2010
              • 9

              #7
              It looks like all your floats are "0.#" format. Why not try to replace just the single value ".,". This way the search is looking the actual value that needs replacing. And although my python isn't perfect you get the idea with...

              Code:
              # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
              # line.replace('.,','0.0,')
              # '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'
              This way you don't loose the field and mess the database up completely. If you don't need the "0.0" you can write it in to the code that it doesn't display.

              Comment

              • akselo
                New Member
                • Sep 2007
                • 21

                #8
                The option of searching for the literal is at one level compelling
                Code:
                # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
                # line.replace('.,','0.0,')
                # '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'
                But this doesn't work if the last item is a period and not a number as the replacement looks for a trailing comma. This could of course be handled separately, but I was hoping for a both general and clean solution to this issue.

                Also, what is the "best" way of dealing with the carriage return in the last item of each line when iterating over lines in a file? Replacing \n with blank space?

                Comment

                • Thekid
                  New Member
                  • Feb 2007
                  • 145

                  #9
                  I was going to suggest just replacing the 'period comma' with a blank space which works except if the last value is a period:

                  Code:
                  # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
                  # line.replace('.,','')
                  # '2.5,3,100,4.10,8,8.9,.'

                  Comment

                  • Endkill Emanon
                    New Member
                    • Dec 2010
                    • 9

                    #10
                    You could switch the comma placement from "0.0," to ",0.0".
                    Code:
                    # line.replace(',.',',0.0')
                    And generally the new line at the end of the is to keep the format of the table or database.

                    If I split a database in to fields. Every field needs to have some value thus the ".".

                    AA|AB|AC|AD
                    BA|BB|BC|BD
                    CA|CB|CC|CD
                    DA|DB|DC|DD

                    If I have a 4 field database and I kill the \n it becomes a run-on and then the parsing engine looses everything beyond it's initial 4 fields.

                    AA|AB|AC|AD|BA| BB|BC|BD|CA|CB| CC|CD|DA|DB|DC| DD

                    to the engine looks like AA|AB|AC|AD

                    Now If you just want a longlist of numbers and dont care what the numbers mean you can in most languages just replace the "\n".

                    Code:
                    # # line='2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.\n2.5,3,.,100,.,4.10,.,8,.,.,.,8.9,.'
                    # # line.replace(',.',',0.0')
                    # # line.replace('\n',',')
                    # # '2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0,2.5,3,0.0,100,0.0,4.10,0.0,8,0.0,0.0,0.0,8.9,0.0'

                    Comment

                    • Thekid
                      New Member
                      • Feb 2007
                      • 145

                      #11
                      I think another issue will be if a value is a fraction like:
                      line='.,.45,.25 ,.,100,7.5,.'

                      Comment

                      • akselo
                        New Member
                        • Sep 2007
                        • 21

                        #12
                        Last column values are often 'period'. How about a pattern like
                        Code:
                        #line='ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,.,.,.,.,.,.,.,.,.,.,.,.,.,593,474,159,49,110,0,315,119,541772500,438522500,103250000,.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415\n'
                        #patt = re.compile(r'\D\.\D')
                        #'ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,0.0,.,0.0,.,0.0,.,0.0,.,0.0,.,0.0,.,0.0,593,474,159,49,110,0,315,119,541772500,438522500,103250000,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415\n'
                        It doesn't give me the desired result, but conceptually straightforward : return a match whereever a period exists if surrounded by non-digits. Repeat periods cause grief here.

                        Comment

                        • Endkill Emanon
                          New Member
                          • Dec 2010
                          • 9

                          #13
                          From the code that you were showing at the beginning all your fractions were in were in a "0.#" format. But if that is an issue try...

                          Code:
                          line.replace(',.',',0.')
                          line.replace('0.,',',0.0,')
                          line.replace('\n',',')
                          ( On a side note I think that the regex for the find would be ",\.," or ",[\.],". And to keep out the numbers and letters, ",(*[^0-9a-zA-Z\s][^\.]*[^0-9a-zA-Z\s])," but my regex is not perfect.)

                          Comment

                          • Thekid
                            New Member
                            • Feb 2007
                            • 145

                            #14
                            I don't have python on this particular computer but how about:
                            Code:
                            line='ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,.,.,.,.,.,.,.,.,.,.,.,.,.,593,474,159,49,110,0,315,119,541772500,438522500,103250000,.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415\n' 
                            
                            newline1=line.replace('.,','')
                            newline2=newline1.replace(',.\n','') #incase last value is a period
                            ...or you could make your replacement patterns contain '.,',,.\n','\n'
                            Last edited by Thekid; Dec 20 '10, 11:10 PM. Reason: Forgot closing 'code' tag.

                            Comment

                            • bvdet
                              Recognized Expert Specialist
                              • Oct 2006
                              • 2851

                              #15
                              This is a CSV file, right? All data is separated by a comma or other known delimiter. This is the easiest way:
                              Code:
                              line = '.,ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,.,.,.,.,.,.,.,.,.,.,.,.,.,593,474,159,49,110,0,315,119,541772500,438522500,103250000,.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415,.\n'
                              lineList = line.split(',')
                              output = []
                              for item in lineList:
                                  if item.strip() == ".":
                                      output.append("***")
                                  else:
                                      output.append(item)
                              print ",".join(output)
                              Output:
                              Code:
                              >>> ***,ACSSF,2009e5,ca,000,0100,0013344,309,11,0,0,0,0,11,0,10,0,0,0,0,10,0,93,0,0,0,38,55,0,54,0,0,0,12,42,0,43,12,0,8,21,2,0,20,11,0,0,9,0,0,78,33,39,6,0,0,0,593,0,0,0,0,0,0,0,0,17,0,0,0,0,0,11,0,0,0,0,17,10,45,308,185,789200,909500,1000001,***,***,***,***,***,***,***,***,***,***,***,***,***,593,474,159,49,110,0,315,119,541772500,438522500,103250000,***,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,***,593,474,0,0,0,0,0,0,0,0,0,17,20,22,62,69,284,119,0,0,0,0,0,32,0,0,19,30,38,2909,4001,628,2035900,1966500,69300,593,474,49,48,91,81,51,49,8,39,58,0,119,77,42,0,0,0,0,0,0,0,0,19.4,23.0,7.7,593,0,0,0,0,0,0,0,41,0,0,0,10,31,0,358,144,81,44,31,58,0,194,163,0,7,8,16,0,593,0,0,0,32,0,19,30,21,17,0,37,22,415,***
                              >>>

                              Comment

                              Working...