[Re] Checking each item in m.group()?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • nospam@nospam.com

    [Re] Checking each item in m.group()?

    Hello

    I need to go through each line of a CSV file, and extract some fields
    using a regex. Then, I need to check each retrieved field, and if it
    looks like "", turn this into NULL so that it's correct SQL.

    I tried a few things, but still can't it working:
    ========
    #Second field might be empty -""
    #"Col1",""
    #"Col1","Col 2"
    p = re.compile('^"( .+?)","(.*?)"$' )

    for line in textlines:
    m = p.search(line)
    if m:
    #Check each column : if '', then turn into NULL

    """
    for col in line:
    if col == "":
    col = "NULL"
    """

    """
    for col in m.group():
    if col == "":
    col="NULL"
    """

    """
    for col in m.group(0):
    if col == "":
    col="NULL"
    """

    """
    for i in range (0,len(line)):
    if line[i] == "":
    line[i]="NULL"
    """

    """
    for i in range(1,len(m.g roup(0))):
    if m.group(i) == "":
    m.group(i)="NUL L"
    """

    sql = "INSERT INTO mytable (col1, col2) VALUES
    ('%s','%s')" % (m.group(1),m.g roup(2))
    print sql
    f.close()
    ========

    Does someone know the correct syntax?

    Thank you.
  • Matimus

    #2
    Re: Checking each item in m.group()?

    On Jun 2, 11:42 am, "nos...@nospam. com" <Gilles@wrote :
    Hello
    >
    I need to go through each line of a CSV file, and extract some fields
    using a regex. Then, I need to check each retrieved field, and if it
    looks like "", turn this into NULL so that it's correct SQL.
    >
    I tried a few things, but still can't it working:
    ========
    #Second field might be empty -""
    #"Col1",""
    #"Col1","Col 2"
    p = re.compile('^"( .+?)","(.*?)"$' )
    >
    for line in textlines:
    m = p.search(line)
    if m:
    #Check each column : if '', then turn into NULL
    >
    """
    for col in line:
    if col == "":
    col = "NULL"
    """
    >
    """
    for col in m.group():
    if col == "":
    col="NULL"
    """
    >
    """
    for col in m.group(0):
    if col == "":
    col="NULL"
    """
    >
    """
    for i in range (0,len(line)):
    if line[i] == "":
    line[i]="NULL"
    """
    >
    """
    for i in range(1,len(m.g roup(0))):
    if m.group(i) == "":
    m.group(i)="NUL L"
    """
    >
    sql = "INSERT INTO mytable (col1, col2) VALUES
    ('%s','%s')" % (m.group(1),m.g roup(2))
    print sql
    f.close()
    ========
    >
    Does someone know the correct syntax?
    >
    Thank you.
    I think you want to use 'groups' instead of 'group'.

    Here is a brief example. Note that this code is very insecure and
    susceptible to a SQL injection attack. Hopefully these csv files are
    from a trusted source.

    sql = "INSERT INTO mytable (col1, col2) VALUES ('%s','%s')"%tu ple(
    (c, "NULL")[c == ''] for c in m.groups()
    )

    Also, check out the csv module for parsing your csv file.

    Matt

    Comment

    • miller.paul.w@gmail.com

      #3
      Re: Checking each item in m.group()?

      On Jun 2, 5:06 pm, Peter Otten <__pete...@web. dewrote:
      You are taking the wrong approach here.
      >
      Don't build SQL statements as strings; you are enabling the next SQL
      injection attack. Pass parameters using the DB API instead.
      >
      Don't use regular expressions to parse a CSV file. Python's csv module is
      more likely to deal correctly with the quirks of that standard.
      >
      I'd like to second both these statements. Regardless of whether these
      CSV files are from a trusted source or not, it's a virtual truism of
      programming that eventually, any application will be used in ways it
      was not intended. Since using a parameterized query is a simple way
      to avoid a common security hole, even if such a thing could never be
      exploited by the app in its current configuration, you should do
      things the Right Way. That way, even if your code is twisted to some
      other use in the future, it's less likely to cause problems.

      Comment

      • Gilles Ganault

        #4
        Re: Checking each item in m.group()?

        On Mon, 2 Jun 2008 12:06:21 -0700 (PDT), Matimus <mccredie@gmail .com>
        wrote:
        >Here is a brief example. Note that this code is very insecure and
        >susceptible to a SQL injection attack. Hopefully these csv files are
        >from a trusted source.
        Yes they are, and this script will only run on my PC, so it doesn't
        need to be more secure than this.
        >sql = "INSERT INTO mytable (col1, col2) VALUES ('%s','%s')"%tu ple(
        (c, "NULL")[c == ''] for c in m.groups()
        )
        >
        I don't understand this syntax :-/
        >Also, check out the csv module for parsing your csv file.
        Will do. Thank you.

        Comment

        • rurpy@yahoo.com

          #5
          Re: Checking each item in m.group()?

          miller.paul.w@g mail.com wrote:
          On Jun 2, 5:06 pm, Peter Otten <__pete...@web. dewrote:
          >
          >You are taking the wrong approach here.
          >>
          >Don't build SQL statements as strings; you are enabling the next SQL
          >injection attack. Pass parameters using the DB API instead.
          >>
          >Don't use regular expressions to parse a CSV file. Python's csv module is
          >more likely to deal correctly with the quirks of that standard.
          >>
          >
          I'd like to second both these statements. Regardless of whether these
          CSV files are from a trusted source or not, it's a virtual truism of
          programming that eventually, any application will be used in ways it
          was not intended. Since using a parameterized query is a simple way
          to avoid a common security hole, even if such a thing could never be
          exploited by the app in its current configuration, you should do
          things the Right Way. That way, even if your code is twisted to some
          other use in the future, it's less likely to cause problems.
          I don't have a problem with a response saying "it's
          a good idea to use parameterized queries and explaining
          why", but I have seen way too many responses like this
          which are basically FUD.

          I'm not sure what a "virtual" truism is, but if it is
          like a truism, it's not true. There are many cases
          where one can accurately predict that the code will
          not be used in the future in some different app.
          I don't know what the OP was doing, but I have done
          many data conversion jobs where I have done things
          similar to the OP. The jobs were one-time, move data
          from system A to system B (with some munging in between)
          and I could and did predict the conversion code would
          not get reused. My accuracy rate is 100%.
          And if you do reuse code where you feed it untrusted
          input in a security sensitive context,
          and you don't bother to verify the security of said,
          code, you already have so many problems, one more
          probably won't make much difference.

          To the OP:
          The advice to use parameterized queries is good
          but overstated. There are cases when it is quite
          safe to use non-parameterized statements:
          * When you control the data going into the query
          )e.g., you've generated it yourself).
          * When the data come from trusted sources (including
          something like sys.input if the only people with
          access to the program are trusted).
          * When you can reliably check the data yourself,
          for example in:
          sql = "SELECT * FROM foo WHERE id=%d" % int(some_string )
          cursor.execute (sql)
          it doesn't really matter what "some_strin g" contains
          (if you are prepared for a Python exception). But
          note that checking and escaping strings in more general
          or complicated cases can be quite tricky.)

          In most cases a good reason to use a parameterized query
          is that it is no harder than to not use one, so why
          not and get the additional safety for free? A parameterized
          query can often run faster than a non-parameterized one
          since the database can reuse the cached compiled query.
          (But sometimes the opposite is true, see below).
          A parameterized form of the above is:

          sql = "SELECT * FROM foo WHERE id=?" % int(some_string )
          cursor.execute (sql, int(some_string ))

          so it is just as easy. There are times though when
          it is slightly harder. If idnums is an arbitrary list
          of ints:

          sql = "SELECT * FROM foo WHERE id IN(%s) % ','.join(idnums )
          cursor.execute (sql)

          Using a parameterized query might look like:

          sql = "SELECT * FROM foo WHERE id IN(%s) %
          ','.join(['?']*len(idnums))
          cursor.execute (sql, idnums)

          When you have written such code a few times it becomes
          a natural idiom, but if you only do so occasionally,
          you are in a hurry, and the conditions above apply,
          then there is no reason not to go with the first form.

          And if you you already have a text string of comma-separated
          digits, the ease of using the direct sql form becomes
          even greater:

          sql = "SELECT * FROM foo WHERE id IN(%s) % idnums_string
          cursor.execute (sql)

          But of course, if "idnums_strings " came from an untrusted
          source, then you need to validate it first, e.g.:

          if idnums_string.s trip("012345678 9 ,"): then raise Error

          There are also times when using a parameterized query
          can dramatically (and I mean two or three *orders of
          magnitude*) slow down your query when using prepared
          queries. For example:

          sql = "SELECT * FROM foo WHERE txt LIKE "%s%%" % some_string
          cursor.execute (sql)

          can be expected to run quite quickly in most database
          systems, since the database knows that the searched for
          text starts with a constant string and can thus use an
          index. The parameterized form:

          sql = "SELECT * FROM foo WHERE txt LIKE ?"
          cursor.execute (sql, [some_string + "%"])

          will often run very very slowly, because when the query
          is prepared the database has no idea if the argument
          will start with a constant string of not, and thus can
          only assume the worst, and prepare the query so that
          it doesn't use an index.

          The bottom line is, as in all things, understanding
          the issues will lead to much better decisions than
          blindly following some dumbed down advice like,
          "always use parameterized queries".

          (And to the OP. if you already know all this, my
          apologies if I sound like I'm talking down to you,
          but perhaps other people may benefit. I get tired
          of reading simplistic "do X, period." responses
          sometimes.)

          Comment

          • Scott David Daniels

            #6
            Re: Checking each item in m.group()?

            Gilles Ganault wrote:
            On Mon, 2 Jun 2008 12:06:21 -0700 (PDT), Matimus <mccredie@gmail .com>
            wrote:
            >Here is a brief example. Note that this code is very insecure ....
            >
            >sql = "INSERT INTO mytable (col1, col2) VALUES ('%s','%s')"%tu ple(
            > (c, "NULL")[c == ''] for c in m.groups()
            > )
            I don't understand this syntax :-/
            (c, "NULL") is a tuple; it is being indexed by the boolean "c == ''"
            Since False is 0, and True is 1, the expression picks out "NULL"
            exactly when c is the zero-length string.

            A more idiomatic Python way of writing this (for the very special case
            of '' or 0 or 0.0, or ...) is
            sql = "INSERT INTO mytable (col1, col2) VALUES ('%s','%s')" % tuple(
            (c or "NULL") for c in m.groups())

            You can avoid problems w/ possible 0s or 0.0s or .. by using:
            (str(c) or "NULL") for c in groups())

            the "or" (or a similar "and" trick) will continue to work. The
            different "empty" or "nothing" values of a data type are treated as
            false in part to allow such shenanigans. If used sparingly, it make
            your code clearer, but over-use can make the reader scratch his head in
            wonder.

            --Scott David Daniels
            Scott.Daniels@A cm.Org

            Comment

            • Gilles Ganault

              #7
              Re: Checking each item in m.group()?

              On Mon, 02 Jun 2008 17:49:11 -0700, Scott David Daniels
              <Scott.Daniels@ Acm.Orgwrote:
              >(c, "NULL") is a tuple; it is being indexed by the boolean "c == ''"
              >Since False is 0, and True is 1, the expression picks out "NULL"
              >exactly when c is the zero-length string.
              Thanks Scott, and also to Peter above, and sorry for not having
              answered earlier.

              I'm having two problems: The first code doesn't strip the
              double-quotes when the string is empty. IOW, "not empty" or NULL, but
              not "NULL"

              The second code hits "IndexError : list index out of range" because of
              Yield() which I don't seem to be using correctly:

              ==========
              p = re.compile("^(\ d+)\t(.*?)")
              for line in textlines:
              m = p.search(line)
              if m:
              sql = sql + 'INSERT INTO mytable (col1,col2) VALUES
              ("%s","%s"); ' % tuple ((c,"NULL")[c == ''] for c in m.groups())

              #cursor.execute (sql)
              connection.clos e(True)
              print sql
              ==========
              import csv
              import sqlite3 as sqlite

              def records(infile) :
              for row in csv.reader(infi le):
              #IndexError: list index out of range
              #BAD yield row[0], row[1] or None
              #BAD yield row[0] or None, row[1] or None

              def main():
              db = sqlite.connect( "test.sqlit e")
              cursor = db.cursor()

              #How to handle empty columns, ie. <TAB><TAB>?
              cursor.executem any("insert into mytable (col1,col2) values (?,?);",
              records("test.s qlite"))

              if __name__ == "__main__":
              main()
              ==========

              Thank you.

              Comment

              • Gilles Ganault

                #8
                Re: Checking each item in m.group()?

                On Sun, 15 Jun 2008 18:15:38 -0700, Dennis Lee Bieber
                <wlfraed@ix.net com.comwrote:
                > I don't know quite what the reason for the sql = sql + ... is -- if
                >you are trying to package more than one discrete statement into a single
                >query you should be advised that not all adapters/DBMS support that
                >function (I think one uses "executescript( )" to signify multiple
                >distinct statements.
                The script will go through about 3,000 lines of text, so I wanted to
                create a transaction with BEGIN/COMMIT. It seems like APSW (the
                simpler version of PySQLite) doesn't do transactions silently.
                >A decent adapter should convert Python's None object into a proper DBMS
                >Null. The adapter is also responsible for doing any needed quoting or
                >escaping of the data supplied, hence no quoting of the placeholders!
                Thanks for the tip. However, after looking at the code you gave, I'm
                getting an error when running cur.execute(), so it is only ran once
                and the program exits:

                =========
                import sys, re, apsw, os

                connection=apsw .Connection("te st.sqlite")
                cursor=connecti on.cursor()

                textlines = []
                textlines.appen d("123\titem1\t item2\titem3\ti tem4\t345\titem 6")
                textlines.appen d("123\titem1\t \titem3\titem4\ t345\titem6")

                p = re.compile("^(\ d+)\t(.*?)\t(.* ?)\t(.*?)\t(.*? )\t(\d+)\t(.+?) $")
                for line in textlines:
                m = p.search(line)
                if m:
                sql = 'INSERT INTO test (col1,col2,col3 ,col4,col5,col6 ,col7)
                VALUES (?,?,?,?,?,?,?) ;'

                """
                cursor.execute( sql, tuple((c, None)[c == ""] for c in m.groups()))
                File "apsw.c", line 3518, in resetcursor
                apsw.Constraint Error: ConstraintError : not an error
                apsw.Connection NotClosedError: apsw.Connection on "test.sqlit e". The
                destructor has been called, but you haven't closed the connection. All
                connections must be explicitly closed. The SQLite database object is
                being leaked.
                """
                cursor.execute( sql, tuple((c, None)[c == ""] for c in m.groups()))

                connection.comm it()
                connection.clos e(True)

                =========

                I read the online sample for APSW, but didn't find what it could be.
                Any idea?

                Thank you.

                Comment

                Working...