Trying to fix Invalid CSV File

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Ryan Rosario

    Trying to fix Invalid CSV File

    I have a very large CSV file that contains double quoted fields (since
    they contain commas). Unfortunately, some of these fields also contain
    other double quotes and I made the painful mistake of forgetting to
    escape or double the quotes inside the field:

    123,"Here is some, text "and some quoted text" where the quotes should
    have been doubled",321

    Has anyone dealt with this problem before? Any ideas of an algorithm I
    can use for a Python script to create a new, repaired CSV file?

    TIA,
    Ryan
  • Emile van Sebille

    #2
    Re: Trying to fix Invalid CSV File

    Ryan Rosario wrote:
    I have a very large CSV file that contains double quoted fields (since
    they contain commas). Unfortunately, some of these fields also contain
    other double quotes and I made the painful mistake of forgetting to
    escape or double the quotes inside the field:
    >
    123,"Here is some, text "and some quoted text" where the quotes should
    have been doubled",321
    >

    rec = '''123,"Here is some, text "and some quoted text" where the quotes
    should have been doubled",321'''

    import csv

    csv.reader([rec.replace('," ',',"""')
    .replace('",',' """,')
    .replace('"""', "'''")
    .replace('"','" "')
    .replace("'''", '"')]).next()

    ['123', 'Here is some, text "and some quoted text" where the quotes
    should have been doubled', '321']

    :))

    Emile

    Has anyone dealt with this problem before? Any ideas of an algorithm I
    can use for a Python script to create a new, repaired CSV file?
    >
    TIA,
    Ryan
    --

    >

    Comment

    • Ryan Rosario

      #3
      Re: Trying to fix Invalid CSV File

      On Aug 3, 10:38 pm, Emile van Sebille <em...@fenx.com wrote:
      Ryan Rosario wrote:
      I have a very large CSV file that contains double quoted fields (since
      they contain commas). Unfortunately, some of these fields also contain
      other double quotes and I made the painful mistake of forgetting to
      escape or double the quotes inside the field:
      >
      123,"Here is some, text "and some quoted text" where the quotes should
      have been doubled",321
      >
      rec = '''123,"Here is some, text "and some quoted text" where the quotes
      should have been doubled",321'''
      >
      import csv
      >
      csv.reader([rec.replace('," ',',"""')
                      .replace('",',' """,')
                      .replace('"""', "'''")
                      .replace('"','" "')
                      .replace("'''", '"')]).next()
      >
      ['123', 'Here is some, text "and some quoted text" where the quotes
      should have been doubled', '321']
      >
      :))
      >
      Emile
      >
      Has anyone dealt with this problem before? Any ideas of an algorithm I
      can use for a Python script to create a new, repaired CSV file?
      >>
      >
      Thanks Emile! Works almost perfectly, but is there some way I can
      adapt this to quote fields that contain a comma in them?

      TIA,
      Ryan

      Comment

      • John Machin

        #4
        Re: Trying to fix Invalid CSV File

        On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
        >
        Thanks Emile! Works almost perfectly, but is there some way I can
        adapt this to quote fields that contain a comma in them?
        >
        You originally said "I have a very large CSV file that contains double
        quoted fields (since they contain commas)". Are you now saying that
        if a field contained a comma, you didn't wrap the field in quotes? Or
        is this a separate question unrelated to your original problem?

        Comment

        • Ryan Rosario

          #5
          Re: Trying to fix Invalid CSV File

          On Aug 4, 1:01 am, John Machin <sjmac...@lexic on.netwrote:
          On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
          >
          >
          >
          Thanks Emile! Works almost perfectly, but is there some way I can
          adapt this to quote fields that contain a comma in them?
          >
          You originally said "I have a very large CSV file that contains double
          quoted fields (since they contain commas)". Are you now saying  that
          if a field contained a comma, you didn't wrap the field in quotes? Or
          is this a separate question unrelated to your original problem?
          I enclosed all text fields within quotes. The problem is that I have
          quotes embedded inside those text fields as well and I did not double/
          escape them. Emile's snippet takes care of the escaping but it strips
          the outer quotes from the text fields and if there are commas inside
          the text field, the field is split into multiple fields. Of course, it
          is possible that I am not using the snippet correctly I suppose.

          Comment

          • John Machin

            #6
            Re: Trying to fix Invalid CSV File

            On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
            On Aug 4, 1:01 am, John Machin <sjmac...@lexic on.netwrote:
            >
            On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
            >
            Thanks Emile! Works almost perfectly, but is there some way I can
            adapt this to quote fields that contain a comma in them?
            >
            You originally said "I have a very large CSV file that contains double
            quoted fields (since they contain commas)". Are you now saying that
            if a field contained a comma, you didn't wrap the field in quotes? Or
            is this a separate question unrelated to your original problem?
            >
            I enclosed all text fields within quotes. The problem is that I have
            quotes embedded inside those text fields as well and I did not double/
            escape them. Emile's snippet takes care of the escaping but it strips
            the outer quotes from the text fields and if there are commas inside
            the text field, the field is split into multiple fields. Of course, it
            is possible that I am not using the snippet correctly I suppose.
            Without you actually showing how you are using it, I can only surmise:

            Emile's snippet is pushing it through the csv reading process, to
            demonstrate that his series of replaces works (on your *sole* example,
            at least). Note carefully his output for one line is a *list* of
            fields. The repr() of that list looks superficially like a line of csv
            input. It looks like you are csv-reading it a second time, using
            quotechar="'", after stripping off the enclosing []. If this guess is
            not correct, please show what you are actually doing.

            If (as you said) you require a fixed csv file, you need to read the
            bad file line by line, use Emile's chain of replaces, and write each
            fixed line out to the new file.

            Comment

            • Emile van Sebille

              #7
              Re: Trying to fix Invalid CSV File

              John Machin wrote:
              On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
              >On Aug 4, 1:01 am, John Machin <sjmac...@lexic on.netwrote:
              >>
              >>On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
              >>>Thanks Emile! Works almost perfectly, but is there some way I can
              >>>adapt this to quote fields that contain a comma in them?
              <snip>
              Emile's snippet is pushing it through the csv reading process, to
              demonstrate that his series of replaces works (on your *sole* example,
              at least).
              Exactly -- just print out the results of the passed argument:
              >>>
              rec.replace('," ',",'''").repla ce('",',"''',") .replace('"','" "').replace("'' '",'"')

              '123,"Here is some, text ""and some quoted text"" where the quotes
              should have been doubled",321'

              Where it won't work is if any of the field embedded quotes are next to
              commas.

              I'd run it against the file. Presumably, you've got a consistent field
              count expectation per record. Any resulting record not matching is
              suspect and will identify records this approach won't address.

              There's probably better ways, but sometimes it's fun to create
              executable line noise. :)

              Emile

              Comment

              • Ryan Rosario

                #8
                Re: Trying to fix Invalid CSV File

                On Aug 4, 8:30 am, Emile van Sebille <em...@fenx.com wrote:
                John Machin wrote:
                On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
                On Aug 4, 1:01 am, John Machin <sjmac...@lexic on.netwrote:
                >
                >On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
                >>Thanks Emile! Works almost perfectly, but is there some way I can
                >>adapt this to quote fields that contain a comma in them?
                >
                <snip>
                >
                Emile's snippet is pushing it through the csv reading process, to
                demonstrate that his series of replaces works (on your *sole* example,
                at least).
                >
                Exactly -- just print out the results of the passed argument:
                >
                 >>>
                rec.replace('," ',",'''").repla ce('",',"''',") .replace('"','" "').replace("'' '",'"')
                >
                '123,"Here is some, text ""and some quoted text"" where the quotes
                should have been doubled",321'
                >
                Where it won't work is if any of the field embedded quotes are next to
                commas.
                >
                I'd run it against the file.  Presumably, you've got a consistent field
                count expectation per record.  Any resulting record not matching is
                suspect and will identify records this approach won't address.
                >
                There's probably better ways, but sometimes it's fun to create
                executable line noise.  :)
                >
                Emile
                Thanks for your responses. I think John may be right that I am reading
                it a second time. I will take a look at the CSV reader documentation
                and see if that helps. Then once I run it I can see if I need to worry
                about the comma-next-to-quote issue.

                Comment

                • Larry Bates

                  #9
                  Re: Trying to fix Invalid CSV File

                  Ryan Rosario wrote:
                  On Aug 4, 8:30 am, Emile van Sebille <em...@fenx.com wrote:
                  >John Machin wrote:
                  >>On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
                  >>>On Aug 4, 1:01 am, John Machin <sjmac...@lexic on.netwrote:
                  >>>>On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
                  >>>>>Thanks Emile! Works almost perfectly, but is there some way I can
                  >>>>>adapt this to quote fields that contain a comma in them?
                  ><snip>
                  >>
                  >>Emile's snippet is pushing it through the csv reading process, to
                  >>demonstrate that his series of replaces works (on your *sole* example,
                  >>at least).
                  >Exactly -- just print out the results of the passed argument:
                  >>
                  > >>>
                  >rec.replace(', "',",'''").repl ace('",',"'''," ).replace('"',' ""').replace("' ''",'"')
                  >>
                  >'123,"Here is some, text ""and some quoted text"" where the quotes
                  >should have been doubled",321'
                  >>
                  >Where it won't work is if any of the field embedded quotes are next to
                  >commas.
                  >>
                  >I'd run it against the file. Presumably, you've got a consistent field
                  >count expectation per record. Any resulting record not matching is
                  >suspect and will identify records this approach won't address.
                  >>
                  >There's probably better ways, but sometimes it's fun to create
                  >executable line noise. :)
                  >>
                  >Emile
                  >
                  Thanks for your responses. I think John may be right that I am reading
                  it a second time. I will take a look at the CSV reader documentation
                  and see if that helps. Then once I run it I can see if I need to worry
                  about the comma-next-to-quote issue.
                  This is a perfect demonstration of why tab delimited files are so much better
                  than comma and quote delimited. Virtually all software can handle table
                  delimited as well as comma and quote delimited, but you would have none of these
                  problems if you had used tab delimited. The chances of tabs being embedded in
                  most data is virtually nil.

                  -Larry

                  Comment

                  • John Machin

                    #10
                    Re: Trying to fix Invalid CSV File

                    On Aug 5, 6:56 am, Larry Bates <larry.ba...@we bsafe.com`wrote :
                    Ryan Rosario wrote:
                    On Aug 4, 8:30 am, Emile van Sebille <em...@fenx.com wrote:
                    John Machin wrote:
                    >On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
                    >>On Aug 4, 1:01 am, John Machin <sjmac...@lexic on.netwrote:
                    >>>On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
                    >>>>Thanks Emile! Works almost perfectly, but is there some way I can
                    >>>>adapt this to quote fields that contain a comma in them?
                    <snip>
                    >
                    >Emile's snippet is pushing it through the csv reading process, to
                    >demonstrate that his series of replaces works (on your *sole* example,
                    >at least).
                    Exactly -- just print out the results of the passed argument:
                    >
                    rec.replace('," ',",'''").repla ce('",',"''',") .replace('"','" "').replace("'' '",'"')
                    >
                    '123,"Here is some, text ""and some quoted text"" where the quotes
                    should have been doubled",321'
                    >
                    Where it won't work is if any of the field embedded quotes are next to
                    commas.
                    >
                    I'd run it against the file. Presumably, you've got a consistent field
                    count expectation per record. Any resulting record not matching is
                    suspect and will identify records this approach won't address.
                    >
                    There's probably better ways, but sometimes it's fun to create
                    executable line noise. :)
                    >
                    Emile
                    >
                    Thanks for your responses. I think John may be right that I am reading
                    it a second time. I will take a look at the CSV reader documentation
                    and see if that helps. Then once I run it I can see if I need to worry
                    about the comma-next-to-quote issue.
                    >
                    This is a perfect demonstration of why tab delimited files are so much better
                    than comma and quote delimited.
                    No, it's a perfect demonstration of what happens when a protocol is
                    not followed.
                    Virtually all software can handle table
                    delimited as well as comma and quote delimited, but you would have none of these
                    problems if you had used tab delimited. The chances of tabs being embedded in
                    most data is virtually nil.
                    >
                    There may be no tabs in *your* data. There is no guarantee that there
                    are no tabs in a VARCHAR(n) column in somebody else's database. I've
                    seen all of \x00, \t, \n, \r and \x1a (Ctrl-Z (EOF in CP/M, *DOS and
                    Windows text files)).

                    The possibilities include (1) Don't check (2) check if '\t' in field
                    and raise an exception (3) silently remove tabs; what do you
                    recommend?

                    Comment

                    • Ryan Rosario

                      #11
                      Re: Trying to fix Invalid CSV File

                      On Aug 4, 1:56 pm, Larry Bates <larry.ba...@we bsafe.com`wrote :
                      Ryan Rosario wrote:
                      On Aug 4, 8:30 am, Emile van Sebille <em...@fenx.com wrote:
                      John Machin wrote:
                      >On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
                      >>On Aug 4, 1:01 am, John Machin <sjmac...@lexic on.netwrote:
                      >>>On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gm ail.comwrote:
                      >>>>Thanks Emile! Works almost perfectly, but is there some way I can
                      >>>>adapt this to quote fields that contain a comma in them?
                      <snip>
                      >
                      >Emile's snippet is pushing it through thecsvreading process, to
                      >demonstrate that his series of replaces works (on your *sole* example,
                      >at least).
                      Exactly -- just print out the results of the passed argument:
                      >
                      rec.replace('," ',",'''").repla ce('",',"''',") .replace('"','" "').replace("'' '",'"')
                      >
                      '123,"Here is some, text ""and some quoted text"" where the quotes
                      should have been doubled",321'
                      >
                      Where it won't work is if any of the field embedded quotes are next to
                      commas.
                      >
                      I'd run it against the file.  Presumably, you've got a consistent field
                      count expectation per record.  Any resulting record not matching is
                      suspect and will identify records this approach won't address.
                      >
                      There's probably better ways, but sometimes it's fun to create
                      executable line noise.  :)
                      >
                      Emile
                      >
                      Thanks for your responses. I think John may be right that I am reading
                      it a second time. I will take a look at theCSVreader documentation
                      and see if that helps. Then once I run it I can see if I need to worry
                      about the comma-next-to-quote issue.
                      >
                      This is a perfect demonstration of why tab delimited files are so much better
                      than comma and quote delimited.  Virtually all software can handle table
                      delimited as well as comma and quote delimited, but you would have none of these
                      problems if you had used tab delimited.  The chances of tabs being embedded in
                      most data is virtually nil.
                      >
                      -Larry
                      Thank you for all the help. I wasn't using Emile's code correctly. It
                      fixed 99% of the problem, reducing 30,000 bad lines to about 300. The
                      remaining cases were too difficult to pin a pattern on, so I just
                      spent an hour fixing those lines. It was typically just adding one
                      more " to one that was already there.

                      Next time I am going to be much more careful. Tab delimited is
                      probably better for my purpose, but I can definitely see there being
                      issues with invisible tab characters and other weirdness.

                      Ryan

                      Comment

                      • Roel Schroeven

                        #12
                        Re: Trying to fix Invalid CSV File

                        Ryan Rosario schreef:
                        Next time I am going to be much more careful. Tab delimited is
                        probably better for my purpose, but I can definitely see there being
                        issues with invisible tab characters and other weirdness.
                        No matter which delimiter you use, there will always be data that
                        includes that delimiter, and you need some way to deal with it.

                        I prefer the approach that esr suggests in "The Art of Unix Programming"
                        (http://www.catb.org/~esr/writings/ta...ch05s02.html): define a
                        delimiter (preferably but necessary one that doesn't occur frequently in
                        your data) and an escape character. On output, escape all occurrences of
                        delimiter and escape character in your data. On input, you can trivially
                        and unambiguously distinguish delimiters in the data from delimiters
                        between data, and unescape everything.

                        Cheers,
                        Roel

                        --
                        The saddest aspect of life right now is that science gathers knowledge
                        faster than society gathers wisdom.
                        -- Isaac Asimov

                        Roel Schroeven

                        Comment

                        Working...