Troubles with CSV file

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Vladimir Ignatov

    Troubles with CSV file

    Hello!

    I have a big CSV file, which I must read and do some processing with it.
    Unfortunately I can't figure out how to use standard *csv* module in my
    situation. The problem is that some records look like:

    ""read this, man"", 1

    which should be decoded back into the:

    "read this, man"
    1

    .... which is look pretty "natural" for me. Instead I got a:

    read this
    man""
    1

    output. In other words, csv reader does not understand using of "" here.
    Quick experiment show me that *csv* module (with default 'excel' dialect)
    expects something like

    """read this, man""", 1

    in my situation - quotes actually must be trippled. I don't understand this
    and can't figure out how to proceed with my CSV file. Maybe some
    *alternative* CSV parsers can help? Any suggestions are welcomed.

    Vladimir Ignatov


  • Peter Hansen

    #2
    Re: Troubles with CSV file

    Vladimir Ignatov wrote:
    [color=blue]
    > I have a big CSV file, which I must read and do some processing with it.
    > Unfortunately I can't figure out how to use standard *csv* module in my
    > situation. The problem is that some records look like:
    >
    > ""read this, man"", 1
    >
    > which should be decoded back into the:
    >
    > "read this, man"
    > 1[/color]

    Do you have anything that already accepts this particular dialect?
    It seems to me that the above could just as easily be interpreted
    as three fields (using parentheses as delimiters) :

    (""read this) ( man"") ( 1)

    Is it possible that what you have is not really any standard CSV
    format, but just something home-brewed? In that case, you may
    well need to massage it before feeding it to the csv module.

    Or, if you can define how your example works in terms of delimiters,
    quoting and such, maybe there's a way to make the csv module handle
    it without complaints.

    As far as I can see, you want either the doubled quotation marks to
    be treated as single quotation marks, or you want the outer quotation
    marks to magically quote the whole string containing the comma even
    though it contains the quotation marks already. I don't think CSV
    can handle the latter (and it's probably an impossible goal), so you
    must really want the former. In that case, unfortunately, you
    are also screwed because the doubling of quotation marks must mean
    that 'doublequote' is True, but then 'quotechar' must have been '"'
    in the first place and that first field would now have triple quotes
    around it, like the Excel dialect.

    Can you just blindly substitute all double quotes with triple quotes
    in the input string first? That might be the easiest approach.

    -Peter

    Comment

    • Fuzzyman

      #3
      Re: Troubles with CSV file

      "Vladimir Ignatov" <vignatov@color pilot.com> wrote in message news:<mailman.8 .1084529146.415 7.python-list@python.org >...[color=blue]
      > Hello!
      >
      > I have a big CSV file, which I must read and do some processing with it.
      > Unfortunately I can't figure out how to use standard *csv* module in my
      > situation. The problem is that some records look like:
      >
      > ""read this, man"", 1
      >
      > which should be decoded back into the:
      >
      > "read this, man"
      > 1
      >
      > ... which is look pretty "natural" for me. Instead I got a:
      >
      > read this
      > man""
      > 1
      >
      > output. In other words, csv reader does not understand using of "" here.
      > Quick experiment show me that *csv* module (with default 'excel' dialect)
      > expects something like
      >
      > """read this, man""", 1
      >
      > in my situation - quotes actually must be trippled. I don't understand this
      > and can't figure out how to proceed with my CSV file. Maybe some
      > *alternative* CSV parsers can help? Any suggestions are welcomed.
      >
      > Vladimir Ignatov[/color]


      I have written a very simple CSV parser which uses a simple function
      'unquote' to unquote quoted elements.
      It would be *very* simple to amend unquote to handle double-quoted
      elements.



      Regards,

      Fuzzy

      Comment

      • Paul McGuire

        #4
        Re: Troubles with CSV file

        "Vladimir Ignatov" <vignatov@color pilot.com> wrote in message
        news:mailman.8. 1084529146.4157 .python-list@python.org ...[color=blue]
        > Hello!
        >
        > I have a big CSV file, which I must read and do some processing with it.
        > Unfortunately I can't figure out how to use standard *csv* module in my
        > situation. The problem is that some records look like:
        >
        > ""read this, man"", 1
        >
        > which should be decoded back into the:
        >
        > "read this, man"
        > 1
        >
        > ... which is look pretty "natural" for me. Instead I got a:
        >
        > read this
        > man""
        > 1
        >
        > output. In other words, csv reader does not understand using of "" here.
        > Quick experiment show me that *csv* module (with default 'excel' dialect)
        > expects something like
        >
        > """read this, man""", 1
        >
        > in my situation - quotes actually must be trippled. I don't understand[/color]
        this[color=blue]
        > and can't figure out how to proceed with my CSV file. Maybe some
        > *alternative* CSV parsers can help? Any suggestions are welcomed.
        >
        > Vladimir Ignatov
        >
        >[/color]
        Vladimir -

        Here is the CSV example that is provided with pyparsing (with some slight
        edits). I wrote this for exactly the situation you describe - just
        splitting on commas doesn't always do the right thing.

        You can download pyparsing at http://pyparsing.sourceforge.net .

        -- Paul

        =============== ===========
        # commasep.py
        #
        # comma-separated list example, to illustrate the advantages of using
        # the pyparsing commaSeparatedL ist as opposed to string.split(", "):
        # - leading and trailing whitespace is implicitly trimmed from list elements
        # - list elements can be quoted strings, which can safely contain commas
        without breaking
        # into separate elements

        from pyparsing import commaSeparatedL ist
        import string

        testData = [
        "a,b,c,100.2,,3 ",
        "d, e, j k , m ",
        "'Hello, World', f, g , , 5.1,x",
        "John Doe, 123 Main St., Cleveland, Ohio",
        "Jane Doe, 456 St. James St., Los Angeles , California ",
        "",
        ]

        for line in testData:
        print "input:", repr(line)
        print "split:", line.split(",")
        print "parse:", commaSeparatedL ist.parseString (line)
        print

        =============== ===========
        Output:
        input: 'a,b,c,100.2,,3 '
        split: ['a', 'b', 'c', '100.2', '', '3']
        parse: ['a', 'b', 'c', '100.2', '', '3']

        input: 'd, e, j k , m '
        split: ['d', ' e', ' j k ', ' m ']
        parse: ['d', 'e', 'j k', 'm']

        input: "'Hello, World', f, g , , 5.1,x"
        split: ["'Hello", " World'", ' f', ' g ', ' ', ' 5.1', 'x']
        parse: ["'Hello, World'", 'f', 'g', '', '5.1', 'x']

        input: 'John Doe, 123 Main St., Cleveland, Ohio'
        split: ['John Doe', ' 123 Main St.', ' Cleveland', ' Ohio']
        parse: ['John Doe', '123 Main St.', 'Cleveland', 'Ohio']

        input: 'Jane Doe, 456 St. James St., Los Angeles , California '
        split: ['Jane Doe', ' 456 St. James St.', ' Los Angeles ', ' California ']
        parse: ['Jane Doe', '456 St. James St.', 'Los Angeles', 'California']

        input: ''
        split: ['']
        parse: ['']



        Comment

        • Dennis Lee Bieber

          #5
          Re: Troubles with CSV file

          On Fri, 14 May 2004 14:08:15 +0400, "Vladimir Ignatov"
          <vignatov@color pilot.com> declaimed the following in comp.lang.pytho n:

          [color=blue]
          > output. In other words, csv reader does not understand using of "" here.
          > Quick experiment show me that *csv* module (with default 'excel' dialect)
          > expects something like
          >
          > """read this, man""", 1
          >
          > in my situation - quotes actually must be trippled. I don't understand this[/color]

          Which is standard behavior in almost all programming languages.
          The first " signals the beginning of a quoted string. Within a quoted
          string, double "s flag an escape, being replaced with a single " in the
          text. Then a final " ends the quoted string.

          "This is a ""quoted"" string"
          becomes
          This is a "quoted" string
          internally.

          I don't know why you got the "" on the trailing segment of your
          text -- maybe a bug in the CSV module, as I'd parse your (use fixed
          font)

          ""read this, man"", 1
          start---|
          end------| ie, an empty quoted string
          unquoted--^^^^^^^^^
          comma-split--------|
          unquoted------------^^^^
          start-------------------|
          end----------------------| another empty quoted string
          comma-split---------------|
          unquoted-------------------^^

          whereas

          """read this, man""", 1
          start---|
          end?-----| could be empty string
          NO-doubled| no, it's a " inside the string
          quoted-----^^^^^^^^^^^^^^
          end?---------------------| end of string?
          NO-doubled----------------| no, another " inside the string
          end------------------------| not doubled so end of string
          comma-split-----------------|
          unquoted---------------------^^


          --[color=blue]
          > =============== =============== =============== =============== == <
          > wlfraed@ix.netc om.com | Wulfraed Dennis Lee Bieber KD6MOG <
          > wulfraed@dm.net | Bestiaria Support Staff <
          > =============== =============== =============== =============== == <
          > Home Page: <http://www.dm.net/~wulfraed/> <
          > Overflow Page: <http://wlfraed.home.ne tcom.com/> <[/color]

          Comment

          • Skip Montanaro

            #6
            Re: Troubles with CSV file

            Dennis> ""read this, man"", 1
            Dennis> start---|
            Dennis> end------| ie, an empty quoted string
            Dennis> unquoted--^^^^^^^^^
            Dennis> comma-split--------|
            Dennis> unquoted------------^^^^
            Dennis> start-------------------|
            Dennis> end----------------------| another empty quoted string
            Dennis> comma-split---------------|
            Dennis> unquoted-------------------^^

            I'm not sure what "correct" interpretation of this should be since no
            separator was placed after the first '""' and before the second. Given that
            the input is ill-defined, just about any output could be considered
            "valid". ;-)

            Skip

            Comment

            Working...