Detecting line endings

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Fuzzyman

    Detecting line endings

    Hello all,

    I'm trying to detect line endings used in text files. I *might* be
    decoding the files into unicode first (which may be encoded using
    multi-byte encodings) - which is why I'm not letting Python handle the
    line endings.

    Is the following safe and sane :

    text = open('test.txt' , 'rb').read()
    if encoding:
    text = text.decode(enc oding)
    ending = '\n' # default
    if '\r\n' in text:
    text = text.replace('\ r\n', '\n')
    ending = '\r\n'
    elif '\n' in text:
    ending = '\n'
    elif '\r' in text:
    text = text.replace('\ r', '\n')
    ending = '\r'


    My worry is that if '\n' *doesn't* signify a line break on the Mac,
    then it may exist in the body of the text - and trigger ``ending =
    '\n'`` prematurely ?

    All the best,

    Fuzzyman


  • Sybren Stuvel

    #2
    Re: Detecting line endings

    Fuzzyman enlightened us with:[color=blue]
    > My worry is that if '\n' *doesn't* signify a line break on the Mac,
    > then it may exist in the body of the text - and trigger ``ending =
    > '\n'`` prematurely ?[/color]

    I'd count the number of occurences of '\r\n', '\n' without a preceding
    '\r' and '\r' without following '\n', and let the majority decide.

    Sybren
    --
    The problem with the world is stupidity. Not saying there should be a
    capital punishment for stupidity, but why don't we just take the
    safety labels off of everything and let the problem solve itself?
    Frank Zappa

    Comment

    • Fuzzyman

      #3
      Re: Detecting line endings


      Sybren Stuvel wrote:[color=blue]
      > Fuzzyman enlightened us with:[color=green]
      > > My worry is that if '\n' *doesn't* signify a line break on the Mac,
      > > then it may exist in the body of the text - and trigger ``ending =
      > > '\n'`` prematurely ?[/color]
      >
      > I'd count the number of occurences of '\r\n', '\n' without a preceding
      > '\r' and '\r' without following '\n', and let the majority decide.
      >[/color]

      Sounds reasonable, edge cases for small files be damned. :-)

      Fuzzyman

      [color=blue]
      > Sybren
      > --
      > The problem with the world is stupidity. Not saying there should be a
      > capital punishment for stupidity, but why don't we just take the
      > safety labels off of everything and let the problem solve itself?
      > Frank Zappa[/color]

      Comment

      • Fuzzyman

        #4
        Re: Detecting line endings


        Sybren Stuvel wrote:[color=blue]
        > Fuzzyman enlightened us with:[color=green]
        > > My worry is that if '\n' *doesn't* signify a line break on the Mac,
        > > then it may exist in the body of the text - and trigger ``ending =
        > > '\n'`` prematurely ?[/color]
        >
        > I'd count the number of occurences of '\r\n', '\n' without a preceding
        > '\r' and '\r' without following '\n', and let the majority decide.
        >[/color]

        This is what I came up with. As you can see from the docstring, it
        attempts to sensible(-ish) things in the event of a tie, or no line
        endings at all.

        Comments/corrections welcomed. I know the tests aren't very useful
        (because they make no *assertions* they won't tell you if it breaks),
        but you can see what's going on :

        import re
        import os

        rn = re.compile('\r\ n')
        r = re.compile('\r( ?!\n)')
        n = re.compile('(?< !\r)\n')

        # Sequence of (regex, literal, priority) for each line ending
        line_ending = [(n, '\n', 3), (rn, '\r\n', 2), (r, '\r', 1)]


        def find_ending(tex t, default=os.line sep):
        """
        Given a piece of text, use a simple heuristic to determine the line
        ending in use.

        Returns the value assigned to default if no line endings are found.
        This defaults to ``os.linesep``, the native line ending for the
        machine.

        If there is a tie between two endings, the priority chain is
        ``'\n', '\r\n', '\r'``.
        """
        results = [(len(exp.findal l(text)), priority, literal) for
        exp, literal, priority in line_ending]
        results.sort()
        print results
        if not sum([m[0] for m in results]):
        return default
        else:
        return results[-1][-1]

        if __name__ == '__main__':
        tests = [
        'hello\ngoodbye \nmy fish\n',
        'hello\r\ngoodb ye\r\nmy fish\r\n',
        'hello\rgoodbye \rmy fish\r',
        'hello\rgoodbye \n',
        '',
        '\r\r\r \n\n',
        '\n\n \r\n\r\n',
        '\n\n\r \r\r\n',
        '\n\r \n\r \n\r',
        ]
        for entry in tests:
        print repr(entry)
        print repr(find_endin g(entry))
        print

        All the best,


        Fuzzyman
        http://www.voidspace.org.uk/python/index.shtml[color=blue]
        > Sybren
        > --
        > The problem with the world is stupidity. Not saying there should be a
        > capital punishment for stupidity, but why don't we just take the
        > safety labels off of everything and let the problem solve itself?
        > Frank Zappa[/color]

        Comment

        • Alex Martelli

          #5
          Re: Detecting line endings

          Fuzzyman <fuzzyman@gmail .com> wrote:
          [color=blue]
          > Hello all,
          >
          > I'm trying to detect line endings used in text files. I *might* be
          > decoding the files into unicode first (which may be encoded using[/color]

          Open the file with 'rU' mode, and check the file object's newline
          attribute.
          [color=blue]
          > My worry is that if '\n' *doesn't* signify a line break on the Mac,[/color]

          It does, since a few years, since MacOSX is version of Unix to all
          practical intents and purposes.


          Alex

          Comment

          • Sybren Stuvel

            #6
            Re: Detecting line endings

            Fuzzyman enlightened us with:[color=blue]
            > This is what I came up with. [...] Comments/corrections welcomed.[/color]

            You could use a little more comments in the code, but apart from that
            it looks nice.

            Sybren
            --
            The problem with the world is stupidity. Not saying there should be a
            capital punishment for stupidity, but why don't we just take the
            safety labels off of everything and let the problem solve itself?
            Frank Zappa

            Comment

            • Fuzzyman

              #7
              Re: Detecting line endings


              Alex Martelli wrote:[color=blue]
              > Fuzzyman <fuzzyman@gmail .com> wrote:
              >[color=green]
              > > Hello all,
              > >
              > > I'm trying to detect line endings used in text files. I *might* be
              > > decoding the files into unicode first (which may be encoded using[/color]
              >
              > Open the file with 'rU' mode, and check the file object's newline
              > attribute.
              >[/color]

              Ha, so long as it works with Python 2.2, that makes things a bit
              easier.

              Rats, I liked that snippet of code (I'm a great fan of list
              comprehensions) . :-)
              [color=blue][color=green]
              > > My worry is that if '\n' *doesn't* signify a line break on the Mac,[/color]
              >
              > It does, since a few years, since MacOSX is version of Unix to all
              > practical intents and purposes.
              >[/color]

              I wondered if that might be the case. I think I've worried about this
              more than enough now.

              Thanks

              Fuzzyman

              [color=blue]
              >
              > Alex[/color]

              Comment

              • Fuzzyman

                #8
                Re: Detecting line endings


                Alex Martelli wrote:[color=blue]
                > Fuzzyman <fuzzyman@gmail .com> wrote:
                >[color=green]
                > > Hello all,
                > >
                > > I'm trying to detect line endings used in text files. I *might* be
                > > decoding the files into unicode first (which may be encoded using[/color]
                >
                > Open the file with 'rU' mode, and check the file object's newline
                > attribute.
                >[/color]

                Do you know if this works for multi-byte encodings ? Do files have
                metadata associated with them showing the line-ending in use ?

                I suppose I could test this...

                All the best,


                Fuzzy
                [color=blue][color=green]
                > > My worry is that if '\n' *doesn't* signify a line break on the Mac,[/color]
                >
                > It does, since a few years, since MacOSX is version of Unix to all
                > practical intents and purposes.
                >
                >
                > Alex[/color]

                Comment

                • Arthur

                  #9
                  Re: Detecting line endings

                  Alex Martelli wrote:[color=blue]
                  > Fuzzyman <fuzzyman@gmail .com> wrote:
                  >
                  >[color=green]
                  >>Hello all,
                  >>
                  >>I'm trying to detect line endings used in text files. I *might* be
                  >>decoding the files into unicode first (which may be encoded using[/color]
                  >
                  >
                  > Open the file with 'rU' mode, and check the file object's newline
                  > attribute.[/color]

                  Do you think it would be sensible to have file.readline in universal
                  newline support by default?

                  I just got flummoxed by this issue, working with a (pre-alpha) package
                  by very experienced Python programmers who sent file.readline to
                  tokenizer.py without universal newline support. Went on a long (and
                  educational) journey trying to figure out why my file was not being
                  processed as expected.

                  Are there circumstances that it would be sensible to have tokenizer
                  process files without universal newline support?

                  The result here was having tokenizer detect indentation inconstancies
                  that did not exist - in the sense that the files were compiled and ran
                  fine by Python.exe.

                  Art

                  Comment

                  • Arthur

                    #10
                    Re: Detecting line endings

                    Arthur wrote:[color=blue]
                    > Alex Martelli wrote:
                    >
                    > I just got flummoxed by this issue, working with a (pre-alpha) package
                    > by very experienced Python programmers who sent file.readline to
                    > tokenizer.py without universal newline support. Went on a long (and
                    > educational) journey trying to figure out why my file was not being
                    > processed as expected.[/color]

                    For example, the widely used MoinMoin source code colorizer sends files
                    to tokenizer without universal newline support:



                    Is my premise that tokenizer needs universal newline support to be
                    reliable correct?

                    What else could put it out of sync with the complier?

                    Art

                    Comment

                    • Bengt Richter

                      #11
                      Re: Detecting line endings

                      On 6 Feb 2006 06:35:14 -0800, "Fuzzyman" <fuzzyman@gmail .com> wrote:
                      [color=blue]
                      >Hello all,
                      >
                      >I'm trying to detect line endings used in text files. I *might* be
                      >decoding the files into unicode first (which may be encoded using
                      >multi-byte encodings) - which is why I'm not letting Python handle the
                      >line endings.
                      >
                      >Is the following safe and sane :
                      >
                      >text = open('test.txt' , 'rb').read()
                      >if encoding:
                      > text = text.decode(enc oding)
                      >ending = '\n' # default
                      >if '\r\n' in text:
                      > text = text.replace('\ r\n', '\n')
                      > ending = '\r\n'
                      >elif '\n' in text:
                      > ending = '\n'
                      >elif '\r' in text:
                      > text = text.replace('\ r', '\n')
                      > ending = '\r'
                      >
                      >
                      >My worry is that if '\n' *doesn't* signify a line break on the Mac,
                      >then it may exist in the body of the text - and trigger ``ending =
                      >'\n'`` prematurely ?
                      >[/color]
                      Are you guaranteed that text bodies don't contain escape or quoting
                      mechanisms for binary data where it would be a mistake to convert
                      or delete an '\r' ? (E.g., I think XML CDATA might be an example).

                      Regards,
                      Bengt Richter

                      Comment

                      • Alex Martelli

                        #12
                        Re: Detecting line endings

                        Fuzzyman <fuzzyman@gmail .com> wrote:
                        ...[color=blue][color=green]
                        > > Open the file with 'rU' mode, and check the file object's newline
                        > > attribute.[/color]
                        >
                        > Do you know if this works for multi-byte encodings ? Do files have[/color]

                        You mean when you open them with the codecs module?
                        [color=blue]
                        > metadata associated with them showing the line-ending in use ?[/color]

                        Not in the filesystems I'm familiar with (they did use to, in
                        filesystems used on VMS and other ancient OSs, but that was a very long
                        time ago).


                        Alex

                        Comment

                        • Fuzzyman

                          #13
                          Re: Detecting line endings


                          Bengt Richter wrote:[color=blue]
                          > On 6 Feb 2006 06:35:14 -0800, "Fuzzyman" <fuzzyman@gmail .com> wrote:
                          >[color=green]
                          > >Hello all,
                          > >
                          > >I'm trying to detect line endings used in text files. I *might* be
                          > >decoding the files into unicode first (which may be encoded using
                          > >multi-byte encodings) - which is why I'm not letting Python handle the
                          > >line endings.
                          > >
                          > >Is the following safe and sane :
                          > >
                          > >text = open('test.txt' , 'rb').read()
                          > >if encoding:
                          > > text = text.decode(enc oding)
                          > >ending = '\n' # default
                          > >if '\r\n' in text:
                          > > text = text.replace('\ r\n', '\n')
                          > > ending = '\r\n'
                          > >elif '\n' in text:
                          > > ending = '\n'
                          > >elif '\r' in text:
                          > > text = text.replace('\ r', '\n')
                          > > ending = '\r'
                          > >
                          > >
                          > >My worry is that if '\n' *doesn't* signify a line break on the Mac,
                          > >then it may exist in the body of the text - and trigger ``ending =
                          > >'\n'`` prematurely ?
                          > >[/color]
                          > Are you guaranteed that text bodies don't contain escape or quoting
                          > mechanisms for binary data where it would be a mistake to convert
                          > or delete an '\r' ? (E.g., I think XML CDATA might be an example).
                          >[/color]

                          My personal use case is for reading config files in arbitrary encodings
                          (so it's not an issue).

                          How would Python handle opening such files when not in binary mode ?
                          That may be an issue even on Linux - if you open a windows file and
                          use splitlines does Python convert '\r\n' to '\n' ? (or does it leave
                          the extra '\r's in place, which is *different to the behaviour under
                          windows).

                          All the best,

                          Fuzzyman

                          [color=blue]
                          > Regards,
                          > Bengt Richter[/color]

                          Comment

                          • Fuzzyman

                            #14
                            Re: Detecting line endings


                            Alex Martelli wrote:[color=blue]
                            > Fuzzyman <fuzzyman@gmail .com> wrote:
                            > ...[color=green][color=darkred]
                            > > > Open the file with 'rU' mode, and check the file object's newline
                            > > > attribute.[/color]
                            > >
                            > > Do you know if this works for multi-byte encodings ? Do files have[/color]
                            >
                            > You mean when you open them with the codecs module?
                            >[/color]

                            No, if I open a UTF16 encoded file in universal mode - will it still
                            have the correct lineending attribute ?

                            I can't open with a codec unless an encoding is explicitly supplied. I
                            still want to detect UTF16 even if the encoding isn't specified.

                            As I said, I ought to test this... Without metadata I wonder how Python
                            determines it ?

                            All the best,

                            Fuzzyman

                            [color=blue][color=green]
                            > > metadata associated with them showing the line-ending in use ?[/color]
                            >
                            > Not in the filesystems I'm familiar with (they did use to, in
                            > filesystems used on VMS and other ancient OSs, but that was a very long
                            > time ago).
                            >
                            >
                            > Alex[/color]

                            Comment

                            • ajsiegel@optonline.com

                              #15
                              Re: Detecting line endings


                              Arthur wrote:[color=blue]
                              > Arthur wrote:[/color]
                              [color=blue]
                              > Is my premise that tokenizer needs universal newline support to be
                              > reliable correct?
                              >
                              > What else could put it out of sync with the complier?[/color]

                              Anybody out there?

                              Is my question, and the real world issue that provked it, unclear.

                              Is the answer too obvious?

                              Have I made *everybody's* kill list?

                              Isn't it a prima facie issue if the tokenizer fails in ways
                              incompatible with what the compiler is seeing?

                              Is this just easy, and I am making it hard? As I apparently do with
                              Python more generally.

                              Art

                              Comment

                              Working...