Scanning a file

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Lasse Vågsæther Karlsen

    #61
    Re: Scanning a file

    David Rasmussen wrote:
    <snip>[color=blue]
    > If you must know, the above one-liner actually counts the number of
    > frames in an MPEG2 file. I want to know this number for a number of
    > files for various reasons. I don't want it to take forever.[/color]
    <snip>

    Don't you risk getting more "frames" than the file actually have? What
    if the encoded data happens to have the magic byte values for something
    else?

    --
    Lasse Vågsæther Karlsen

    mailto:lasse@vk arlsen.no
    PGP KeyID: 0x2A42A1C2

    Comment

    • Steven D'Aprano

      #62
      Re: Scanning a file

      David Rasmussen wrote:
      [color=blue]
      > Steven D'Aprano wrote:
      >[color=green]
      >> On Fri, 28 Oct 2005 06:22:11 -0700, pinkfloydhomer@ gmail.com wrote:
      >>[color=darkred]
      >>> Which is quite fast. The only problems is that the file might be huge.[/color]
      >>
      >>
      >> What *you* call huge and what *Python* calls huge may be very different
      >> indeed. What are you calling huge?
      >>[/color]
      >
      > I'm not saying that it is too big for Python. I am saying that it is too
      > big for the systems it is going to run on. These files can be 22 MB or 5
      > GB or ..., depending on the situation. It might not be okay to run a
      > tool that claims that much memory, even if it is available.[/color]

      If your files can reach multiple gigabytes, you will
      definitely need an algorithm that avoids reading the
      entire file into memory at once.


      [snip]
      [color=blue]
      > print file("filename" , "rb").count("\x 00\x00\x01\x00" )
      >
      > (or something like that)
      >
      > instead of the original
      >
      > print file("filename" , "rb").read().co unt("\x00\x00\x 01\x00")
      >
      > it would be exactly what I am after.[/color]

      I think I can say, without risk of contradiction, that
      there is no built-in method to do that.

      [color=blue]
      > What is the conceptual difference?
      > The first solution should be at least as fast as the second. I have to
      > read and compare the characters anyway. I just don't need to store them
      > in a string. In essence, I should be able to use the "count occurences"
      > functionality on more things, such as a file, or even better, a file
      > read through a buffer with a size specified by me.[/color]

      Of course, if you feel like coding the algorithm and
      submitting it to be included in the next release of
      Python... :-)


      I can't help feeling that a generator with a buffer is
      the way to go, but I just can't *quite* deal with the
      case where the pattern overlaps the boundary... it is
      very annoying.

      But not half as annoying as it must be to you :-)

      However, there may be a simpler solution *fingers
      crossed* -- you are searching for a sub-string
      "\x00\x00\x01\x 00", which is hex 0x100. Surely you
      don't want any old substring of "\x00\x00\x01\x 00", but
      only the ones which align on word boundaries?

      So "ABCD\x00\x00\x 01\x00" would match (in hex, it is
      0x41424344 0x100), but "AB\x00\x00\x01 \x00CD" should
      not, because that is 0x41420000 0x1004344 in hex.

      If that is the case, your problem is simpler: you don't
      have to worry about the pattern crossing a boundary, so
      long as your buffer is a multiple of four bytes.



      --
      Steven.

      Comment

      • Paul Watson

        #63
        Re: Scanning a file

        Alex Martelli wrote:
        ....[color=blue][color=green][color=darkred]
        >>>>gc.garbag e[/color][/color]
        >
        > [<__main__.a object at 0x64cf0>, <__main__.b object at 0x58510>]
        >
        > So, no big deal -- run a gc.collect() and parse through gc.garbage for
        > any instances of your "wrapper of file" class, and you'll find ones that
        > were forgotten as part of a cyclic garbage loop and you can check
        > whether they were explicitly closed or not.
        >
        >
        > Alex[/color]

        Since everyone needs this, how about building it in such that files
        which are closed by the runtime, and not user code, are reported or
        queryable? Perhaps a command line switch to either invoke or suppress
        reporting them on exit.

        Is there any facility for another program to peer into the state of a
        Python program? Would this be a security problem?

        Comment

        • Steve Holden

          #64
          Re: Scanning a file

          Paul Watson wrote:[color=blue]
          > Alex Martelli wrote:
          > ...
          >[color=green][color=darkred]
          >>>>>gc.garba ge[/color]
          >>
          >>[<__main__.a object at 0x64cf0>, <__main__.b object at 0x58510>]
          >>
          >>So, no big deal -- run a gc.collect() and parse through gc.garbage for
          >>any instances of your "wrapper of file" class, and you'll find ones that
          >>were forgotten as part of a cyclic garbage loop and you can check
          >>whether they were explicitly closed or not.
          >>
          >>
          >>Alex[/color]
          >
          >
          > Since everyone needs this, how about building it in such that files
          > which are closed by the runtime, and not user code, are reported or
          > queryable? Perhaps a command line switch to either invoke or suppress
          > reporting them on exit.
          >[/color]
          This is a rather poor substitute from correct program design and
          implementation. It also begs the question of exactly what constitutes a
          "file". What about a network socket that the user has run makefile() on?
          What about a pipe to another process? This suggestion is rather ill-defined.
          [color=blue]
          > Is there any facility for another program to peer into the state of a
          > Python program? Would this be a security problem?[/color]

          It would indeed be a security problem, and there are enough of those
          already without adding more.

          regards
          Steve
          --
          Steve Holden +44 150 684 7255 +1 800 494 3119
          Holden Web LLC www.holdenweb.com
          PyCon TX 2006 www.python.org/pycon/

          Comment

          • Bengt Richter

            #65
            Re: Scanning a file

            On Mon, 31 Oct 2005 09:41:02 +0100, =?ISO-8859-1?Q?Lasse_V=E5g s=E6ther_Karlse n?= <lasse@vkarlsen .no> wrote:
            [color=blue]
            >David Rasmussen wrote:
            ><snip>[color=green]
            >> If you must know, the above one-liner actually counts the number of
            >> frames in an MPEG2 file. I want to know this number for a number of
            >> files for various reasons. I don't want it to take forever.[/color]
            ><snip>
            >
            >Don't you risk getting more "frames" than the file actually have? What
            >if the encoded data happens to have the magic byte values for something
            >else?
            >[/color]
            Good point, but perhaps the bit pattern the OP is looking for is guaranteed
            (e.g. by some kind of HDLC-like bit or byte stuffing or escaping) not to occur
            except as frame marker (which might make sense re the problem of re-synching
            to frames in a glitched video stream).

            The OP probably knows. I imagine this thread would have gone differently if the
            title had been "How to count frames in an MPEG2 file?" and the OP had supplied
            the info about what marks a frame and whether it is guaranteed not to occur
            in the data ;-)

            Regards,
            Bengt Richter

            Comment

            • Bengt Richter

              #66
              Re: Scanning a file

              On Mon, 31 Oct 2005 09:19:10 +0100, Peter Otten <__peter__@web. de> wrote:
              [color=blue]
              >Bengt Richter wrote:
              >[color=green]
              >> I still smelled a bug in the counting of substring in the overlap region,
              >> and you motivated me to find it (obvious in hindsight, but aren't most ;-)
              >>
              >> A substring can get over-counted if the "overlap" region joins
              >> infelicitously with the next input. E.g., try counting 'xx' in 10*'xx'
              >> with a read chunk of 4 instead of 1024*1024:
              >>
              >> Assuming corrections so far posted as I understand them:
              >>[color=darkred]
              >> >>> def byblocks(f, blocksize, overlap):[/color]
              >> ... block = f.read(blocksiz e)
              >> ... yield block
              >> ... if overlap>0:
              >> ... while True:
              >> ... next = f.read(blocksiz e-overlap)
              >> ... if not next: break
              >> ... block = block[-overlap:] + next
              >> ... yield block
              >> ... else:
              >> ... while True:
              >> ... next = f.read(blocksiz e)
              >> ... if not next: break
              >> ... yield next
              >> ...[color=darkred]
              >> >>> def countsubst(f, subst, blksize=1024*10 24):[/color]
              >> ... count = 0
              >> ... for block in byblocks(f, blksize, len(subst)-1):
              >> ... count += block.count(sub st)
              >> ... f.close()
              >> ... return count
              >> ...
              >>[color=darkred]
              >> >>> from StringIO import StringIO as S
              >> >>> countsubst(S('x x'*10), 'xx', 4)[/color]
              >> 13[color=darkred]
              >> >>> ('xx'*10).count ('xx')[/color]
              >> 10[color=darkred]
              >> >>> list(byblocks(S ('xx'*10), 4, len('xx')-1))[/color]
              >> ['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx']
              >>
              >> Of course, a large read chunk will make the problem either
              >> go away
              >>[color=darkred]
              >> >>> countsubst(S('x x'*10), 'xx', 1024)[/color]
              >> 10
              >>
              >> or might make it low probability depending on the data.[/color]
              >
              >[David Rasmussen]
              >[color=green]
              >> First of all, this isn't a text file, it is a binary file. Secondly,
              >> substrings can overlap. In the sequence 0010010 the substring 0010
              >> occurs twice.[/color][/color]
              The OP didn't reply to my post re the above for some reason

              [color=blue]
              >
              >Coincidental ly the "always overlap" case seems the easiest to fix. It
              >suffices to replace the count() method with
              >
              >def count_overlap(s , token):
              > pos = -1
              > n = 0
              > while 1:
              > try:
              > pos = s.index(token, pos+1)
              > except ValueError:
              > break
              > n += 1
              > return n
              >
              >Or so I hope, without the thorough tests that are indispensable as we should
              >have learned by now...
              >[/color]
              Unfortunately, there is such a thing as a correct implementation of an incorrect spec ;-)
              I have some doubts about the OP's really wanting to count overlapping patterns as above,
              which is what I asked about in the above referenced post. Elsewhere he later reveals:

              [David Rasmussen][color=blue][color=green]
              >> If you must know, the above one-liner actually counts the number of
              >> frames in an MPEG2 file. I want to know this number for a number of
              >> files for various reasons. I don't want it to take forever.[/color][/color]

              In which case I doubt whether he wants to count as above. Scanning for the
              particular 4 bytes would assume that non-frame-marker data is escaped
              one way or another so it can't contain the marker byte sequence.
              (If it did, you'd want to skip it, not count it, I presume). Robust streaming video
              format would presumably be designed for unambigous re-synching, meaning
              the data stream can't contain the sync mark. But I don't know if that
              is guaranteed in conversion from file to stream a la HDLC or some link packet protocol
              or whether it is actually encoded with escaping in the file. If framing in the file is with
              length-specifying packet headers and no data escaping, then the filebytes.count (pattern)
              approach is not going to do the job reliably, as Lasse was pointing out.

              Requirements, requirements ;-)

              Regards,
              Bengt Richter

              Comment

              • Paul Watson

                #67
                Re: Scanning a file

                Steve Holden wrote:[color=blue][color=green]
                >> Since everyone needs this, how about building it in such that files
                >> which are closed by the runtime, and not user code, are reported or
                >> queryable? Perhaps a command line switch to either invoke or suppress
                >> reporting them on exit.
                >>[/color]
                > This is a rather poor substitute from correct program design and
                > implementation. It also begs the question of exactly what constitutes a
                > "file". What about a network socket that the user has run makefile() on?
                > What about a pipe to another process? This suggestion is rather
                > ill-defined.
                >[color=green]
                >> Is there any facility for another program to peer into the state of a
                >> Python program? Would this be a security problem?[/color]
                >
                > It would indeed be a security problem, and there are enough of those
                > already without adding more.
                >
                > regards
                > Steve[/color]

                All I am looking for is the runtime to tell me when it is doing things
                that are outside the language specification and that the developer
                should have coded.

                How "ill" will things be when large bodies of code cannot run
                successfully on a future version of Python or a non-CPython
                implementation which does not close files. Might as well put file
                closing on exit into the specification.

                The runtime knows it is doing it. Please allow the runtime to tell me
                what it knows it is doing. Thanks.

                Comment

                • Steve Holden

                  #68
                  Re: Scanning a file

                  Paul Watson wrote:[color=blue]
                  > Steve Holden wrote:
                  >[color=green][color=darkred]
                  >>>Since everyone needs this, how about building it in such that files
                  >>>which are closed by the runtime, and not user code, are reported or
                  >>>queryable? Perhaps a command line switch to either invoke or suppress
                  >>>reporting them on exit.
                  >>>[/color]
                  >>
                  >>This is a rather poor substitute from correct program design and
                  >>implementatio n. It also begs the question of exactly what constitutes a
                  >>"file". What about a network socket that the user has run makefile() on?
                  >>What about a pipe to another process? This suggestion is rather
                  >>ill-defined.
                  >>
                  >>[color=darkred]
                  >>>Is there any facility for another program to peer into the state of a
                  >>>Python program? Would this be a security problem?[/color]
                  >>
                  >>It would indeed be a security problem, and there are enough of those
                  >>already without adding more.
                  >>
                  >>regards
                  >> Steve[/color]
                  >
                  >
                  > All I am looking for is the runtime to tell me when it is doing things
                  > that are outside the language specification and that the developer
                  > should have coded.
                  >
                  > How "ill" will things be when large bodies of code cannot run
                  > successfully on a future version of Python or a non-CPython
                  > implementation which does not close files. Might as well put file
                  > closing on exit into the specification.
                  >
                  > The runtime knows it is doing it. Please allow the runtime to tell me
                  > what it knows it is doing. Thanks.[/color]

                  In point oif fact I don't believe the runtime does any such thing
                  (though I must admit I haven't checked the source, so you may prove me
                  wrong).

                  As far as I know, Python simply relies on the opreating system to close
                  files left open at the end of the program.

                  regards
                  Steve
                  --
                  Steve Holden +44 150 684 7255 +1 800 494 3119
                  Holden Web LLC www.holdenweb.com
                  PyCon TX 2006 www.python.org/pycon/

                  Comment

                  • John J. Lee

                    #69
                    Re: Scanning a file

                    Paul Watson <pwatson@redlin epy.com> writes:
                    [...][color=blue]
                    > How "ill" will things be when large bodies of code cannot run
                    > successfully on a future version of Python or a non-CPython
                    > implementation which does not close files. Might as well put file
                    > closing on exit into the specification.[/color]
                    [...]

                    There are many, many ways of making a large body of code "ill".

                    Closing off this particular one would make it harder to get benefit of
                    non-C implementations of Python, so it has been judged "not worth it".
                    I think I agree with that judgement.


                    John

                    Comment

                    • Paul Rubin

                      #70
                      Re: Scanning a file

                      jjl@pobox.com (John J. Lee) writes:[color=blue]
                      > Closing off this particular one would make it harder to get benefit of
                      > non-C implementations of Python, so it has been judged "not worth it".
                      > I think I agree with that judgement.[/color]

                      The right fix is PEP 343.

                      Comment

                      • Alex Martelli

                        #71
                        Re: Scanning a file

                        Steve Holden <steve@holdenwe b.com> wrote:
                        ...[color=blue][color=green]
                        > > The runtime knows it is doing it. Please allow the runtime to tell me
                        > > what it knows it is doing. Thanks.[/color]
                        >
                        > In point oif fact I don't believe the runtime does any such thing
                        > (though I must admit I haven't checked the source, so you may prove me
                        > wrong).
                        >
                        > As far as I know, Python simply relies on the opreating system to close
                        > files left open at the end of the program.[/color]

                        Nope, see
                        <http://cvs.sourceforge.net/viewcvs.p...src/Objects/fi
                        leobject.c?rev= 2.164.2.3&view= markup> :

                        """
                        static void
                        file_dealloc(Py FileObject *f)
                        {
                        int sts = 0;
                        if (f->weakreflist != NULL)
                        PyObject_ClearW eakRefs((PyObje ct *) f);
                        if (f->f_fp != NULL && f->f_close != NULL) {
                        Py_BEGIN_ALLOW_ THREADS
                        sts = (*f->f_close)(f->f_fp);
                        """
                        etc.

                        Exactly how the OP wants to "allow the runtime to tell [him] what it
                        knows it is doing", that is not equivalent to reading the freely
                        available sources of that runtime, is totally opaque to me, though.

                        "The runtime" (implementation of built-in object type `file`) could be
                        doing or not doing a bazillion things (in its ..._dealloc function as
                        well as many other functions), up to and including emailing the OP's
                        cousin if it detects the OP is up later than his or her bedtime -- the
                        language specs neither mandate nor forbid such behavior. How, exactly,
                        does the OP believe the language specs should "allow" (presumably,
                        REQUIRE) ``the runtime'' to communicate the sum total of all that it's
                        doing or not doing (beyond whatever the language specs themselves may
                        require or forbid it to do) on any particular occasion...?!


                        Alex

                        Comment

                        • Paul Watson

                          #72
                          Re: Scanning a file

                          Paul Rubin wrote:[color=blue]
                          > jjl@pobox.com (John J. Lee) writes:
                          >[color=green]
                          >>Closing off this particular one would make it harder to get benefit of
                          >>non-C implementations of Python, so it has been judged "not worth it".
                          >>I think I agree with that judgement.[/color]
                          >
                          >
                          > The right fix is PEP 343.[/color]

                          I am sure you are right. However, PEP 343 will not change the existing
                          body of Python source code. Nor will it, alone, change the existing
                          body of Python programmers who are writing code which does not close files.

                          Comment

                          • Paul Watson

                            #73
                            Re: Scanning a file

                            Alex Martelli wrote:[color=blue]
                            > Steve Holden <steve@holdenwe b.com> wrote:
                            > ...
                            >[color=green][color=darkred]
                            >>>The runtime knows it is doing it. Please allow the runtime to tell me
                            >>>what it knows it is doing. Thanks.[/color]
                            >>
                            >>In point oif fact I don't believe the runtime does any such thing
                            >>(though I must admit I haven't checked the source, so you may prove me
                            >>wrong).
                            >>
                            >>As far as I know, Python simply relies on the opreating system to close
                            >>files left open at the end of the program.[/color]
                            >
                            >
                            > Nope, see
                            > <http://cvs.sourceforge.net/viewcvs.p...src/Objects/fi
                            > leobject.c?rev= 2.164.2.3&view= markup> :
                            >
                            > """
                            > static void
                            > file_dealloc(Py FileObject *f)
                            > {
                            > int sts = 0;
                            > if (f->weakreflist != NULL)
                            > PyObject_ClearW eakRefs((PyObje ct *) f);
                            > if (f->f_fp != NULL && f->f_close != NULL) {
                            > Py_BEGIN_ALLOW_ THREADS
                            > sts = (*f->f_close)(f->f_fp);
                            > """
                            > etc.
                            >
                            > Exactly how the OP wants to "allow the runtime to tell [him] what it
                            > knows it is doing", that is not equivalent to reading the freely
                            > available sources of that runtime, is totally opaque to me, though.
                            >
                            > "The runtime" (implementation of built-in object type `file`) could be
                            > doing or not doing a bazillion things (in its ..._dealloc function as
                            > well as many other functions), up to and including emailing the OP's
                            > cousin if it detects the OP is up later than his or her bedtime -- the
                            > language specs neither mandate nor forbid such behavior. How, exactly,
                            > does the OP believe the language specs should "allow" (presumably,
                            > REQUIRE) ``the runtime'' to communicate the sum total of all that it's
                            > doing or not doing (beyond whatever the language specs themselves may
                            > require or forbid it to do) on any particular occasion...?!
                            >
                            >
                            > Alex[/color]

                            The OP wants to know which files the runtime is closing automatically.
                            This may or may not occur on other or future Python implementations .
                            Identifying this condition will accelerate remediation efforts to avoid
                            the deleterious impact of failure to close().


                            The mechanism to implement such a capability might be similar to the -v
                            switch which traces imports, reporting to stdout. It might be a
                            callback function.

                            Comment

                            • Fredrik Lundh

                              #74
                              Re: Scanning a file

                              Alex Martelli wrote:
                              [color=blue][color=green]
                              >> As far as I know, Python simply relies on the opreating system to close
                              >> files left open at the end of the program.[/color]
                              >
                              > Nope, see
                              > <http://cvs.sourceforge.net/viewcvs.p...src/Objects/fi
                              > leobject.c?rev= 2.164.2.3&view= markup>[/color]

                              that's slightly misleading: CPython will make a decent attempt (via the module cleanup
                              mechanism: http://www.python.org/doc/essays/cleanup.html ), but any files that are not
                              closed by that process will be handled by the OS.

                              CPython is not designed to run on an OS that doesn't reclaim memory and other re-
                              sources upon program exit.

                              </F>



                              Comment

                              • David Rasmussen

                                #75
                                Re: Scanning a file

                                Lasse Vågsæther Karlsen wrote:[color=blue]
                                > David Rasmussen wrote:
                                > <snip>
                                >[color=green]
                                >> If you must know, the above one-liner actually counts the number of
                                >> frames in an MPEG2 file. I want to know this number for a number of
                                >> files for various reasons. I don't want it to take forever.[/color]
                                >
                                > Don't you risk getting more "frames" than the file actually have? What
                                > if the encoded data happens to have the magic byte values for something
                                > else?
                                >[/color]

                                I am not too sure about the details, but I've been told from a reliable
                                source that 0x00000100 only occurs as a "begin frame" marker, and not
                                anywhere else. So far, it has been true on the files I have tried it on.

                                /David

                                Comment

                                Working...