1 - 2 millions files in one folder?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • b007uk@gmail.com

    1 - 2 millions files in one folder?

    Hi all,
    I really need your advice.
    I have to store over a million files, 10 - 15 kb each, in one folder.
    The files are created by my php script, sometimes the old files are
    deleted and new ones are written.
    So, basically on every connection my script reads/deletes/ writes files
    from/to that folder.
    Right now i have only around 300 000 files in that folder, and it feels
    like its getting slower for that script to work. It does work at the
    moment, but i am not sure what will happen when there is over a million
    files there...
    Are there any limits of files that can be stored in a folder?
    Would it be better for me to use mysql? I am not sure how mysql will
    cope with millions of writes/reads
    What would you recommend?
    Thank you very much!
    p.s.I am running linux, fedora core 3

  • Steve

    #2
    Re: 1 - 2 millions files in one folder?

    On Thu, 27 Jul 2006 02:49:10 -0700, b007uk wrote:
    Hi all,
    I really need your advice.
    I have to store over a million files, 10 - 15 kb each, in one folder.
    The files are created by my php script, sometimes the old files are
    deleted and new ones are written.
    So, basically on every connection my script reads/deletes/ writes files
    from/to that folder.
    Right now i have only around 300 000 files in that folder, and it feels
    like its getting slower for that script to work. It does work at the
    moment, but i am not sure what will happen when there is over a million
    files there...
    Are there any limits of files that can be stored in a folder?
    Would it be better for me to use mysql? I am not sure how mysql will
    cope with millions of writes/reads
    What would you recommend?
    Thank you very much!
    p.s.I am running linux, fedora core 3
    In a word... *you're crazy*!!! Look at the way that files are stored under
    linux, with the different file systems. with ext2/3, god knows how many
    levels of indirection you'll be going through to even amange to index the
    directory.

    You need to do a lot of reading, a lot of customization, and a load of
    benchmarking to get this to work. And, tbh, I'd find another solution.
    There must be a way to subdivide this data to get an acceptable number of
    files ( thousands or less!!! ) in each directory.

    Steve

    Comment

    • Erwin Moller

      #3
      Re: 1 - 2 millions files in one folder?

      b007uk@gmail.co m wrote:
      Hi all,
      I really need your advice.
      I have to store over a million files, 10 - 15 kb each, in one folder.
      The files are created by my php script, sometimes the old files are
      deleted and new ones are written.
      So, basically on every connection my script reads/deletes/ writes files
      from/to that folder.
      Right now i have only around 300 000 files in that folder, and it feels
      like its getting slower for that script to work. It does work at the
      moment, but i am not sure what will happen when there is over a million
      files there...
      Are there any limits of files that can be stored in a folder?
      Would it be better for me to use mysql? I am not sure how mysql will
      cope with millions of writes/reads
      What would you recommend?
      Thank you very much!
      p.s.I am running linux, fedora core 3
      Hi,

      Don't. :P
      If you know that folder will contain millions of files, the underlying OS
      will need more and more time to get the right file.
      Some OS's are smarter than others, I do not know the details.

      But better be safe than sorry, so if possible, use a database. These things
      are set up to easily handle massive tablelookups by means of (smart)
      indexing.

      A simple approach:
      (Postgresnotati on, not mySQL which I avoid)

      create a table that holds your content
      create table files{
      fileid serial primary key,
      filename text,
      content text
      }

      Now you can get the content of each file based on its id very fast because
      fileid is primary key, and thus indexed.
      If you want to use the filename, index that one too.

      So (very fast)
      SELECT content from files WHERE (fileid=238756) ;
      because fileid is indexed.

      And if you indexed filename too, this will also be very fast:
      SELECT content FROM files WHERE (filename='myfi le_3_4_2006.txt ');

      Alternatively: Maybe you can translate your whole approach to a database and
      produce the results when needed instead of making millions of files.
      This is hard to say since I do not know the underlying problem, but in
      general you can solve this with a good designed database.

      Hope that helps,

      Regards,
      Erwin Moller

      Comment

      • Geoff Berrow

        #4
        Re: 1 - 2 millions files in one folder?

        Message-ID: <44c8961b$0$452 3$e4fe514c@news .xs4all.nlfrom Erwin Moller
        contained the following:
        >A simple approach:
        >(Postgresnotat ion, not mySQL which I avoid)
        Last time I checked it mysql was being used for tinyurl.com



        --
        Geoff Berrow (put thecat out to email)
        It's only Usenet, no one dies.
        My opinions, not the committee's, mine.
        Simple RFDs http://www.ckdog.co.uk/rfdmaker/

        Comment

        • Erwin Moller

          #5
          Re: 1 - 2 millions files in one folder?

          Geoff Berrow wrote:
          Message-ID: <44c8961b$0$452 3$e4fe514c@news .xs4all.nlfrom Erwin Moller
          contained the following:
          >
          >>A simple approach:
          >>(Postgresnota tion, not mySQL which I avoid)
          >
          Last time I checked it mysql was being used for tinyurl.com
          >

          >
          So what?
          My choice of database is not based on tinyurl.com using something or not.
          ;-)

          The reason I prefer Postgresql above mySQL has more to do with support of
          Foreign Keys (which failed silently in mySQL), transactions, etc.
          When I made my pick of prefered database a few years ago, mySQL was no
          comparision to Postgres.

          Yes I know: Since mySQL offered the use of innoDB in combination with imysql
          they solved these serious shortcomings.

          Also: I do not say mySQL sucks or anything, it is just that mySQL matured a
          short while ago and you'll have to tweak it before you can use FK's and
          transactions and the like, but yeah, it can be done with innoDB.

          I think the main reason for mySQL's popularity lies in the fact they offered
          it on M$ systems, where Postgres was only running on *nix. That too is in
          the past by the way: Postgresql is available for years now on M$.

          Regards,
          Erwin Moller

          Comment

          • b007uk@gmail.com

            #6
            Re: 1 - 2 millions files in one folder?

            You are right :(
            Thats probably why i can't even enter that folder now, it takes ages :(
            I'll try to change it to work with mysql...
            To tell you the truth i am a bit affraid of mysql, was always storing
            data in folders/files, but i think i'll manage...
            And no, i can't predict the file names, so cant organize it, files are
            generated from the user input.
            Is it possible to store complete html files with all its tags in mysql?
            Thank you very much!

            Comment

            • Geoff Berrow

              #7
              Re: 1 - 2 millions files in one folder?

              Message-ID: <44c89b83$0$452 4$e4fe514c@news .xs4all.nlfrom Erwin Moller
              contained the following:
              >>>A simple approach:
              >>>(Postgresnot ation, not mySQL which I avoid)
              >>
              >Last time I checked it mysql was being used for tinyurl.com
              >>
              >http://tinyurl.com/
              >>
              >
              >So what?
              >My choice of database is not based on tinyurl.com using something or not.
              >;-)
              I know that, but it may have been seen as indicating that you thought
              MySQL could not handle large numbers
              >
              >The reason I prefer Postgresql above mySQL has more to do with support of
              >Foreign Keys (which failed silently in mySQL), transactions, etc.
              >When I made my pick of prefered database a few years ago, mySQL was no
              >comparision to Postgres.
              Not quite sure what you mean by foreign keys failing silently but
              perhaps this is off topic for this group.
              --
              Geoff Berrow (put thecat out to email)
              It's only Usenet, no one dies.
              My opinions, not the committee's, mine.
              Simple RFDs http://www.ckdog.co.uk/rfdmaker/

              Comment

              • Markus Ernst

                #8
                Re: 1 - 2 millions files in one folder?

                b007uk@gmail.co m schrieb:
                [...]
                Is it possible to store complete html files with all its tags in mysql?
                Rather store the HTML code in a text field. To display it, just get the
                contents of that field with PHP, echo it and exit. No need to make it a
                file at all!

                --
                Markus

                Comment

                • Erwin Moller

                  #9
                  Re: 1 - 2 millions files in one folder?

                  Geoff Berrow wrote:
                  Message-ID: <44c89b83$0$452 4$e4fe514c@news .xs4all.nlfrom Erwin Moller
                  contained the following:
                  >
                  >>>>A simple approach:
                  >>>>(Postgresno tation, not mySQL which I avoid)
                  >>>
                  >>Last time I checked it mysql was being used for tinyurl.com
                  >>>
                  >>http://tinyurl.com/
                  >>>
                  >>
                  >>So what?
                  >>My choice of database is not based on tinyurl.com using something or not.
                  >>;-)
                  >
                  Hi Geoff,
                  I know that, but it may have been seen as indicating that you thought
                  MySQL could not handle large numbers
                  No, that was not what I ment.
                  I just ment: if you start using a database, Postgres has (had) some
                  advantages above mySQL.

                  >>
                  >>The reason I prefer Postgresql above mySQL has more to do with support of
                  >>Foreign Keys (which failed silently in mySQL), transactions, etc.
                  >>When I made my pick of prefered database a few years ago, mySQL was no
                  >>comparision to Postgres.
                  >
                  Not quite sure what you mean by foreign keys failing silently but
                  perhaps this is off topic for this group.
                  Yes a little off topic, but that happens all the time in here. ;-)

                  What I mean by 'failing silently' is this:
                  create table tbluser(
                  userid serial primary key,
                  username text
                  )

                  create table tblarticle(
                  articleid serial primary key,
                  writtenby integer references tbluser.userid,
                  title text,
                  content text
                  )

                  and then:
                  insert into tbluser (username) values ('Geoff');

                  suppose the userid for that insert (serial/autonumber) is 1.
                  Now with mySQL, if I insert an illegal value for writtenby in tblarticle,
                  like this:
                  insert into tblarticle (writtenby,titl e,content) VALUES
                  (33, 'my title', 'bla');

                  it just fails to check the contstraint, and boldly inserts 33 for writtenby,
                  which should actually give an error (Foreign Key contraint violation, os
                  something like that).

                  I rather had mySQL say: "What does 'references' mean in your
                  tabledefinition ? I do not know that word."
                  instead of pretending it understands, but never enforcing the constraint.

                  I had the same kind of trouble with transactions with mySQL, that is why I
                  said it matured just a short while ago (with innoDB and imysql).

                  For simple datastorage, this presents no problem, but once your database
                  gets more complex, you really want to be able to rely on FK constraints.

                  Anyway, this is off topic indeed. :-)

                  Regards,
                  Erwin Moller

                  Comment

                  • Erwin Moller

                    #10
                    Re: 1 - 2 millions files in one folder?

                    b007uk@gmail.co m wrote:
                    You are right :(
                    Thats probably why i can't even enter that folder now, it takes ages :(
                    I'll try to change it to work with mysql...
                    To tell you the truth i am a bit affraid of mysql, was always storing
                    data in folders/files, but i think i'll manage...
                    And no, i can't predict the file names, so cant organize it, files are
                    generated from the user input.
                    In that case make sure you put an index on the filename in your table.
                    It will speed up the lookups a lot, but it will decrease the
                    inserts/updates.

                    So if your system is most of the time busy looking up: use an index.
                    In case you are almost all the time inserting, leave it.

                    That is just a very general rule-of-thumb.
                    If you really want to know what is going on, use a profiler.
                    But forget about the profiler for now, and start learning SQL. :-)
                    (If you want to use mySQL, nothing wrong with that. I was just making a
                    punch to mySQL. It can probably easy do what you want. So just go mySQL.)
                    Is it possible to store complete html files with all its tags in mysql?
                    Yes, no problem.
                    For a column of type TEXT or VARCHAR, the HTML is just a bunch of
                    characters.
                    Pay attention however to escaping.
                    SQL uses the ' as stringdelimitte r, so if you use that ' in your HTML, be
                    sure you escape it. mySQL has all kinds of escapingfunctio ns, as does PHP
                    (addslashes()), so you'll find one that suits your needs.

                    Regards,
                    Erwin Moller
                    Thank you very much!

                    Comment

                    • b007uk@gmail.com

                      #11
                      Re: 1 - 2 millions files in one folder?

                      Thanks a lot, I'll do that!
                      I guess its time to learn mysql @)

                      Comment

                      • Jerry Stuckle

                        #12
                        Re: 1 - 2 millions files in one folder?

                        Erwin Moller wrote:
                        Geoff Berrow wrote:
                        >
                        >
                        >>Message-ID: <44c89b83$0$452 4$e4fe514c@news .xs4all.nlfrom Erwin Moller
                        >>contained the following:
                        >>
                        >>
                        >>>>>A simple approach:
                        >>>>>(Postgresn otation, not mySQL which I avoid)
                        >>>>
                        >>>>Last time I checked it mysql was being used for tinyurl.com
                        >>>>
                        >>>>http://tinyurl.com/
                        >>>>
                        >>>
                        >>>So what?
                        >>>My choice of database is not based on tinyurl.com using something or not.
                        >>>;-)
                        >>
                        >
                        Hi Geoff,
                        >
                        >
                        >>I know that, but it may have been seen as indicating that you thought
                        >>MySQL could not handle large numbers
                        >
                        >
                        No, that was not what I ment.
                        I just ment: if you start using a database, Postgres has (had) some
                        advantages above mySQL.
                        >
                        Sure. And MySQL has advantages over Progress. Both are good
                        databases, with their own advantages and disadvantages.
                        >
                        >
                        >>>The reason I prefer Postgresql above mySQL has more to do with support of
                        >>>Foreign Keys (which failed silently in mySQL), transactions, etc.
                        >>>When I made my pick of prefered database a few years ago, mySQL was no
                        >>>comparisio n to Postgres.
                        >>
                        >>Not quite sure what you mean by foreign keys failing silently but
                        >>perhaps this is off topic for this group.
                        >
                        >
                        Yes a little off topic, but that happens all the time in here. ;-)
                        >
                        What I mean by 'failing silently' is this:
                        create table tbluser(
                        userid serial primary key,
                        username text
                        )
                        >
                        create table tblarticle(
                        articleid serial primary key,
                        writtenby integer references tbluser.userid,
                        title text,
                        content text
                        )
                        >
                        and then:
                        insert into tbluser (username) values ('Geoff');
                        >
                        suppose the userid for that insert (serial/autonumber) is 1.
                        Now with mySQL, if I insert an illegal value for writtenby in tblarticle,
                        like this:
                        insert into tblarticle (writtenby,titl e,content) VALUES
                        (33, 'my title', 'bla');
                        >
                        it just fails to check the contstraint, and boldly inserts 33 for writtenby,
                        which should actually give an error (Foreign Key contraint violation, os
                        something like that).
                        Not a failure. Documented operation when now using INNODB.
                        >
                        I rather had mySQL say: "What does 'references' mean in your
                        tabledefinition ? I do not know that word."
                        instead of pretending it understands, but never enforcing the constraint.
                        >
                        But REFERENCES is part of the SQL standard, and they are trying to
                        adhere to the standard.
                        I had the same kind of trouble with transactions with mySQL, that is why I
                        said it matured just a short while ago (with innoDB and imysql).
                        >
                        Again, documented operation.
                        For simple datastorage, this presents no problem, but once your database
                        gets more complex, you really want to be able to rely on FK constraints.
                        >
                        True. But the bottom line is - know your tools!
                        Anyway, this is off topic indeed. :-)
                        >
                        Regards,
                        Erwin Moller

                        --
                        =============== ===
                        Remove the "x" from my email address
                        Jerry Stuckle
                        JDS Computer Training Corp.
                        jstucklex@attgl obal.net
                        =============== ===

                        Comment

                        • Miguel Cruz

                          #13
                          Re: 1 - 2 millions files in one folder?

                          b007uk@gmail.co m wrote:
                          I have to store over a million files, 10 - 15 kb each, in one folder.
                          The files are created by my php script, sometimes the old files are
                          deleted and new ones are written.
                          So, basically on every connection my script reads/deletes/ writes files
                          from/to that folder.
                          Right now i have only around 300 000 files in that folder, and it feels
                          like its getting slower for that script to work. It does work at the
                          moment, but i am not sure what will happen when there is over a million
                          files there...
                          Are there any limits of files that can be stored in a folder?
                          No (depends on the filesystem but in general no).

                          However, with many filesystems the search time will get really bad when
                          you have so many files in one folder.

                          Instead you can make a little hash structure, it's easy to do and will
                          provide you a significant performance boost.

                          Let's say your files are all named with a sequence of 6 random letters
                          (like "rjudfx" and "qopmnu" and "zsijpa").

                          Make yourself 26 directories inside of your one large directory: 'a',
                          'b', 'c', 'd', 'e', etc.

                          Then store the files in the directory named after the first letter. file
                          "rjudfx" would go inside 'r', and so on.

                          You can make some quick, easy functions to add the directory prefix onto
                          the names when you are reading and writing them.

                          function hashname($filen ame)
                          {
                          return $filename{0} . "/{$filename}";
                          }

                          Then, instead of doing fopen($filename ), just do
                          fopen(hashname( $filename)).

                          This way the search space is cut into 1/26 of what it was before, and
                          accessing the files will be much faster.

                          miguel
                          --
                          Photos from 40 countries on 5 continents: http://travel.u.nu
                          Latest photos: Malaysia; Thailand; Singapore; Spain; Morocco
                          Airports of the world: http://airport.u.nu

                          Comment

                          • b007uk@gmail.com

                            #14
                            Re: 1 - 2 millions files in one folder?

                            Thank you!
                            Maybe I won't need to use mysql after all!
                            File names are words or frases that may have digits, separated by '-',
                            like this: this-is-one-file.txt this-1-is-another.txt and-more.txt
                            ill try to change '-' to '/' and save it like that:
                            ../t/this/is/one/file.txt
                            it should work
                            Thank you for the idea again!

                            Comment

                            • Miguel Cruz

                              #15
                              Re: 1 - 2 millions files in one folder?

                              b007uk@gmail.co m wrote:
                              Thank you!
                              Maybe I won't need to use mysql after all!
                              File names are words or frases that may have digits, separated by '-',
                              like this: this-is-one-file.txt this-1-is-another.txt and-more.txt
                              ill try to change '-' to '/' and save it like that:
                              ./t/this/is/one/file.txt
                              it should work
                              Thank you for the idea again!
                              If you do this you will have to make a lot of directories all the time.

                              If the names are pretty unpredictable like that, how about just taking
                              the md5() of the name and using the first character of that? That way
                              you get 16 buckets to spread them out over.

                              miguel
                              --
                              Photos from 40 countries on 5 continents: http://travel.u.nu
                              Latest photos: Malaysia; Thailand; Singapore; Spain; Morocco
                              Airports of the world: http://airport.u.nu

                              Comment

                              Working...