which is the better option for directory hashing to store large number of image files?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • theCancerus

    which is the better option for directory hashing to store large number of image files?

    Hi All,

    I am not sure if this is the right place to ask this question but i am
    very sure you may have faced this problem, i have already found some
    post related to this but not the answer i am looking for.

    My problem is that i have to upload images and store them. I am using
    filesystem for that.

    setup is something like this, their will be items/groups/user each can
    have upto 6 images which needs to be scaled to 4 different sizes ie
    every item can have upto 24 images of varying sizes.

    now the standard way of storing these files would be to store them in
    subdirectories based on some hash.

    my partial solution is to split the four types of files into four
    fixed base folders for each dimension,

    since filename is in format "YmdHis" i decided to use directory
    structure as Y/m/d/<filename>.
    but i realize that even this could be inefficient.

    so now i am thinking about going one more level by creating Y/m/d/H/i/
    <filenamedirect ory structure.

    now my question is how to go about creating subdirectories below base
    folders, will my scheme hold or should i use md5 hash as suggested by
    others, over the filename and then take 2-3 characters and create one
    or two level of directory structure and then store the files?

    Regards,
    Amit

  • Jerry Stuckle

    #2
    Re: which is the better option for directory hashing to store largenumber of image files?

    theCancerus wrote:
    Hi All,
    >
    I am not sure if this is the right place to ask this question but i am
    very sure you may have faced this problem, i have already found some
    post related to this but not the answer i am looking for.
    >
    My problem is that i have to upload images and store them. I am using
    filesystem for that.
    >
    setup is something like this, their will be items/groups/user each can
    have upto 6 images which needs to be scaled to 4 different sizes ie
    every item can have upto 24 images of varying sizes.
    >
    now the standard way of storing these files would be to store them in
    subdirectories based on some hash.
    >
    my partial solution is to split the four types of files into four
    fixed base folders for each dimension,
    >
    since filename is in format "YmdHis" i decided to use directory
    structure as Y/m/d/<filename>.
    but i realize that even this could be inefficient.
    >
    so now i am thinking about going one more level by creating Y/m/d/H/i/
    <filenamedirect ory structure.
    >
    now my question is how to go about creating subdirectories below base
    folders, will my scheme hold or should i use md5 hash as suggested by
    others, over the filename and then take 2-3 characters and create one
    or two level of directory structure and then store the files?
    >
    Regards,
    Amit
    >
    I use databases for this.

    --
    =============== ===
    Remove the "x" from my email address
    Jerry Stuckle
    JDS Computer Training Corp.
    jstucklex@attgl obal.net
    =============== ===

    Comment

    • NoDude

      #3
      Re: which is the better option for directory hashing to store large number of image files?

      I personally use something like /images/front/controller/row_id/ -
      that way I can only store the name of the image.

      On Sep 17, 2:49 pm, Jerry Stuckle <jstuck...@attg lobal.netwrote:
      theCancerus wrote:
      Hi All,
      >
      I am not sure if this is the right place to ask this question but i am
      very sure you may have faced this problem, i have already found some
      post related to this but not the answer i am looking for.
      >
      My problem is that i have to upload images and store them. I am using
      filesystem for that.
      >
      setup is something like this, their will be items/groups/user each can
      have upto 6 images which needs to be scaled to 4 different sizes ie
      every item can have upto 24 images of varying sizes.
      >
      now the standard way of storing these files would be to store them in
      subdirectories based on some hash.
      >
      my partial solution is to split the four types of files into four
      fixed base folders for each dimension,
      >
      since filename is in format "YmdHis" i decided to use directory
      structure as Y/m/d/<filename>.
      but i realize that even this could be inefficient.
      >
      so now i am thinking about going one more level by creating Y/m/d/H/i/
      <filenamedirect ory structure.
      >
      now my question is how to go about creating subdirectories below base
      folders, will my scheme hold or should i use md5 hash as suggested by
      others, over the filename and then take 2-3 characters and create one
      or two level of directory structure and then store the files?
      >
      Regards,
      Amit
      >
      I use databases for this.
      >
      --
      =============== ===
      Remove the "x" from my email address
      Jerry Stuckle
      JDS Computer Training Corp.
      jstuck...@attgl obal.net
      =============== ===

      Comment

      • Steve

        #4
        Re: which is the better option for directory hashing to store large number of image files?

        Moral: Programming, as well as life, is not always an either-or.
        Sometimes a compromise/hybrid is the best solution.
        >
        --
        Shelly
        ahhh, but shelly, the thing i like most is that in programming, it is always
        either/or: on/off. to say otherwise is to not know programming. the same
        holds true for life. you either do or do not. any notions about the nobility
        or superiority of human action in his contemplation of life are simply
        false, save the fact that there is none of either. do or do not is all that
        remains and that directly linked to his own survivability - as is the
        impetous of all animals.

        compromise. chuckle.


        Comment

        • Shelly

          #5
          Re: which is the better option for directory hashing to store large number of image files?


          "Steve" <no.one@example .comwrote in message
          news:3rwHi.805$ 3C.788@newsfe05 .lga...
          >Moral: Programming, as well as life, is not always an either-or.
          >Sometimes a compromise/hybrid is the best solution.
          >>
          >--
          >Shelly
          >
          ahhh, but shelly, the thing i like most is that in programming, it is
          always either/or: on/off. to say otherwise is to not know programming. the
          same holds true for life. you either do or do not. any notions about the
          nobility or superiority of human action in his contemplation of life are
          simply false, save the fact that there is none of either. do or do not is
          all that remains and that directly linked to his own survivability - as is
          the impetous of all animals.
          >
          compromise. chuckle.
          So, I take it that if you fed a meal which is a wonderfully prepared, 10
          pound, filet mignon you either (a) eat all of it or (b) eat none of it?

          or,

          If you are faced with a court appearance for excessive speeding in your car
          you should either be acquitted or should get the death sentence?

          On one project about 25 years ago I needed to modify a very large
          application that was written in Fortran. I needed dynamic allocation.
          According to you, I should have been faced with two choices. One was to
          emulate dynamic allocation by setting aside a large part of memory and doing
          my own allocation from that memory heap. A second would have been to
          totally rewrite that entire (largggggeeeee) application in C. I chose a
          "compromise ". I wrote a small module in C and used that in conjunction with
          the rest of the Fortran code.

          The point here is that there are two extremes in handling his situation.
          Either avoid a database and just use the file system, or avoid the file
          system and put all of the contents of the file into a blob field in the
          database. Often, the better way is to use the database as a rapid search
          engine for a file in the file system.

          I guess you aren't married? I have been for over four decades. Believe me,
          "all or nothing" just doesn't work. Even with a swich for the lights you
          can always add a dimmer.

          By the way, I have been programming four over forty years. We are not
          talking ones and zeros, true or false, here. We are talking design
          philosophy -- and that if usually a compromise among various alternatives to
          achieve the most efficient results in the shortest time for the least cost.

          Shelly


          Comment

          • Andy Hassall

            #6
            Re: which is the better option for directory hashing to store large number of image files?

            On Mon, 17 Sep 2007 00:09:14 -0700, theCancerus <thecancerus@gm ail.comwrote:
            >My problem is that i have to upload images and store them. I am using
            >filesystem for that.
            >
            >setup is something like this, their will be items/groups/user each can
            >have upto 6 images which needs to be scaled to 4 different sizes ie
            >every item can have upto 24 images of varying sizes.
            >
            >now the standard way of storing these files would be to store them in
            >subdirectori es based on some hash.
            >
            >my partial solution is to split the four types of files into four
            >fixed base folders for each dimension,
            >
            >since filename is in format "YmdHis" i decided to use directory
            >structure as Y/m/d/<filename>.
            >but i realize that even this could be inefficient.
            >
            >so now i am thinking about going one more level by creating Y/m/d/H/i/
            ><filenamedirec tory structure.
            >
            >now my question is how to go about creating subdirectories below base
            >folders, will my scheme hold or should i use md5 hash as suggested by
            >others, over the filename and then take 2-3 characters and create one
            >or two level of directory structure and then store the files?
            Splitting the files by date (down to whatever resolution) is potentially still
            susceptible to a large number arriving at the same time, and ending up with a
            large number of files in a single directory. If the goal is to spread the files
            across a number of directories, then you probably want the value that
            determines the directories to be approximately randomly distributed, and to
            have a bounded and resonable number of possible directory names.

            md5 of some property (name? or even contents?) likely fits this reasonably
            well. The number of bytes you use for subdirectories depends on however many
            images you have. If you don't actually expose the
            hash-used-for-storage-directory in the URL, then you're free to re-hash the
            images' directories if you end up needing more levels to split the directories
            (if it was in the URL, then it would change the URLs of all your images, which
            is something to be avoided).

            Substrings of just the name may work as well, although there could be a bias
            to particular letters or numbers depending on where the names come from and
            what language they're in.


            There's more than one way to do it, as ever, and the way to go depends on what
            exactly you're doing. Have you checked whether your initial assumption is true,
            though? Whilst "large number of entries in a directory is slow" is true in many
            filesystems, it's not a universal truth. What's the threshold for your
            filesystem, and are you planning on getting anywhere close to it in the
            forseeable future? (after overestimating it a bit to be safely pessimistic)

            --
            Andy Hassall :: andy@andyh.co.u k :: http://www.andyh.co.uk
            http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool

            Comment

            • Jerry Stuckle

              #7
              Re: which is the better option for directory hashing to store largenumber of image files?

              Shelly wrote:
              "Steve" <no.one@example .comwrote in message
              news:3rwHi.805$ 3C.788@newsfe05 .lga...
              >>Moral: Programming, as well as life, is not always an either-or.
              >>Sometimes a compromise/hybrid is the best solution.
              >>>
              >>--
              >>Shelly
              >ahhh, but shelly, the thing i like most is that in programming, it is
              >always either/or: on/off. to say otherwise is to not know programming. the
              >same holds true for life. you either do or do not. any notions about the
              >nobility or superiority of human action in his contemplation of life are
              >simply false, save the fact that there is none of either. do or do not is
              >all that remains and that directly linked to his own survivability - as is
              >the impetous of all animals.
              >>
              >compromise. chuckle.
              >
              So, I take it that if you fed a meal which is a wonderfully prepared, 10
              pound, filet mignon you either (a) eat all of it or (b) eat none of it?
              >
              (a). (b) is not even an option!
              or,
              >
              If you are faced with a court appearance for excessive speeding in your car
              you should either be acquitted or should get the death sentence?
              >
              No, but I should either be acquitted or found guilty. And if found
              guilty, I should receive the appropriate punishment. The death sentence
              is not appropriate for all infractions.
              On one project about 25 years ago I needed to modify a very large
              application that was written in Fortran. I needed dynamic allocation.
              According to you, I should have been faced with two choices. One was to
              emulate dynamic allocation by setting aside a large part of memory and doing
              my own allocation from that memory heap. A second would have been to
              totally rewrite that entire (largggggeeeee) application in C. I chose a
              "compromise ". I wrote a small module in C and used that in conjunction with
              the rest of the Fortran code.
              >
              What is your point?
              The point here is that there are two extremes in handling his situation.
              Either avoid a database and just use the file system, or avoid the file
              system and put all of the contents of the file into a blob field in the
              database. Often, the better way is to use the database as a rapid search
              engine for a file in the file system.
              >
              Sure, there are extremes. But have you actually tried storing the data
              in a blob field and tuning your database for it? I thought not. Access
              is quite fast - virtually always faster than a mix of the two, because
              you don't have to make both a database and a file system call. Less
              overhead - the database returns the blob just as effectively as it does
              a file name.
              I guess you aren't married? I have been for over four decades. Believe me,
              "all or nothing" just doesn't work. Even with a swich for the lights you
              can always add a dimmer.
              >
              Sure it does. If I don't let my wife have her own way ALL the time, I
              get "nothing". :-)
              By the way, I have been programming four over forty years. We are not
              talking ones and zeros, true or false, here. We are talking design
              philosophy -- and that if usually a compromise among various alternatives to
              achieve the most efficient results in the shortest time for the least cost.
              >
              Shelly
              >
              >
              Sure we are. Everything in programming comes down to ones and zeros.
              It's just the approach to getting there that differs.

              --
              =============== ===
              Remove the "x" from my email address
              Jerry Stuckle
              JDS Computer Training Corp.
              jstucklex@attgl obal.net
              =============== ===

              Comment

              • theCancerus

                #8
                Re: which is the better option for directory hashing to store large number of image files?

                On Sep 17, 11:29 pm, Andy Hassall <a...@andyh.co. ukwrote:
                On Mon, 17 Sep 2007 00:09:14 -0700, theCancerus <thecance...@gm ail.comwrote:
                My problem is that i have to upload images and store them. I am using
                filesystem for that.
                >
                setup is something like this, their will be items/groups/user each can
                have upto 6 images which needs to be scaled to 4 different sizes ie
                every item can have upto 24 images of varying sizes.
                >
                now the standard way of storing these files would be to store them in
                subdirectories based on some hash.
                >
                my partial solution is to split the four types of files into four
                fixed base folders for each dimension,
                >
                since filename is in format "YmdHis" i decided to use directory
                structure as Y/m/d/<filename>.
                but i realize that even this could be inefficient.
                >
                so now i am thinking about going one more level by creating Y/m/d/H/i/
                <filenamedirect ory structure.
                >
                now my question is how to go about creating subdirectories below base
                folders, will my scheme hold or should i use md5 hash as suggested by
                others, over the filename and then take 2-3 characters and create one
                or two level of directory structure and then store the files?
                >
                Splitting the files by date (down to whatever resolution) is potentially still
                susceptible to a large number arriving at the same time, and ending up with a
                large number of files in a single directory. If the goal is to spread the files
                across a number of directories, then you probably want the value that
                determines the directories to be approximately randomly distributed, and to
                have a bounded and resonable number of possible directory names.
                >
                md5 of some property (name? or even contents?) likely fits this reasonably
                well. The number of bytes you use for subdirectories depends on however many
                images you have. If you don't actually expose the
                hash-used-for-storage-directory in the URL, then you're free to re-hash the
                images' directories if you end up needing more levels to split the directories
                (if it was in the URL, then it would change the URLs of all your images, which
                is something to be avoided).
                >
                Substrings of just the name may work as well, although there could be a bias
                to particular letters or numbers depending on where the names come from and
                what language they're in.
                >
                There's more than one way to do it, as ever, and the way to go depends on what
                exactly you're doing. Have you checked whether your initial assumption is true,
                though? Whilst "large number of entries in a directory is slow" is true in many
                filesystems, it's not a universal truth. What's the threshold for your
                filesystem, and are you planning on getting anywhere close to it in the
                forseeable future? (after overestimating it a bit to be safely pessimistic)
                >
                --
                Andy Hassall :: a...@andyh.co.u k ::http://www.andyh.co.ukhttp://www.and....co.uk/space:: disk and FTP usage analysis tool
                hi Andy,

                thanks for sensible reply.
                we need to upload around 2.5 million images as seed data for the
                website. we are using linux system(centos ) so any ideas what would be
                the reasonable number of files per directory?

                and unless thousands of users want to upload images at the same time i
                am sure it will never happen that their are large number of files in
                one directory every minute.

                anyways i have decided to go with MD5 as 3/3 leter combination gives
                me good spread for long time :)

                Comment

                • Andy Hassall

                  #9
                  Re: which is the better option for directory hashing to store large number of image files?

                  On Tue, 18 Sep 2007 05:26:12 -0000, theCancerus <thecancerus@gm ail.comwrote:
                  >On Sep 17, 11:29 pm, Andy Hassall <a...@andyh.co. ukwrote:
                  >>
                  > There's more than one way to do it, as ever, and the way to go depends on what
                  >exactly you're doing. Have you checked whether your initial assumption is true,
                  >though? Whilst "large number of entries in a directory is slow" is true in many
                  >filesystems, it's not a universal truth. What's the threshold for your
                  >filesystem, and are you planning on getting anywhere close to it in the
                  >forseeable future? (after overestimating it a bit to be safely pessimistic)
                  >>
                  >thanks for sensible reply.
                  >we need to upload around 2.5 million images as seed data for the
                  >website. we are using linux system(centos ) so any ideas what would be
                  >the reasonable number of files per directory?
                  So, you're probably using the ext3 filesystem? This has an option for "hashed
                  b-tree" storage of directory entries, which helps with the
                  large-number-of-files issue (at least, the relevant part of it - obviously it
                  still takes a while to iterate through them all, but accessing one file that
                  you already know the filename of doesn't have the same problems as older
                  filesystems that do a linear scan every time).

                  On my CentOS system:

                  # tune2fs -l /dev/mapper/VolGroup00-LogVol00 | grep features
                  Filesystem features: has_journal ext_attr resize_inode dir_index filetype
                  needs_recovery sparse_super large_file

                  The "dir_index" option says it's turned on for me, and I didn't change it, so
                  it must be the default.

                  I don't know what the limits of this are, though.

                  --
                  Andy Hassall :: andy@andyh.co.u k :: http://www.andyh.co.uk
                  http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool

                  Comment

                  Working...