Suggestion needed on data storage format in text file

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Manish

    Suggestion needed on data storage format in text file


    The project I am developing doesn't involves database. I want to parse
    the mailbox file (.mbx) and store the summary in the text file for fast
    retrieval and display of information in the Inbox page.

    The sugegsted format are as:

    #1

    ID [4 bytes]: Subject [100 bytes]: To Address[100 bytes]: From
    Address[100 bytes]...etc...

    #2

    Instead of preassining fixed size to variable (as actual data may be
    much less or can grew to more), we can store the values continuously,
    seperated by some unique seperator (#|#, *#*, ...)

    1324#|#Hi, How are you#|#me@google .com#|#you@goog le.com#|# ... and so
    on


    Which of these will be the efficeint one (as there will be frequent
    insert/delete/update of the individual information, eg. set message as
    read ..., delete message ..., new message ...)

    Also please suggest on how to determine the variable size (100 bytes as
    in #1), and assign the size to the variable accordingly and read it
    (differentiate multiple variables) when required.

    Thanks.

    Manish

  • Jerry Stuckle

    #2
    Re: Suggestion needed on data storage format in text file

    Manish wrote:
    The project I am developing doesn't involves database. I want to parse
    the mailbox file (.mbx) and store the summary in the text file for fast
    retrieval and display of information in the Inbox page.
    >
    The sugegsted format are as:
    >
    #1
    >
    ID [4 bytes]: Subject [100 bytes]: To Address[100 bytes]: From
    Address[100 bytes]...etc...
    >
    #2
    >
    Instead of preassining fixed size to variable (as actual data may be
    much less or can grew to more), we can store the values continuously,
    seperated by some unique seperator (#|#, *#*, ...)
    >
    1324#|#Hi, How are you#|#me@google .com#|#you@goog le.com#|# ... and so
    on
    >
    >
    Which of these will be the efficeint one (as there will be frequent
    insert/delete/update of the individual information, eg. set message as
    read ..., delete message ..., new message ...)
    >
    Also please suggest on how to determine the variable size (100 bytes as
    in #1), and assign the size to the variable accordingly and read it
    (differentiate multiple variables) when required.
    >
    Thanks.
    >
    Manish
    >
    Personally, I'd use a database. I wouldn't even try a flat file for
    this. Too much work trying to keep things straight.

    But you asked about the formats. The fixed length fields will have
    extra space any time the amount of data is less than that of the amount
    reserved. Then you run into the problem of someone who gets very
    verbose with their subject line and exceeds the 100 characters. And 4
    bytes allows up to 9999 ID's. Is that enough? Or are you going to try
    to read/write binary (not easy in PHP)?

    The second one is problematical because the user may include your
    separator in its Subject: line (or even name/address if you pick the
    wrong character).

    Two other ways - use CSV format, which is well documented and supported
    by PHP and other programs. Or, add a length field at the beginning of
    each field, specifying how many characters in the following field.

    But I'd still use a database.


    --
    =============== ===
    Remove the "x" from my email address
    Jerry Stuckle
    JDS Computer Training Corp.
    jstucklex@attgl obal.net
    =============== ===

    Comment

    • Andy Jeffries

      #3
      Re: Suggestion needed on data storage format in text file

      On Tue, 18 Jul 2006 21:28:13 -0700, Manish wrote:
      #1
      ID [4 bytes]: Subject [100 bytes]: To Address[100 bytes]: From Address[100
      bytes]...etc...
      >
      #2
      1324#|#Hi, How are you#|#me@google .com#|#you@goog le.com#|# ... and so on
      >
      >
      Which of these will be the efficeint one (as there will be frequent
      insert/delete/update of the individual information, eg. set message as
      read ..., delete message ..., new message ...)
      The first one will be more efficient from a search/replace point of view,
      the second will be more efficient from a space usage point of view.
      Efficiency is subjective.
      Also please suggest on how to determine the variable size (100 bytes as
      in #1), and assign the size to the variable accordingly and read it
      (differentiate multiple variables) when required.
      substr would be used to cut out various portions of the string (e.g. 100
      charactes starting at position 4) and sprintf (or fprintf to do it in PHP5
      if you're using PHP5 to save a step).

      If you need more than a pointer to the right functions, then it's starting
      to sound like a homework assignment and I wish you luck with it...

      Cheers,


      Andy


      --
      Andy Jeffries MBCS CITP ZCE | gPHPEdit Lead Developer
      http://www.gphpedit.org | PHP editor for Gnome 2
      http://www.andyjeffries.co.uk | Personal site and photos

      Comment

      • ronverdonk
        Recognized Expert Specialist
        • Jul 2006
        • 4259

        #4
        Don't re-ivent the wheel!
        Either use a data base or, if you must, stay with well documented formats like CSV or XML.

        Ron :cool:

        Comment

        • ronverdonk
          Recognized Expert Specialist
          • Jul 2006
          • 4259

          #5
          Originally posted by ronverdonk
          Don't re-ivent the wheel!
          Either use a data base or, if you must, stay with well documented formats like CSV or XML.

          Ron :cool:
          And to contradict myself:
          I just saw a new class 'Variable Length Coding' at the PHP Classes, link:
          http://www.phpclasses. org/browse/package/3232.html

          Its description reads:
          This class can be used to compress and uncompress data using the variable length encoding.

          It can read a stream of data and pack it using an pure PHP implementation of the variable length encoding algorithm.

          It can also do the opposite reading a variable length encoded stream of data and unpack it to restore the original uncompressed data.

          So, if you still want variable length coding, check it out!

          Ronald :cool:

          Comment

          • Chung Leong

            #6
            Re: Suggestion needed on data storage format in text file

            Manish wrote:
            The project I am developing doesn't involves database. I want to parse
            the mailbox file (.mbx) and store the summary in the text file for fast
            retrieval and display of information in the Inbox page.
            >
            The sugegsted format are as:
            >
            #1
            >
            ID [4 bytes]: Subject [100 bytes]: To Address[100 bytes]: From
            Address[100 bytes]...etc...
            >
            #2
            >
            Instead of preassining fixed size to variable (as actual data may be
            much less or can grew to more), we can store the values continuously,
            seperated by some unique seperator (#|#, *#*, ...)
            >
            1324#|#Hi, How are you#|#me@google .com#|#you@goog le.com#|# ... and so
            on
            >
            >
            Which of these will be the efficeint one (as there will be frequent
            insert/delete/update of the individual information, eg. set message as
            read ..., delete message ..., new message ...)
            >
            Also please suggest on how to determine the variable size (100 bytes as
            in #1), and assign the size to the variable accordingly and read it
            (differentiate multiple variables) when required.
            >
            Thanks.
            >
            Manish
            That's the kind of project that SQLite was designed for. It's worth
            looking into.

            Comment

            • ImOk

              #7
              Re: Suggestion needed on data storage format in text file

              My suggestion is to use XML. PHP and Javascript has the Dom class that
              supports this format very well. Its also easily extensible. And best of
              all it's a text file.

              Sample:

              <mailbox name="some user">
              <email>
              <id>1234</id>
              <subject>Send me the check<subject>
              <to>nospam@nosp am.com</to>
              <from>someone@s omeone.com</from>
              <message><![CDATA[blah blah blah blah blah
              blah blah blah]]></message>
              <attach>path to attach 1</attach>
              <attach>path to attach 2</attach>
              </email>
              <email>
              <id>5678</id>
              <subject>Send me the check<subject>
              <to>nospam@nosp am.com</to>
              <from>someone@s omeone.com</from>
              <message><![cdata[blah blah blah ]]></message>
              <attach>path to attach 1</attach>
              <attach>path to attach 2</attach>
              </email>
              ....etc...
              </mailbox>
              <mailbox name="some other user">
              ....
              </mailbox>

              Chung Leong wrote:
              Manish wrote:
              The project I am developing doesn't involves database. I want to parse
              the mailbox file (.mbx) and store the summary in the text file for fast
              retrieval and display of information in the Inbox page.

              The sugegsted format are as:

              #1

              ID [4 bytes]: Subject [100 bytes]: To Address[100 bytes]: From
              Address[100 bytes]...etc...

              #2

              Instead of preassining fixed size to variable (as actual data may be
              much less or can grew to more), we can store the values continuously,
              seperated by some unique seperator (#|#, *#*, ...)

              1324#|#Hi, How are you#|#me@google .com#|#you@goog le.com#|# ... and so
              on


              Which of these will be the efficeint one (as there will be frequent
              insert/delete/update of the individual information, eg. set message as
              read ..., delete message ..., new message ...)

              Also please suggest on how to determine the variable size (100 bytes as
              in #1), and assign the size to the variable accordingly and read it
              (differentiate multiple variables) when required.

              Thanks.

              Manish
              >
              That's the kind of project that SQLite was designed for. It's worth
              looking into.

              Comment

              • Manish

                #8
                Re: Suggestion needed on data storage format in text file

                Hi Jerry Stuckle, the project specifies not to use database, otherwise
                it would have been definitely much easier. I have to store all the
                information in the file itself. Thanks for bringing into atention that
                whatever, seperator with least probbability of occurence is chosen, it
                can occur in subject line. May be we should use some escape character
                for it. As it is used in mailbox file. Every new mail starts with "From
                ", but if it's in the message itself, it's replaced by ">From ". I will
                also look into the CSV format for storing the data.


                Hi Andy Jeffries, we are using PHP 5, so sprintf/fprintf can be used. I
                haven't come across using pointers in PHP. I will definitely try to
                learn it.


                Hi ImOk, our initial datastructure was in the XML format itself,
                (individual XML file for every user). As there can be thousands of
                email, the file will grew larger and reading/writing may be slow/error
                prone. So it was suggested to use text file.

                -----------------------------------------------------------------------------------------------------------------------------
                This is how the datastructure is
                -----------------------------------------------------------------------------------------------------------------------------

                <mails>
                <details id="">
                <!-- Mail type (incoming, outgoing) -->
                <mailtype></mailtype>
                <!-- Whether the message is saved as templete (Yes: 1, No: 0) -->
                <istemplate></istemplate>
                <!-- The mailbox id in which the mail reside (id for Inbox, Personal
                Folders, Trash ... ) -->
                <mailboxid></mailboxid>
                <!-- Message Priority (Normal:1, High Priority: 2) -->
                <priority></priority>
                <!-- Is message starred (Yes: 1, No: 0) -->
                <isstarred></isstarred>
                <!-- Is message read (Yes: 1, No: 0) -->
                <isread></isread>
                <!-- Is message replied back to sender (Yes: 1, No: 0) -->
                <isreplied></isreplied>
                <!-- Is message forwarded to any email (Yes: 1, No: 0) -->
                <isforwarded> </isforwarded>

                <!-- Does message has attachment (Yes: 1, No: 0) -->
                <hasattachment> </hasattachment>
                <!-- Attachment details -->
                <attachments>
                <attdetails id="">
                <!-- Attachment file name -->
                <filename></filename>
                <!-- Attachment file size -->
                <filesize></filesize>
                </attdetails>
                </attachments>


                <!-- Sender name -->
                <fromname></fromname>
                <!-- Sender email -->
                <fromemail></fromemail>
                <!-- Total email conversation (1, 2, ... ) -->
                <totalconversat ion></totalconversati on>
                <!-- Main Email detail id (sno), from which the conversation started
                -->
                <mainemailsno ></mainemailsno>
                <!-- Emails in To field -->
                <toemails></toemails>
                <!-- Emails in CC field -->
                <ccemails></ccemails>

                <!-- Mail content in HTML format -->
                <htmlcontent> </htmlcontent>
                <!-- Mail content in Text format -->
                <textcontent> </textcontent>
                <!-- Date time when the message was sent -->
                <sentdatetime ></sentdatetime>
                <!-- Message size in KB -->
                <messagesize> </messagesize>

                <!-- Offset in mbx file -->
                <offsetinmbx> </offsetinmbx>

                <!-- Extra details for incoming/outgoing type emails -->
                <incomingdetail s>
                </incomingdetails >
                <outgoingdetail s>
                <!-- Emails in CC field -->
                <bccemails></bccemails>
                <!-- Message Status (sent, pending) -->
                <msgstatus></msgstatus>
                <!-- Id of the signature to be appended to the message -->
                <signatureid> </signatureid>
                <!-- Scheduled date time (24 hour format) for sending the mail to
                recepients (MM/DD/YYY hh:mm) -->
                <scheduledtime> </scheduledtime>
                <!-- Whether to request a return receipt (Yes: 1, No: 0) -->
                <requestreceipt ></requestreceipt>
                <!-- Message send status (pending, sent) -->
                <sendstatus></sendstatus>
                </outgoingdetails >


                </details>

                </mails>

                -----------------------------------------------------------------------------------------------------------------------------

                But the other setting will still be in XML file.

                We are using SimpleXML functions (get values, update values), DOM
                (insert). Still the delete functionality is not working. We are
                thinking of implementing preg_replace() for it.

                Thanks.

                Manish

                Comment

                • Andy Jeffries

                  #9
                  Re: Suggestion needed on data storage format in text file

                  On Wed, 19 Jul 2006 21:07:06 -0700, Manish wrote:
                  Hi Andy Jeffries, we are using PHP 5, so sprintf/fprintf can be used. I
                  haven't come across using pointers in PHP. I will definitely try to learn
                  it.
                  It's not pointers but string parsing (getting out a section of a string
                  and formatting a string to contain exact lengths of string).
                  Hi ImOk, our initial datastructure was in the XML format itself,
                  (individual XML file for every user). As there can be thousands of email,
                  the file will grew larger and reading/writing may be slow/error prone. So
                  it was suggested to use text file.
                  I don't wish to sound offensive, but if you can't correctly write to an
                  XML file without errors, why do you think you'll be able to do it to a
                  flat file using functions/methods you don't know?

                  Also, bear in mind if you use a database it will also handle locking from
                  multiple processes easily, which you will have to handle yourself in this
                  situation.

                  Don't think "we'll only have one user accessing their account through a
                  single web instance so we won't have concurrency issues" - people these
                  days may use browser tabs to work on their mail concurrently.

                  And you really do run the risk of data loss/corruption if you don't
                  correctly lock access to the file.

                  Cheers,



                  Andy



                  --
                  Andy Jeffries MBCS CITP ZCE | gPHPEdit Lead Developer
                  http://www.gphpedit.org | PHP editor for Gnome 2
                  http://www.andyjeffries.co.uk | Personal site and photos

                  Comment

                  • Jerry Stuckle

                    #10
                    Re: Suggestion needed on data storage format in text file

                    Manish wrote:
                    Hi Jerry Stuckle, the project specifies not to use database, otherwise
                    it would have been definitely much easier. I have to store all the
                    information in the file itself. Thanks for bringing into atention that
                    whatever, seperator with least probbability of occurence is chosen, it
                    can occur in subject line. May be we should use some escape character
                    for it. As it is used in mailbox file. Every new mail starts with "From
                    ", but if it's in the message itself, it's replaced by ">From ". I will
                    also look into the CSV format for storing the data.
                    >
                    >
                    Hi Andy Jeffries, we are using PHP 5, so sprintf/fprintf can be used. I
                    haven't come across using pointers in PHP. I will definitely try to
                    learn it.
                    >
                    >
                    Hi ImOk, our initial datastructure was in the XML format itself,
                    (individual XML file for every user). As there can be thousands of
                    email, the file will grew larger and reading/writing may be slow/error
                    prone. So it was suggested to use text file.
                    >
                    -----------------------------------------------------------------------------------------------------------------------------
                    This is how the datastructure is
                    -----------------------------------------------------------------------------------------------------------------------------
                    <snip>
                    -----------------------------------------------------------------------------------------------------------------------------
                    >
                    But the other setting will still be in XML file.
                    >
                    We are using SimpleXML functions (get values, update values), DOM
                    (insert). Still the delete functionality is not working. We are
                    thinking of implementing preg_replace() for it.
                    >
                    Thanks.
                    >
                    Manish
                    >
                    Manish,

                    If the problem is speed, a flat file isn't going to help you that much
                    more. You'll still have to encode and decode the data, no matter which
                    format you use. And even if it's faster now, all you're doing is
                    delaying the inevitable. You definitely need a database.

                    If it were me, I'd go back to them and explain why they need a database.
                    But I'm only a consultant...

                    --
                    =============== ===
                    Remove the "x" from my email address
                    Jerry Stuckle
                    JDS Computer Training Corp.
                    jstucklex@attgl obal.net
                    =============== ===

                    Comment

                    • Chung Leong

                      #11
                      Re: Suggestion needed on data storage format in text file

                      ImOk wrote:
                      My suggestion is to use XML. PHP and Javascript has the Dom class that
                      supports this format very well. Its also easily extensible. And best of
                      all it's a text file.
                      XML and any text format is very inefficiency when updates/deletions are
                      frequent, as you have to rewrite the file everytime. For a mailbox,
                      that's unacceptable since the file size will likely be fairly large. A
                      suitable format requires a directory of sort storing the offsets of
                      records, so you can quicly seek to the them and modify them in place.
                      Whatever you come up with it'll end up resembling a database. So why
                      not just use what's there already?

                      Comment

                      • ImOk

                        #12
                        Re: Suggestion needed on data storage format in text file

                        Agreed,

                        But I believe there are database engines whose natural format is XML..
                        It's probably fixed length.

                        Chung Leong wrote:
                        ImOk wrote:
                        My suggestion is to use XML. PHP and Javascript has the Dom class that
                        supports this format very well. Its also easily extensible. And best of
                        all it's a text file.
                        >
                        XML and any text format is very inefficiency when updates/deletions are
                        frequent, as you have to rewrite the file everytime. For a mailbox,
                        that's unacceptable since the file size will likely be fairly large. A
                        suitable format requires a directory of sort storing the offsets of
                        records, so you can quicly seek to the them and modify them in place.
                        Whatever you come up with it'll end up resembling a database. So why
                        not just use what's there already?

                        Comment

                        • Manish

                          #13
                          Re: Suggestion needed on data storage format in text file

                          >I don't wish to sound offensive, but if you can't correctly write to an
                          >XML file without errors, why do you think you'll be able to do it to a
                          >flat file using functions/methods you don't know?
                          >Also, bear in mind if you use a database it will also handle locking from
                          >multiple processes easily, which you will have to handle yourself in this situation.
                          >Don't think "we'll only have one user accessing their account through a
                          >single web instance so we won't have concurrency issues" - people these
                          >days may use browser tabs to work on their mail concurrently.
                          >And you really do run the risk of data loss/corruption if you don't
                          >correctly lock access to the file.
                          It's definitely a serious issue. Opening same files concurrently, for
                          each tabbed browser and then update the content of index file will be
                          less effecient.

                          e.g. There can be >1000 messages, say 2 are unread, user reads 1
                          message, to update the status from read to unread for that message, we
                          have to update single byte positin for that message. It's critical from
                          performance (response to user) point of view. If we do it in database,
                          it will be much faster.
                          >If the problem is speed, a flat file isn't going to help you that much more. You'll
                          >still have to encode and decode the data, no matter which format you use. And
                          >even if it's faster now, all you're doing is delaying the inevitable. You definitely
                          >need a database.
                          >If it were me, I'd go back to them and explain why they need a database.
                          >But I'm only a consultant...
                          Surely. We will also suggest for the database.
                          >XML and any text format is very inefficiency when updates/deletions are
                          >frequent, as you have to rewrite the file everytime. For a mailbox,
                          >that's unacceptable since the file size will likely be fairly large. A
                          >suitable format requires a directory of sort storing the offsets of
                          >records, so you can quicly seek to the them and modify them in place.
                          The mailbox file (.mbx) will be there. We will parse it and store only
                          some of the details (including mailbox file offset for that message) in
                          the index file. (.idx, .xml, and surely the best will be database)

                          Comment

                          • Chung Leong

                            #14
                            Re: Suggestion needed on data storage format in text file

                            Manish wrote:
                            Surely. We will also suggest for the database.
                            Keep in mind that using a "database" doesn't necessarily imply a
                            full-blown, standalone RDBMS. An embedded database like SQLite or
                            Sleepycat would work very well in these types of situations.

                            Comment

                            • Andy Jeffries

                              #15
                              Re: Suggestion needed on data storage format in text file

                              On Thu, 20 Jul 2006 13:58:59 -0700, Chung Leong wrote:
                              XML and any text format is very inefficiency when up
                              dates/deletions are frequent, as you have to rewrite the file
                              everytime.
                              Actually, that's not strictly true. If your text format file has fixed
                              length records, you could do random writing to it if you fopen it using
                              "a+" access mode and then use fseek/fwrite to overwrite just the record
                              you're interested in. No need to rewrite the whole file every time.

                              Cheers,


                              Andy

                              --
                              Andy Jeffries MBCS CITP ZCE | gPHPEdit Lead Developer
                              http://www.gphpedit.org | PHP editor for Gnome 2
                              http://www.andyjeffries.co.uk | Personal site and photos

                              Comment

                              Working...