personal document mgmt system idea

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Sandy Norton

    personal document mgmt system idea

    Hi folks,

    I have been mulling over an idea for a very simple python-based
    personal document management system. The source of this possible
    solution is the following typical problem:

    I accumulate a lot of files (documents, archives, pdfs, images, etc.)
    on a daily basis and storing them in a hierarchical file system is
    simple but unsatisfactory:

    - deeply nested hierarchies are a pain to navigate
    and to reorganize
    - different file systems have inconsistent and weak schemes
    for storing metadata e.g. compare variety of incompatible
    schemes in windows alone (office docs vs. pdfs etc.) .

    I would like a personal document management system that:

    - is of adequate and usable performance
    - can accomodate data files of up to 50MB
    - is simple and easy to use
    - promotes maximum programmibility
    - allows for the selective replication (or backup) of data
    over a network
    - allows for multiple (custom) classification schemes
    - is portable across operating systems

    The system should promote the following simple pattern:

    receive file -> drop it into 'special' folder

    after an arbitrary period of doing the above n times -> run
    application

    for each file in folder:
    if automatic metadata extraction is possible:
    scan file for metadata and populate fields accordingly
    fill in missing metadata
    else:
    enter metadata
    store file

    every now and then:
    run replicator function of application -> will backup data
    over a network
    # this will make specified files available to co-workers
    # accessing a much larger web-based non-personal version of the
    # docmanagement system.

    My initial prototyping efforts involved creating a single test table
    in
    mysql (later to include fields for dublin-core metadata elements)
    and a BLOB field for the data itself. My present dev platform is
    windows XP pro, mysql 4.1.1-alpha, MySQL-python connector v.0.9.2
    and python 2.3.3 . However, I will be testing the same app on Mac OS X
    and Linux Mandrake 9.2 as well.

    The first problem I've run into is that mysql or the MySQL
    connector crashes when the size of one BLOB reaches a certain point:
    in this case an .avi file of 7.2 mb .

    Here's the code:

    <code>

    import sys, time, os, zlib
    import MySQLdb, _mysql


    def initDB(db='test '):
    connection = MySQLdb.Connect ("localhost" , "sa")
    cursor = connection.curs or()
    cursor.execute( "use %s;" % db)
    return (connection, cursor)

    def close(connectio n, cursor):
    connection.clos e()
    cursor.close()

    def drop_table(curs or):
    try:
    cursor.execute( "drop table tstable")
    except:
    pass

    def create_table(cu rsor):
    cursor.execute( '''create table tstable
    ( id INTEGER PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(100),
    data BLOB
    );''')

    def process(data):
    data = zlib.compress(d ata, 9)
    return _mysql.escape_s tring(data)

    def populate_table( cursor):
    files = [(f, os.path.join('t estdocs', f)) for f in
    os.listdir('tes tdocs')]
    for filename, filepath in files:
    t1 = time.time()
    data = open(filepath, 'rb').read()
    data = process(data)
    # IMPORTANT: you have to quote the binary txt even after
    escaping it.
    cursor.execute( '''insert into tstable (id, name, data)
    values (NULL, '%s', '%s')''' % (filename, data))
    print time.time() - t1, 'seconds for ', filepath


    def main ():
    connection, cursor = initDB()
    # doit
    drop_table(curs or)
    create_table(cu rsor)
    populate_table( cursor)
    close(connectio n, cursor)


    if __name__ == "__main__":
    t1 = time.time()
    main ()
    print '=> it took total ', time.time() - t1, 'seconds to complete'

    </code>

    <traceback>
    [color=blue]
    >pythonw -u "test_blob. py"[/color]
    0.155999898911 seconds for testdocs\busine ss plan.doc
    0.0160000324249 seconds for testdocs\concep t2businessproce ss.pdf
    0.0160000324249 seconds for testdocs\diagra m.vsd
    0.0149998664856 seconds for testdocs\logo.j pg
    Traceback (most recent call last):
    File "test_blob. py", line 59, in ?
    main ()
    File "test_blob. py", line 53, in main
    populate_table( cursor)
    File "test_blob. py", line 44, in populate_table
    cursor.execute( '''insert into tstable (id, name, data) values
    (NULL, '%s', '%s')''' % (filename, data))
    File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\cursors.py",
    line 95, in execute
    return self._execute(q uery, args)
    File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\cursors.py",
    line 114, in _execute
    self.errorhandl er(self, exc, value)
    File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\connections.p y",
    line 33, in defaulterrorhan dler
    raise errorclass, errorvalue
    _mysql_exceptio ns.OperationalE rror: (2006, 'MySQL server has gone
    away')[color=blue]
    >Exit code: 1[/color]

    </traceback>

    My Questions are:

    - Is my test code at fault?

    - Is this the wrong approach to begin with: i.e. is it a bad idea to
    store the data itself in the database?

    - Am I using the wrong database? (or is the connector just buggy?)


    Thanks to all.

    best regards,

    Sandy Norton
  • John J. Lee

    #2
    Re: personal document mgmt system idea

    sandskyfly@hotm ail.com (Sandy Norton) writes:
    [color=blue]
    > I have been mulling over an idea for a very simple python-based
    > personal document management system. The source of this possible
    > solution is the following typical problem:
    >
    > I accumulate a lot of files (documents, archives, pdfs, images, etc.)
    > on a daily basis and storing them in a hierarchical file system is
    > simple but unsatisfactory:
    >
    > - deeply nested hierarchies are a pain to navigate
    > and to reorganize
    > - different file systems have inconsistent and weak schemes
    > for storing metadata e.g. compare variety of incompatible
    > schemes in windows alone (office docs vs. pdfs etc.) .
    >
    > I would like a personal document management system that:[/color]
    [...][color=blue]
    > The system should promote the following simple pattern:[/color]
    [...]

    Pybliographer 2 is aiming at these features (but a lot more besides).
    Work has been slow for a long while, but several new releases of
    pyblio 1 have come out recently, and work is taking place on pyblio 2.
    There are design documents on the web at pybliographer.o rg. Why not
    muck in and implement what you want with Pyblio?

    [...][color=blue]
    > My initial prototyping efforts involved creating a single test table
    > in
    > mysql (later to include fields for dublin-core metadata elements)
    > and a BLOB field for the data itself. My present dev platform is
    > windows XP pro, mysql 4.1.1-alpha, MySQL-python connector v.0.9.2
    > and python 2.3.3 . However, I will be testing the same app on Mac OS X
    > and Linux Mandrake 9.2 as well.[/color]

    ATM Pyblio only runs on GNOME, but that's going to change.

    [color=blue]
    > The first problem I've run into is that mysql or the MySQL
    > connector crashes when the size of one BLOB reaches a certain point:
    > in this case an .avi file of 7.2 mb .
    >
    > Here's the code:[/color]
    [...][color=blue]
    > _mysql_exceptio ns.OperationalE rror: (2006, 'MySQL server has gone
    > away')[color=green]
    > >Exit code: 1[/color]
    >
    > </traceback>
    >
    > My Questions are:
    >
    > - Is my test code at fault?
    >
    > - Is this the wrong approach to begin with: i.e. is it a bad idea to
    > store the data itself in the database?[/color]

    Haven't read your code, but the error certainly strongly suggests a
    MySQL configuration problem.


    John

    Comment

    • John Roth

      #3
      Re: personal document mgmt system idea

      I wouldn't put the individual files in a data base - that's what
      file systems are for. The exception is small files (and by the
      time you say ".doc" in MS Word, it's now longer a small
      file) where you can save substantial space by consolidating
      them.

      John Roth

      "Sandy Norton" <sandskyfly@hot mail.com> wrote in message
      news:b03e80d.04 01200538.10fcf3 3a@posting.goog le.com...[color=blue]
      > Hi folks,
      >
      > I have been mulling over an idea for a very simple python-based
      > personal document management system. The source of this possible
      > solution is the following typical problem:
      >
      > I accumulate a lot of files (documents, archives, pdfs, images, etc.)
      > on a daily basis and storing them in a hierarchical file system is
      > simple but unsatisfactory:
      >
      > - deeply nested hierarchies are a pain to navigate
      > and to reorganize
      > - different file systems have inconsistent and weak schemes
      > for storing metadata e.g. compare variety of incompatible
      > schemes in windows alone (office docs vs. pdfs etc.) .
      >
      > I would like a personal document management system that:
      >
      > - is of adequate and usable performance
      > - can accomodate data files of up to 50MB
      > - is simple and easy to use
      > - promotes maximum programmibility
      > - allows for the selective replication (or backup) of data
      > over a network
      > - allows for multiple (custom) classification schemes
      > - is portable across operating systems
      >
      > The system should promote the following simple pattern:
      >
      > receive file -> drop it into 'special' folder
      >
      > after an arbitrary period of doing the above n times -> run
      > application
      >
      > for each file in folder:
      > if automatic metadata extraction is possible:
      > scan file for metadata and populate fields accordingly
      > fill in missing metadata
      > else:
      > enter metadata
      > store file
      >
      > every now and then:
      > run replicator function of application -> will backup data
      > over a network
      > # this will make specified files available to co-workers
      > # accessing a much larger web-based non-personal version of the
      > # docmanagement system.
      >
      > My initial prototyping efforts involved creating a single test table
      > in
      > mysql (later to include fields for dublin-core metadata elements)
      > and a BLOB field for the data itself. My present dev platform is
      > windows XP pro, mysql 4.1.1-alpha, MySQL-python connector v.0.9.2
      > and python 2.3.3 . However, I will be testing the same app on Mac OS X
      > and Linux Mandrake 9.2 as well.
      >
      > The first problem I've run into is that mysql or the MySQL
      > connector crashes when the size of one BLOB reaches a certain point:
      > in this case an .avi file of 7.2 mb .
      >
      > Here's the code:
      >
      > <code>
      >
      > import sys, time, os, zlib
      > import MySQLdb, _mysql
      >
      >
      > def initDB(db='test '):
      > connection = MySQLdb.Connect ("localhost" , "sa")
      > cursor = connection.curs or()
      > cursor.execute( "use %s;" % db)
      > return (connection, cursor)
      >
      > def close(connectio n, cursor):
      > connection.clos e()
      > cursor.close()
      >
      > def drop_table(curs or):
      > try:
      > cursor.execute( "drop table tstable")
      > except:
      > pass
      >
      > def create_table(cu rsor):
      > cursor.execute( '''create table tstable
      > ( id INTEGER PRIMARY KEY AUTO_INCREMENT,
      > name VARCHAR(100),
      > data BLOB
      > );''')
      >
      > def process(data):
      > data = zlib.compress(d ata, 9)
      > return _mysql.escape_s tring(data)
      >
      > def populate_table( cursor):
      > files = [(f, os.path.join('t estdocs', f)) for f in
      > os.listdir('tes tdocs')]
      > for filename, filepath in files:
      > t1 = time.time()
      > data = open(filepath, 'rb').read()
      > data = process(data)
      > # IMPORTANT: you have to quote the binary txt even after
      > escaping it.
      > cursor.execute( '''insert into tstable (id, name, data)
      > values (NULL, '%s', '%s')''' % (filename, data))
      > print time.time() - t1, 'seconds for ', filepath
      >
      >
      > def main ():
      > connection, cursor = initDB()
      > # doit
      > drop_table(curs or)
      > create_table(cu rsor)
      > populate_table( cursor)
      > close(connectio n, cursor)
      >
      >
      > if __name__ == "__main__":
      > t1 = time.time()
      > main ()
      > print '=> it took total ', time.time() - t1, 'seconds to complete'
      >
      > </code>
      >
      > <traceback>
      >[color=green]
      > >pythonw -u "test_blob. py"[/color]
      > 0.155999898911 seconds for testdocs\busine ss plan.doc
      > 0.0160000324249 seconds for testdocs\concep t2businessproce ss.pdf
      > 0.0160000324249 seconds for testdocs\diagra m.vsd
      > 0.0149998664856 seconds for testdocs\logo.j pg
      > Traceback (most recent call last):
      > File "test_blob. py", line 59, in ?
      > main ()
      > File "test_blob. py", line 53, in main
      > populate_table( cursor)
      > File "test_blob. py", line 44, in populate_table
      > cursor.execute( '''insert into tstable (id, name, data) values
      > (NULL, '%s', '%s')''' % (filename, data))
      > File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\cursors.py",
      > line 95, in execute
      > return self._execute(q uery, args)
      > File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\cursors.py",
      > line 114, in _execute
      > self.errorhandl er(self, exc, value)
      > File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\connections.p y",
      > line 33, in defaulterrorhan dler
      > raise errorclass, errorvalue
      > _mysql_exceptio ns.OperationalE rror: (2006, 'MySQL server has gone
      > away')[color=green]
      > >Exit code: 1[/color]
      >
      > </traceback>
      >
      > My Questions are:
      >
      > - Is my test code at fault?
      >
      > - Is this the wrong approach to begin with: i.e. is it a bad idea to
      > store the data itself in the database?
      >
      > - Am I using the wrong database? (or is the connector just buggy?)
      >
      >
      > Thanks to all.
      >
      > best regards,
      >
      > Sandy Norton[/color]


      Comment

      • Stephan Diehl

        #4
        Re: personal document mgmt system idea

        Sandy Norton wrote:

        Hi Sandy,

        looks like this will be the year of personal document management projects.
        Since I'm involved in a similar project (hope I can go Open Source with it),
        here are some of my thoughts.
        [color=blue]
        > Hi folks,
        >
        > I have been mulling over an idea for a very simple python-based
        > personal document management system. The source of this possible
        > solution is the following typical problem:
        >[/color]

        [...]
        [color=blue]
        >
        > The first problem I've run into is that mysql or the MySQL
        > connector crashes when the size of one BLOB reaches a certain point:
        > in this case an .avi file of 7.2 mb .
        >[/color]

        Just dump your files somewhere in the filesystem and keep a record of it in
        your database.

        In addition, a real (text) search engine might be of help. I'm using swish-e
        (www.swish-e.org) and are very pleased with it.

        Maybe, before you invest to much time into such a project, you should check
        out the following:

        Chandler (http://www.osafoundation.org)
        if it's finished, it will do excactly what you are aiming for (and it's
        written in Python)

        ReiseFS (see www.namesys.com -> Future Vision)

        Gnome Storage (http://www.gnome.org/~seth/storage)

        WinFS
        (http://msdn.microsoft.com/Longhorn/u...S/default.aspx)

        Hope that helps

        Stephan



        Comment

        • Sandy Norton

          #5
          Re: personal document mgmt system idea

          John J. Lee:
          [color=blue]
          > Pybliographer 2 is aiming at these features (but a lot more besides).
          > Work has been slow for a long while, but several new releases of
          > pyblio 1 have come out recently, and work is taking place on pyblio 2.
          > There are design documents on the web at pybliographer.o rg. Why not
          > muck in and implement what you want with Pyblio?[/color]

          Thanks for the reference, Pyblio definitely seems interesting and I
          will be looking into this project closely.

          cheers.

          Sandy

          Comment

          • Sandy Norton

            #6
            Re: personal document mgmt system idea

            John Roth wrote :
            [color=blue]
            > I wouldn't put the individual files in a data base - that's what
            > file systems are for. The exception is small files (and by the
            > time you say ".doc" in MS Word, it's now longer a small
            > file) where you can save substantial space by consolidating
            > them.[/color]

            There seems to be consensus that I shouldn't store files in the
            database. This makes sense as filesystems seem to be optimized for,
            um, files (-;

            As I want to get away from deeply nested directories, I'm going to
            test two approaches:

            1. store everything in a single folder and hash each file name to give
            a unique id

            2. create a directory structure based upon a calendar year and store
            the daily downloads automatically.

            I can finally use some code I'd written before for something like this
            purpose:

            <code>

            from pprint import pprint
            import os
            import calendar


            class Calendirs:

            months = {
            1 : 'January',
            2 : 'February',
            3 : 'March',
            4 : 'April',
            5 : 'May',
            6 : 'June',
            7 : 'July',
            8 : 'August',
            9 : 'September',
            10 : 'October',
            11 : 'November',
            12 : 'December'
            }

            wkdays = {
            0 : 'Monday',
            1 : 'Tuesday',
            2 : 'Wednesday',
            3 : 'Thursday',
            4 : 'Friday',
            5 : 'Saturday',
            6 : 'Sunday'
            }

            def __init__(self, year):
            self.year = year

            def calendir(self):
            '''returns list of calendar matrices'''
            mc = calendar.monthc alendar
            cal = [(self.year, m) for m in range(1,13)]
            return [mc(y,m) for (y, m) in cal]

            def yearList(self):
            res=[]
            weekday = calendar.weekda y
            m = 0
            for month in self.calendir() :
            lst = []
            m += 1
            for week in month:
            for day in week:
            if day:
            day_str = Calendirs.wkday s[weekday(self.ye ar,
            m, day)]
            lst.append( (str(m)+'.'+Cal endirs.months[m],
            str(day)+'.'+da y_str) )
            res.append(lst)
            return res

            def make(self):
            for month in self.yearList() :
            for m, day in month:
            path = os.path.join(st r(self.year), m, day)
            os.makedirs(pat h)

            Calendirs(2004) .make()

            </code>


            I don't know which method will perform better or be more usable...
            testing testing testing.

            regards,

            Sandy

            Comment

            • Sandy Norton

              #7
              Re: personal document mgmt system idea

              Stephan Diehl wrote:

              [...]
              [color=blue]
              > Just dump your files somewhere in the filesystem and keep a record of it in
              > your database.[/color]

              I think I will go with this approach. (see other posting for details)

              [color=blue]
              > In addition, a real (text) search engine might be of help. I'm using swish-e
              > (www.swish-e.org) and are very pleased with it.[/color]

              Just downloaded it... looks good. Now if it also had a python api (-;
              [color=blue]
              > Maybe, before you invest to much time into such a project, you should check
              > out the following:
              >
              > Chandler (http://www.osafoundation.org)
              > if it's finished, it will do excactly what you are aiming for (and
              > it's written in Python)[/color]

              Still early stages... I see they dropped the ZODB.
              [color=blue]
              > ReiseFS (see www.namesys.com -> Future Vision)
              > Gnome Storage (http://www.gnome.org/~seth/storage)
              > WinFS
              > (http://msdn.microsoft.com/Longhorn/u...S/default.aspx)[/color]


              Wow! Very exciting stuff... I guess we'll just have to wait and see what develops.

              [color=blue]
              > Hope that helps[/color]

              Yes. Very informative. Cheers for the help.
              [color=blue]
              > Stephan[/color]

              Sandy

              Comment

              • John Abel

                #8
                Re: personal document mgmt system idea

                Have you looked at the modules available from divmod.org for your text
                searching?

                Sandy Norton wrote:
                [color=blue]
                >Stephan Diehl wrote:
                >
                >[...]
                >
                >
                >[color=green]
                >>Just dump your files somewhere in the filesystem and keep a record of it in
                >>your database.
                >>
                >>[/color]
                >
                >I think I will go with this approach. (see other posting for details)
                >
                >
                >
                >[color=green]
                >>In addition, a real (text) search engine might be of help. I'm using swish-e
                >>(www.swish-e.org) and are very pleased with it.
                >>
                >>[/color]
                >
                >Just downloaded it... looks good. Now if it also had a python api (-;
                >
                >
                >[color=green]
                >>Maybe, before you invest to much time into such a project, you should check
                >>out the following:
                >>
                >>Chandler (http://www.osafoundation.org)
                >> if it's finished, it will do excactly what you are aiming for (and
                >> it's written in Python)
                >>
                >>[/color]
                >
                >Still early stages... I see they dropped the ZODB.
                >
                >
                >[color=green]
                >>ReiseFS (see www.namesys.com -> Future Vision)
                >>Gnome Storage (http://www.gnome.org/~seth/storage)
                >>WinFS
                >>(http://msdn.microsoft.com/Longhorn/u...S/default.aspx)
                >>
                >>[/color]
                >
                >
                >Wow! Very exciting stuff... I guess we'll just have to wait and see what develops.
                >
                >
                >
                >[color=green]
                >>Hope that helps
                >>
                >>[/color]
                >
                >Yes. Very informative. Cheers for the help.
                >
                >
                >[color=green]
                >>Stephan
                >>
                >>[/color]
                >
                >Sandy
                >
                >[/color]

                Comment

                • Stephan Diehl

                  #9
                  Re: personal document mgmt system idea

                  Sandy Norton wrote:
                  [color=blue]
                  > Stephan Diehl wrote:
                  >
                  > [...]
                  >[/color]
                  [...][color=blue][color=green]
                  >> In addition, a real (text) search engine might be of help. I'm using
                  >> swish-e (www.swish-e.org) and are very pleased with it.[/color]
                  >
                  > Just downloaded it... looks good. Now if it also had a python api (-;[/color]

                  I'm just using the command line interface via os.system and the popenX
                  calls.
                  The only thing that (unfortunatelly ) not possible, is to remove a document
                  from the index :-(
                  If you need any help, just drop me a line.
                  [color=blue]
                  >[color=green]
                  >> Maybe, before you invest to much time into such a project, you should
                  >> check out the following:
                  >>
                  >> Chandler (http://www.osafoundation.org)
                  >> if it's finished, it will do excactly what you are aiming for
                  >> (and it's written in Python)[/color]
                  >
                  > Still early stages... I see they dropped the ZODB.[/color]

                  Did they? If they succeed, Chandler will rock. My personal opinion is that
                  they try doing too much at once. I guess that a better filesystem will make
                  most of the document management type applications obsolete.
                  The big problem, of course, is to define 'better' in a meaningfull way.
                  [color=blue]
                  >[color=green]
                  >> ReiseFS (see www.namesys.com -> Future Vision)
                  >> Gnome Storage (http://www.gnome.org/~seth/storage)
                  >> WinFS
                  >>[/color][/color]
                  (http://msdn.microsoft.com/Longhorn/u...S/default.aspx)[color=blue]
                  >
                  >
                  > Wow! Very exciting stuff... I guess we'll just have to wait and see what
                  > develops.[/color]

                  Or go the other way: build a new filesystem prototype application in python
                  and see, if it works out as intended and then build a proper file system.
                  [color=blue]
                  >
                  >[color=green]
                  >> Hope that helps[/color]
                  >
                  > Yes. Very informative. Cheers for the help.
                  >[color=green]
                  >> Stephan[/color]
                  >
                  > Sandy[/color]

                  Comment

                  Working...