Concurrency issues with "Tree" structures.

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Robin Tucker

    Concurrency issues with "Tree" structures.

    Hi,

    I'm currently implementing a database with a tree structure in a table. The
    nodes in the tree are stored as records with a column called "Parent". The
    root of the tree has a "NULL" parent. The path to each node is stored in
    the column "Path" and is of the form "\000001\000002 \000003\" etc. The
    latter enabling me to fetch subtrees using the "LIKE" predicate. I also
    have created the relation "ID" <-> "ID_Parent, effectively the table is
    related to itself. I did this so that an attempt to remove a parent when
    that parent still has children will fail due to referential integrity.
    Unfortunately, in order to delete subtrees, I have to first set all of the
    nodes in the subtree to point to the "NULL" parent so that I don't get
    caught with integrity errors during deletes.

    Unlike a typical linear system, any given node record is related to more
    than one of the other records. What I mean is it is possible to follow a
    chain from any given node back to the root (ancestors) or collect a series
    of branches and leaves (descendants) and so it is not reasonable to consider
    any given node in isolation. Changes to any given descendant can trigger
    changes which are propagated to it's ancestors. I am not using recursion to
    do this, rather, I am using an iterative approach to select and update the
    parent, then the parents parent, etc. according to it's new state (given by
    the state of it's immediate descendants), right back up to the root.

    Now, the problem comes when I consider concurrency with respect to this
    scheme. It seems to me, that locking the record I am updating is not
    sufficient to ensure clients are kept synchronised or the integrity of the
    tree structure is correct. I think I need to lock all ancestors of the tree
    (HOLDLOCK) before performing any operation on a given node. Is this
    reasonable? Also, consider the "delete" problem given above. I really
    should HOLDLOCK on the entire subtree of any node I wish to delete as I am
    going to set the entire subtrees parent values to NULL. I don't want
    another client to perform a read on part of the subtree while the nodes are
    "parentless " pending deletion.

    Secondly, I am not sure how to handle synchronization of the tree for each
    client. How does each client know when a change has been made to the tree?
    Consider a client holding a copy of the tree in memory. Another client
    deletes a subtree. The first client attempts to update one of the deleted
    subtree nodes and fails because the node no longer exists. At this point,
    in the eyes of the first client, all of the ancestors and all of the
    descendants of the node in question must become suspect. Should the
    software now attempt to re-build this part of the tree? It seems that any
    operation on any of the nodes in the tree will potentially make a lot of
    other nodes suspect and so my application will be doing a lot of updating.

    I would be interested to hear any insights people have on these issues,
    particularly those implementing a "file system" structure in their database
    or similar. How do you deal with these concurrency issues when manipulating
    trees?

    Thanks




    Robin



  • Thomas R. Hummel

    #2
    Re: Concurrency issues with &quot;Tree&quot ; structures.

    I believe that Joe Celko has a book that is devoted to tree structures
    in a SQl environment. Do a search on Amazon for Celko and you should
    see it. I'm sure that Joe will probably chime in here as well. My
    thoughts on the subject though...
    [color=blue]
    > The path to each node is stored in the column "Path" and is of the[/color]
    form[color=blue]
    > "\000001\000002 \000003\"[/color]
    Ack, yuck, ick! I hate that method of storing tree information myself.
    Check Joe's book for several different ways to store trees in a RDBMS.
    [color=blue]
    > also have created the relation "ID" <-> "ID_Parent, effectively the[/color]
    table is[color=blue]
    > related to itself. I did this so that an attempt to remove a parent[/color]
    when[color=blue]
    > that parent still has children will fail due to referential[/color]
    integrity.[color=blue]
    > Unfortunately, in order to delete subtrees, I have to first set all[/color]
    of the[color=blue]
    > nodes in the subtree to point to the "NULL" parent so that I don't[/color]
    get[color=blue]
    > caught with integrity errors during deletes.[/color]
    So, what you seem to be saying is that you want RI so that you can't
    delete any rows by mistake, but you want to be able to easily delete
    rows. You can't have it both ways... you either need to explicitly
    delete the children (or set the FK's to NULL) or you need to remove the
    RI.
    [color=blue]
    > Changes to any given descendant can trigger changes which are[/color]
    propagated to it's[color=blue]
    > ancestors[/color]
    This sounds like a design problem with regards to normalization to me,
    but without knowing the specifics I really can't say.

    As for the client application tracking changes to the tree, that is
    really dependent on the business requirements for your application. You
    can reload the client every X seconds or you could wait for the client
    to take an action then check to see if the action is still valid on the
    tree in its current form. To limit the number of client refreshes that
    you have to perform, each node could use a last_updated datetime column
    to track changes and you could retrieve only those nodes and their
    subtrees where the last_updated column has a value that is greater than
    the last time that you refreshed your tree.

    My suggestion would be to check out Joe's book (or Google past posts by
    Joe) to find alternative tree representations . Some of them are rather
    ingenious and easy to work with.

    HTH,
    -Tom.

    Comment

    • Erland Sommarskog

      #3
      Re: Concurrency issues with &quot;Tree&quot ; structures.

      Robin Tucker (idontwanttobes pammedanymore@r eallyidont.com) writes:[color=blue]
      > Now, the problem comes when I consider concurrency with respect to this
      > scheme. It seems to me, that locking the record I am updating is not
      > sufficient to ensure clients are kept synchronised or the integrity of
      > the tree structure is correct. I think I need to lock all ancestors of
      > the tree (HOLDLOCK) before performing any operation on a given node. Is
      > this reasonable?[/color]

      UPDLOCK would be better. Else you could run into conversion deadlocks.
      An UPDLOCK is a shared lock, so other processes can read. But only one
      can have an UPDLOCK.
      [color=blue]
      > Also, consider the "delete" problem given above. I
      > really should HOLDLOCK on the entire subtree of any node I wish to
      > delete as I am going to set the entire subtrees parent values to NULL.
      > I don't want another client to perform a read on part of the subtree
      > while the nodes are "parentless " pending deletion.[/color]

      I'm not really sure why you need to do this set NULL thing. In fact
      that is something I would avoid like the plague. But then I know
      very little of your actual business problem.

      I think Joe Celko's trick for tress is to number each node in a way so
      that a subtree is a contiguous range. Then you can blow away to whole
      subtree in one delete.

      Then again, you already had that path. Would not:

      DELETE tbl WHERE path = 'a/b/c' or path LIKE 'a/b/c/%'

      work?

      Of course, even with set-based statements, you can have interesting
      effects if two clients are it at the same time.
      [color=blue]
      > Secondly, I am not sure how to handle synchronization of the tree for
      > each client. How does each client know when a change has been made to
      > the tree?[/color]

      You will have to ask you tech lead about that. :-) Seriously, with
      no knowledge of the requirements etc, it is very difficult to answer.
      As Thomas said, you can refresh automatically with some frequency.
      For more bells and whistle you could push the change by activating
      some signaling mechanism from a trigger. But you make have to ask for
      a bigger budget to do this.



      --
      Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

      Books Online for SQL Server SP3 at
      Get the flexibility you need to use integrated solutions, apps, and innovations in technology with your data, wherever it lives—in the cloud, on-premises, or at the edge.

      Comment

      • --CELKO--

        #4
        Re: Concurrency issues with &quot;Tree&quot ; structures.

        >> I would be interested to hear any insights people have on these
        issues, .. <<

        Get a copy of TREES & HIERARCHIES IN SQL and you can see several
        different approaches.
        [color=blue][color=green]
        >> in order to delete subtrees, I have to first set all of the nodes in[/color][/color]
        the subtree to point to the "NULL" parent so that I don't get caught
        with integrity errors during deletes. <<

        Look up the nested sets model; this one atomic DELETE FROM statement.

        Comment

        • Robin Tucker

          #5
          Re: Concurrency issues with &quot;Tree&quot ; structures.


          Hi Tom,
          [color=blue]
          > Ack, yuck, ick! I hate that method of storing tree information myself.
          > Check Joe's book for several different ways to store trees in a RDBMS.
          >[/color]

          Ok, I don't want to get into a religious war about the best representation
          of trees. I am aware of Celkos nested sets model but prefer the clarify of
          the path string approach (my system isn't going to be operating with 1,000
          users or anything, so I don't mind if it's performance isn't so good); of
          course when I refer to clarity, I mean my own personal ability to visualise
          whats going on and write code to perform the various manipulations required!
          However, I will look once again at Celkos methods because I rejected them a
          year ago for various reasons (the potential renumbering of large numbers of
          nodes during inserts for example) and am now a bit wiser. I also tried out
          nested sets with Tropashenkos' (apologies if I misspelled the name)
          materialised path (binary fractions) and found this to be unbelievably slow
          and restrictive (for various reasons mainly based on numeric accuracy).
          [color=blue]
          > So, what you seem to be saying is that you want RI so that you can't
          > delete any rows by mistake, but you want to be able to easily delete
          > rows. You can't have it both ways... you either need to explicitly
          > delete the children (or set the FK's to NULL) or you need to remove the
          > RI.[/color]

          You are right of course, I need to throw away the RI on this, then deletion
          of subtrees will be atomic and I will not have to set parents to NULL.
          [color=blue]
          >[color=green]
          >> Changes to any given descendant can trigger changes which are[/color]
          > propagated to it's[color=green]
          >> ancestors[/color]
          > This sounds like a design problem with regards to normalization to me,
          > but without knowing the specifics I really can't say.
          >[/color]

          I will solve this by locking the ancestors (as another person says in his
          post, using UPDLOCK rather than HOLDLOCK) and then processing them in any
          transaction that changes state. Thinking about it though, it will still be
          possible for a client to read a node that has not yet been processed while
          another client is currently processing. So that first client may well
          receive some processed and some unprocessed nodes back after a read
          operation (is this true though? If the processing occurs within a
          transaction, can other users read the uncommitted data?). The reason for
          this is because the algorithm for state propagation cannot be atomic (as
          fetching "ancestors" of any given node requires iteration).

          However, I suppose it is possible for the client to detect problems like
          this, by checking ancestors itself for consistency and forcing a refresh if
          it detects an inconsistency when it reads back parts of the tree. Extra
          work I think but bearing in mind the above (Celko nested sets), I suppose it
          may be possible to UPDATE all ancestors atomically. If this is indeed the
          case, then I might not need an iterative approach. But I doubt this is
          possible, as I would have to ensure updates on the set of ancestors were
          performed in order of their "depth" (leaves first of course). I don't think
          SQL has such a mechanism (tables are sets after all, not sequences). I
          would love to know how, using a path approach it is possible to fetch all
          ancestors of a given node. I don't think it can be done with just a path
          string and ID_Parent relation.

          The specifics of this is each node has a "condition" (green, yellow, red,
          undefined) according to the condition of it's children. For nodes of type
          0, it's condition is the highest condition of each of it's children, for
          nodes of type 1, it's condition is the condition of it's "most recently
          added child". For nodes of type 2, the condition is fixed by the client
          when the record is created. Condition propagation basically is tasked with
          ensuring these conditions are consistent after deletes, updates or inserts.
          [color=blue]
          > As for the client application tracking changes to the tree, that is
          > really dependent on the business requirements for your application. You
          > can reload the client every X seconds or you could wait for the client
          > to take an action then check to see if the action is still valid on the
          > tree in its current form. To limit the number of client refreshes that
          > you have to perform, each node could use a last_updated datetime column
          > to track changes and you could retrieve only those nodes and their
          > subtrees where the last_updated column has a value that is greater than
          > the last time that you refreshed your tree.
          >[/color]

          I have another idea about this (I will timestamp the nodes). I have a
          worker thread that serializes access to the database from the various client
          controls, firing callbacks when it performs an operation. This thread can
          check through the tree in it's idle time to ensure it is synchronized. Of
          course, if I execute an update and fail (ROWCOUNT = 0) or a delete come to
          think of it, I can force a refresh of the entire subtree and it's ancestors.
          But I would consider the entire subtree and all it's ancestors to be suspect
          if I detected any given node had a later timestamp than that expected.
          [color=blue]
          > My suggestion would be to check out Joe's book (or Google past posts by
          > Joe) to find alternative tree representations . Some of them are rather
          > ingenious and easy to work with.
          >
          > HTH,
          > -Tom.
          >[/color]

          Thanks for all of your ideas.



          Robin


          Comment

          Working...