Parsing XML file

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • zmbd
    Recognized Expert Moderator Expert
    • Mar 2012
    • 5501

    Parsing XML file

    Looking for tricks or methods of tracking the Node>Children>S ubChildren in an XML traverse.

    Loading the XML isn't an issue, I have working code for that, it's determining the level of recursion as I traverse the tree, branches, twigs, leaves. What I need to do is relate back the Leaves to the Twigs to Branches
    (yes, I know: Nodes, Child Nodes, Subchild nodes, etc...)

    Tables I'm looking at (simplified)
    [t_leaves]
    [PK][FK_MM_T2B][Information from the leaf]

    [MM_T2B]
    [PK][t_Branches][t_twigs]

    [t_twigs]
    [PK][Information about the twig]

    [t_branches]
    [PK][information about the branch]

    When I tried to do the simple External>XML-Import my poor database had some 185 tables based on the nodes in the tree... very sad indeed... and there were no relationships to link the data in one table to the next.

    No, I do not know where the XML came from.
    This isn't an urgent as right now I'm manipulating the data by hand in Excel - but it's painful to do.

    HOWEVER, if I open the file in excel it will place the data in the columns by level, etc...
    So a group label in Col(A) spans the rows under that group etc... so I know the XML parser can skim the data. I'm almost to the point of instancing Excel, opening the XML in Excel and then walking the rows! However, I know there is an elegant way of doing this if I can just figure out how to solve tacking the recursion...

    This is a Very Simplified version of the XML I'm working with and unfortunately the information is confidential.
    Code:
    'AIR CODE example xml
    <Root>
      <Group>
        <UUID>1234<UUID />
        <Name>GroupName<Name />
        <!-- Blah Blah Blah ->
        <Group>
          <UUID>7890<UUID />
          <Name>CatagoryName<Name />
          <!-- Blah Blah Blah ->
          <Entry>
            <UUID>ABCD<UUID />
            <ItemID>InternalIdOfItem<ItemID />
            <ItemName>InternalNameOfItem<ItemName />
            <!-- Blah Blah Blah ->
          <Entry />
        <Group />
        <Group>
          <!-- Here we have another Root\Group\child that follows the first child, different UUID etc... ->
        <Group />
      <Group />
    <Root />
    so when I parse the file I'd like to distribute the information so that:
    Code:
    [t_leaves]
    [AutoNum_PK][1][ABCD][InternalIdOfItem][InternalNameOfItem]
    [AutoNum_PK][2][EFGH][InternalIdOfItem2][InternalNameOfItem2]
    Code:
    [MM_T2B]
    [1][1][20]
    [2][2][1]
    [3][15][null]
    [...]
    Code:
    [t_Twigs]
    [AutoNum_PK=1][7890][CatagoryName]
    [AutoNum_PK=2][A123][CatagoryName2]
    Code:
    [t_Branches]
    [AutoNumber_PK=1][0001][GroupName]
    [...]
    [AutoNumber_PK=20][1234][GroupName20]
    (Yes I thought about a pedigree type self linking within the [t_branches] and that may be how I end up merging the data; however, it doesn't help with tracking the parent to the child or subchild.

    I'd post my VBA; however, it's very dirty right now dozens of Commented out lines of code and notations.

    This is my recursion call:
    Code:
    Call SpiderChildren(zSpider.childNodes, zRecursion, zSpiderGroupLevel)
    zRecursion As Long - it's passed into the sub, incremented by one within the recursion scope

    zSpiderGroupLev el as long - testing the <tag> to see if it's a group and if so then zSpiderGroupLev el = zRecursion

    This is my print line to the text file:
    Code:
    If PrintSpider Then _
    Print #zFreeFile, String(zRecursion, ".") & _
      " (" & " ParentNode: " & _
      zSpider.ParentNode.basename & _
      ") (Spider: " & zSpider.basename & _
      ")(text: " & zSpider.Text & ")"
    I found a few example files to work with so attached are the results. These have really nothing to do with the actual XML data - just fodder for the recursion code while I develop the parser.

    SiteXML.XML.txt <remove the .txt>

    xmlParse.txt is the output file.
    (the periods are added String(zRecursi on, ".") ) the end of file is "<*>" just tracking for me so that I know that we actually finished parsing the big file.
    Attached Files
    Last edited by zmbd; Jul 26 '18, 10:36 PM.
  • PhilOfWalton
    Recognized Expert Top Contributor
    • Mar 2016
    • 1430

    #2
    I know nothing of XML and the only thing I know about Trees & leaves is they block my view if the river & sea in summer.

    That said, and I am probably way off track, I have a routine that loads a web page, then scans the page source for a particular piece of information. I use it mainly to extract a share price from differently formatted Web pages. It works perfectly on most pages, but on some pages, the information is buried a little deeper, and I guess that is where your SubNodes come in.

    If that could be of any help, let me know.

    Phil

    Comment

    • zmbd
      Recognized Expert Moderator Expert
      • Mar 2012
      • 5501

      #3
      Good Morning PhilOfWalton
      Thank you for the offer.
      Can the code you have scraping the webpage relate the childnode back to and element in the calling node's structure

      I think the secret is as I am passing the parameters on to the next recursion I have to have pulled <UUID> and/or <Name> node and pass them in the same way I've been tracking the recursion level.

      My current thought the main loop might be (air code):
      (btw - spider name from Web-Search Spiders)
      Code:
      for each zSpider in zSpiders
      if zSpider.haschildNodes then
      if zSpider.tagName = "Group" then
      '? Flag to pull <UUID> and <NAME> passed to the next recursion to put in module level variable?
      '? Once in the module level variable how to pull out the values - would a TempVars or a global level collection be the way to go?
      'As it stands now when zRecursion is passed the value is at the recursive called scope
      Call SpiderChildren(zSpider.childNodes, zRecursion, zSpiderGroupLevel, zFlagToPullUuidName)
      end if
      Print #zFreeFile, String(zRecursion, ".") & " (" & " ParentNode: " & zSpider.ParentNode.basename & ") (Spider: " & zSpider.basename & ")(text: " & zSpider.Text & ")"

      Comment

      • PhilOfWalton
        Recognized Expert Top Contributor
        • Mar 2016
        • 1430

        #4
        Good Evening, zmbd. Sorry, as I said I nothing about nodes.

        My method is to hunt for clues.
        so here is a portion of a Web Pagehttps://www.londonstockexchange.com/...GBGBXSTMM.html

        Code:
        <div class="commonTable table-responsive">
        <table width="100%" cellspacing="0" cellpadding="0" summary="Price data">
        <tbody>
        <tr class="even">
        <td class="name">Price&nbsp;(GBX)</td>
        <td>1,155.00</td>
        <td class="name">Var % (+/-)</td>
        <td>
        <span class="green">+1.23%</span>
        <span class="green">
        (<img src="/media-sme/img/icon/up.gif" alt="Up" />
        +14.00)
        I have a table of "Clues" called Codes

        This is a picture of the "clues" to find the price


        So I scan the website page source until I find a sequence of
        <td class="name">Pr ice&nbsp;(GBX)</td>
        <td>
        </td>

        The price I am looking for is designated by "Share Price" on the second line (though I don't think it matters that the words "Share Price" is used.

        Sorry, I can't help you with the arboriculture

        Phil

        Comment

        • zmbd
          Recognized Expert Moderator Expert
          • Mar 2012
          • 5501

          #5
          Looks like you're opening the source and then reading the text directly?

          There are over 100 such patterns in the file I'm trying to parse...
          In one instance I get something like
          Code:
          <Group>
            <UUID></UUID>
            <Name></Name>
            <Comment />
            <DefaultAuthUserStatusReq />
            <Group>
               <!--Blah Blah Blah-->
               <DefaultAuthUserStatusReq>AsstMngr</DefaultAuthUserStatusReq>
            </Group>
          <Group>
          Another instance would be
          Code:
          <Group>
            <UUID></UUID>
            <Name></Name>
            <DefaultAuthUserStatusReq>Mngr</DefaultAuthUserStatusReq >
            <DefaultAuthUserStatusReq />
            <Group>
               <!--Blah Blah Blah-->
               <DefaultAuthUserStatusReq />
            </Group>
          <Group>
          So take for instance the <DefaultAuthUse rStatusReq />
          in the first case the Default user status to pull items under the first group is not set so anyone has default access to the items under this group; however, in one of the sub-groups the default is set so that at least an assistant manager (or supervisor) is required to authorize the

          In the second case the default is set to a manager or department-lead or higher and the subgroup also doesn't override that requirement

          Right now I'm just trying to pull the Group level UUID and Names - say something like KeePass
          Group for internet
          .SubGroup for Work
          ..Entries for Work
          .SubGroup for Personal
          ..Entries for Personal

          If you xport that file you get a real mess with Root/Groups/Groups etc... somehow when you import this file into a new KeePass it parses the XML and sets up the tree without any effort at all...

          so maybe this will help
          I'm using the MSXML object
          Code:
          '>>>This is an Abstract of the Main loop<<<
          'No module level variables - yet
          '
          Sub ParseXmlToTables()
          '
          Dim zMsXML as Object
          '(...)
          'assume all of the other variables are properly declared.
          'their names should provide context for usage - if unclear ask
          'Error trapping is also setup
          '
          '(... setup for the file path etc...)
          ' I should move this to the module level... so that I don't have to keep closing the file and reopening... I'll change that after this post.
          '
            Set zMsXml = CreateObject("MSXML2.DOMDocument.6.0")
          '
          'set the parser parameters and load the document
            With zMsXml
              .resolveExternals = False
              .validateOnParse = False
              .async = True
              .Load (zXmlSrcPath)
            End With
          '
          'fail if the document didn't properly load and tell why
             If (zMsXml.parseError.errorCode) Then _
                 Err.Raise Number:=zMsXml.parseError.errorCode, _
                 Source:="XML Document Load Error", _
                 Description:=zMsXml.parseError.reason
             End If
          '
          'set the xPath call
          xPath = "//Root"
          '
          'Pull the nodes from the XMLDocument
            Set zNodeList = zMsXml.selectNodes(zXpath)
          '
          'some stuff for the text file here - Freefile, Open, Print, Close
          '
          'Cannot pass the zNodeList directly - no children
          'just items so each item's child nodes have to be passed into the recursion
          '
            For zNodeLength = 0 To (zNodeList.length - 1)
              Call SpiderChildren(zNodeList.Item(zNodeLength).childNodes, 0, 0, zXmlParceFile)
          zFreeFile = FreeFile
          Open zXmlParceFile For Append As zFreeFile
          Print #zFreeFile, vbCrLf & "<*> <*> <*> <*> <*> <*> <*> <*> <*> <*> <*> <*> <*> <*> <*> <*>" & vbCrLf
          Close
            Next zNodeLength
          '
          'Cleanup and error traping...
          '
          End Sub
          Code:
          '>This is not air-code, this is the actual working code that produced the above text file.
          'Here I'm going to just post the messy VBA
          'It's a work in progress so the code isn't cleaned up in the slightest
          'A rare insight as to how my mind processes information
          '
          Sub SpiderChildren(ByRef inNode As Object, ByVal zRecursion As Integer, ByVal zSpiderGroupLevel As Integer, zXmlParceFile As String)
            Dim zSpiders As Object
            Dim zSpider As Object
            Dim zPrintSpider As Boolean
          '
          On Error GoTo zerrtrap
          '
            zPrintSpider = True
          '
          '
            Set zSpiders = inNode
            For Each zSpider In zSpiders
              'here's the recursion limit... currently set to stop at 100 deep
              If zRecursion <= 100 Then
          ''
          ''<<I'll be moving this to a module level pointer
          ''<<so that I don't have to close and re-open
          ''<<the output text file with each recursion
          ''<<the file pointer was being lost
          ''<<mostly because this is *brain-storm* code
          ;;
          ''output to a text file, find a free number, close any open files, open the text file for output
          Dim zFreeFile As Long
          Close
          zFreeFile = FreeFile
          Open zXmlParceFile For Append As zFreeFile
          '
                If zSpider.haschildNodes Then
          Print #zFreeFile, String(zRecursion, ".") & " " & zSpider.tagName & "::"
                  'Debug.Print String(zRecursion, ".") & " " & zSpider.tagName & "::"
                  zRecursion = zRecursion + 1
                  'group level
                  If zSpider.tagName = "Group" Then
                    zSpiderGroupLevel = zRecursion
          Print #zFreeFile, ">Group level: " & zSpiderGroupLevel
                    'Debug.Print ">Group level: " & zSpiderGroupLevel
                  End If
                  ''pass the child nodes to the next recursion
                  Call SpiderChildren(zSpider.childNodes, zRecursion, zSpiderGroupLevel, zXmlParceFile)
                  '
                  'don't print the parent node's text
                  zPrintSpider = False
                End If
              Else
                'let the values print as is until the recusion is under the limit... not sure what will happen here...
                zPrintSpider = False
              End If
              If zPrintSpider Then
          Print #zFreeFile, String(zRecursion, ".") & " (" & " ParentNode: " & zSpider.ParentNode.basename & ") (Spider: " & zSpider.basename & ")(text: " & zSpider.Text & ")"
                'Debug.Print String(zRecursion, ".") & " (" & " ParentNode: " & zSpider.ParentNode.basename & ") (Spider: " & zSpider.basename & ")(text: " & zSpider.Text & ")"
              End If
          ''    Debug.Print String(zRecursion, ".") & " " & " ParentNode: " & zSpider.ParentNode.baseName & " -- " & zSpider.baseName & ": " & zSpider.Text
              zPrintSpider = True
            Next zSpider
          '
          zExitCleanUp:
          On Error Resume Next
            Close zFreeFile
            If zRecursion > 0 Then zRecursion = zRecursion - 1
            If Not zSpider Is Nothing Then Set zSpider = Nothing
            If Not zSpiders Is Nothing Then Set zSpiders = Nothing
          Exit Sub
          zerrtrap:
          Stop
            MsgBox Prompt:=Err.Source & vbCrLf & Err.Number & vbCrLf & Err.Description, title:="Error Trap"
            Resume zExitCleanUp
          End Sub

          Comment

          • PhilOfWalton
            Recognized Expert Top Contributor
            • Mar 2016
            • 1430

            #6
            Correct.

            I have found by putting sufficient "Clues" in before the info I want, I get to the correct answer.

            I probably should have pointed out that the "Clues" must follow each other as in the image, with no gaps or intervening statements.

            I guess the code could be modified, so that having got the first result, continue reading the XML until the next set of "Clues" is solved.

            Phil

            Comment

            • Rabbit
              Recognized Expert MVP
              • Jan 2007
              • 12517

              #7
              So if you're recursing and you're passing in the info of the level you're at, doesn't that give you what you need?

              Comment

              • zmbd
                Recognized Expert Moderator Expert
                • Mar 2012
                • 5501

                #8
                Rabbit,
                That's the part I'm having issues with, the information isn't available until after the recursive call.
                SO the UUID=123 and Name=ABC isn't available until the recursive call is made to pull the childnodes to Line 1


                Code:
                  <Group>
                     <UUID>123</UUID>
                     <Name>ABC</Name>
                     <Comment />
                     <DefaultAuthUserStatusReq />
                     <Group>
                     <UUID>689</UUID>
                     <Name>FGH</Name>
                     <ItemDetail>
                        <UUID>ItemJKL</UUID>
                        <!-- Blah Blah Blah -->
                     </ItemDetail>
                So when the Line 1 <Group> is parsed
                Look at the code for
                Sub SpiderChildren
                Line 33 If zSpider.haschil dNodes Then
                Grabs the information
                This evaluates to true (UUID and Name are child nodes>
                So the information about the childnodes isn't available until the recursive call ...

                I think I need to study the parsed file a bit more - this like relating Houston to Texas without knowing you're in Texas then to USA without knowing you're in Texas first to Earth to Solar System without knowing your in the Milky way Galaxy first.

                I think I need to pull the XML down to something simple, maybe the <Root>/<Group>/<Name> and then a single child <Group>/<Name> instead of the larger data files...

                Comment

                • zmbd
                  Recognized Expert Moderator Expert
                  • Mar 2012
                  • 5501

                  #9
                  "By Jove, Watson, I've got it! ... "

                  Or at least getting closer

                  Code:
                          If zSpider.tagName = "Group" Then
                            Dim GroupSpiderName As Object
                            Set GroupSpiderName = zSpider.selectSingleNode("Name")
                            Debug.Print GroupSpiderName.Text
                            If Not GroupSpiderName Is Nothing Then Set GroupSpiderName = Nothing
                          End If
                  In the immediate pane I get just the <Group>\<name>. text on each recursion so if I pass this through to the recursion call append the name, use the UBound(split()) +1 to determine the level of the grouping...

                  Haven't implemented yet - didn't think the selectSinglNode was specific to the current node/childNode but I was running out of ways of getting that information.

                  I should be able to reduce the number of recursions to just specific spider.tagname in each Group node...

                  Comment

                  • Rabbit
                    Recognized Expert MVP
                    • Jan 2007
                    • 12517

                    #10
                    You should only need to recurse if there is another group node embedded inside a group node.
                    Code:
                    For Each child node
                       Select node name
                          Case "UUID"
                             iUUD = ...
                          Case "Group"
                             recurse
                          Case "Name"
                             strName = ...
                       ......
                    Or something like that.

                    Comment

                    • PhilOfWalton
                      Recognized Expert Top Contributor
                      • Mar 2016
                      • 1430

                      #11
                      If you would care to share, I would be very interested in seeing your final Db, because, as I have mentioned, I can pick up multiple bits of information from the Page Source, but other things, I can't find, but appear on the "Inspect Element".

                      Phil

                      Comment

                      • zmbd
                        Recognized Expert Moderator Expert
                        • Mar 2012
                        • 5501

                        #12
                        Phil - absolutely once I have things a bit more polished.

                        Comment

                        • PhilOfWalton
                          Recognized Expert Top Contributor
                          • Mar 2016
                          • 1430

                          #13
                          Thanks, I really appreciate that

                          I have just written a db for pulling multiple values from a Web Page (or XML text file) but it is a modified bit of code from a much larger project, so has a lot of irrelevant code.

                          I will clean it up and send you a copy to put straight into your delete folder.

                          Phil

                          Comment

                          • PhilOfWalton
                            Recognized Expert Top Contributor
                            • Mar 2016
                            • 1430

                            #14
                            @ zmbd

                            I don't know how well you got on with your trees & leaves, but have just spent a few days in hospital (home now) and to pass the time, I knocked up this. It is from various sources, so is messy in the extreme, but it extracts some data from Websites an the data from your test file.

                            I have just chucked the results into the Debug window. In practice, I dare say they would be used to update a table.

                            Phil

                            Comment

                            • zmbd
                              Recognized Expert Moderator Expert
                              • Mar 2012
                              • 5501

                              #15
                              I'll have to take a look, I've moved on to the Universe, Galaxies, StarSystems, Planets, and Countries... gave up on the trees :-)

                              I've a 3/4 parsed XML now; however, haven't had the time to sit back down and finish cleaning up the code. I'll post when done :)

                              Comment

                              Working...