how to parse a S expression and build an binary decision tree?

**Glenton** · May 24 '10, 01:47 PM

Okay, wow, that's quite a big question. You got any code you're trying, or more specific questions to narrow it down?

1. Parsing the string into usable data.
regular expressions will be your friend here.

2. The tree structure.
I'll give some more thought about this, but the question is basically can you use one of the standard python data structures to hold this kind of thing in a neat way, or do you need to build a data structure to hold it. I suspect a clever useage of defaultdict (from collections) could work.

But I can't write now, I'm afraid!

**kaushik1221** · May 24 '10, 01:56 PM

Originally posted by Glenton

Okay, wow, that's quite a big question. You got any code you're trying, or more specific questions to narrow it down?

1. Parsing the string into usable data.
regular expressions will be your friend here.

2. The tree structure.
I'll give some more thought about this, but the question is basically can you use one of the standard python data structures to hold this kind of thing in a neat way, or do you need to build a data structure to hold it. I suspect a clever useage of defaultdict (from collections) could work.

But I can't write now, I'm afraid!

Code:

import sys

def _gen_tokens(file):
  for line in file:
    line_len = len(line)
    left = 0

    while left < line_len:
      c = line[left]

      if c.isspace():
        left += 1
      elif c in '()':
        yield c
        left += 1

      else:
        right = left + 1
        while right < line_len:
          c = line[right]
          if c.isspace() or c in '()':
            break

          right += 1

        token = line[left:right]
        if token.isdigit():
          token = int(token)
        yield token

        left = right

def read_all(file=sys.stdin):
  stack = []
  for token in _gen_tokens(file):
    if token == '(':
      stack.append([])

    elif token == ')':
      top = stack.pop()
      if len(stack) == 0:
        yield top
      else:
        stack[-1].append(top)

    else:
      stack[-1].append(token)

  assert len(stack) == 0

def s_expr_to_str(e):
  if type(e) is ListType:
    return '(%s)' % ' '.join(map(s_expr_to_str, e))
  else:
    return str(e)

s_expr=(1.0 + 3.5)
# for s_expr in s_expr_gen.read_all():
print s_expr_to_str(s_expr)

basically my type of data will be in S expr form,so this is the code i am trying to parse the S expression.but this code isnt running..it is giving me 2-3 error.would be happy if u can sort this..
As a first step thinking of parsing the expression ,then will see on build a decision tree..
No i guess decision tree is required as it is the one used widely for pattern recognization and classification of large data.

Please find the bug in the above code.

thanks for the reply

**Glenton** · May 25 '10, 02:10 AM

Hi

Starting with the overall objective of making a decision tree. So you have a string with ((Question)(Lef t_Node)(right_n ode))
1. Can you tell us if the Question is unique or if there are multiple nodes with the same question?
2. Can you tell us what Left_Node and right_node actually are? Are they references to another question, and if so how is it done?
3. Can you tell us how big the tree is? Ie how efficient does the code need to be?
I'm not disputing that you need a decision tree - I'm just wondering what the best way is of coding it!

Regarding your code, it would be helpful if you told us what the various functions were meant to achieve! Little details make a big difference and we can't really tell. Eg give us an example of how you would run it, what you expect and what you got instead. This will make it more likely that you get reasonable help!

But here for example is some code for a _gen_token function using regular expressions:

Code:

import re

def _gen_tokens(myfile):
    pattern=r"\(\((.*)\)\((.*)\)\((.*)\)\)"
    for line in myfile:
        match=re.search(pattern,line.strip())
        if match:
            yield match.group(1)
            yield match.group(2)
            yield match.group(3)


f=open("T1.txt")
for token in _gen_tokens(f):
    print token

This gives:

Code:

question
leftnode
rightnode
question1
leftnode1
rightnode1
question2
leftnode2
rightnode3
etc...

which I think solves the parsing part of the problem, right? Let us know if that works for you, and if not what the problem is, and if there's anything else.

when run from the same directory as a file T1.txt that looks like this:

Code:

((question)(leftnode)(rightnode))
((question1)(leftnode1)(rightnode1))
((question2)(leftnode2)(rightnode2))
((question3)(leftnode3)(rightnode3))
((question4)(leftnode4)(rightnode4))
((question5)(leftnode5)(rightnode5))
((question6)(leftnode6)(rightnode6))
((question7)(leftnode7)(rightnode7))
((question8)(leftnode8)(rightnode8))

**Glenton** · May 25 '10, 02:15 AM

Originally posted by kaushik1221

Code:

import sys

def _gen_tokens(file):
  for line in file:
    line_len = len(line)
    left = 0

    while left < line_len:
      c = line[left]

      if c.isspace():
        left += 1
      elif c in '()':
        yield c
        left += 1

      else:
        right = left + 1
        while right < line_len:
          c = line[right]
          if c.isspace() or c in '()':
            break

          right += 1

        token = line[left:right]
        if token.isdigit():
          token = int(token)
        yield token

        left = right

def read_all(file=sys.stdin):
  stack = []
  for token in _gen_tokens(file):
    if token == '(':
      stack.append([])

    elif token == ')':
      top = stack.pop()
      if len(stack) == 0:
        yield top
      else:
        stack[-1].append(top)

    else:
      stack[-1].append(token)

  assert len(stack) == 0

def s_expr_to_str(e):
  if type(e) is ListType:
    return '(%s)' % ' '.join(map(s_expr_to_str, e))
  else:
    return str(e)

s_expr=(1.0 + 3.5)
# for s_expr in s_expr_gen.read_all():
print s_expr_to_str(s_expr)

basically my type of data will be in S expr form,so this is the code i am trying to parse the S expression.but this code isnt running..it is giving me 2-3 error.would be happy if u can sort this..
As a first step thinking of parsing the expression ,then will see on build a decision tree..
No i guess decision tree is required as it is the one used widely for pattern recognization and classification of large data.

Please find the bug in the above code.

thanks for the reply

Oh, looking at your _gen_token function I see that you were simply using the name of the file in the "for line in file:" of line 4 (you get it from your stdin). This won't work. You need to use a file object - see the code from my previous post for an example.

This is a classic example of why commenting your code is necessary (I know I didn't with my code ;P). When I read your _gen_token function and saw the line 4 I assumed you were passing a file object. A comment about what you were doing (for yourself in the future and us now) would have made the error obvious!

**kaushik1221** · May 25 '10, 02:36 AM

I am working on speech synthesis.In this i have a large number of pronunciation for each phone i.e alphabet and need to classify them according to few feature such as segment size(int) and alphabet itself(string) into a smaller set suitable for a particular context. For this purpose,i have decided to use decision tree for classification. the data to be parsed is in the S expression format.eg:((que stion)(LEFTNODE )(RIGHTNODE)).

example of how question can be at root question can segment size<1.5,if yes left node else right node.again left node might have a question such a segment size <.8,for further clasification of pronunciation or may have a diff question such as alphabet = "a"..
question wil be different at each node.
left and right node which have whole part of left tree and right tree respectively..

I am giving u a sample of file,which is my target to parse and build the decision tree is here.

((segment_durat ion < 0.145)
((n.ph_cvox is -)
((n.ph_vfront is 3)
((p.ph_cplace is v)
((segment_durat ion < 0.119999)
((p.name is en:)
((((253 9.2873) (267 8.75463) (269 9.66194) (312 8.5538) (384 12.0333) (467 10.9705) (496 10.078) (503 12.0626) (565 9.90752) (581 9.75047) (628 11.2111) (710 9.70235) (746 9.18527) (780 8.73101) (891 9.45934) (901 9.50958) (905 8.82014) (1002 10.4217) (1214 12.9305) (1410 9.78915) (1680 10.6755) (1690 10.4012) (1698 9.17619) (1847 11.0898) (2055 9.69433) (2141 10.4624) (2276 9.80647) (2342 11.8977) (2346 11.8854) (2395 9.69119) (2450 12.8875) (2481 9.93514) (2491 13.5101) (2527 11.7152) (2739 10.2178) (2809 9.32473)) 10.3664))
((p.name is an:)
((((56 10.7208) (108 9.96128) (192 12.9366) (219 9.62509) (262 9.67294) (299 10.6707) (303 9.27964) (306 9.93745) (313 10.1076) (348 10.7988) (354 9.1955) (403 10.9634) (408 10.3513) (533 10.7836) (569 11.7902) (618 9.58224) (696 8.51287) (708 10.6939) (769 10.6282) (791 10.4681) (792 12.8401) (823 10.9245) (861 11.3897) (880 8.44017) (945 10.6206) (1015 9.53572) (1029 9.73096) (1257 9.61424) (1611 11.9002) (1999 10.3606) (2012 9.60825) (2688 9.17272) (2903 12.7348)) 10.4107))
((((74 12.7488) (392 12.9928) (402 13.958) (495 12.6443) (539 10.8991) (754 11.8237) (831 12.5241) (1039 9.25672) (1325 11.1247) (1383 13.317) (1653 11.2169) (1822 10.6504) (1831 10.1751) (1916 11.5109) (2047 12.5661) (2189 11.7423) (2244 11.3622) (2312 12.2735) (2340 11.0929) (2500 12.4542) (2961 11.5023)) 11.8017))))
((R:SylStructur e.parent.parent .R:Word.p.gpos is 0)
((((59 7.99577) (65 8.07265) (82 9.05552) (117 8.04792) (168 7.55799) (189 11.229) (206 9.65302) (216 7.95128) (238 7.21761) (287 7.84751) (317 11.4078) (370 8.85067) (378 12.0625) (410 10.4587) (421 9.69372) (462 9.31811) (497 10.5316) (516 7.84422) (527 8.38548) (540 8.24113) (564 7.79119) (575 9.11389) (656 8.65479) (667 11.5971) (692 7.78092) (734 9.50851) (742 7.21358) (743 7.74665) (788 8.2656) (805 8.52475) (833 7.86942) (842 9.54321) (866 8.77619) (890 7.56151) (944 8.03321) (987 11.184) (1042 8.13802) (1447 8.64641) (1598 10.1785) (1607 11.7318) (1618 9.42431) (2021 11.203) (2088 8.42375) (2120 10.4004) (2137 10.7255) (2156 12.4177) (2158 9.69772) (2488 7.10815) (2591 13.2943) (2612 11.102)) 9.26156))

**Glenton** · May 25 '10, 03:51 AM

Hi. I'm sorry to be dense, but I'm struggling to see how this is a
((question)(lef tnode)(rightnod e)) format?

Can you take us through the meaning of what you've got here a bit more?

Also, where are the line breaks, and what do they mean?

And what is the end product you're after. Sure, a decision tree, but for printing graphically, or for a computer program to go through step by step or what?

**kaushik1221** · May 25 '10, 04:25 AM

Originally posted by Glenton

Hi. I'm sorry to be dense, but I'm struggling to see how this is a
((question)(lef tnode)(rightnod e)) format?

Can you take us through the meaning of what you've got here a bit more?

Also, where are the line breaks, and what do they mean?

And what is the end product you're after. Sure, a decision tree, but for printing graphically, or for a computer program to go through step by step or what?

Basically,assum e that there are 10,000 toys and there is need for classifying them and there are different question need to be answers to categories them,so that each leaf has just around 20 toys of one type..
my aim is given a toy along with its few feature which will help in categorizing it,the computer should parse through the decision tree build for 10,000 toys and give a result only the leaf node contents for the given toy category inputed earlier by checking its feature..
I guess should be clear by now,it is similar like simplifying the search by now searching from instead of 10,000 toys,from only 20.

The sample i gave is in that format but recursively i guess:
((Question)
((LEFT NODE CONTENT:questio n at left node if left node is selected from root node)
((few more question = to the depth of the tree as 1 question parse increses the depth by 1) and so on not only questions but also left and right part at that node )))
((Right NODE CONTENT::questi on at right node if right node is selected from root node)
((few more question = to the depth of the tree as 1 question parse increses the depth by 1) and so on not only question but left & right part at that node)))

it is similar pattern recursively at each node and each node has a list of the toys in that particular category such a after one question answered there will be 5000 toys on left node category and 5000 on right node category..

((Question)((L1 :Question)((L2: Question)(L2:le ft leaf)(L2:right leaf))(L1:left child)(L1:right child))((R1:Que stion)((R2:Ques tion)(R2:left leaf)(R2:right leaf))(R1:left child)(R1:right child))

should be clear now i guess,
please help me,not having a clue for building a decision tree..
If not able to understand my requirement still.
tell me how to do with what u got,i wil try to make changes to suit my need..
at present i am clueless in building the decision tree.please suggest methods.

Sorry if my question previously given wasnt clear..earlier i wanted to be simple now i elaborated my whole problem.
thank u

**Glenton** · May 25 '10, 06:26 AM

I had understood most of that already.

But if I understand correctly, there should be at most 3 brackets in a row at the lowest level. Eg in your example:
((Question)((L1 :Question)((L2: Question)(L2:le ft leaf)(L2:right leaf))(L1:left child)(L1:right child))((R1:Que stion)((R2:Ques tion)(R2:left leaf)(R2:right leaf))(R1:left child)(R1:right child))

L2 has the form ((...)(...)(... )), while L1 also has that form, and the second bracket is just L2.

But in your actual example data you have many (...) in a row. How are you suppose to tell what those all mean? And it ends with (2612 11.102)) 9.26156), so the last number doesn't seem to have an open bracket.

In short the example structure you give is not compatible with your explanation.

Actually, even your simple example doesn't make much sense when looked at in detail.
Here's my picture understanding

Code:

                                Q
                       /                  \
                  Q0                      Q1
             /       \                       /      \
     Q00           Q01          Q10           Q11
    /      \         /      \        /       \        /      \
A000 A001 A010 A011 A100 A101 A110 A111

I'm drawing this out in painful detail because we seem not to be getting each other. Obviously in this example they all go down to the bottom level, but you could have A11 and get rid of A110 and A111. Is this understanding of the decision tree structure correct?

Now to convert this to the format you've got it would be like this:

Code:

(
  (Q)
  (
     (Q0)
     (
        (Q00)
        (A000)
        (A001)
     )
     (
        (Q01)
        (A010)
        (A011)
     )
  )
  (
     (Q1)
     (
        (Q10)
        (A100)
        (A101)
     )
     (
        (Q11)
        (A110)
        (A111)
     )
  )
)

I'm tab formatting it to (hopefully) make it clearer. This is clearly a recursive structure.

Now here is yours (I've changed the labelling to be consistent):

Code:

(
   (Q)
   (
      (Q0)
      (
         (Q00)
         (A000)
         (A001)
      )
      (???L1:left child)
      (???L1:right child)
   )
   (
      (Q1)
      (
         (Q10)
         (Q100)
         (Q101)
      )
      (???R1:left child)
      (???R1:right child)
   )

There appears to be a mismatch of questions and answers ((total number of questions) should equal (total number of answers minus 1)). Eg Q0 and Q1 (corresponding to your L1 and R1) both have 3 answers at the same level.

So I ask again will you please explain your existing data structure (without assuming I'm an idiot, but being more precise)

**Glenton** · May 25 '10, 07:20 AM

By the way, with the structure as I made it, it's relatively easy to make a decision tree.

With a text file like this:

Code:

(
  (Q)
  (
     (Q0)
     (
        (Q00)
        (A000)
        (A001)
     )
     (
        (Q01)
        (A010)
        (A011)
     )
  )
  (
     (Q1)
     (
        (Q10)
        (A100)
        (A101)
     )
     (
        (Q11)
        (A110)
        (A111)
     )
  )
)

You can run this code:

Code:

import re

def questionAndNodes(s):
    """extracts Q, A0 and A1 from a bottom level pattern
    of form (Q)(A0)(A1)"""
    pattern=r"\((.*)\)\((.*)\)\((.*)\)"
    match=re.search(pattern,s.strip())
    if match:
        return match.group(1), match.group(2), match.group(3)

#Create continuous string out of file
f=open("T1.txt")                 #file with the string
s=""
for line in f:
    s+=line.strip().replace(") (",")(")              #create one long string from file"
f.close()


#Dictionary of questions with nodes
tree=dict()

#Pattern for finding bottom level  (...)(...)(...), where ... has no brackets.
pattern1=r"\([^()]*\)\([^()]*\)\([^()]*\)"

while True:
    print "*****************"
    myfindall=re.findall(pattern1,s)  #Find all bottom nodes
    if len(myfindall)==0: break          #If there aren't any more quit
    for v in myfindall:
        q,l,r=questionAndNodes(v)     #extract the question and nodes
        print q, l, r                             
        tree[q]=[l,r]                           #Add the nodes to the dictionary
        s=s.replace(v,q)                #replace the whole node with the question

print tree

def myask(q,tree):
    "Recursive function for navigating down the tree"
    if q in tree.keys():
        print "Question:"
        print q
        ans=raw_input("Enter 0, 1, or (q)uit: ")
        while ans not in "01qQ":
            ans=raw_input("Invalid answer. Enter 0, 1, or (q)uit: ")
        if ans=="q":
            print "User quit"
        if ans=="0":
            myask(tree[q][0],tree)
        if ans=="1":
            myask(tree[q][1],tree)
    else:    
        print "End.  Answer:"
        print q

#Initialise the function with the top node question "Q"
myask("Q",tree)

Presumably this isn't what you want. :,(

But it's kinda cool!

**kaushik1221** · May 25 '10, 07:45 AM

Originally posted by Glenton

By the way, with the structure as I made it, it's relatively easy to make a decision tree.

With a text file like this:

Code:

(
  (Q)
  (
     (Q0)
     (
        (Q00)
        (A000)
        (A001)
     )
     (
        (Q01)
        (A010)
        (A011)
     )
  )
  (
     (Q1)
     (
        (Q10)
        (A100)
        (A101)
     )
     (
        (Q11)
        (A110)
        (A111)
     )
  )
)

You can run this code:

Code:

import re

def questionAndNodes(s):
    """extracts Q, A0 and A1 from a bottom level pattern
    of form (Q)(A0)(A1)"""
    pattern=r"\((.*)\)\((.*)\)\((.*)\)"
    match=re.search(pattern,s.strip())
    if match:
        return match.group(1), match.group(2), match.group(3)

#Create continuous string out of file
f=open("T1.txt")                 #file with the string
s=""
for line in f:
    s+=line.strip().replace(") (",")(")              #create one long string from file"
f.close()


#Dictionary of questions with nodes
tree=dict()

#Pattern for finding bottom level  (...)(...)(...), where ... has no brackets.
pattern1=r"\([^()]*\)\([^()]*\)\([^()]*\)"

while True:
    print "*****************"
    myfindall=re.findall(pattern1,s)  #Find all bottom nodes
    if len(myfindall)==0: break          #If there aren't any more quit
    for v in myfindall:
        q,l,r=questionAndNodes(v)     #extract the question and nodes
        print q, l, r                             
        tree[q]=[l,r]                           #Add the nodes to the dictionary
        s=s.replace(v,q)                #replace the whole node with the question

print tree

def myask(q,tree):
    "Recursive function for navigating down the tree"
    if q in tree.keys():
        print "Question:"
        print q
        ans=raw_input("Enter 0, 1, or (q)uit: ")
        while ans not in "01qQ":
            ans=raw_input("Invalid answer. Enter 0, 1, or (q)uit: ")
        if ans=="q":
            print "User quit"
        if ans=="0":
            myask(tree[q][0],tree)
        if ans=="1":
            myask(tree[q][1],tree)
    else:    
        print "End.  Answer:"
        print q

#Initialise the function with the top node question "Q"
myask("Q",tree)

Presumably this isn't what you want. :,(

But it's kinda cool!

Hey,sorry its was a common mistake made by both of us..I forgot to mention that the leaf node will have different format.as there will be no question in it.and only list to stored..i.e the reason we need to check everytime while parsing whether it is leaf node or not..the last value which does not start with an opening left parenthesis is of no use to me and can be ignored..
In the sample given the information is in the format of ( X Y)co ordinates of which only X is the information of need to be stored in the decision tree.

thank you very much for the support given

Now answer to your previous post,ya i think i made a mistake i meant the same structure what you have showed,except that have to check whether it is a leaf node or not to coz it wil have a different format.
how can we discard the one which does not start with parenthesis??
how to store the information at each node ??

The code looks seriously cool,but ya doesnt satisfy my requirement..
I want the system to automatically parse the decision tree,and give me and display of content of the leaf node..
Assuming the input will be a letter "a" with feature such as an list ( 'seg_size','pit ch','width') .match them to the question at each level and answer them and parse till the leaf :)

**Glenton** · May 25 '10, 09:05 AM

I'm sure I can help if you help me understand what the structure is.

**kaushik1221** · May 25 '10, 09:40 AM

ya trying to get a better way of understanding the structure my self and for giving good explanation...w ill get back in some time...thanks for giving supports and confidence

**kaushik1221** · May 25 '10, 11:07 AM

Originally posted by Glenton

I'm sure I can help if you help me understand what the structure is.

The structure is pretty much the same what is told you.
for normal node:((Question )(left node)(right node))
for leaf(content).. you may see in the sample where there are two separate sets with only values and no question in between within some braces(((( ))))
Please dont give much stress on the structure..
Help me in building the decision tree,which will parse the information and display the contents of leaf nodes..
i guess this can be using data structure concepts.

def parse( node )
check if leaf
if yes:parse the leaf,create object,create object pointer
return an object pointer
Else :
question: parse(1st field)
left: parse(2nd field)
right: parse(3rd field)
these three can be made a class called binary_decision _node and return current object identifier
we should perform this recursively called until leaf node is reached.

While parsing have to store the content of format(X Y) in some list,index etc and also store the questions in some field to answer

atlast we should have all pointer to traverse from root to the leaf.

i have this kind of concept in mind but as new to python not able to implement this looking forward for your help in programming this..
Please help...

**Glenton** · May 26 '10, 02:48 AM

Have you tried the program I wrote a couple of posts back? I think it works very well. The basic structure is a dictionary with entries like this:
d[Q] = [QL,QR] for normal nodes, and
d[Q] = [AL,AR] for leaf nodes.

Therefore to navigate it, you start with d[Q], and depending on your answer to that you move to d[QL] or d[QR].

Please look at that example in detail and if it doesn't work for you, please let me know why.

There's also some details about parsing the structure from a string to the dictionary. I'm relatively confident that tweaking that for your irregular leaves will not be too difficult.

Anyway, maybe you can have a detailed look at that, and come back with questions and a way forward.

how to parse a S expression and build an binary decision tree?

how to parse a S expression and build an binary decision tree?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment