Comparing somewhat irregular data, counting and printing!

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • dechen
    New Member
    • Jan 2010
    • 8

    Comparing somewhat irregular data, counting and printing!

    Here's the problem:

    Word POS

    AB'C' NNP
    DEF' CC
    GH'I' NNP
    JKL ' CD
    MN'O' CG

    ->In this input the first column consists of words and the second column depicts the Parts Of Speech(POS) of each word. ' used as the syllabic boundary. So a word might be just one syllable or more than one syllables. [Our language does not have word boundaries just like in English language. We have syllabic and sentence boundaries.]

    Input2:
    [AB'C'DEF'GH'I'] [JKL'MN'O']
    ->This is an input consisting of phrases without the word boundaries. While the first input is same as input2 in terms of the words they consist but input2 does not have word boundaries marked between words. So the task is to compare the two input files and get a result where the phrases in input2 has spaces in between words as word boundaries. And then for each word in the resulting file, the no. of syllables in it and it's corresponding POS must be output to a file.

    Output:
    [AB'C' DEF' GH'I'] [JKL' MN'O']
    Here the words are separated by spaces in each phrase. No. of words in the first phrase is 3 while the second phrase has only 2 words.

    Final output:
    Using the above result, no. of syllables in each word and the POS of each word must be found. As depicted below:

    [2/NNP 1/CC 2/NNP] [1/CD 2/CG]

    Help me!

    Thanks.
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    dechen,

    We cannot do you work for you. Please show some effort to solve this problem for yourself, and we will be glad to help you. Please see posting guidelines.

    BV - Moderator

    Comment

    • dechen
      New Member
      • Jan 2010
      • 8

      #3
      Yes....been trying....with no luck.. I saved the first file as a dictionary pair of words and their corresponding POS. Then the second file, i saved it as a list of Phrases, each phrase contained in a pair of square brackets.
      Code:
      #!/usr/bin/python
      # -*- coding: utf-8 -*-
      
      import os, sys, UserDict
      
      file_encoding="utf-8"
      
      dictfile = file("pos-taged-corpus.txt","r")
      file2 = file("mergeOP.txt","w")
      
      dictdata = dictfile.read().strip()
      
      lexicon_words = [ ]
      pos = [ ]
      
      l1=dictdata.split("\n")
      l2 = [t.split("\t") for t in l1]
      for i in l2:
          count=0
          for j in i:
              count+=1
      
              if count==1:
                  lexicon_words.append(j.strip())
              if count==2:
                  pos.append(j.strip())
      
      dictionary = dict(zip(lexicon_words,pos))
      
      line=open('corpus_with_phrase_break.txt','r')
      line1=line.read()
      phrase=[t.strip() for t in line1.split(' ')]
      
      for phrase1 in phrase:
          phrase2=phrase1.strip('[').strip(']')
      
          #file2.write(phrase2)
          #file2.write('\n')
      
          ##provided phrase2 had word breaks as well with * as word boundary which is not the case in my problem
      
          for word in phrase2.split('*'): 
             textin=""
             v=dictionary.get(word,None)
             if v:
               file2.write("The values were found! ")
               textin=word+'\t'+v
               file2.write(textin)
               file2.write('\n')
             else:
               file2.write('There is nothing, no v:')
               file2.write('\n')
      dictfile.close()
      file2.close()
      But since there are no word breaks in the second file it is hard to compare with the dictionary. I tried using a counter, like counting the no of syllables for each word entries. And trying to count the same no. of syllables in the phrases but I cannot implement it as i want..getting messy. I need to identify the words in the phrases by comparing them to the dictionary entries. How should I compare the two entries at least on what basis? Just give me a hint. I have run out of ideas.
      Last edited by bvdet; Jan 24 '10, 02:46 PM. Reason: Add code tags

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #4
        Please use code tags when posting code. Please read "Posting Guidelines, How To Ask A Question".

        BV - Moderator

        Comment

        • bvdet
          Recognized Expert Specialist
          • Oct 2006
          • 2851

          #5
          Let's assume you have created a dictionary named dd:
          Code:
          >>> dd
          {"GH'I'": 'NNP', "JKL'": 'CD', "MN'O'": 'CG', "DEF'": 'CC', "AB'C'": 'NNP'}
          >>>
          The text of the second input file is stored in a variable input2.
          Code:
          input2 = """[AB'C'DEF'GH'I'] [JKL'MN'O']"""
          Initialize an empty list of results
          Iterate on input2.split() with built-in function enumerate(). Each iteration will assign a count value (j) and string value (item).
          Initialize an empty list at results[j].
          Iterate on the keys of dictionary dd.
          If dictionary key is in item, append the following string to results[j]:
          Code:
          "%d/%s" % (key.count("'"), dd[key])
          String method count() returns the number of occurrences of the ' character in key, and dd[key] returns the POS.

          If coded in this way:
          Code:
          >>> results
          [['2/NNP', '1/CC', '2/NNP'], ['1/CD', '2/CG']]
          >>>

          Comment

          • dechen
            New Member
            • Jan 2010
            • 8

            #6
            Thanks a lot. It worked!

            Comment

            • dechen
              New Member
              • Jan 2010
              • 8

              #7
              Word POS
              AB'C' NNP
              DEF' CC
              GH'I' NNP
              JKL ' CD
              MN'O' CG
              DEF' CG

              What happens when the same dictionary key has two different values(as above)? Then the word POS in the output may or may not be right, right? Is there a way to track that or prevent the wrong values from getting printed?

              Comment

              • bvdet
                Recognized Expert Specialist
                • Oct 2006
                • 2851

                #8
                How would you be able to distinguish which POS is correct for a given phrase? You can create the dictionary as a list or tuple of POS values and decide using rules which one is correct. The dictionary could be created like this:
                Code:
                dd = {}
                for item in input1.split("\n"):
                    key, value = item.split()
                    dd.setdefault(key, []).append(value)
                .....and will look like this:
                Code:
                >>> dd
                {"GH'I'": ['NNP'], "JKL'": ['CD'], "MN'O'": ['CG'], "DEF'": ['CC', 'CG'], "AB'C'": ['NNP']}
                >>>

                Comment

                Working...