Comparing somewhat irregular data, counting and printing!

**bvdet** · Jan 20 '10, 02:46 PM

dechen,

We cannot do you work for you. Please show some effort to solve this problem for yourself, and we will be glad to help you. Please see posting guidelines.

BV - Moderator

**dechen** · Jan 24 '10, 01:37 PM

Yes....been trying....with no luck.. I saved the first file as a dictionary pair of words and their corresponding POS. Then the second file, i saved it as a list of Phrases, each phrase contained in a pair of square brackets.

Code:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os, sys, UserDict

file_encoding="utf-8"

dictfile = file("pos-taged-corpus.txt","r")
file2 = file("mergeOP.txt","w")

dictdata = dictfile.read().strip()

lexicon_words = [ ]
pos = [ ]

l1=dictdata.split("\n")
l2 = [t.split("\t") for t in l1]
for i in l2:
    count=0
    for j in i:
        count+=1

        if count==1:
            lexicon_words.append(j.strip())
        if count==2:
            pos.append(j.strip())

dictionary = dict(zip(lexicon_words,pos))

line=open('corpus_with_phrase_break.txt','r')
line1=line.read()
phrase=[t.strip() for t in line1.split(' ')]

for phrase1 in phrase:
    phrase2=phrase1.strip('[').strip(']')

    #file2.write(phrase2)
    #file2.write('\n')

    ##provided phrase2 had word breaks as well with * as word boundary which is not the case in my problem

    for word in phrase2.split('*'): 
       textin=""
       v=dictionary.get(word,None)
       if v:
         file2.write("The values were found! ")
         textin=word+'\t'+v
         file2.write(textin)
         file2.write('\n')
       else:
         file2.write('There is nothing, no v:')
         file2.write('\n')
dictfile.close()
file2.close()

But since there are no word breaks in the second file it is hard to compare with the dictionary. I tried using a counter, like counting the no of syllables for each word entries. And trying to count the same no. of syllables in the phrases but I cannot implement it as i want..getting messy. I need to identify the words in the phrases by comparing them to the dictionary entries. How should I compare the two entries at least on what basis? Just give me a hint. I have run out of ideas.

**bvdet** · Jan 24 '10, 02:49 PM

Please use code tags when posting code. Please read "Posting Guidelines, How To Ask A Question".

BV - Moderator

**bvdet** · Jan 24 '10, 03:42 PM

Let's assume you have created a dictionary named dd:

Code:

>>> dd
{"GH'I'": 'NNP', "JKL'": 'CD', "MN'O'": 'CG', "DEF'": 'CC', "AB'C'": 'NNP'}
>>>

The text of the second input file is stored in a variable input2.

Code:

input2 = """[AB'C'DEF'GH'I'] [JKL'MN'O']"""

Initialize an empty list of results
Iterate on input2.split() with built-in function enumerate(). Each iteration will assign a count value (j) and string value (item).
Initialize an empty list at results[j].
Iterate on the keys of dictionary dd.
If dictionary key is in item, append the following string to results[j]:

Code:

"%d/%s" % (key.count("'"), dd[key])

String method count() returns the number of occurrences of the ' character in key, and dd[key] returns the POS.

If coded in this way:

Code:

>>> results
[['2/NNP', '1/CC', '2/NNP'], ['1/CD', '2/CG']]
>>>

**dechen** · Jan 28 '10, 07:39 AM

Thanks a lot. It worked!

**dechen** · Feb 3 '10, 10:49 AM

Word POS
AB'C' NNP
DEF' CC
GH'I' NNP
JKL ' CD
MN'O' CG
DEF' CG

What happens when the same dictionary key has two different values(as above)? Then the word POS in the output may or may not be right, right? Is there a way to track that or prevent the wrong values from getting printed?

**bvdet** · Feb 3 '10, 01:44 PM

How would you be able to distinguish which POS is correct for a given phrase? You can create the dictionary as a list or tuple of POS values and decide using rules which one is correct. The dictionary could be created like this:

Code:

dd = {}
for item in input1.split("\n"):
    key, value = item.split()
    dd.setdefault(key, []).append(value)

.....and will look like this:

Code:

>>> dd
{"GH'I'": ['NNP'], "JKL'": ['CD'], "MN'O'": ['CG'], "DEF'": ['CC', 'CG'], "AB'C'": ['NNP']}
>>>

Comparing somewhat irregular data, counting and printing!

Comparing somewhat irregular data, counting and printing!

Comment

Comment

Comment

Comment

Comment

Comment

Comment