Case tagging and python

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Fred Mangusta

    Case tagging and python

    Hi,

    I'm relatively new to programming in general, and totally new to python,
    and I've been told that this language is particularly good for what I
    need to do. Let me explain.
    I have a large corpus of English text, in the form of several files.

    First of all I would like to scan each file. Then, for each word I find,
    I'd like to examine its case status, and write the (lower case) word back
    to another text file - with, appended, a tag stating the case it had in
    the original file.

    An example. Suppose we have three possible "case conditions"
    -all lowercase
    -all uppercase
    -initial uppercase only

    Three corresponding tags for each of these might be, respectively:
    -nocap
    -allcaps
    -cap

    Therefore, given the string

    "The Chairman of BP was asleep"

    I would like to produce

    "the/cap chairman/cap of/nocap /bp/allcaps was/nocap /asleep/nocap"

    and writing this into a file.


    I have the following algorithm in mind:

    -open input file
    -open output file
    -get line of text
    -split line into words
    -for each word
    -tag = checkCase(word)
    -newword = lowercase(word) + append(tag)
    rejoin words into line
    write line into output file

    Now, I managed to write the following initial code

    for s in file:
    lines += 1
    if lines % 1000 == 0:
    print '%d lines' % We print the total lines
    sent = s.split() #split string by spaces
    #...


    But then I don't quite know what would be the fastest/best way to do
    this. Could I use the join function to reform the string? And, regarding
    the casetest() function, what do you suggest to do? Should I test each
    character of each word or there are faster methods?

    Thanks very much,

    F.



  • bearophileHUGS@lycos.com

    #2
    Re: Case tagging and python

    Fred Mangusta:
    Could I use the join function to reform the string?
    You can write a function to split the words, for example taking in
    account the points too, etc.

    And, regarding the casetest() function, what do you suggest to do?
    Python strings have isupper, islower, istitle methods, they may be
    enough for your purposes.

    -open input file
    -open output file
    -get line of text
    -split line into words
    -for each word
    -tag = checkCase(word)
    -newword = lowercase(word) + append(tag)
    rejoin words into line
    write line into output file
    It seems good. To join the words of a line there's str.join. Now you
    can write a function that splits lines, and another to check the case,
    then you can show them to us.

    Yet, I don't see how much use can have your output file :-)

    Bye,
    bearophile

    Comment

    • Fred Mangusta

      #3
      Re: Case tagging and python

      Hi, I came up with the following procedure

      ALLCAPS = "|ALLCAPS"
      NOCAPS = "|NOCAPS"
      MIDCAPS = "|MIDCAPS"
      CAPS = "|CAPS"
      DIGIT = "|DIGIT"

      def test_case(w):

      w_out = ''

      if w.isalpha(): #se la virgola non ci entra
      if w.isupper():
      w_out = w.lower() + ALLCAPS
      return w_out
      elif w.islower():
      w_out = w + NOCAPS
      return w_out
      else:
      m = re.match("^[A-Z]",w)
      if m:
      w_out = w.lower() + CAPS #notsure about this..
      return w_out
      else:
      w_out = w.lower() + MIDCAPS
      return w_out
      elif w.isdigit():
      w_out = w + DIGIT
      return w_out

      Called in here:
      #============== ===========
      lines = 0
      for s in file:
      lines += 1
      if lines % 1000 == 0:
      print '%d lines' % lines
      #sent = sent.replace(", ","")
      sent = s.split() #split string by spaces
      for w in sent:
      wout= test_case(w)
      #============== ============

      But I don't know if I'm doing something sensible? Moreover:

      - test_case has problems, cause whenever It finds some punctuation
      character attached to some word, doesn't tag it. I was thinking of
      cleaning the line of the punctuation before using split on it (see
      commented row) but I don't know if I have to call that replace() once
      for every punctuation char?
      -Is there a way to reprint the tagged text in a file including punctuation?
      -Is my test_case a good start? Would you use regular expressions?

      Thanks very much!
      F.

      Comment

      • chrispoliquin@gmail.com

        #4
        Re: Case tagging and python


        I second the idea of just using the islower(), isupper(), and
        istitle() methods.
        So, you could have a function - let's call it checkCase() - that
        returns a string with the tag you want...

        def checkCase(word) :

        if word.islower():
        tag = 'nocap'
        elif word.isupper():
        tag = 'allcaps'
        elif word.istitle():
        tag = 'cap'

        return tag

        Then let's take an input file and pass every word through the
        function...

        f = open(path:to:fi le, 'r')
        corpus_text = f.read()
        f.close()

        tagged_corpus = ''
        all_words = corpus_text.spl it()

        for w in all_words:
        tagtext = checkCase(w)
        tagged_corpus = tagged_corpus + ' ' + w + '/' + tagtext

        output_file = open(path:to:fi le, 'w')
        output_file.wri te(tagged_corpu s)
        print 'All Done!'



        Also, if you're doing natural language processing in Python, you
        should get NLTK.

        Comment

        Working...