Counting words from text file

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Aplus212
    New Member
    • Jun 2012
    • 1

    Counting words from text file

    Hi guys,

    Very new to Python and was hoping you guys could give me some help.

    I have a book about The Great War, and want to count the times a country appears in the book. So far i have this:

    Accesing the book
    Code:
    >>> from __future__ import division 
    >>> import nltk, re, pprint
    >>> from urllib import urlopen
    >>> url = "http://www.gutenberg.org/files/29270/29270.txt"
    >>> raw = urlopen(url).read() 
    >>> type(raw)
    <type 'str'>
    >>> len(raw)
    1067008
    >>> raw[:75]
    'The Project Gutenberg EBook of The Story of the Great War, Volume II (of\r\nV'
    
    Tokenizing
    >>> tokens = nltk.word_tokenize(raw)
    >>> type(tokens)
    <type 'list'>
    >>> len(tokens)
    189743
    >>> tokens[:10]	
    ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Story', 'of', 'the', 'Great']
    
    Slicing
    >>> text = nltk.Text(tokens)
    >>> type(text)
    <class 'nltk.text.Text'>
    >>> text[1020:1060]	
    ['Battles', 'of', 'the', 'Polish', 'Campaign', '462', 'LXXX.', 'Winter', 'Battles', 'in', 'East', 'Prussia', '478', 'LXXXI.', 'Results', 'of', 'First', 'Six', 'Months', 'of', 'Russo-German', 'Campaign', '482', 'PART', 'VIII.--TURKEY', 'AND', 'THE', 'DARDANELLES', 'LXXXII.', 'First', 'Moves', 'of', 'Turkey', '493', 'LXXXIII.', 'The', 'First', 'Blow', 'Against', 'the']
    >>> text.collocations() 
    Building collocations list
    General von; Project Gutenberg-tm; East Prussia; Von Kluck; von Kluck;
    General Staff; General Joffre; army corps; General Foch; crown prince;
    Project Gutenberg; von Buelow; Sir John; Third Army; right wing; Crown
    Prince; Field Marshal; Von Buelow; First Army; Army Corps
    
    Correcting the start and ending
    >>> raw.find("PART I") 
    2629
    >>> raw.rfind("End of the Project Gutenberg")	
    1047663
    >>> raw = raw[2629:1047663]	
    >>> raw.find("PART I")	
    0
    For counting words i have this:

    Code:
    def getWordFrequencies(text):
    frequencies = {}
    
    for c in re.split('\W+', text):
    frequencies[c] = (frequencies[c] if frequencies.has_key[c] else 0) + 1
    
    return frequencies
    
    <HERE THE BOOK SHOULD BE INSERTED, I THINK>
    
    result = dict([(w, Book.count(w)) for w in Book.split()])
    
    for i in result.items(): print "%s\t%d"%i
    ----------

    I unfortunately have no idea how to implement the book into the wordcount. My ideal outcome would be something like this:

    Germany 2000
    United Kingdom 1500
    USA 1000
    Holland 50
    Belgium 150

    etc.


    Please help!
    Last edited by zmbd; Dec 14 '12, 05:37 AM. Reason: [Z{Format posted code/HTML/SQL using the <CODE/> button}]
  • Glenton
    Recognized Expert Contributor
    • Nov 2008
    • 391

    #2
    Hi

    Do you have a seed list of countries that you are going to look for? Are you going to look for "USA", "United States", "US of A", "United States of America", "America" as separate items? If so, you need also to be cognisant that "America" would also be found when "United States of America" is found etc.

    There tends to be a lot of messy details with real life data mining!!

    Also, please use the code tags - it will make it much clearer to us - especially with Python where indents are important.

    Anyway, you seem to have got some results so far - hopefully you can clarify your question a little more...

    Comment

    • eGrove Systems
      New Member
      • Dec 2012
      • 6

      #3
      Code:
      def count_words(file_name ):
      
              fname = file_name
      
              num_lines = 0
              num_words = 0
      
              with open(fname, 'r') as f:
                  for line in f:
                      words = line.split()
      
                      num_lines += 1
                      num_words += len(words)
      
              return num_words
      
          words_count = count_words(file_name ) //File name with absolute path.
      Last edited by zmbd; Dec 14 '12, 05:37 AM. Reason: [Z{Please format code using the <CODE/> button}{In the future explain your code or it may be deleted!}]

      Comment

      • kttr
        New Member
        • Nov 2012
        • 12

        #4
        I agree with Glenton. Data analysis in real life, especially with finding specific words matching a given meaning is complicated. However, I will assume that you only want to find the number of times "Germany" is found and NOT "Prussia". Either way, this difference should be made clear.

        So far, your code is a little hard to read. Try adding more comments so that readers know what each part of your program is doing.

        All criticism aside, here is what I would try, and why:

        (Tell me if the code works or not!!!!!)

        Before doing anything, just copy and paste the text from the online book onto a notepad document and save it as "ww1.txt" (quotes not included in name). That way, you can avoid any troubles that might arise by reading the file over the internet.

        Once you've done that, here is what the code that you will make might look like (I have explained the code using inline comments).

        Code:
        filehandler = open("ww1.txt","r+")
        
        #start a counter variable.
        #Every time you find the word, the corresponding variable will increase by 1.
        #this part is under the 'for' loop
        germany_counter = 0
        holland_counter = 0
        #add more countries' counters here
        
        for line in filehandler:
            stringy_line = str(line)#convert the line to string so you can use the find function.
            
            if stringy_line.find("Germany") != -1: #essentially, this part on the left means: if the word Germany is found AT ALL, then execute the following code. 
                germany_counter = germany_counter + 1 #increase counter by one when word is found in line
            if stringy_line.find("Holland") != -1:
                holland_counter = holland_counter + 1 #same principle as above applies here
                
        #when it's done reading all of the lines, print out the country's name and the counter
        print "Germany",germany_counter
        print "Holland",holland_counter
        
        #obviously, you must add more counters and if statements to the code for other countries. 
        #PLEASE NOTE: you will have to change the search string depending on what you're looking for. Do you want Prussia or Germany? Edit the search string to see the difference.

        Comment

        • dwblas
          Recognized Expert Contributor
          • May 2008
          • 626

          #5
          The original code is from June and the OP has not responded. It was either solved months ago or forgotten.

          Comment

          Working...