Hi guys,
Very new to Python and was hoping you guys could give me some help.
I have a book about The Great War, and want to count the times a country appears in the book. So far i have this:
Accesing the book
For counting words i have this:
----------
I unfortunately have no idea how to implement the book into the wordcount. My ideal outcome would be something like this:
Germany 2000
United Kingdom 1500
USA 1000
Holland 50
Belgium 150
etc.
Please help!
Very new to Python and was hoping you guys could give me some help.
I have a book about The Great War, and want to count the times a country appears in the book. So far i have this:
Accesing the book
Code:
>>> from __future__ import division >>> import nltk, re, pprint >>> from urllib import urlopen >>> url = "http://www.gutenberg.org/files/29270/29270.txt" >>> raw = urlopen(url).read() >>> type(raw) <type 'str'> >>> len(raw) 1067008 >>> raw[:75] 'The Project Gutenberg EBook of The Story of the Great War, Volume II (of\r\nV' Tokenizing >>> tokens = nltk.word_tokenize(raw) >>> type(tokens) <type 'list'> >>> len(tokens) 189743 >>> tokens[:10] ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Story', 'of', 'the', 'Great'] Slicing >>> text = nltk.Text(tokens) >>> type(text) <class 'nltk.text.Text'> >>> text[1020:1060] ['Battles', 'of', 'the', 'Polish', 'Campaign', '462', 'LXXX.', 'Winter', 'Battles', 'in', 'East', 'Prussia', '478', 'LXXXI.', 'Results', 'of', 'First', 'Six', 'Months', 'of', 'Russo-German', 'Campaign', '482', 'PART', 'VIII.--TURKEY', 'AND', 'THE', 'DARDANELLES', 'LXXXII.', 'First', 'Moves', 'of', 'Turkey', '493', 'LXXXIII.', 'The', 'First', 'Blow', 'Against', 'the'] >>> text.collocations() Building collocations list General von; Project Gutenberg-tm; East Prussia; Von Kluck; von Kluck; General Staff; General Joffre; army corps; General Foch; crown prince; Project Gutenberg; von Buelow; Sir John; Third Army; right wing; Crown Prince; Field Marshal; Von Buelow; First Army; Army Corps Correcting the start and ending >>> raw.find("PART I") 2629 >>> raw.rfind("End of the Project Gutenberg") 1047663 >>> raw = raw[2629:1047663] >>> raw.find("PART I") 0
Code:
def getWordFrequencies(text): frequencies = {} for c in re.split('\W+', text): frequencies[c] = (frequencies[c] if frequencies.has_key[c] else 0) + 1 return frequencies <HERE THE BOOK SHOULD BE INSERTED, I THINK> result = dict([(w, Book.count(w)) for w in Book.split()]) for i in result.items(): print "%s\t%d"%i
I unfortunately have no idea how to implement the book into the wordcount. My ideal outcome would be something like this:
Germany 2000
United Kingdom 1500
USA 1000
Holland 50
Belgium 150
etc.
Please help!
Comment