Hi guys,
Very new to Python and was hoping you guys could give me some help.
I have a book about The Great War, and want to count the times a country appears in the book. So far i have this:
Accesing the book
For counting words i have this:
----------
I unfortunately have no idea how to implement the book into the wordcount. My ideal outcome would be something like this:
Germany 2000
United Kingdom 1500
USA 1000
Holland 50
Belgium 150
etc.
Please help!
Very new to Python and was hoping you guys could give me some help.
I have a book about The Great War, and want to count the times a country appears in the book. So far i have this:
Accesing the book
Code:
>>> from __future__ import division
>>> import nltk, re, pprint
>>> from urllib import urlopen
>>> url = "http://www.gutenberg.org/files/29270/29270.txt"
>>> raw = urlopen(url).read()
>>> type(raw)
<type 'str'>
>>> len(raw)
1067008
>>> raw[:75]
'The Project Gutenberg EBook of The Story of the Great War, Volume II (of\r\nV'
Tokenizing
>>> tokens = nltk.word_tokenize(raw)
>>> type(tokens)
<type 'list'>
>>> len(tokens)
189743
>>> tokens[:10]
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Story', 'of', 'the', 'Great']
Slicing
>>> text = nltk.Text(tokens)
>>> type(text)
<class 'nltk.text.Text'>
>>> text[1020:1060]
['Battles', 'of', 'the', 'Polish', 'Campaign', '462', 'LXXX.', 'Winter', 'Battles', 'in', 'East', 'Prussia', '478', 'LXXXI.', 'Results', 'of', 'First', 'Six', 'Months', 'of', 'Russo-German', 'Campaign', '482', 'PART', 'VIII.--TURKEY', 'AND', 'THE', 'DARDANELLES', 'LXXXII.', 'First', 'Moves', 'of', 'Turkey', '493', 'LXXXIII.', 'The', 'First', 'Blow', 'Against', 'the']
>>> text.collocations()
Building collocations list
General von; Project Gutenberg-tm; East Prussia; Von Kluck; von Kluck;
General Staff; General Joffre; army corps; General Foch; crown prince;
Project Gutenberg; von Buelow; Sir John; Third Army; right wing; Crown
Prince; Field Marshal; Von Buelow; First Army; Army Corps
Correcting the start and ending
>>> raw.find("PART I")
2629
>>> raw.rfind("End of the Project Gutenberg")
1047663
>>> raw = raw[2629:1047663]
>>> raw.find("PART I")
0
Code:
def getWordFrequencies(text):
frequencies = {}
for c in re.split('\W+', text):
frequencies[c] = (frequencies[c] if frequencies.has_key[c] else 0) + 1
return frequencies
<HERE THE BOOK SHOULD BE INSERTED, I THINK>
result = dict([(w, Book.count(w)) for w in Book.split()])
for i in result.items(): print "%s\t%d"%i
I unfortunately have no idea how to implement the book into the wordcount. My ideal outcome would be something like this:
Germany 2000
United Kingdom 1500
USA 1000
Holland 50
Belgium 150
etc.
Please help!
Comment