Skip to content Skip to sidebar Skip to footer

Python Program That Finds Most Frequent Word In A .txt File, Must Print Word And Its Count

As of right now, I have a function to replace the countChars function, def countWords(lines): wordDict = {} for line in lines: wordList = lines.split() for word in word

Solution 1:

If you need to count a number of words in a passage, then it is better to use regex.

Let's start with a simple example:

import re

my_string = "Wow! Is this true? Really!?!? This is crazy!"

words = re.findall(r'\w+', my_string) #This finds words in the document

Result:

>>> words
['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']

Note that "Is" and "is" are two different words. My guess is that you want the to count them the same, so we can just capitalize all the words, and then count them.

from collections import Counter

cap_words = [word.upper() for word in words] #capitalizes all the words

word_counts = Counter(cap_words) #counts the number each time a word appears

Result:

>>> word_counts
Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})

Are you good up to here?

Now we need to do exactly the same thing we did above just this time we are reading a file.

import re
from collections import Counter

withopen('your_file.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word.upper() for word in words]

word_counts = Counter(cap_words)

Solution 2:

This program is actually a 4-liner, if you use the powerful tools at your disposal:

withopen(yourfile) as f:
    text = f.read()

words = re.compile(r"[\w']+", re.U).findall(text)   # re.U == re.UNICODE
counts = collections.Counter(words)

The regular expression will find all words, irregardless of the punctuation adjacent to them (but counting apostrophes as part of the word).

A counter acts almost just like a dictionary, but you can do things like counts.most_common(10), and add counts, etc. See help(Counter)

I would also suggest that you not make functions printBy..., since only functions without side-effects are easy to reuse.

defcountsSortedAlphabetically(counter, **kw):
    returnsorted(counter.items(), **kw)

#def countsSortedNumerically(counter, **kw):#    return sorted(counter.items(), key=lambda x:x[1], **kw)#### use counter.most_common(n) instead# `from pprint import pprint as pp` is also usefuldefprintByLine(tuples):
    print( '\n'.join(' '.join(map(str,t)) for t in tuples) )

Demo:

>>>words = Counter(['test','is','a','test'])>>>printByLine( countsSortedAlphabetically(words, reverse=True) )
test 2
is 1
a 1

edit to address Mateusz Konieczny's comment: replaced [a-zA-Z'] with [\w']... the character class \w, according to the python docs, "Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched." (... but apparently doesn't match an apostrophe...) However \w includes _ and 0-9, so if you don't want those and you aren't working with unicode, you can use [a-zA-Z']; if you are working with unicode you'd need to do a negative assertion or something to subtract [0-9_] from the \w character class

Solution 3:

You have a simple typo, words where you want word.

Edit: You appear to have edited the source. Please use copy and paste to get it right the first time.

Edit 2: Apparently you're not the only one prone to typos. The real problem is that you have lines where you want line. I apologize for accusing you of editing the source.

Solution 4:

 words = ['red', 'green', 'black', 'pink', 'black', 'white', 'black', 
'eyes','white', 'black', 'orange', 'pink', 'pink', 'red', 'red', 
'white', 'orange', 'white', "black", 'pink', 'green', 'green', 'pink', 
'green', 'pink','white', 'orange', "orange", 'red']

 from collections import Counter
 counts = Counter(words)
 top_four = counts.most_common(4)
 print(top_four)

Solution 5:

Here a possible solution, not as elegant as ninjagecko's but still:

from collections import defaultdict

dicto = defaultdict(int)

withopen('yourfile.txt') as f:
    for line in f:
        s_line = line.rstrip().split(',') #assuming ',' is the delimiterfor ele in s_line:
            dicto[ele] += 1#dicto contians words as keys, word counts as valuesfor k,v in dicto.iteritems():
     print k,v

Post a Comment for "Python Program That Finds Most Frequent Word In A .txt File, Must Print Word And Its Count"