How To Split A Text File To Its Words In Python?
Solution 1:
It depends on how you define words
, or what you regard as the delimiters
.
Notice string.split
in Python receives an optional parameter delimiter
, so you could pass it as this:
for lines in content[0].split():
for word in lines.split(','):
print(word)
Unfortunately, string.split
receives a single delimiter only, so you may need multi-level splitting like this:
for lines in content[0].split():
for split0 in lines.split(' '):
for split1 in split0.split(','):
for split2 in split1.split('.'):
for split3 in split2.split('?'):
for split4 in split3.split('!'):
for word in split4.split(':'):
if word != "":
print(word)
Looks ugly, right? Luckily we can use iteration instead:
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = content
for delimiter in delimiters:
new_words = []
for word in words:
new_words += word.split(delimiter)
words = new_words
EDITED: Or simply we could use the regular expression package:
import re
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = re.split('|'.join(delimiters), content)
Solution 2:
with open("C:\...\...\...\record-13.txt") as f:
for line in f:
for word in line.split():
print word
Or, this gives you a list of words
with open("C:\...\...\...\record-13.txt") as f:
words = [word for line in f for word in line.split()]
Or, this gives you a list of lines, but with each line as a list of words.
with open("C:\...\...\...\record-13.txt") as f:
words = [line.split() for line in f]
Solution 3:
I would use Natural Language Tool Kit as the split()
way does not deal well with punctuation.
import nltk
for line in file:
words = nltk.word_tokenize(line)
Solution 4:
Nobody has suggested a generator, I'm surprised. Here's how I would do it:
def words(stringIterable):
#upcast the argument to an iterator, if it's an iterator already, it stays the same
lineStream = iter(stringIterable)
for line in lineStream: #enumerate the lines
for word in line.split(): #further break them down
yield word
Now this can be used both on simple lists of sentences that you might have in memory already:
listOfLines = ['hi there', 'how are you']
for word in words(listOfLines):
print(word)
But it will work just as well on a file, without needing to read the whole file in memory:
with open('words.py', 'r') as myself:
for word in words(myself):
print(word)
Solution 5:
The most flexible approach is to use list comprehension to generate a list of words:
with open("C:\...\...\...\record-13.txt") as f:
words = [word
for line in f
for word in line.split()]
# Do what you want with the words list
Which you can then iterate over, add to a collections.Counter
or anything else you please.
Post a Comment for "How To Split A Text File To Its Words In Python?"