How To Split A Text File To Its Words In Python?

February 24, 2023 Post a Comment

I am very new to python and also didn't work with text before...I have 100 text files, each has around 100 to 150 lines of unstructured text describing patient's condition. I read

Solution 1:

It depends on how you define words, or what you regard as the delimiters.
Notice string.split in Python receives an optional parameter delimiter, so you could pass it as this:

for lines in content[0].split():
    for word in lines.split(','):
        print(word)

Unfortunately, string.split receives a single delimiter only, so you may need multi-level splitting like this:

for lines in content[0].split():
    for split0 in lines.split(' '):
        for split1 in split0.split(','):
            for split2 in split1.split('.'):
                for split3 in split2.split('?'):
                    for split4 in split3.split('!'):
                        for word in split4.split(':'): 
                            if word != "":
                                print(word)

Looks ugly, right? Luckily we can use iteration instead:

delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = content
for delimiter in delimiters:
    new_words = []
    for word in words:
        new_words += word.split(delimiter)
    words = new_words

EDITED: Or simply we could use the regular expression package:

import re
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = re.split('|'.join(delimiters), content)

Solution 2:

with open("C:\...\...\...\record-13.txt") as f:
    for line in f:
        for word in line.split():
            print word

Or, this gives you a list of words

with open("C:\...\...\...\record-13.txt") as f:
    words = [word for line in f for word in line.split()]

Or, this gives you a list of lines, but with each line as a list of words.

with open("C:\...\...\...\record-13.txt") as f:
    words = [line.split() for line in f]

Solution 3:

I would use Natural Language Tool Kit as the split() way does not deal well with punctuation.

import nltk

for line in file:
    words = nltk.word_tokenize(line)

Solution 4:

Nobody has suggested a generator, I'm surprised. Here's how I would do it:

def words(stringIterable):
    #upcast the argument to an iterator, if it's an iterator already, it stays the same
    lineStream = iter(stringIterable)
    for line in lineStream: #enumerate the lines
        for word in line.split(): #further break them down
            yield word

Now this can be used both on simple lists of sentences that you might have in memory already:

listOfLines = ['hi there', 'how are you']
for word in words(listOfLines):
    print(word)

But it will work just as well on a file, without needing to read the whole file in memory:

with open('words.py', 'r') as myself:
    for word in words(myself):
        print(word)

Solution 5:

The most flexible approach is to use list comprehension to generate a list of words:

with open("C:\...\...\...\record-13.txt") as f:
    words = [word
             for line in f
             for word in line.split()]

# Do what you want with the words list

Which you can then iterate over, add to a collections.Counter or anything else you please.

Python Developer