I'm using Python 3.5.2
I have two lists
- a list of about 750,000 "sentences" (long strings)
- a list of about 20,000 "words" that I would like to delete from my 750,000 sentences
So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.
I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter
compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
Then I loop through my "sentences"
import re
for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word, "", sentence)
# put sentence into a growing list
This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.
-
Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?
-
Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it's not much of an improvement.
Thank you for any suggestions.