FlashText is a highly specialized and efficient Python library designed for one primary purpose: searching for and replacing keywords in text at incredibly high speeds. While other tools like regular expressions can perform these tasks, FlashText is significantly faster when dealing with a large number of keywords (from a few hundred to millions).
Its secret lies in its underlying algorithm, which is inspired by the Aho-Corasick algorithm and uses a Trie data structure. Instead of iterating through the text for every keyword, FlashText creates a dictionary of keywords and then iterates through the text once, looking up words in its dictionary. This makes its performance dependent on the length of the text, not the number of keywords you are searching for.
Key Concepts
- Keyword Processor: The main object in FlashText that stores your dictionary of keywords.
- Trie Data Structure: A tree-like data structure that stores the keyword dictionary efficiently for fast lookups.
- Single Pass: FlashText processes the entire text in a single pass, which is the key to its speed. It doesn't re-scan the text for each keyword.
To use the examples, you first need to install FlashText: pip install flashtext
Code Examples
1. Extracting Keywords from a Sentence
This is the most basic use case: finding which keywords from your dictionary are present in a given text.
from flashtext import KeywordProcessor
# 1. Create a KeywordProcessor object
keyword_processor = KeywordProcessor()
# 2. Add keywords to the processor
# You can add them one by one or from a list
keyword_dict = {
"python": ["Python", "Jython"],
"java": ["Java", "J2EE"],
"data science": ["Data Science", "Data Scientist"]
}
keyword_processor.add_keywords_from_dict(keyword_dict)
# 3. Find keywords in a sentence
sentence = "I am a Data Scientist and I love to code in Python and Java."
found_keywords = keyword_processor.extract_keywords(sentence)
print(f"Sentence: '{sentence}'")
print(f"Found Keywords: {found_keywords}")
# Output: ['data science', 'python', 'java']
2. Replacing Keywords
FlashText can also replace keywords with a "clean" or standardized term. It does this in a single pass, avoiding issues where a replaced word might match another keyword.
from flashtext import KeywordProcessor
# We can use the same processor from the previous example
keyword_processor = KeywordProcessor()
keyword_dict = {
"python": ["Python", "Jython"],
"java": ["Java", "J2EE"],
"data science": ["Data Science", "Data Scientist"]
}
# The dictionary keys are the "clean" terms that will be used for replacement.
keyword_processor.add_keywords_from_dict(keyword_dict)
sentence = "I am a Data Scientist and I love to code in Python and Jython."
new_sentence = keyword_processor.replace_keywords(sentence)
print(f"Original Sentence: '{sentence}'")
print(f"New Sentence: '{new_sentence}'")
# Output: 'I am a data science and I love to code in python and python.'
3. Case Sensitivity
By default, FlashText is case-insensitive. You can easily change this during initialization.
from flashtext import KeywordProcessor
# Initialize the processor with case_sensitive=True
case_sensitive_processor = KeywordProcessor(case_sensitive=True)
case_sensitive_processor.add_keyword('Python')
case_sensitive_processor.add_keyword('java')
sentence = "I like Python but not python. I also like Java."
found_keywords = case_sensitive_processor.extract_keywords(sentence)
print(f"Sentence: '{sentence}'")
print(f"Found Keywords (Case-Sensitive): {found_keywords}")
# Output: ['Python'] (it does not match 'python' or 'Java')
4. Extracting Keywords with Span Information
Sometimes, you need to know where in the text a keyword was found. FlashText can provide the start and end character indices for each match.
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('data science')
keyword_processor.add_keyword('Python')
sentence = "I am a data science professional who uses Python."
# Set span_info=True to get the start and end positions
found_keywords_with_span = keyword_processor.extract_keywords(sentence, span_info=True)
print(f"Sentence: '{sentence}'")
print(f"Found Keywords with Span Info: {found_keywords_with_span}")
# Output: [('data science', 7, 19), ('Python', 41, 47)]
# You can use this info to highlight the words
for keyword, start, end in found_keywords_with_span:
print(f"Found '{keyword}' at index {start}:{end} -> '{sentence[start:end]}'")