Study | StudyLover

FlashText

Unit:1 Foundations of Python and Its Applications in Machine Learning

FlashText is a highly specialized and efficient Python library designed for one primary purpose: searching for and replacing keywords in text at incredibly high speeds. While other tools like regular expressions can perform these tasks, FlashText is significantly faster when dealing with a large number of keywords (from a few hundred to millions).

Its secret lies in its underlying algorithm, which is inspired by the Aho-Corasick algorithm and uses a Trie data structure. Instead of iterating through the text for every keyword, FlashText creates a dictionary of keywords and then iterates through the text once, looking up words in its dictionary. This makes its performance dependent on the length of the text, not the number of keywords you are searching for.

Key Concepts

Keyword Processor: The main object in FlashText that stores your dictionary of keywords.
Trie Data Structure: A tree-like data structure that stores the keyword dictionary efficiently for fast lookups.
Single Pass: FlashText processes the entire text in a single pass, which is the key to its speed. It doesn't re-scan the text for each keyword.

To use the examples, you first need to install FlashText: pip install flashtext

Code Examples

1. Extracting Keywords from a Sentence

This is the most basic use case: finding which keywords from your dictionary are present in a given text.

from flashtext import KeywordProcessor

# 1. Create a KeywordProcessor object

keyword_processor = KeywordProcessor()

# 2. Add keywords to the processor

# You can add them one by one or from a list

keyword_dict = {

"python": ["Python", "Jython"],

"java": ["Java", "J2EE"],

"data science": ["Data Science", "Data Scientist"]

}

keyword_processor.add_keywords_from_dict(keyword_dict)

# 3. Find keywords in a sentence

sentence = "I am a Data Scientist and I love to code in Python and Java."

found_keywords = keyword_processor.extract_keywords(sentence)

print(f"Sentence: '{sentence}'")

print(f"Found Keywords: {found_keywords}")

# Output: ['data science', 'python', 'java']

2. Replacing Keywords

FlashText can also replace keywords with a "clean" or standardized term. It does this in a single pass, avoiding issues where a replaced word might match another keyword.

from flashtext import KeywordProcessor

# We can use the same processor from the previous example

keyword_processor = KeywordProcessor()

keyword_dict = {

"python": ["Python", "Jython"],

"java": ["Java", "J2EE"],

"data science": ["Data Science", "Data Scientist"]

}

# The dictionary keys are the "clean" terms that will be used for replacement.

keyword_processor.add_keywords_from_dict(keyword_dict)

sentence = "I am a Data Scientist and I love to code in Python and Jython."

new_sentence = keyword_processor.replace_keywords(sentence)

print(f"Original Sentence: '{sentence}'")

print(f"New Sentence: '{new_sentence}'")

# Output: 'I am a data science and I love to code in python and python.'

3. Case Sensitivity

By default, FlashText is case-insensitive. You can easily change this during initialization.

from flashtext import KeywordProcessor

# Initialize the processor with case_sensitive=True

case_sensitive_processor = KeywordProcessor(case_sensitive=True)

case_sensitive_processor.add_keyword('Python')

case_sensitive_processor.add_keyword('java')

sentence = "I like Python but not python. I also like Java."

found_keywords = case_sensitive_processor.extract_keywords(sentence)

print(f"Sentence: '{sentence}'")

print(f"Found Keywords (Case-Sensitive): {found_keywords}")

# Output: ['Python'] (it does not match 'python' or 'Java')

4. Extracting Keywords with Span Information

Sometimes, you need to know where in the text a keyword was found. FlashText can provide the start and end character indices for each match.

from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor()

keyword_processor.add_keyword('data science')

keyword_processor.add_keyword('Python')

sentence = "I am a data science professional who uses Python."

# Set span_info=True to get the start and end positions

found_keywords_with_span = keyword_processor.extract_keywords(sentence, span_info=True)