Regex for Extracting Email Addresses from Mixed Text - Handling Duplicates and Invalid Formats

👀 Views: 45 💬 Answers: 1 📅 Created: 2025-08-06

I'm writing unit tests and After trying multiple solutions online, I still can't figure this out..... I'm relatively new to this, so bear with me. I'm trying to extract email addresses from a block of text using regex in Python, but I'm running into issues with duplicates and invalid formats. My current regex is: ```python import re text = "You can reach me at test@example.com, or at info@example.com. Also, test@example.com is another alias I use!" pattern = r'\b[\w._%+-]+@[\w.-]+\.[a-zA-Z]{2,}\b' emails = re.findall(pattern, text) print(set(emails)) # Using set to remove duplicates ``` While this does extract the emails correctly, I'm still getting duplicates when there are multiple occurrences in the text. For instance, `test@example.com` is appearing twice in the output. I'm also concerned about edge cases where the email format might be slightly off, like missing a `.com` or including invalid characters. I'm trying to ensure I only keep valid emails and avoid duplicates in the list. Does anyone have suggestions on how to improve this regex or the extraction method? Also, if my regex could be further optimized for performance, that would be a bonus! I’m using Python 3.9.6 and I need this to run efficiently as part of a larger text processing pipeline. For context: I'm using Python on Linux. How would you solve this? I've been using Python for about a year now. Thanks in advance! This is for a desktop app running on Windows 11. Is there a simpler solution I'm overlooking?