Regex for Extracting Phone Numbers from Mixed Content - implementing Formatting Variations
I've been banging my head against this for hours. I've tried everything I can think of but I'm converting an old project and I'm writing unit tests and I'm trying to debug I'm trying to extract phone numbers from a large dataset that includes mixed content, including text, HTML, and different phone number formats....... The formats I need to handle are: (123) 456-7890, 123-456-7890, and 1234567890. I initially tried using the following regex: ```python import re text = "Call me at (123) 456-7890 or 123-456-7890, but not 1234-567-890!" pattern = r'\(?P<area>\d{3}\)?[ -]?\d{3}[ -]?\d{4}' phone_numbers = re.findall(pattern, text) print(phone_numbers) ``` However, this code only returns the area code and not the full phone number. I expected to get a full list like ['(123) 456-7890', '123-456-7890'], but instead, I got something like ['123', '123']. I've tried adjusting my regex pattern, but I'm unsure how to properly capture the full phone numbers, especially when they are formatted differently. Also, I want to ensure that it doesn't mistakenly capture similar-looking numbers such as 1234-567-890, which aren't valid phone numbers. Can anyone suggest a regex that would effectively capture valid U.S. phone numbers while excluding invalid formats? Any help would be appreciated! I'd be grateful for any help. I'm working with Python in a Docker container on Ubuntu 20.04. This issue appeared after updating to Python 3.9.