Regex Not Matching HTML Tags with Attributes in Python - implementing Whitespace Handling

👀 Views: 39 💬 Answers: 1 📅 Created: 2025-06-11

I've looked through the documentation and I'm still confused about I'm wondering if anyone has experience with I've encountered a strange issue with I'm wondering if anyone has experience with I've looked through the documentation and I'm still confused about I'm having trouble using regex to match HTML tags that may include attributes, especially when there's whitespace involved..... I need to extract tags like `<a href='link' class='btn'>` or `<img src="image.jpg" />`, but my current regex seems to unexpected result when there's varying amounts of whitespace or when attributes are placed in different orders. Here’s what I’ve tried so far: ```python import re test_string = "<a href='link' class='btn'>Some Link</a> <img src= \"image.jpg\" />" regex_pattern = r'<(\w+)\s+([^>]*)>(.*?)</\1>|<(\w+)([^>]*)/>' matches = re.findall(regex_pattern, test_string) print(matches) ``` The above regex captures the tags, but not the attributes correctly when there are extra spaces. For instance, if I use something like this: ```python < a href = 'link' > ``` It fails to match entirely. I'm getting empty results or incorrect parsing of attributes when whitespace is inconsistent. Additionally, I notice that when the attributes contain quotes, the regex does not seem to handle escaped quotes well. Is there a better way to structure my regex to ensure that it can handle varying whitespace and correctly parse attributes regardless of their order? I'm using Python 3.9.1 and the `re` module. Any suggestions or best practices would be greatly appreciated! Thanks in advance! I'm on Windows 11 using the latest version of Python. Could someone point me to the right documentation? Is there a better approach? Am I missing something obvious?