Regex for Extracting H1 Tags in HTML for SEO Analysis - Handling Nested Tags and Attributes

👀 Views: 1 💬 Answers: 1 📅 Created: 2025-09-21

I've tried everything I can think of but I'm confused about While refactoring our microservices to improve SEO analysis, I need to extract H1 tags from HTML content returned by our endpoint... The challenge arises from the potential for nested tags and varying attributes that can obscure the actual text. For example, the HTML might look like this: ```html <h1 class="title">Main Title <span>with a span</span></h1> <h1>Another Title</h1> ``` My goal is to retrieve just the text inside the H1 tags, ignoring any nested tags. I tried a simple regex like `/<h1[^>]*>(.*?)</h1>/g`, which captures H1 elements, but it also includes the nested tags. To refine this, I attempted to use a more complex regex: ```regex /<h1[^>]*>(?:(?!</h1>).)*</h1>/g ``` This should ideally work by ensuring that we don’t capture anything up to the closing H1 tag. Unfortunately, it returns unexpected results when there are additional H1 elements within the same string, leading to incomplete matches. Additionally, I have experimented with enabling the dot-all flag, hoping to capture newlines as well. However, regex performance is vital given the scale of content we handle, and I am concerned this may lead to slower processing times. Currently, I’m using Node.js with the `express` framework, and I’m open to leveraging libraries like `cheerio` or `jsdom` for a DOM-like manipulation approach if regex proves too cumbersome. Does anyone have suggestions on how to reliably extract these H1 tags without losing nested content, or an alternative methodology that balances performance and accuracy? I'm coming from a different tech stack and learning Javascript. What are your experiences with this?