CodexBloom - Programming Q&A Platform

Regex for Extracting Version Numbers from Software Documentation - Trouble Handling Different Formats

πŸ‘€ Views: 0 πŸ’¬ Answers: 1 πŸ“… Created: 2025-08-06
regex python string-manipulation Python

I'm collaborating on a project where I'm experimenting with I'm optimizing some code but I'm working on a Python script that needs to extract version numbers from a software documentation file... The problem is that the version numbers can appear in various formats, such as `v1.0`, `version 2.1.3`, or `3.0.0-beta`. I've tried using a regex pattern to capture these formats, but it's not behaving as expected. Here’s the regex pattern I started with: ```python import re pattern = r'v?\d+\.\d+(\.\d+)?(-[a-zA-Z]+)?' file_content = ''' Software Release Notes: - v1.0 released on 2021-01-01 - version 2.1.3 is now available - 3.0.0-beta is in testing - v4.0.0 ''' matches = re.findall(pattern, file_content) print(matches) ``` When I run this, I get the output `['1.0', '2.1.3', '3.0.0-beta', '4.0.0']`, which is correct, but it seems like it’s missing some versions if they are formatted differently or if there are extra spaces. I also tried tweaking the regex to allow for more variations, such as optional spaces or different prefixes, but it just complicates it further. For instance, when I include optional spaces like this: ```python pattern = r'v?\s*\d+\.\d+(\.\d+)?(-[a-zA-Z]+)?' ``` I started getting unexpected matches like `v 1.0`, which is not in the original text. I’m not sure how to refine the regex to handle these cases without capturing unwanted strings. Is there a more robust regex pattern I can use to effectively capture these version numbers across varying formats without introducing false positives? Any suggestions for refining my approach would be greatly appreciated! This is my first time working with Python 3.10. This is for a application running on Debian. Any feedback is welcome! I'm using Python 3.11 in this project. How would you solve this?