Regex optimization guide for Multi-Line CSV Parsing in Python - Capturing Quoted Fields

👀 Views: 84 💬 Answers: 1 📅 Created: 2025-06-03

I've been struggling with this for a few days now and could really use some help. I'm working through a tutorial and I'm writing unit tests and I'm trying to parse a multi-line CSV where some fields are enclosed in quotes, and these fields may also contain newline characters. I thought I could use regex to extract the desired fields, but I'm running into issues when the quoted field spans multiple lines. My current regex pattern is as follows: ```python import re csv_data = '''"value1","value2" "value3, with, commas","value4" "value5,\nwith newline"''' pattern = r'"([^"]*?)"' matches = re.findall(pattern, csv_data) print(matches) ``` This code correctly captures value1, value2, and value3, but it fails to capture value5 because of the newline within the quotes. The output I'm getting is: `['value1', 'value2', 'value3, with, commas', 'value4']`. I've tried modifying the regex pattern to account for newlines, but I need to seem to find a combination that works. I also tried using the `re.DOTALL` flag, but it didn't help as I still need to seem to make it capture the quoted fields correctly across lines. The documentation isn't particularly clear on handling multi-line matches within quotes, so I'm exploring. Is there a more effective way to write the regex to capture quoted fields, including those containing newlines? Or should I consider a different approach for parsing this CSV format altogether? Any advice would be greatly appreciated! The project is a web app built with Python. How would you solve this? Thanks for your help in advance! I'm using Python 3.9 in this project. Any advice would be much appreciated.