CodexBloom - Programming Q&A Platform

Regex Not Capturing UTF-8 Characters in Node.js - Need guide with Multilingual Strings

πŸ‘€ Views: 36 πŸ’¬ Answers: 1 πŸ“… Created: 2025-06-12
regex node.js utf-8 JavaScript

I've looked through the documentation and I'm still confused about I am trying to extract specific patterns from multilingual strings using regex in Node.js, but I need to seem to get it right when it comes to UTF-8 characters. My current regex is designed to capture sequences of letters, digits, and underscores, but it fails when the strings contain characters from different languages like Chinese or Arabic. Here's the regex I'm using: ```javascript const pattern = /[A-Za-z0-9_]+/g; const input = 'Hello δΈ–η•Œ, this is a test 123! Ω…Ψ±Ψ­Ψ¨Ψ§'; const matches = input.match(pattern); console.log(matches); ``` I expected the output to include both English and non-English segments, but it only returns the English and numeric parts, which is: ``` [ 'Hello', 'this', 'is', 'a', 'test', '123' ] ``` I've tried changing the regex to include Unicode properties like this: ```javascript const pattern = /[\p{L}\p{N}_]+/gu; ``` But now I’m getting an behavior: `SyntaxError: Invalid regular expression: /[\p{L}\p{N}_]+/gu: Unicode property escapes are not allowed in this context`. It seems that my version of Node.js (v12.22.9) does not support the `\p{}` syntax. I've also looked into using the `unicode` flag, but I’m not sure how to implement it correctly. Are there any workarounds or other regex patterns I should consider for this task? Any help would be greatly appreciated! I'm working on a web app that needs to handle this.