[Django]-How to account for accent characters for regex in Python?

34👍

✅

Try the following:

hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)

Regex101 Demo

EDIT
Check the useful comment below from Martijn Pieters.

19👍

I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.

hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)

which will return ['#yogenfrüz']

Hope this’ll help anyone else.

4👍

You may also want to use

import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
Assume you have loaded your unicode into a variable called my_unicode… normalizing à into a is this simple…

import unicodedata
output = unicodedata.normalize(‘NFD’, my_unicode).encode(‘ascii’, ‘ignore’)
Explicit example…

myfoo = u'àà'
myfoo
u'\xe0\xe0'
unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'

check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?

0👍

Here’s an update to Ibrahim Najjar’s original answer based on the comment Martijn Pieters made to the answer and another answer Martijn Pieters gave in https://stackoverflow.com/a/16467505/5302861:

import re
import unicodedata

s = "#ábá123"
n = unicodedata.normalize('NFC', s)

print(n)
c = ''.join(re.findall(r'#\w+', n, re.UNICODE))
print(s, len(s), c, len(c))

0👍

Building on all the other answers:

The key problem is that the re module differs in significant ways to other regular expression engines. In theory, Unicode’s definition of \w metacharacter would do what the question requires, but the re module does not implement Unicode’s \w metacharacter.

The easy solution is to swap the regular expression engine, using a solution that is more compatible. The easiest way is to install the regex module and use it. The code that some of the other answers have given will then work as the question needs.

import regex as re
# import unicodedata as ud
import unicodedataplus as ud
hashtags = re.findall(r'#(\w+)', ud.normalize("NFC",str1))

Or if you only what to focus on Latin script, including non-spacing marks (i.e. combining diacritics):

import regex as re
# import unicodedata as ud
import unicodedataplus as ud
hashtags = re.findall(r'#([\p{Latin}\p{Mn}]+)', ud.normalize("NFC",str1))

P.S. I have used unicodedataplus which is a drop-in replacement for unicodedata. It has additional methods, and it is kept up to date with Unicode versions. With unicodedata module to up date the Unicode version required updating Python.

Leave a comment