34
Try the following:
hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)
EDIT
Check the useful comment below from Martijn Pieters.
19
I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.
hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)
which will return ['#yogenfrüz']
Hope this’ll help anyone else.
- [Django]-'RelatedManager' object is not iterable Django
- [Django]-How do I migrate a model out of one django app and into a new one?
- [Django]-Case insensitive urls for Django?
4
You may also want to use
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
Assume you have loaded your unicode into a variable called my_unicode… normalizing à into a is this simple…
import unicodedata
output = unicodedata.normalize(‘NFD’, my_unicode).encode(‘ascii’, ‘ignore’)
Explicit example…
myfoo = u'àà'
myfoo
u'\xe0\xe0'
unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?
- [Django]-Unable to connect to server: PgAdmin 4
- [Django]-ImportError: No module named 'django.core.urlresolvers'
- [Django]-How to test auto_now_add in django
0
Here’s an update to Ibrahim Najjar’s original answer based on the comment Martijn Pieters made to the answer and another answer Martijn Pieters gave in https://stackoverflow.com/a/16467505/5302861:
import re
import unicodedata
s = "#ábá123"
n = unicodedata.normalize('NFC', s)
print(n)
c = ''.join(re.findall(r'#\w+', n, re.UNICODE))
print(s, len(s), c, len(c))
- [Django]-How to reset Django admin password?
- [Django]-Django: Get current user in model save
- [Django]-Django won't refresh staticfiles
0
Building on all the other answers:
The key problem is that the re module differs in significant ways to other regular expression engines. In theory, Unicode’s definition of \w
metacharacter would do what the question requires, but the re module does not implement Unicode’s \w
metacharacter.
The easy solution is to swap the regular expression engine, using a solution that is more compatible. The easiest way is to install the regex module and use it. The code that some of the other answers have given will then work as the question needs.
import regex as re
# import unicodedata as ud
import unicodedataplus as ud
hashtags = re.findall(r'#(\w+)', ud.normalize("NFC",str1))
Or if you only what to focus on Latin script, including non-spacing marks (i.e. combining diacritics):
import regex as re
# import unicodedata as ud
import unicodedataplus as ud
hashtags = re.findall(r'#([\p{Latin}\p{Mn}]+)', ud.normalize("NFC",str1))
P.S. I have used unicodedataplus which is a drop-in replacement for unicodedata. It has additional methods, and it is kept up to date with Unicode versions. With unicodedata module to up date the Unicode version required updating Python.
- [Django]-How do I perform a batch insert in Django?
- [Django]-No module named MySQLdb
- [Django]-Request.user returns a SimpleLazyObject, how do I "wake" it?