[Answered ]-Django slug, `\w` doesn't detect korean + chinese

1👍

From the python documentation:

\w:
When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

You just need to add the flag re.UNICODE for it to work and convert the string to unicode (as u'mystring' or unicode(string)).

>>> re.findall(r'\w+', '/review_metas/2108/발견/24986/')
['review_metas', '2108', '24986']

>>> re.findall(r'\w+', u'/review_metas/2108/발견/24986/', re.UNICODE)
[u'review_metas', u'2108', u'\ubc1c\uacac', u'24986']

In your example:

>>> expr = r'^/review_metas/(?P<review_meta_id>\d+)/(?P<slug>[-~\w]+)/(?P<review_thread_id>\d+)/$'
>>> url = u'/review_metas/2108/발견/24986/'

>>> re.match(expr, url)
None

>>> f = re.match(expr, url, re.UNICODE)
>>> f
<_sre.SRE_Match at 0x7f2e08dd8620>
>>> f.group('slug')
u'\ubc1c\uacac'

Just by passing a proper unicode string and adding the re.UNICODE flag your parser works fine.


I don’t know how does Django handle the URLS internally (never used Django before), but if there is no way you can provide the unicode flag to Django, you can replace your slug pattern \w+ with [^/]+.

r'^/review_metas/(?P<review_meta_id>\d+)/(?P<slug>[^/]+)/(?P<review_thread_id>\d+)/$'

It read as anything but '/'.

1👍

use
re.findall(pattern, string, flags = re.U)
or just
re.findall(pattern, string, re.U)

You’ll deal with the same problem if you have to parse any language using non-canonical latin letters (i.e., Czech, Russian or Chinese).

👤DenisK

0👍

Use this:

r'^/review_metas/(?P<review_meta_id>\d+)/(?P<slug>.*)/(?P<review_thread_id>\d+)/$'
👤anand

Leave a comment