48
import unicodedata as ud
latin_letters= {}
def is_latin(uchr):
try: return latin_letters[uchr]
except KeyError:
return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))
def only_roman_chars(unistr):
return all(is_latin(uchr)
for uchr in unistr
if uchr.isalpha()) # isalpha suggested by John Machin
>>> only_roman_chars(u"ελληνικά means greek")
False
>>> only_roman_chars(u"frappé")
True
>>> only_roman_chars(u"hôtel lœwe")
True
>>> only_roman_chars(u"123 ångstrom ð áß")
True
>>> only_roman_chars(u"russian: гага")
False
37
The top answer to this by @tzot is great, but IMO there should really be a library for this that works for all scripts. So, I made one (heavily based on that answer).
pip install alphabet-detector
and then use it directly:
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
ad.only_alphabet_chars(u"ελληνικά means greek", "LATIN") #False
ad.only_alphabet_chars(u"ελληνικά", "GREEK") #True
ad.only_alphabet_chars(u'سماوي يدور', 'ARABIC')
ad.only_alphabet_chars(u'שלום', 'HEBREW')
ad.only_alphabet_chars(u"frappé", "LATIN") #True
ad.only_alphabet_chars(u"hôtel lœwe 67", "LATIN") #True
ad.only_alphabet_chars(u"det forårsaker første", "LATIN") #True
ad.only_alphabet_chars(u"Cyrillic and кириллический", "LATIN") #False
ad.only_alphabet_chars(u"кириллический", "CYRILLIC") #True
Also, a few convenience methods for major languages:
ad.is_cyrillic(u"Поиск") #True
ad.is_latin(u"howdy") #True
ad.is_cjk(u"hi") #False
ad.is_cjk(u'汉字') #True
- [Django]-Django: reverse accessors for foreign keys clashing
- [Django]-Track the number of "page views" or "hits" of an object?
- [Django]-Memory efficient (constant) and speed optimized iteration over a large table in Django
4
The standard string
package contains all Latin
letters, numbers
and symbols
. You can remove these values from the text and if there is anything left, it is not-Latin characters. I did that:
In [1]: from string import printable
In [2]: def is_latin(text):
...: return not bool(set(text) - set(printable))
...:
In [3]: is_latin('Hradec Králové District,,Czech Republic,')
Out[3]: False
In [4]: is_latin('Hradec Krlov District,,Czech Republic,')
Out[4]: True
I have no way to check all non-Latin characters and if anyone can do that, please let me know. Thanks.
- [Django]-Django Sitemaps and "normal" views
- [Django]-Celery missed heartbeat (on_node_lost)
- [Django]-How to make Django's DateTimeField optional?
1
check the code in django.template.defaultfilters.slugify
import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
is what you are looking for, you can then compare the resulting string with the original
- [Django]-Reverse for '*' with arguments '()' and keyword arguments '{}' not found
- [Django]-One-to-many inline select with django admin
- [Django]-How do I get all the variables defined in a Django template?
1
For what you say you want to do, your approach is about right. If you are running on Windows, I’d suggest using cp1252
instead of iso-8859-1
. You might also allow cp1250
as well — this would pick up eastern European countries like Poland, Czech Republic, Slovakia, Romania, Slovenia, Hungary, Croatia, etc where the alphabet is Latin-based. Other cp125x would include Turkish and Maltese …
You may also like to consider transcription from Cyrillic to Latin; as far as I know there are several systems, one of which may be endorsed by the UPU (Universal Postal Union).
I’m a little intrigued by your comment “Our shipping department doesn’t want to have to fill out labels with, e.g., Chinese addresses” … three questions: (1) do you mean “addresses in country X” or “addresses written in X-ese characters” (2) wouldn’t it be better for your system to print the labels? (3) how does the order get shipped if it fails your test?
- [Django]-Django-debug-toolbar not showing up
- [Django]-Matplotlib – Tcl_AsyncDelete: async handler deleted by the wrong thread?
- [Django]-Adding links to full change forms for inline items in django admin?
1
Checking for ISO-8559-1 would miss reasonable Western characters like ‘œ’ and ‘€’. The solution depends on how you define “Western”, and how you want to handle non-letters. Here’s one approach:
import unicodedata
def is_permitted_char(char):
cat = unicodedata.category(char)[0]
if cat == 'L': # Letter
return 'LATIN' in unicodedata.name(char, '').split()
elif cat == 'N': # Number
# Only DIGIT ZERO - DIGIT NINE are allowed
return '0' <= char <= '9'
elif cat in ('S', 'P', 'Z'): # Symbol, Punctuation, or Space
return True
else:
return False
def is_valid(text):
return all(is_permitted_char(c) for c in text)
- [Django]-How do i pass GET parameters using django urlresolvers reverse
- [Django]-NumPy array is not JSON serializable
- [Django]-Itertools.groupby in a django template
0
Maybe this will do if you’re a django user?
from django.template.defaultfilters import slugify
def justroman(s):
return len(slugify(s)) == len(s)
- [Django]-Aggregate (and other annotated) fields in Django Rest Framework serializers
- [Django]-How to update an existing Conda environment with a .yml file
- [Django]-Are sessions needed for python-social-auth
0
To simply tzot’s answer using the built-in unicodedata library, this seems to work for me:
import unicodedata as ud
def is_latin(word):
return all(['LATIN' in ud.name(c) for c in word])
- [Django]-A better way to restart/reload Gunicorn (via Upstart) after 'git pull'ing my Django projects
- [Django]-TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
- [Django]-SyntaxError: Generator expression must be parenthezised / python manage.py migrate