4👍
Put both collections into frozensets.
little performance test:
import random
from timeit import Timer
def random_strings(size):
alpha = 'abcdefghijklmnopqrstuvwxyz'
min = 3
max = 8
strings = []
for count in xrange(1, size):
current = ''
for x in random.sample(alpha, random.randint(min,max)):
current += x
strings.append(current)
return strings
string_list_1 = random_strings(10000)
string_list_2 = random_strings(10000)
def string_test():
common = filter(lambda x: x in string_list_2, string_list_1)
return common
def set_test():
string_set_1 = frozenset(string_list_1)
string_set_2 = frozenset(string_list_2)
common = string_set_1 & string_set_2
return common
string_timer = Timer("__main__.string_test()", "import __main__")
set_timer = Timer("__main__.set_test()", "import __main__")
print string_timer.timeit(10)
# 22.6108954005
print set_timer.timeit(10)
# 0.0226439453
As you can see, set is exponentially faster. Should perform better than dictionary, too.
Of important note is that I included the time necessary to make the sets. This overhead will affect your performance, too, but with the exception of having one set that is much much smaller than the other, you’ll net a large gain.
1👍
For a search like this I would go with a binary search. One of the fasted methods for long SORTED lists. If it isn’t sorted, then don’t use binary search.
- [Django]-How to find duplicate records based on certain fields in Django
- [Django]-Specify submit button when testing a form view in Django
0👍
You have two million strings that you need to match against one million other strings‽
A couple of things to try:
- Use a set instead of a list for those 2 million items.
- If that doesn’t speed it up, try having the strings as keys in a dictionary.
- If that also doesn’t help, get a nice binary tree implementation and use that.
Update:
As mentioned in the comments, sets and dicts don’t use binary trees, they use hash tables. This should be faster than a list, and in fact probably even faster than a binary search.
- [Django]-Django model manager or django queryset
- [Django]-Django – NameError: name 'Post' is not defined in many to many relationship
- [Django]-Django sequence:item 4 expected string or unicode int found
0👍
off the top of my head, with so little info on why you are doing this a few million times:
1.) can you import the csv into a table and do the checks in sql?
2.) how about sorting and indexing the list for quick access?
cheers,
P
- [Django]-Django rest framework – session auth vs token auth, csrf
- [Django]-AssertionError: `create()` did not return an object instance
- [Django]-Stop capturing any string based on / for urls in Django (regex)
- [Django]-InlineModelAdmin not showing up on admin page