[Django]-Speeding up a large process run over some data obtained from a database

1👍

In your question, you are describing an ETL process. I suggest you to use an ETL tool.

To mention some python ETL tool I can talk about Pygrametl, wrote by Christian Thomsen, in my opinion it runs nicely and its performance is impressive. Test it and comeback with results.

I can’t post this answer without mention MapReduce. This programming model can catch with your requirements if you are planing to distribute task through nodes.

1👍

It looks like you have a file for each country that you append hashes to, instead of opening and closing handles to these files 10 million+ times you should open each one once and close them all at the end.

countries = {}  # country -> file
with open(os.path.join(PROJECT_ROOT, 'list_countries')) as country_file:
    for line in country_file:
        country = line.strip()
        countries[country] = open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % country), "a")

for country in countries:
    agreements = get_agreements(country)
    for postcode in Postcode.objects.filter(nationality=country):
        for agreement in agreements:
            countries[agreement].write(generate_hash(passport.nationality + "<" + passport.id_passport, country_agreement) + "\n")

for country, file in countries.items():
    file.close()

I don’t how big a list of Postcode objects Postcode.objects.filter(nationality=country) will return, if it is massive and memory is an issue, you will have to start thinking about chunking/paginating the query using limits

You are using sets for your list of countries and their agreements, if that is because your file containing the list of countries is not guaranteed to be unique, the dictionary solution may error when you attempt to open another handle to the same file. This can be avoided by added a simple check to see if the country is already a member of countries

Leave a comment