[Django]-Postgresql ArrayField vs ForeignKey? which one is performant?

4👍

If you build a GIN index on the array column and Django will write queries in such a way that they can use that index, then the performance for reading will be quite similar between the two.

It is very unlikely that the performance difference should be the driving factor behind this choice. For example, do you need more info behind the phone number than just the number, such as when it was added, when it was last used, whether it is a mobile phone or something else, etc.

The array column should be faster because it only has to consult one index and table, rather than two of each. Also, it will be more compact, and so more cacheable.

On the other hand, the statistical estimates for your array column will have a problem when estimating rare values, which you are likely to have here, as no phone number is likely to be shared by a large number of people. This misestimate could have devastating results on your query performance. For example in a little test, overestimating the number of rows by many thousand fold caused it to launch parallel worker for a single-row query, leading it to be about 20 fold slower than when parallelization is turned off, and 10 times slower than using the foreign-key representation which doesn’t suffer from the estimation problem.

For example:

create table contact as select md5(floor(random()*50000000)::text) as name, array_agg(floor(random()*100000000)::int) phones from generate_series(1,100000000) f(x) group by name;
vacuum analyze contact;
create index on contact using gin (phones );
explain analyze select * from contact where phones @> ARRAY[123456];
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=3023.30..605045.19 rows=216167 width=63) (actual time=0.668..8.071 rows=2 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on contact  (cost=2023.30..582428.49 rows=90070 width=63) (actual time=0.106..0.110 rows=1 loops=3)
         Recheck Cond: (phones @> '{123456}'::integer[])
         Heap Blocks: exact=2
         ->  Bitmap Index Scan on contact_phones_idx  (cost=0.00..1969.25 rows=216167 width=0) (actual time=0.252..0.252 rows=2 loops=1)
               Index Cond: (phones @> '{123456}'::integer[])
 Planning Time: 0.820 ms
 Execution Time: 8.137 ms

You can see that it estimates where will be 216167 rows, but in fact there are only 2. (For convenience, I used ints, rather than the text field you would probably use for phone numbers, but this doesn’t change anything fundamental).

If this is really vital to you, then you should do the test and see, using your own data and your own architecture. It will depend on what does and does not fit in memory, what kinds of queries you are doing (do you ever look up numbers in bulk? Join them to other tables besides the immediately discussed foreign key?), and maybe how your driver/library handle columns/parameters with array types.

👤jjanes

Leave a comment