[Django]-Postgres int arrays vs a intermediary table to hold user ids for access control list

4๐Ÿ‘

โœ…

Iโ€™ve gone out to test this out myself.
Iโ€™ve created a simple users table with just a serial id, a records table of 10Mil rows distributed
across 10K users and 5 accounts.
The records table have an account field (multi-tenant) and an acl column of integer[] as well as a
separate access table to restrict access to users. Records also have a private boolean flag which
determines if a record is public (to the account) or private to the user + users on the ACL.

Records is populated with the 10Mil rows and random ACL lists โ€“ some are empty, some are variable length
values. The assumption is a user can see a record if:

  • it has no ACL and is not private
  • it has an ACL, but the user is in it

Same applies for the table method: if the user exists in the access table or if the left join is null
(no access specified) and the private flag is false, access is allowed.
The private flag is populated randomly based for rows with empty/no ACL.

Running the two last query (under explain analyse) below and more realistic variations of them using
limits result in what appears to be a clear winner for the array approach. Granted, the data used
for both approaches is not similar in the sense that it was randomly created twice separately, but I
assume with this number rows I can guage a close enough approximation of a winner.

Please critique and correct me test if you can.

-- create the intarray extension.
create extension intarray;

-- Create users table. Just an id will do fot this test
create table users (
    id serial primary key
);

-- A records table. This one is controled by the ACL
create table records (
    id serial primary key,
    account integer,
    private bool default false,
    acl integer[]
);

-- access table to limit record access for users. this is the second option vs array.
create table access (
    record_id integer,
    user_id integer
);


-- generate 10000 users
insert into users
    select from generate_series(1,10000);


-- some indexes on both methods:
create index aclindex on records using gin(acl gin__int_ops);
create index accounts on records (account);
create index useraccess on access(record_id, user_id);


-- function to geneate random acls
DROP FUNCTION IF EXISTS make_acl();
CREATE FUNCTION make_acl() RETURNS integer[] AS $$
DECLARE
    acl integer[];
    count integer;
    rand integer;
    done bool;
BEGIN
    count := (trunc(random() * 9 ));
    WHILE count != 0 LOOP
        rand := (trunc(random() * 9999 + 1));
        acl := array_append(acl, rand);
        count := count -1;
    END LOOP;
    RETURN acl;
END;
$$ LANGUAGE PLPGSQL VOLATILE;
SELECT make_acl();

-- populate records table
insert into records(acl, account)
    SELECT make_acl(), (trunc(random() * 5 + 1))
    from generate_series(1,10000000);

-- set private randomly on all records without ACL, and true where acl exists
UPDATE records set private = (trunc(random() * 10 + 1)) > 5
WHERE acl is null;
update records set private = true where acl is not null;

-- populate access table
insert into access(record_id, user_id)
select  (trunc(random() * 99999 + 1)), (trunc(random() * 9999 + 1))
from generate_series(1, 10000000);

-- Select using access table
explain analyze
select records.id
from records
left join access on records.id=access.record_id
where records.account = 1
and ((records.private = true and (access.user_id = 25 or access.user_id is null)) 
or records.private = false);


-- Select using ACL array
explain analyze
select * from records
where account = 1
and ((private = true and (acl @> array [25] or acl is null)) or private = false);

The results:

Results for intermediary table approach:

                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=275544.43..634615.43 rows=1933800 width=4) (actual time=1765.044..6152.780 rows=2093789 loops=1)
   Hash Cond: (access.record_id = records.id)
   Filter: ((records.private AND ((access.user_id = 25) OR (access.user_id IS NULL))) OR (NOT records.private))
   Rows Removed by Filter: 1878666
   ->  Seq Scan on access  (cost=0.00..183085.23 rows=9458923 width=8) (actual time=100.433..842.632 rows=10000000 loops=1)
   ->  Hash  (cost=243059.43..243059.43 rows=1980000 width=5) (actual time=1662.915..1662.915 rows=2001411 loops=1)
         Buckets: 16384  Batches: 32  Memory Usage: 2281kB
         ->  Bitmap Heap Scan on records  (cost=37065.43..243059.43 rows=1980000 width=5) (actual time=219.929..1349.931 rows=2001411 loops=1)
               Recheck Cond: (account = 1)
               Rows Removed by Index Recheck: 4713065
               Heap Blocks: exact=50052 lossy=105473
               ->  Bitmap Index Scan on accounts  (cost=0.00..36570.43 rows=1980000 width=0) (actual time=210.267..210.267 rows=2001411 loops=1)
                     Index Cond: (account = 1)
 Planning time: 0.283 ms
 Execution time: 6208.045 ms
(15 rows)

Results for array approach:

                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on records  (cost=37053.90..247997.90 rows=1933853 width=48) (actual time=215.179..1769.639 rows=223663 loops=1)
   Recheck Cond: (account = 1)
   Rows Removed by Index Recheck: 4713065
   Filter: ((private AND ((acl @> '{25}'::integer[]) OR (acl IS NULL))) OR (NOT private))
   Rows Removed by Filter: 1777748
   Heap Blocks: exact=50052 lossy=105473
   ->  Bitmap Index Scan on accounts  (cost=0.00..36570.43 rows=1980000 width=0) (actual time=205.250..205.250 rows=2001411 loops=1)
         Index Cond: (account = 1)
 Planning time: 0.106 ms
 Execution time: 1775.743 ms
(10 rows)
๐Ÿ‘คHarel

Leave a comment