[Django]-SQL query to filter on group of related rows

3👍

With group by and having, we can do Tag 1 AND Tag 2:

with cte_tags as (
select book_id
  from tags
 where tag in ('Tag 1', 'Tag 2')
 group by book_id
 having count(*)=2)
select b.id as book_id,
       b.name
  from books b
  join cte_tags t
    on b.id = t.book_id;

EDIT:

  1. If tag in tags for a book_id can be duplicated, use the following:
with cte_tags as (
select book_id
  from tags
 where tag in ('Tag 1', 'Tag 2')
 group by book_id
 having count(distinct tag)=2)
select b.id as book_id,
       b.name
  from books b
  join cte_tags t
    on b.id = t.book_id;
  1. If looking for tag1 OR tag2, or tag1 AND tag2, use the following:
with cte_tags as (
select book_id
  from tags
 where tag in ('Tag 1', 'Tag 2')
 group by book_id
 having count(distinct tag) between 1 and 2)
select b.id as book_id,
       b.name
  from books b
  join cte_tags t
    on b.id = t.book_id;
👤JHH

1👍

If I understand it correctly you want books that only have tags ‘Tag1’ and ‘Tag2’. I.e. no other tags. I’m not aware of an official name for this problem, maybe exclusive contains.

It amounts to finding books meeting two conditions:

  • have all tags in ("Tag1", "Tag2")
  • have two unique tags (or don’t have other or zero tags)

You’re looking for a solution that "works with any ORM". Of course, that’s impossible. However, there is a LINQ solution that will probably work with any LINQ-based ORM. It works with EF anyway.

This query can be used when the tag names are not unique per book:

var tags = new[] { "Tag1", "Tag2" };
var books = context.Books
    .Where(b => b.Tags.All(t => tags.Contains(t.Tag))
        && b.Tags.Select(t => t.Tag).Distinct().Count() == tags.Count());

This, if the tag names are unique per book (which I assume):

var books = context.Books
    .Where(b => b.Tags.All(t => tags.Contains(t.Tag))
        && b.Tags.Count() > 0);

The second condition is necessary, otherwise books without any tags would also be selected (that’s the semantics of All).

This generates SQL queries that you’ll have to settle with (as with any LINQ-based ORM). For the second case, using EF-core 6:

  SELECT [b].[Id]
  FROM [Books] AS [b]
  WHERE NOT EXISTS (
      SELECT 1
      FROM [Tags] AS [t]
      WHERE ([b].[Id] = [t].[BookId]) AND [t].[Tag] NOT IN (N'Tag1', N'Tag2'))
      AND ((
          SELECT COUNT(*)
          FROM [Tags] AS [t0]
          WHERE [b].[Id] = [t0].[BookId]) > 0)

Of course it’s possible to write SQL queries manually that do a (far) better job, performance-wise, in a specific constellation of indexes and statistics. Also, most (if not all) ORMs will offer a way to execute raw SQL. In that sense, a SQL query can be used with "any" ORM. I don’t consider that an "ORM solution" though. It doesn’t use the ORM’s core machinery, merely its database connection.

1👍

If performance is important, you should try various queries on your server with actual data and measure their performance.

I have a general note.

A query like this:

select book_id
from tags
where tag in ('Tag1', 'Tag2')

or like this:

select book_id
from tags
where tag = 'Tag 1' OR tag = 'Tag2'

usually would result in the scan of the whole table tags even if it had index on the tag column.

On the other hand, a query like this:

select book_id
from tags
where tag = 'Tag1'

would usually use an index.

So, we can expand a query with OR into two separate queries and then combine their results.

WITH
CTE_BookIDs
AS
(
    select book_id
    from tags
    where tag = 'Tag1'

    INTERSECT

    select book_id
    from tags
    where tag = 'Tag2'
)
SELECT
    books.*
FROM
    books
    INNER JOIN CTE_BookIDs ON CTE_BookIDs.book_id = books.id
;

Here is a query over a sample data set:

CREATE TABLE #Tags
    (ID int IDENTITY NOT NULL PRIMARY KEY
    ,BookID int NOT NULL
    ,Tag varchar(50) NOT NULL);

INSERT INTO #Tags VALUES
(1, 'Tag1'),
(1, 'Tag2'),
(1, 'Tag3'),
(1, 'Tag4'),
(2, 'Tag1'),
(3, 'Tag2'),
(4, 'Tag1'),
(4, 'Tag2'),
(4, 'Tag3'),
(5, 'Tag3'),
(5, 'Tag4'),
(5, 'Tag5'),
(6, 'Tag1'),
(6, 'Tag3'),
(6, 'Tag5'),
(7, 'Tag2'),
(7, 'Tag3'),
(8, 'Tag1'),
(8, 'Tag2');

CREATE INDEX IX_Tag ON #Tags
(
    Tag, BookID
);

WITH
CTE_BookIDs
AS
(
    select BookID
    from #Tags
    where tag = 'Tag1'

    INTERSECT

    select BookID
    from #Tags
    where tag = 'Tag2'
)
SELECT *
FROM CTE_BookIDs
;

DROP TABLE #Tags;

Result

+--------+
| BookID |
+--------+
|      1 |
|      4 |
|      8 |
+--------+

Execution plan

execution plan

0👍

Try next solution:

First create index for speed up this query

create index on tags (tag, book_id);

Second, check next query

SELECT * FROM books
JOIN tags t1 on t1.tag = 'Tag1' AND t2.book_id = books.id
JOIN tags t2 on t2.tag = 'Tag2' AND t2.book_id = books.id;

If you want to deal with arbitrary list of any number of tags:

SELECT 
   books.id,
   count(distinct tags.id) as tags_count
FROM books
JOIN tags on tags.tag = ANY(['Tag1', 'Tag2', ...]) AND tags.book_id = books.id
GROUP BY books.id
HAVING 
   count(distinct tags.id) = <number of tags>

For speed up second query check if next index wil help

create index on tags (book_id, tag);

Leave a comment