1đź‘Ť
Well if you need a result (genotype) per Probe for every Subject, then a standard many-to-many intermediary table (Genotype) is going to get huge indeed.
With 1000 Subjects you’d have 500 million records.
If you could save the values for genotype
field encoded/serialized in one or more columns, that would reduce the amount of records drastically. Saving 500k results encoded in a single column would be a problem, but if you can split them in groups, should be workable. This would reduce amount of records to nr. of Subjects. Or another possibility could be having Probe-s grouped in ProbeGroup-s and having nr. ProbeResults = nr. Subject * nr. ProbeGroup.
First option would be something like:
class SubjectProbeResults(models.Model):
subject = models.ForeignKey(Subject, related_name='probe_results')
pg_a_genotypes = models.TextField()
..
pg_n_genotypes = models.TextField()
This will of course make it more difficult to search/filter results, but shouldn’t be too hard if the saved format is simple.
You can have the following format in genotype columns: “probe1_id|genotype1,probe2_id|genotype2,probe3_id|genotype3,…”
To retrieve a queryset of subjects for a specific genotype + probe.
a. Determine which group the probe belongs to
i.e “Group C” -> pg_c_genotypes
b. Query the respective column for probe_id + genotype combination.
from django.db.models import Q
qstring = "%s|%s" % (probe_id, genotype)
subjects = Subject.objects.filter(Q(probe_results__pg_c_genotypes__contains=',%s,' % qstring) | \
Q(probe_results__pg_c_genotypes__startswith='%s,' % qstring) | \
Q(probe_results__pg_c_genotypes__endswith=',%s' % qstring))
The other option that I’ve mentioned is to have ProbeGroup
model too and each Probe
will have a ForeignKey to ProbeGroup
. And then:
class SubjectProbeResults(models.Model):
subject = models.ForeignKey(Subject, related_name='probe_results')
probe_group = models.ForeignKey(ProbeGroup, related_name='probe_results')
genotypes = models.TextField()
You can query the genotypes field the same, except now you can query the group directly, instead of determining the column you need to search.
This way if you have for ex. 1000 probes per group -> 500 groups. Then for 1000 Subjects you’ll have 500K SubjectProbeResults
, still a lot, but certainly more manageable than 500M. But you could have less groups, you’d have to test what works best.