[Chartjs]-Querying mongoDB for some chart data – my pipeline seems convoluted

1👍

I can trim a lot of fat out of this and keep it compatible with MongoDB 3.2 ( which you must be using at least due to preserveNullAndEmptyArrays ) available operators with a few simple actions. Mostly by simply joining the arrays immediately after $lookup, which is the best place to do it:

Short Optimize

db.brands.aggregate([
  { "$lookup": {
    "from": "facebookaccounts",
    "localField": "facebookAccounts",
    "foreignField": "_id",
    "as": "facebookAccounts"
  }},
  { "$lookup": {
    "from": "twitteraccounts",
    "localField": "twitterAccounts",
    "foreignField": "_id",
    "as": "twitterAccounts"
  }},
  { "$project": {
    "name": 1,
    "all": {
      "$concatArrays": [ "$facebookAccounts", "$twitterAccounts" ]
    }
  }},
  { "$match": {
    "all.years.months.days.date": {
      "$gte": new Date("2017-06-16"), "$lte": new Date("2017-06-18")
    }
  }},
  { "$unwind": "$all" },
  { "$unwind": "$all.years" },
  { "$unwind": "$all.years.months" },
  { "$unwind": "$all.years.months.days" },
  { "$match": {
    "all.years.months.days.date": {
      "$gte": new Date("2017-06-16"), "$lte": new Date("2017-06-18")
    }
  }},
  { "$group": {
    "_id": {
      "brand": "$name",
      "date": "$all.years.months.days.date"
    },
    "total": { 
      "$sum": { 
        "$sum": [
          { "$ifNull": [ "$all.years.months.days.likes", 0 ] },
          { "$ifNull": [ "$all.years.months.days.followers", 0 ] }
        ]
      }
    }
  }},
  { "$sort": { "_id": 1 } },
  { "$group": {
    "_id":  "$_id.brand",
    "date": { "$push": "$_id.date" },
    "stat": { "$push": "$total" }
  }}
])

This gives the result:

{
        "_id" : "Brand1",
        "date" : [
                ISODate("2017-06-16T00:00:00Z"),
                ISODate("2017-06-17T00:00:00Z"),
                ISODate("2017-06-18T00:00:00Z")
        ],
        "stat" : [
                973415,
                69397,
                973773
        ]
}

With MongoDB 3.4 we could probably speed it up a “little” more by filtering the arrays and breaking them down before we eventually $unwind to make this work across documents, or maybe even not worry about going across documents at all if the “name” from “brands” is unique. The pipeline operations to compact down the arrays “in place” though are quite cumbersome to code, if a “little” better on performance.

You seem to be doing this “per brand” or for a small sample, so it’s likely of little consequence.

As for the chartjs data format, I don’t seem to be able to get my hands on what I believe is a different data format to the array format here, but again this should have little bearing.

The main point I see addressed is we can easily move away from your previous output that separated the “facebook” and “twitter” data, and simply aggregate by date moving all the data together “before” the arrays are constructed.

That last point then obviates the need for further “convoluted” operations to attempt to “merge” those two documents and the arrays produced.


Alternate Optimize

As an alternate approach where this does in fact not aggregate across documents, then you can essentially do the “filter” on the array in place and then simply sum and reshape the received result in client code.

db.brands.aggregate([
  { "$lookup": {
    "from": "facebookaccounts",
    "localField": "facebookAccounts",
    "foreignField": "_id",
    "as": "facebookAccounts"
  }},
  { "$lookup": {
    "from": "twitteraccounts",
    "localField": "twitterAccounts",
    "foreignField": "_id",
    "as": "twitterAccounts"
  }},
  { "$project": {
    "name": 1,
    "all": {
      "$map": {
        "input": { "$concatArrays": [ "$facebookAccounts", "$twitterAccounts" ] },
        "as": "all",
        "in": {
          "years": {
            "$map": {
              "input": "$$all.years",
              "as": "year",
              "in": {
                "months": {
                  "$map": {
                    "input": "$$year.months",
                    "as": "month",
                    "in": {
                      "days": {
                        "$filter": {
                          "input": "$$month.days",
                          "as": "day",
                          "cond": {
                            "$and": [
                              { "$gte": [ "$$day.date", new Date("2017-06-16") ] },
                              { "$lte": [ "$$day.date", new Date("2017-06-18") ] }
                            ]
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }}
]).map(doc => {
  doc.all = [].concat.apply([],[].concat.apply([],[].concat.apply([],doc.all.map(d => d.years)).map(d => d.months)).map(d => d.days));
  doc.all = doc.all.reduce((a,b) => {
    if ( a.findIndex( d => d.date.valueOf() == b.date.valueOf() ) != -1 ) {
      a[a.findIndex( d => d.date.valueOf() == b.date.valueOf() )].stat += (b.hasOwnProperty('likes')) ? (b.likes || 0) : (b.followers || 0);
    } else {
      a = a.concat([{ date: b.date, stat: (b.hasOwnProperty('likes')) ? (b.likes || 0) : (b.followers || 0) }]);
    }
    return a;
  },[]);

  doc.date = doc.all.map(d => d.date);
  doc.stat = doc.all.map(d => d.stat);
  delete doc.all;
  return doc;
})

This really leaves all the things that “need” to happen on the server, on the server. And it’s then a fairly trivial task to “flatten” the array and process to “sum up” and reshape it. This would mean less load on the server, and the data returned is not really that much greater per document.

Gives the same result of course:

[
        {
                "_id" : ObjectId("5943f427e7c11ac3ad3652b0"),
                "name" : "Brand1",
                "date" : [
                        ISODate("2017-06-16T00:00:00Z"),
                        ISODate("2017-06-17T00:00:00Z"),
                        ISODate("2017-06-18T00:00:00Z")
                ],
                "stat" : [
                        973415,
                        69397,
                        973773
                ]
        }
]

Committing to the Diet

The biggest problem you really have is with the multiple collections and the heavily nested documents. Neither of these is doing you any favors here and will with larger results cause real performance problems.

The nesting in particular is completely unnecessary as well as not being very maintainable since there are limitations to “update” where you have nested arrays. See the positional $ operator documentation, as well as many posts about this.

Instead you really want a single collection with all those “days” entries in it. You can always work with that source easily for query as well as aggregation purposes and it should look something like this:

{
        "_id" : ObjectId("5948cd5cd6eb0b7d6ac38097"),
        "date" : ISODate("2017-06-16T00:00:00Z"),
        "likes" : 904025,
        "__t" : "Facebook",
        "account" : ObjectId("5943f427e7c11ac3ad3652ac")
}
{
        "_id" : ObjectId("5948cd5cd6eb0b7d6ac38098"),
        "date" : ISODate("2017-06-17T00:00:00Z"),
        "likes" : null,
        "__t" : "Facebook",
        "account" : ObjectId("5943f427e7c11ac3ad3652ac")
}
{
        "_id" : ObjectId("5948cd5cd6eb0b7d6ac38099"),
        "date" : ISODate("2017-06-18T00:00:00Z"),
        "likes" : 904345,
        "__t" : "Facebook",
        "account" : ObjectId("5943f427e7c11ac3ad3652ac")
}
{
        "_id" : ObjectId("5948cd5cd6eb0b7d6ac3809a"),
        "date" : ISODate("2017-06-16T00:00:00Z"),
        "followers" : 69390,
        "__t" : "Twitter",
        "account" : ObjectId("5943f427e7c11ac3ad3652aa")
}
{
        "_id" : ObjectId("5948cd5cd6eb0b7d6ac3809b"),
        "date" : ISODate("2017-06-17T00:00:00Z"),
        "followers" : 69397,
        "__t" : "Twitter",
        "account" : ObjectId("5943f427e7c11ac3ad3652aa")
}
{
        "_id" : ObjectId("5948cd5cd6eb0b7d6ac3809c"),
        "date" : ISODate("2017-06-18T00:00:00Z"),
        "followers" : 69428,
        "__t" : "Twitter",
        "account" : ObjectId("5943f427e7c11ac3ad3652aa")
}
{
        "_id" : ObjectId("5948cd5cd6eb0b7d6ac3809d"),
        "date" : ISODate("2017-06-19T00:00:00Z"),
        "followers" : 69457,
        "__t" : "Twitter",
        "account" : ObjectId("5943f427e7c11ac3ad3652aa")
}

Combining those referenced in the brands collection as well:

{
        "_id" : ObjectId("5943f427e7c11ac3ad3652b0"),
        "name" : "Brand1",
        "accounts" : [
                ObjectId("5943f427e7c11ac3ad3652ac"),
                ObjectId("5943f427e7c11ac3ad3652aa")
        ]
}

Then you simply aggregate like this:

db.brands.aggregate([
  { "$lookup": {
    "from": "social",
    "localField": "accounts",
    "foreignField": "account",
    "as": "accounts"
  }},
  { "$unwind": "$accounts" },
  { "$match": {
    "accounts.date": {
      "$gte": new Date("2017-06-16"), "$lte": new Date("2017-06-18")
    }
  }},
  { "$group": {
    "_id": {
      "brand": "$name",
      "date": "$accounts.date"
    },
    "stat": {
      "$sum": {
        "$sum": [
          { "$ifNull": [ "$accounts.likes", 0 ] },
          { "$ifNull": [ "$accounts.followers", 0 ] }
        ]
      }
    }
  }},
  { "$sort": { "_id": 1 } },
  { "$group": {
    "_id": "$_id.brand",
    "date": { "$push": "$_id.date" },
    "stat": { "$push": "$stat" }
  }}
])

This is actually the most efficient thing you can do, and it’s mostly because of what actually happens on the server. We need to look at the “explain” output to see what happens to the pipeline here:

{
    "$lookup" : {
      "from" : "social",
      "as" : "accounts",
      "localField" : "accounts",
      "foreignField" : "account",
        "unwinding" : {
          "preserveNullAndEmptyArrays" : false
        },
        "matching" : {
           "$and" : [
             {
               "date" : {
                 "$gte" : ISODate("2017-06-16T00:00:00Z")
               }
             },
             {
               "date" : {
                 "$lte" : ISODate("2017-06-18T00:00:00Z")
               }
             }
           ]
        }
    }
}

This is what happens when you send $lookup -> $unwind -> $match to the server as the latter two stages are “hoisted” into the $lookup itself. This reduces the results in the actual “query” run on the collection to be joined.

Without that sequence, then $lookup potentially pulls in “a lot of data” with no constraint, and would break the 16MB BSON limit under most normal loads.

So not only is the process a lot more simple in the altered form, it actually “scales” where the present structure will not. This is something that you seriously should consider.

Leave a comment