1đź‘Ť
You are not allowed to use external urls in your sitemap (or rather, they won’t have the desired effect being indexed by Google as part of your site content).
I think your best option is to dedicate a path on your site like /hosted/pdf/xxxx.pdf
that rewrites everything to cloudfront.com/pdf/xxxx.pdf
or similar using mod_rewrite/location patterns/regex.
That way you can use a local site URL in your sitemap but still have the browser sent to the cloudfront served content directly, I think this might even be a good use of the 302 HTTP status code.
In the Sitemap
class there is an items()
method that returns what is to be included in the sitemap.xml, and you could create your own class that extends it and adds additional data.
You can either manually add the data hardcoded in the method but I think the preferred option is to create a Model that represents each remote hosted file and that contains the information necessary to output it in the sitemap. (This also lets you add properties such as visibility on a per file basis and lets you manage it via admin assuming you set up a ModelAdmin for it.)
I think you might be able do something similar to what they show in http://docs.djangoproject.com/en/1.9/ref/contrib/sitemaps with the BlogSitemap
class that extends Sitemap
. Be sure to check the heading “Sitemap for static views” on that page as well.
My suggestion is that you chose the model approach to represent the files, so you have your hosted PDFs (or other CDN content) as a model called StaticHostedFile
or similar and you iterate through all of them in the items()
section. It does require you to index all the current PDFs to create models for them as well as create a new model whenever a new PDF is added (but that could be automated).
It can be good to know that you can add “includes” in a sitemap.xml so you might be able to split the site content into two sitemaps (content+pdfs) and include both in sitemap.xml, for instance:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.example.com/original_sitemap.xml</loc>
<lastmod>2016-07-12T09:12Z</lastmod>
</sitemap>
<sitemap>
<loc>http://www.example.com/pdf_sitemap.xml</loc>
<lastmod>2016-07-15T08:55Z</lastmod>
</sitemap>
</sitemapindex>
This still requires local URLs and rewrites as per above though, but it can be a nifty trick for when you have several separate sitemaps to combine. (For instance if running a Django site under one subdir and a WordPress site under another or whatnot.)