Sitecore BuiltWith Statistics: How Hard to Detect Headless Sites?

Anton Tishchenko
Anton Tishchenko
Cover Image for Sitecore BuiltWith Statistics: How Hard to Detect Headless Sites?

Sitecore BuiltWith Statistics: Is It Harder to Detect Headless Sites?

Recently, there were a lot of discussions of Sitecore BuiltWith statistics.

Sitecore BuiltWith Statistics

Different people suggested different theories, about why it happened. One of the theories was that it became much harder to figure out if it is a Sitecore website when it is built using a Headless approach and Sitecore XM Cloud was used. Let’s figure it out, was something changed? Did it become easier? Harder? Nothing changed?

First of all, we have no idea, how BuiltWith does it. Knowing that BuiltWith is managed by one person, the result is decent. Gary Brewer, if you are reading this post, feel free to improve your BuiltWith Sitecore identification based on this article.

Why do people want to know?

Site technology identification makes sense for a few categories of people.

  1. Hackers. If you know the technologies on which a website is built, you know about existing vulnerabilities then you can exploit them. Fortunately, Sitecore is a secure platform. There were just a few security bulletins released over the last few years. And all of them were hardly exploitable without some additional conditions. It is crucial to hide the platform for the less secure platforms, but not for Sitecore. I don’t know a lot about the security of WordPress, but I saw a few times, how someone tried to run a WP exploit on our Sitecore website. Of course, without success.
  2. Sales. If you know the website technology, you can sell them something additional. Is that bad? - No. Let them try. Or they can try to sell different platforms. You should not be so unconfident in your work and your choice of platform to hide it.
  3. Developers and other tech people. Curiosity is the lust of the mind. When you see something interesting, you want to know how it was built. There is nothing bad in it. Let the people know.
  4. Market analyst. They want to analyze the market and predict the future. Depending on your vendor policies, it can make sense to do that. But for you as a client or integrator, it doesn’t worth efforts to hide it.

How can you define if it is Sitecore MVC website?

There are plenty of ways how to determine the Sitecore MVC website. Thanks to Jammy, Mark, Rob, and Michael you can read about them on StackOverflow. I will sum up and extend them:

  • Media assets. When you use Sitecore, most probably you use Media Library. Even if you have some Digital Asset Management system in place, you still may use a few images from the Media Library. Media items from Sitecore Media Library will have the extension .ashx (ASP.NET handler) and prefixes /~/media or /-/media. Also, if media is not present, you can make requests to known media files: /~/media/System/Template%20Thumbnails/audio.ashx
  • Presence of /sitecore/shell/sitecore.version.xml or other Sitecore-specific files or paths. It is recommended to restrict access to them according to the Sitecore security hardening guide. But, unfortunately, you still can find them on many Sitecore sites.
  • Sitecore Analytics cookies. If you use Sitecore XP, you need SC_ANALYTICS_GLOBAL_COOKIE and SC_ANALYTICS_SESSION_COOKIE cookies to be present to track your visitors journey.
  • Sitecore Analytics files. layouts/system/VisitorIdentificationCSS.aspx, /layouts/system/VIChecker.aspx, /layouts/system/VisitorIdentification.js.
  • Sitecore default items. If you don’t remove default Sitecore content items that come out of the box, most probably you will be able to access them using these URLs: /sitecore/content/home or ?sc_itemid={0DE95AE4-41AB-4D01-9EB0-67441B7C2450}
  • The presence of Sitecore SXA-related things also tells you that it is Sitecore. base-themes/core-libraries/scripts/optimized-min.js script tells that the Sitecore site is SXA-based. Even if the file is renamed, it could be identified by content. The same is about styles present on the page. You can guess Sitecore by known CSS class names and assigned to them styles that correspond to the known SXA components, like Column Splitter: row component column-splitter.
  • sxa_site cookie tells us the Sitecore SXA site name.

You don’t need to have all of them at once. The more signals you can detect, the more probably the website is Sitecore-based.

How can you define if it is Sitecore headless website?

It was easy to detect Sitecore in the case of the MVC approach, but how about headless? Is it harder to identify Sitecore's headless website? Let’s go through possible attributes that tell us that the site is Sitecore-based.

  • Media Assets. Media items are still in place. You can look for /-/jssmedia/ or /-/media/ in the paths to media. And the trick with known media files still works /-/jssmedia/System/Template%20Thumbnails/audio.ashx
  • You still can’t hide analytics. SC_ANALYTICS_GLOBAL_COOKIE still present. Even more, if there is advanced analytics implementation, you will see requests to /sitecore/api/jss/track/event endpoint.
  • The presence of <script id="NEXT_DATA" type="application/json"> on the page with Sitecore layout details JSON tells us that the website is JSS Next.js based. It can be disabled, if you don’t need a single-page application navigation and personalization. But the major part of Sitecore Next.js websites still have it.
  • The presence of <script id="__JSS_STATE__" type="application/json"> on the page with Sitecore layout details JSON tells us that the website is JSS-based but is on React, Angular or Vue (Non-Next.js). You can easily identify Angular, React, or Vue, but that is another topic.
  • sxa_site cookies can tell us that the website is headless and managed by SXA.
  • Sitecore default items trick (/sitecore/content/home and ?sc_itemid={0DE95AE4-41AB-4D01-9EB0-67441B7C2450} ) will work for React, Vue, and Angular, but not for Next.js

Some circumstantial evidence remained, some changed, and some disappeared. But you still can easily detect Sitecore-based websites. Even SSG on Next.js, even on XM Cloud.

Funny thing about Sitecore Astro implementation

If you remember, our company made the fully-featured Sitecore JSS SDK for Astro. I go through the attributes described for Sitecore's headless approach. And they didn’t work. (To be honest, the Analytics signal would work, but it wasn’t enabled for the demo website). It wasn’t intentional, it is just how the Astro approach works. If you want a secure, blazing-fast, and undetectable Sitecore-based website, we have a good option for you, feel free to contact us!

Conclusion

Now we know, that Sitecore headless website detection is still a piece of cake. If developers don’t make any effort to hide it then it is detectable. We don’t know anything about how BuiltWith make their detection. I checked a few Sitecore headless websites and it worked fine for them. It is always up to you to trust or not their statistics. But, the reason for dropping the count of Sitecore websites isn’t in hard detection for Sitecore headless sites. It could be a drop in the count of websites, mistakes in detection on the BuiltWith site, previous wrong detections of sites, or anything else. However, it is not the inability to detect Sitecore headless sites.

And I remind you about our statistics for Sitecore headless SDK usage. Usage of Sitecore JSS SDK grows! Next.js crossed 100k installations over a month and added 75% over the last year. Does that mean that Sitecore usage grows? Not necessary, it is just another statistic. You need to gather information from different sources and build a holistic picture for yourself.

Update from 8 of August 2024

I got a response from Gary Brewer, the founder of BuiltWith. Some of the approaches described in the article will be added to the detection algorithm. Sitecore detection by BuiltWith will become more accurate. Thanks, Gary!