RUM Archive Third-Party Resource Data
By Nic Jansma on
Table of Contents
We're excited to announce a preview of a new type of data available in the RUM Archive: Third-Party Resources
At mPulse, we are using boomerang.js to capture RUM data. Boomerang uses the ResourceTiming API to gather data about all of the sub-resource fetches on the page (such as JavaScript, CSS, images, etc) for every Page Load beacon. This data is compressed and beaconed to mPulse for analytics purposes.
From this data, viewed in aggregate across all of our customers, we can easily spot common third-party components such as:
- Tag Managers (Google Tag Manager, GTag)
- Analytics (Google Analytics, mPulse)
- Widgets (Pinterest, Twitter, Bazaar Voice, Youtube)
- Fonts (Google Fonts)
- Beacons (Meta Pixel, mPulse, Google Analytics, Doubleclick)
- Ads (Google, Amazon, Criteo)
- Libraries: (jQuery)
For this new table, we're going to begin exporting the Top 500+ third-party resources we see each day, similar to how we share Page Load data. Each third-party resource URL will have information on its popularity, performance and size. These URL trends should be trackable over time as well.
Example Queries
Let's look at some example queries for how you can use this data.
It's easy to view the most popular third-party URLs by day. For example, here are the Top 10 third-party URLs on 2023-10-15:
SELECT URL,
SUM(FETCHES) AS FETCHES
FROM akamai-mpulse-rumarchive.rumarchive.rumarchive_resources
WHERE DATE = '2023-10-15'
GROUP BY URL
ORDER BY SUM(FETCHES) DESC
LIMIT 10
Rank | URL | FETCHES |
---|---|---|
1 | https://www.googletagmanager.com/gtag/js |
14,191,855 |
2 | https://www.googletagmanager.com/gtm.js |
9,077,607 |
3 | https://www.facebook.com/tr/ |
7,848,302 |
4 | https://www.google-analytics.com/j/collect |
7,835,923 |
5 | https://www.google-analytics.com/collect |
6,579,589 |
6 | https://www.google-analytics.com/analytics.js |
4,645,448 |
7 | https://stats.g.doubleclick.net/j/collect |
3,755,276 |
8 | https://analytics.google.com/g/collect |
3,296,602 |
9 | https://c.go-mpulse.net/api/config.json |
2,673,639 |
10 | https://securepubads.g.doubleclick.net/pcs/view |
2,544,007 |
Each URL has multiple dimensions associated with it, such as:
DEVICETYPE
USERAGENTFAMILY
andUSERAGENTVERSION
DEVICEMODEL
OS
andOSVERSION
BEACONTYPE
COUNTRY
PROTOCOL
INITIATORTYPE
ASSETTYPE
Let's look at one of those dimensions. initiatorTypes reflects what DOM element triggered the resource fetch. Here are all of the initiatorTypes
for these third-party resource URLs in our data, ordered by popularity:
SELECT INITIATORTYPE,
COUNT(*) AS CNT
FROM akamai-mpulse-rumarchive.rumarchive.rumarchive_resources
WHERE DATE = '2023-10-15'
GROUP BY INITIATORTYPE
ORDER BY CNT DESC
INITIATORTYPE | CNT |
---|---|
img |
421,072 |
script |
283,536 |
xmlhttprequest |
184,936 |
beacon |
76,421 |
fetch |
68,216 |
iframe |
59,296 |
other |
42,581 |
link |
16,156 |
css |
10,844 |
(null) | 565 |
You can split, filter and group by other dimensions as well. For example, let's look at how Google's GTag's median total time (TOTALHISTOGRAM
) and download time (DOWNLOADHISTOGRAM
) changes by country (GROUP BY COUNTRY
) on mobile (DEVICETYPE=Mobile
) devices:
SELECT COUNTRY,
SUM(FETCHES) AS FETCHES,
`akamai-mpulse-rumarchive.rumarchive.PERCENTILE_APPROX`(
ARRAY_AGG(TOTALHISTOGRAM),
[0.50],
10,
false) as TOTAL_MEDIAN,
`akamai-mpulse-rumarchive.rumarchive.PERCENTILE_APPROX`(
ARRAY_AGG(DOWNLOADHISTOGRAM),
[0.50],
10,
false) as DOWNLOAD_MEDIAN,
FROM `akamai-mpulse-rumarchive.rumarchive.rumarchive_resources`
WHERE DATE = '2023-10-15'
AND DEVICETYPE = 'Mobile'
AND URL = 'https://www.googletagmanager.com/gtag/js'
GROUP BY COUNTRY
ORDER BY FETCHES DESC
LIMIT 10
COUNTRY | FETCHES | TOTAL_MEDIAN | DOWNLOAD_MEDIAN |
---|---|---|---|
US | 3,936,631 | 112 | 25 |
IN | 723,496 | 362 | 158 |
JP | 536,711 | 131 | 87 |
GB | 404,121 | 72 | 7 |
CA | 289,541 | 117 | 10 |
IT | 184,021 | 96 | 128 |
BR | 181,061 | 243 | 157 |
DE | 179,368 | 81 | 7 |
FR | 171,739 | 79 | 7 |
MX | 165,870 | 213 | 335 |
What URLs are included
Our goal is to share data about the top third-party resource URLs seen across the web. To start, we plan on sharing the 500+ most popular third-party URLs on a daily basis.
We only consider a URL as being a third-party URL if it shows up in at least 10 different mPulse customer accounts. We are also reviewing (allowlist-ing) all URLs to ensure they don't contain any customer-specific information.
All URLs have their query string removed, so no sensitive data is shared.
Once a URL makes it into the allowlist, we will continue querying for it even if it drops out of the top 500. This means the list of URLs will grow slowly over time.
Some URLs share a common pattern with minor differences, such as the .cc
country code or a v1.0.0
version number. Where possible, we'll assign a wildcard URLGROUP
pattern to these URLs so they can all be tracked as a group.
Caveats
As with any set of data, this Third-Party Resource data comes with some caveats:
- All ResourceTiming data is subject to visibility concerns:
- Over 30% of requests are completely hidden from RUM (due to being in a cross-origin
IFRAME
) - Over 45% of requests don't have detailed timings or sizes (due to cross-origin restrictions and no
Timing-Allow-Origin
header)
- Over 30% of requests are completely hidden from RUM (due to being in a cross-origin
- The mPulse Resources dataset is exactly that -- a reflection of mPulse customers, not of the entire internet
- The mPulse data is sampled and raw numbers (e.g. of fetches) should be only used for relative weighting
- All of the other tips we have on querying RUM Archive data apply
Feedback Welcome
This is a brand new set of data we're making publicly available. It may have errors, inconsistencies, quirks, etc.
We'd love to have web performance nerds take a look at the data and give us feedback. Is it interesting? Useful? Does anything seem off?
Please open an Issue in Github with any feedback you have!
Thanks!