# Querying

## Table of Contents

## Introduction

RUM Archive datasets are stored in Google BigQuery.

To execute queries against RUM Archive datasets, you will need a Google account.

**Note**: BigQuery has costs for storage and querying. You'll want to review what's available in the free tier, and estimate how much querying the RUM Archive may cost. You can read our tips on how to reduce your BigQuery costs.

## Datasets

Each provider of RUM Archive data will store their datasets in a different location.

Review the datasets documentation for information on where each dataset is located within BigQuery.

The rest of this guide will assume you're querying against the Akamai mPulse RUM dataset.

## Accessing BigQuery

The first step in querying RUM Archive data is gaining access to Google BigQuery and the desired datasets.

Follow the steps below to link (or create) a BigQuery project to the desired RUM Archive datasets.

- Log into the Google Cloud Projects starting page:

- Create a new, or use an existing Project:

- You can name a new Google Cloud Project whatever you wish:

- Head over to the BigQuery Console for your new project:

- We're going to add the desired datasets to your BigQuery console. Click on
*+ Add Data*and select*Star a project by name*:

- The shared BigQuery project name should be listed in the dataset definitions. In this case, we'll add the Akamai mPulse RUM dataset which is under the
`akamai-mpulse-rumarchive`

project:

- You should now see the desired dataset starred in your BigQuery Console:

You're all set, and ready to query RUM Archive data.

## Aggregation

All of the data represented in RUM Archive BigQuery tables is **aggregated**. This means that each row represents *more than one* datapoint. In fact, some rows can represent hundreds, thousands or millions of datapoints.

To understand how the aggregated rows are generated, please review the methodology guide.

When querying RUM Archive data, you will be executing standard BigQuery SQL queries against these pre-aggregated rows of data. From those rows, you should be able to execute any of the calculations listed below.

### High-Level Example

To illustrate how RUM Archive data is stored with a simple example, please consider 4 unique page loads, using two dimensions (User Agent Family, Country) that track a single Timer (Page Load Time):

Page Load # | User Agent Family | Country | Page Load Time (ms) |
---|---|---|---|

1 | Chrome | US | 100 |

2 | Chrome | DE | 200 |

3 | Safari | US | 300 |

4 | Safari | US | 350 |

Stored in RUM Archive format, there are 3 unique tuples of dimensions, so there would be 3 rows of aggregated data:

User Agent Family | Country | PLTCount | PLTHistogram | PLTAvg | PLTSumLn |
---|---|---|---|---|---|

Chrome | US | 1 | `{"1":[100,1]}` |
100 | 4.605 |

Chrome | DE | 1 | `{"2":[200,1]}` |
200 | 5.298 |

Safari | US | 2 | `{"3":[325,2]}` |
325 | 11.561 |

With data in this format, one can calculate many statistics such as any (approximate) percentile, averages, and geometric means.

More information:

## Querying

Each row will contain **dimensions**, **counts** and **timers/metrics**.

**Dimensions**are used to**aggregate**,**filter**and**group by**data.**Counts**are used to understand relative weights of each dimension, and used in most of the statistical calculations- Each
**Timer and Metric**has several columns associated with it to aid in the statistical calculations below

For details on how each RUM Archive table is structured, please see the tables documentation.

**Note**: Most RUM Archive queries should be limited to a single `DATE`

(or range) as the RUM Archive tables are partitioned by it. If you exclude a `DATE`

clause, you will be scanning an entire dataset, which can be costly.

### Aggregating

Since RUM Archive data is pre-aggregated, you can apply standard SQL commands to the `BEACONS`

(overall matches) and `xyzCOUNT`

columns to understand relative totals of the data.

`SELECT COUNT(*) as ROWCOUNT,`

SUM(BEACONS) as BEACONCOUNT

FROM `akamai-mpulse-rumarchive.rumarchive.rumarchive_page_loads`

WHERE DATE = "2022-09-01"

`[{`

"rowCount": "1535341",

"beaconCount": "268409228"

}]

So with the Akamai mPulse RUM dataset on `2022-09-01`

, there are 1.5 million aggregation rows that represent 268 million page loads (from the **sampled** and **aggregated** data set).

### Grouping

RUM Archive data can utilize standard `GROUP BY`

SQL clauses to group data.

For example, to look at the distinct HTTP Protocol (`PROTOCOL`

) dimensions for one day:

`SELECT PROTOCOL`

FROM `akamai-mpulse-rumarchive.rumarchive.rumarchive_page_loads`

WHERE DATE = "2022-09-01"

GROUP BY PROTOCOL

### Filtering

RUM Archive data can utilize standard `WHERE`

SQL clauses to filter dimensions.

For example, to filter to just `COUNTRY = 'US'`

:

`SELECT PROTOCOL,`

SUM(BEACONS) as COUNT

FROM `akamai-mpulse-rumarchive.rumarchive.rumarchive_page_loads`

WHERE DATE = "2022-09-01"

AND COUNTRY = 'US'

GROUP BY PROTOCOL

ORDER BY COUNT(*) DESC

## Calculations

From these rows and columns, you should be able to calculate:

- Relative weights (popularity) of dimensions within the dataset
- Statistics about each timer and metric:
- Approximate histograms
- Approximate percentiles (for all percentiles 0 through 100)
- Weighted averages
- Geometric means

### Relative Weights

An important aspect of RUM Archive data is that many datasets will provide **sampled** data.

As a result, the `BEACONS`

and `xyzCOUNT`

columns do not represent 100% of the data that happened in the real world. One should not look at a `BEACONS`

count of 100 and assume there were exactly 100 experiences that match the dimensions on that `DATE`

.

What can be understood from the `BEACONS`

and `xyzCOUNT`

columns is relative weighting -- that is, for a single `DATE`

, if you look at all matching values for a Dimension and **compare them** to other matching values, you can understand the relative weighting for that day of the dimension.

`SELECT PROTOCOL,`

SUM(BEACONS) as COUNT

FROM `akamai-mpulse-rumarchive.rumarchive.rumarchive_page_loads`

WHERE DATE = "2022-09-01"

GROUP BY PROTOCOL

ORDER BY COUNT(*) DESC

### Combined Histograms

For each row, each timer and metric in the RUM Archive contains a histogram of values based on fixed-sized buckets.

You can combine multiple row's histograms to calculate a histogram (frequency distribution) for aggregated rows.

The `akamai-mpulse-rumarchive`

project provides a public BigQuery function `COMBINE_HISTOGRAMS()`

that can combine multiple row's histograms into a combined histogram:

`CREATE OR REPLACE FUNCTION `akamai-mpulse-rumarchive.rumarchive.COMBINE_HISTOGRAMS`(`

histograms ARRAY<STRING>

)

RETURNS STRING

LANGUAGE js

AS ...

The `COMBINE_HISTOGRAMS()`

function takes a single argument:

`histograms ARRAY<STRING>`

: You can utilize the BigQuery`ARRAY_AGG()`

function to pass in the timer or metric`xyzHistogram`

column of choice

Example:

`SELECT COUNTRY,`

COUNT(*) AS ROWCOUNT,

SUM(BEACONS) AS BEACONCOUNT,

`akamai-mpulse-rumarchive.rumarchive.COMBINE_HISTOGRAMS`(ARRAY_AGG(PLTHISTOGRAM)) AS PLTHISTOGRAM

FROM `akamai-mpulse-rumarchive.rumarchive.rumarchive_page_loads`

WHERE DATE = "2022-09-01"

GROUP BY COUNTRY

ORDER BY SUM(BEACONS) DESC

Note the use of the BigQuery `ARRAY_AGG()`

to pass the `PLTHISTOGRAM`

column to the `COMBINE_HISTOGRAMS()`

function.

### Approximate Percentiles

By Combining Histograms of multiple rows, approximate percentiles can be calculated for the matched data.

The `akamai-mpulse-rumarchive`

project provides a public BigQuery function `PERCENTILE_APPROX()`

that can calculate the approximate percentile for any matching rows.

`CREATE OR REPLACE FUNCTION `akamai-mpulse-rumarchive.rumarchive.PERCENTILE_APPROX`(`

histograms ARRAY<STRING>,

percentiles ARRAY<FLOAT64>,

highPrecisionBucketWidth INTEGER,

includeZero BOOL)

RETURNS STRING

LANGUAGE js

AS ...

The `PERCENTILE_APPROX()`

function takes several arguments:

`histograms ARRAY<STRING>`

: You can utilize the BigQuery`ARRAY_AGG()`

function to pass in the timer or metric`xyzHistogram`

column of choice`percentiles ARRAY<FLOAT64>`

: An array of percentiles you'd like calculated- A single value in the array (e.g.
`[0.5]`

) will return an`INTEGER`

result - Multiple values in the array (e.g.
`[0.25, 0.5, 0.75]`

) will return a JSON map of input percentiles to their calculations

- A single value in the array (e.g.
`highPrecisionBucketWidth INTEGER`

: The width of the timer or metric's High Precision bucket, i.e.`100`

for Page Load Time`includeZero BOOL`

: Whether or not to include`0`

values in the percentile calculation

Example:

`SELECT COUNTRY,`

COUNT(*) AS ROWCOUNT,

SUM(BEACONS) AS BEACONCOUNT,

`akamai-mpulse-rumarchive.rumarchive.PERCENTILE_APPROX`(

ARRAY_AGG(PLTHISTOGRAM),

[0.0, 0.25, 0.50, 0.75, 0.90, 0.95, 1.0],

100,

false) as PERCENTILES,

`akamai-mpulse-rumarchive.rumarchive.PERCENTILE_APPROX`(

ARRAY_AGG(PLTHISTOGRAM),

[0.50],

100,

false) as MEDIAN

FROM `akamai-mpulse-rumarchive.rumarchive.rumarchive_page_loads`

WHERE DATE = "2022-09-01"

GROUP BY COUNTRY

ORDER BY SUM(BEACONS) DESC

`includeZero`

determines whether or not to include `0`

values in the percentile calculation. If set to `false`

, percentile calculations will start at values greater than `0`

.

For some timers or metrics, it may make sense ot include zeros, such as Cumulative Layout Shift (CLS), where a value of `0`

is valid and considered "good".

For example, setting `includeZero`

for metrics like CLS will affect the results:

`SELECT COUNTRY,`

COUNT(*) AS ROWCOUNT,

SUM(BEACONS) AS BEACONCOUNT,

PARSE_NUMERIC(`akamai-mpulse-rumarchive.rumarchive.PERCENTILE_APPROX`(

ARRAY_AGG(CLSHISTOGRAM),

[0.50],

10,

false)) / 1000 as MEDIAN_WITHOUT_ZEROS, -- includeZero: false

PARSE_NUMERIC(`akamai-mpulse-rumarchive.rumarchive.PERCENTILE_APPROX`(

ARRAY_AGG(CLSHISTOGRAM),

[0.50],

10,

true)) / 1000 as MEDIAN_WITH_ZEROS -- includeZero: true

FROM `akamai-mpulse-rumarchive.rumarchive.rumarchive_page_loads`

WHERE DATE = "2022-09-01"

AND CLSCOUNT > 0

GROUP BY COUNTRY

ORDER BY SUM(BEACONS) DESC

A few differences from the last query:

`includeZero`

is set to`true`

for one of the columns,`false`

for the other- CLS has a High Width of
`10`

- CLS should be divided by
`1000`

, and utilizes`PARSE_NUMERIC()`

to do so - The
`CLSCOUNT`

clause ensures the query only looks at rows that had a CLS calculation

### Weighted Averages

Each timer and metric contains two columns that can be used to calculate a weighted average (weighted arithmetic mean).

**Note**: We generally don't recommend using averages when looking at performance metrics, as the distributions tend to be normal (or log-normal), and outliers can affect the data. Use approximate percentiles instead. For a longer explanation, SpeedCurve has a great article on Averages, medians & percentiles.

- A
`xyzCOUNT`

column - A
`xyzAVG`

column

Example calculating the weighted average of Page Load Time by Country (and comparing it to the Median, so you can see how they differ):

`SELECT COUNTRY,`

COUNT(*) AS ROWCOUNT,

SUM(BEACONS) AS BEACONCOUNT,

ROUND(SUM(PLTAVG * PLTCOUNT) / SUM(PLTCOUNT)) AS PLTAVG,

`akamai-mpulse-rumarchive.rumarchive.PERCENTILE_APPROX`(

ARRAY_AGG(PLTHISTOGRAM),

[0.50],

100,

false) as PLTMEDIAN

FROM `akamai-mpulse-rumarchive.rumarchive.rumarchive_page_loads`

WHERE DATE = "2022-09-01"

GROUP BY COUNTRY

ORDER BY SUM(BEACONS) DESC

### Weighted Geometric Means

Each timer and metric contains two columns that can be used to calculate a weighted geometric mean.

- A
`xyzCOUNT`

column - A
`xyzSUMLN`

column

Example calculating the weighted geometric mean of Page Load Time by Country (and comparing it to the weighted geometric mean, and Median, so you can see how they differ):

`SELECT COUNTRY,`

COUNT(*) AS ROWCOUNT,

SUM(BEACONS) AS BEACONCOUNT,

ROUND(SUM(PLTAVG * PLTCOUNT) / SUM(PLTCOUNT)) AS PLTAVG,

ROUND(EXP(SUM(PLTSUMLN) / SUM(PLTCOUNT))) AS PLTGEOMEAN,

`akamai-mpulse-rumarchive.rumarchive.PERCENTILE_APPROX`(

ARRAY_AGG(PLTHISTOGRAM),

[0.50],

100,

false) as PLTMEDIAN

FROM `akamai-mpulse-rumarchive.rumarchive.rumarchive_page_loads`

WHERE DATE = "2022-09-01"

GROUP BY COUNTRY

ORDER BY SUM(BEACONS) DESC