Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tech report: category-level aggregations #928

Open
rviscomi opened this issue Sep 4, 2024 · 1 comment
Open

Tech report: category-level aggregations #928

rviscomi opened this issue Sep 4, 2024 · 1 comment
Labels
Tech Report HTTP Archive Technology Report

Comments

@rviscomi
Copy link
Member

rviscomi commented Sep 4, 2024

Use case: when comparing technologies within the same category, it can be useful to know how they all compare to some kind of category-level aggregation over all pages within the category.

Mockup:
image

The blue line represents an aggregation of all pages within the CMS category, so a user can see how it compares to specific technologies within that category. It could also be possible to compare entire categories.

The technical implementation could look something like this:

  • update the technologies table schema to include a field indicating whether the row pertains to a technology or a category aggregation
    • all dimensions supported: rank, client, geo
    • backfill all historical data
  • provide a param in the API endpoints to distinguish between the two, only returning data for the selected aggregation type (default: technology)
  • add categories to the UI, similar to the special "ALL" technology

In terms of the schema changes, we currently have the following fields:

  • date (2024-08-01)
  • geo (ALL)
  • rank (ALL)
  • category (CMS)
  • app (WordPress)
  • client (desktop)
  • [stats] where each field is aggregated over the set of pages that use WordPress for the given dimensions

The updated schema would look something like this for the CMS-level aggregation:

  • date (2024-08-01)
  • type (category)
  • geo (ALL)
  • rank (ALL)
  • category (CMS)
  • app (All CMSs)
  • client (desktop)
  • [stats] where each field is aggregated over the set of pages that one or more CMS for the given dimensions

Calculating category-level data based on technology-level aggregations won't work because percentiles cannot accurately be aggregated together. At best we'd be able to do a weighted average of the medians, but this would also not solve the issue of deduplicating origins that appear multiple times in a category because they use multiple technologies. For example, jQuery UI is always used with jQuery within the JS libraries category, but those websites would be counted twice. So the implementation would need to process the raw origin-level data.

@rviscomi
Copy link
Member Author

rviscomi commented Sep 4, 2024

We decided to put this feature in the backlog for now, given the complexity of the implementation and relatively low value it'd bring to the UX. If anyone feels strongly about it, feel free to add your 👍 to the comment above.

@rviscomi rviscomi added the Tech Report HTTP Archive Technology Report label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tech Report HTTP Archive Technology Report
Projects
None yet
Development

No branches or pull requests

1 participant