Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technology categories are missing in 2024-12-01 crawl #31

Open
max-ostapenko opened this issue Jan 9, 2025 · 3 comments
Open

Technology categories are missing in 2024-12-01 crawl #31

max-ostapenko opened this issue Jan 9, 2025 · 3 comments

Comments

@max-ostapenko
Copy link

max-ostapenko commented Jan 9, 2025

Checked like this:

SELECT
  date,
  client,
  category,
  COUNT(DISTINCT root_page)
FROM crawl.pages
LEFT JOIN pages.technologies AS tech
LEFT JOIN tech.categories AS category
WHERE
  date >= '2024-11-01'
  AND rank <= 10000
  AND tech.technology = 'WordPress'
GROUP BY 1,2,3
ORDER BY 1,2,3;
date	        client	category f0_
2024-11-01	desktop	Blogs	 545
2024-11-01	desktop	CMS	 545
2024-11-01	mobile	Blogs	 832
2024-11-01	mobile	CMS	 832
2024-12-01	desktop		 537
2024-12-01	desktop	Blogs	 47
2024-12-01	desktop	CMS	 47
2024-12-01	mobile		 815
2024-12-01	mobile	Blogs	 50
2024-12-01	mobile	CMS	 50
2025-01-01	desktop	Blogs	 534
2025-01-01	desktop	CMS	 534
2025-01-01	mobile	Blogs	 809
2025-01-01	mobile	CMS	 809

@pmeenan do you have an idea?
Any way to restore?

@pmeenan
Copy link
Member

pmeenan commented Jan 9, 2025

Likely a result of this.

I can revert it an take another run at the change after looking closer to see why it didn't work as expected (maybe something about the inferred technologies not getting picked up).

If we still have the _detected_apps and _detected in the page json payload we may be able to reconstruct it but if they are being stripped out we won't be able to.

@max-ostapenko
Copy link
Author

max-ostapenko commented Jan 10, 2025

We stripped all the duplicates...

Please add any additional objects to payload to make the changes comparable.
Then we can look into them on staging and choose the best version for production.

@max-ostapenko
Copy link
Author

It's not the technologies but NULL categories:

SELECT
  date,
  client,
  COUNT(DISTINCT root_page)
FROM crawl.pages
WHERE
  date >= '2024-10-01'
  AND 'WordPress' IN UNNEST(technologies.technology)
GROUP BY 1,2
ORDER BY 1,2;

I'll fix this.

Why categories were NULLed when using _detected_technologies?

@max-ostapenko max-ostapenko changed the title Technologies are missing in 2024-12-01 crawl Technology categories are missing in 2024-12-01 crawl Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants