-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle numeric values with commas in rating processor #104
base: main
Are you sure you want to change the base?
Handle numeric values with commas in rating processor #104
Conversation
return value | ||
|
||
|
||
def _to_float(value: Any) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should call it something different, like to_rating_float
, to make it somewhat clear that it should not be used for values that could be greater than 999.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about moving these as local functions within rating_processor
? Then it would be clear their usage is restricted to this function.
def rating_processor(value: Any, page: Any) -> Any:
def _to_int(value: Any) -> Any:
...
def _to_float(value: Any) -> Any:
...
...
That was my original idea but I've thought that perhaps we could re-use these in some other place so I left them at the top level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think renaming is better as we can still reuse the renamed ones if we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wRAR what naming would you propose...? Because to_rating_float
sounds like something that should be used only for, well, rating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that these functions could be useful elsewhere. I suggested to_rating_float
, but something more generic, like to_small_float
, would also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't guess from name alone what a function to_small_float
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m open to other names.
But we don’t really need for the name to make it obvious what the function can do, I think the most important thing is for the name to make people think twice before they use the function for a value that could surpass 999. The details can be explained in a docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we also have https://github.com/zytedata/zyte-parsers/blob/8b161cf69df8caeb145e5244601ff93e41b287e6/zyte_parsers/aggregate_rating.py#L101 which does a similar thing for floats. I like how ints are handled in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function in zyte-parsers also swallows the exceptions, and here it's not; it seems not silencing them is better for the case processor is processing the value returned by human-written code (though I'm not 100% sure).
if "," in value: | ||
value = value.replace(",", ".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: why is if
necesary? Is it an optimization, and if so - does it actually help (it can go both ways - it's 2 passes over string instead of one)?
Ratting processor currently accept dictionaries with standard keys
ratingValue
,bestRating
andreviewCount
. It attempts to convert values usingfloat
andint
calls, so that string values are handled:If the value of rating uses comma instead of a dot to separate decimal places, an exception is raised:
Similarly, number of reviews can have comma in string value (seen in Amazon pages), e.g.
12,070 ratings
. Strings such as these also raise exceptions.Because standard schema defines that
ratingValue
andbestRating
useNumber
type andrevewCount
usesInteger
type, decided to treat strings differently depending on used type:,
with.
,