-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Isse with running Baysor segmentation on some but not all patches #123
Comments
Hello @marsdenl, thanks for reporting this weird issue. I have never seen this before. Could you share an example of a transcript.csv file that failed (together with the corresponding |
Hey @quentinblampey, thanks for getting back to me. Here is the info: The transcripts.csv above is the one that contains both the missing columns error and the float64 error. For missing columns see row 17612 for ex. For float64, I have no idea where it's coming from...I've only attached the first 50,000 rows. Thank you! |
In case this helps, @quentinblampey, I also tried Comseg and get a similar issue with poor formatting on some patches but not all patches (some rows have fewer or more columns). Code: sopa segmentation comseg region_0.zarr I get the error on my second (but not first) patch - ParserError: Error tokenizing data. C error: Expected 10 fields in line 112260, saw 14. Here is a portion of the transcripts.csv containing 2 misformatted rows. Config.json Let me know what you think! Thank you very much. |
Indeed I think the problem is happening when creating the Do you know if the issue always happen on the same files if you run it again? Is it random or deterministic? |
I repeated patchify on the same sample with the same config file then simply counted the number of rows that did not have 9 columns in the patch files (I couldn't test re-segmentation with baysor on the patches since the Baysor update, same as issue #125) . Despite getting the same number of patches everytime I patchify, the transcripts.csv for a given patch (say patch 0) has slightly different dimensions and therefore also has a different number of rows that do not have 9 columns. Not too sure how to test for the float64 issue as I don't know where it's coming from... |
Alright, thanks for the details. Can you also let me know if this happens on the toy dataset ( |
Hey, thanks for trying to reproduce the issue :) Just tried with the --technology uniform dataset but for baysor patchify it only creates one patch so I don't know if it'll reproduce the issue where some patches are okay and others aren't. Also since the baysor update I can't run baysor segmentation to check if it all runs smoothly on that single patch (see issue 125). Looking at the transcripts file for that patch though, it seems okay :) Hope this helps. Let me know if I can send anything else over. |
Can you try to make smaller patches using |
Of course, yes sorry xD Made 8 patches on the tutorial dataset and ran baysor (made the changes needed in the config file for v7.0) and it worked perfectly fine. I tried again on my dataset using the same config and again I get the same errors/warnings:
Maybe something to do with the formatting of the input detected_transcripts.csv? Thanks |
Thanks for trying! So, now, my hypothesis is either (1) the To test these:
|
Here is what I have done: Check initial detected_trascripts.csvimport pandas as pd data = pd.read_csv('Z:Queries/Data/Batch5_region0/region_0/detected_transcripts.csv', header=None, low_memory=False) incomplete_rows = data[data.apply(lambda x: len(x.dropna()) != 11, axis=1)] # => this gave 0 rows Check dask written csvimport spatialdata sdata = sopa.io.merscope("Z:Queries/Data/Batch5_region0/region_0/") sdata.points {'Batch5_region0_region_0_transcripts': Dask DataFrame Structure: points_df = sdata.points['Batch5_region0_region_0_transcripts'] check if any row has fewer than 9 columns -> this gave my 0 rowswritten_panda_df[written_panda_df.apply(lambda x: len(x.dropna()) != 9, axis=1)] check if any row has more than 9 columns -> this gave my 0 rowswritten_panda_df[written_panda_df.apply(lambda x: len(x) > 9, axis=1)] At first hand, it seems both files - the inital csv and the dask written csv - look okay in terms of number of columns (although let me know if code above is not correct). Could it be patchify itself? |
When checking for nan values, can you use Also, on the CSV file you gave me, I saw that the header is also weird, as it has three empty columns at the end, see below:
This is also not expected. Do you have something similar with the raw data? Since the header is weird, I suspect there is something wrong before even making the patches. |
Sure thing!
data.isna().any(axis=1).mean() data.isna().any(axis=1).value_counts() The only true count being the column names.
written_panda_df.isna().any(axis=1).mean() written_panda_df.isna().any(axis=1).value_counts() Regarding the csv I sent you, that's not in the original file, I must have added that... Happy to send you the original file by email. |
Do you mean that the only row with NaN values are the columns? The command
Yes, please, can you send it to me (just the file of the patch that has an issue) at [email protected]? Btw, I'll be on vacations for two weeks, I'll answer you in mid-October :) |
Hi @marsdenl, do you also have access to a Linux or MacOS machine by any chance? If yes, do you have the same issue? I think it may be a Windows-specific issue, which would explain why I wasn't able to reproduce it. |
Hey @quentinblampey, you're absolutely right. Tried sopa patchify baysor in Mac OS and ran baysor on each patch on my windows machine and it ran perfectly fine. That's good to know! Thank you very much for your help. I'll go ahead and close the issue as I'm likely going to move forward with Proseg. |
Hi @marsdenl, thanks for trying out! It's good to know that it works on MacOS for you. I'll re-open the issue though, because other Windows users may want to have this fixed (even though I'm afraid it will be complex to fix, as it may be related to parallel writing on-disk with dask). I'm interested in having your feedback about Proseg: how good do you think it is? And how scalable is it? FYI: |
Sure thing :) Best of luck with the windows issue. I am a big fan of proseg - it looks really good. I definitely think you should add it to your segmentation portfolio. I've tried Comseg, Baysor, Cellpose, Proseg and so far Proseg is the most performing method for my tissue type (cortex). It's quite fast - could run 9 million transcripts in ~ 45mn on 64 GB RAM and incredibly easy to use (for MERSCOPE, all you need is a detected_transcripts.csv with a cell_id column giving nuclear prior info. That being said it is highly dependent on good nuclear segmentation (see Proseg issue #34). According to DC Jones: "Currently proseg is somewhat anchored to this (nuclear) input. It can't merge or split cells, it only tries to infer better borders (from existing nuclear borders), so if it's initialized with too many cells it's not able to correct that and has to make due." So bad nuclear segmentation will lead to poor Proseg performance. So I would definitely recommend training a cellpose 2 model on your dataset to get a very good nuclear segmentation, then finishing it off with Proseg. My other reservation for Proseg is the decimal value cell_gene matrix since the assignment of a transcript to a cell is probabilistic, not deterministic. DC Jones mentionned you could round it to integer value or alternatively use the maxpost-counts.csv matrix instead which is an integer count matrix which filters out ambiguous transcripts. Hope this helps :) Thanks again for all your help Quentin |
Thanks @marsdenl for your very detailed comment! It's not the first time I hear good things about proseg. Indeed, if it's highly dependent on a good prior, it could be a good idea to run it after cellpose. Regarding the decimal value for the cell-by-gene table, this is also a concern to me. This is actually one of the current blockers for adding |
Hey Quentin,
I've been trying to use the CLI workflow to segment a MERSCOPE dataset. After successfully creating patches I see that the formatting for the transcript.csv file was quite variable across patches. In some patches it was completely fine and segmentation ran well, in others some rows were shifted/missing values and I got warning messages so I removed those rows for simplicity (after which segmentation worked). Finally, some patches failed even after removing problematic rows due to:
[16:38:11] Info: Run R493595882
[16:38:12] Info: (2024-09-12) Run Baysor v0.6.2
[16:38:12] Info: Loading data...
ERROR: MethodError: Cannot
convert
an object of type InlineStrings.String31 to an object of type Float64.Closest candidates are:
convert(::Type{T}, ::ColorTypes.Gray) where T<:Real
@ ColorTypes C:\Users\marsdenl.julia\packages\ColorTypes\vpFgh\src\conversions.jl:113
convert(::Type{T}, ::ColorTypes.Gray24) where T<:Real
@ ColorTypes C:\Users\marsdenl.julia\packages\ColorTypes\vpFgh\src\conversions.jl:114
convert(::Type{T}, ::Ratios.SimpleRatio{S}) where {T<:AbstractFloat, S}
@ Ratios C:\Users\marsdenl.julia\packages\Ratios\FsiCW\src\Ratios.jl:51
...
Stacktrace:
[1] setindex!(A::Vector{Float64}, x::InlineStrings.String31, i1::Int64)
@ Base .\array.jl:1021
[2] _unsafe_copyto!(dest::Vector{Float64}, doffs::Int64, src::Vector{Union{Missing, InlineStrings.String31}}, soffs::Int64, n::Int64)
@ Base .\array.jl:299
[3] unsafe_copyto!
@ .\array.jl:353 [inlined]
[4] _copyto_impl!
@ .\array.jl:376 [inlined]
[5] copyto!
@ .\array.jl:363 [inlined]
[6] copyto!
@ .\array.jl:385 [inlined]
[7] copyto_axcheck!
@ .\abstractarray.jl:1177 [inlined]
[8] Vector{Float64}(x::Vector{Union{Missing, InlineStrings.String31}})
@ Base .\array.jl:673
[9] convert(::Type{Vector{Float64}}, a::Vector{Union{Missing, InlineStrings.String31}})
@ Base .\array.jl:665
[10] load_df(data_path::String; min_molecules_per_gene::Int64, exclude_genes::Vector{String}, kwargs::@kwargs{x_col::Symbol, y_col::Symbol, z_col::Symbol, gene_col::Symbol, drop_z::Bool, filter_cols::Bool})
@ Baysor.DataLoading C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\data_loading\data.jl:84
[11] load_df
@ C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\data_loading\data.jl:69 [inlined]
[12] load_df(coordinates::String, data_opts::Baysor.Utils.DataOptions; kwargs::@kwargs{filter_cols::Bool})
@ Baysor.DataLoading C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\data_loading\cli_wrappers.jl:165
[13] run(coordinates::String, prior_segmentation::String; config::Baysor.Utils.RunOptions, x_column::String, y_column::String, z_column::String, gene_column::String, min_molecules_per_cell::Int64, scale::Float64, scale_std::String, n_clusters::Int64, prior_segmentation_confidence::Float64, output::String, plot::Bool, save_polygons::String, no_ncv_estimation::Bool, count_matrix_format::String)
@ Baysor.CommandLine C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\cli\main.jl:100
[14] run
@ C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\cli\main.jl:51 [inlined]
[15] command_main(ARGS::Vector{String})
@ Baysor.CommandLine C:\Users\marsdenl.julia\packages\Comonicon\F3QqZ\src\codegen\julia.jl:343
[16] command_main()
@ Baysor.CommandLine C:\Users\marsdenl.julia\packages\Comonicon\F3QqZ\src\codegen\julia.jl:90
[17] command_main(; kwargs::@kwargs{})
@ Baysor C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\Baysor.jl:41
[18] top-level scope
@ none:1
Quick visual inspection shows the files look fine. Weirdly enough, the patches that failed due to that error came at repeating intervals. For instance, if I generated 11 patches total, then patch 1, 4, 7, 10 failed. If I generated 15 patches, 2, 6, 10 and 14 failed (1/3 patches). Do you have any idea why some of the patches failed and not others?
Code run:
cd PATH.zarr.sopa_cache\baysor_boundaries\6
C:\Users\marsdenl.julia\bin\baysor.cmd run --save-polygons GeoJSON -c config.toml transcripts.csv
Thanks!
The text was updated successfully, but these errors were encountered: