-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FFTW3 FFT via FFI #23
base: master
Are you sure you want to change the base?
Conversation
This reverts commit 944d2c5.
…l counterparts based on the FFTW3 library via FFI.
The only single thing that I don't understand here is the problem that after initialization of gsl-shell the very first num.fft transform gives zero output and transforms the input to zero as well. All following calls work correctly. Any ideas on that? |
Hi Benjamin, I was trying your branch and I stumbled on the same problem. It seems that I found the solution by reading the FFTW user's manual here: http://www.fftw.org/fftw3_doc/Complex-One_002dDimensional-DFTs.html#Complex-One_002dDimensional-DFTs.
and before they write:
So in practice when you create a plan with FFTW_MEASURE the library does mess with your input even before you run the plan. I've seens the problem disappear if you use FFTW_ESTIMATE instead. Probably a better solution is to keep a flag to know weather a given plan has been already measured or not. If it is not yet measured you should create a copy of the input and copy back the data after the creation of the plan. I'm also wondering if they don't have lower level interface to make the MEASURE stuff explicit. Francesco |
Excellent points. |
Well, of course we need to change something because right now it doesn't work :-) I hesitate to use ESTIMATE because if you doesn't really gain in performance what the point of using FFTW? Right now I was reading the manual to figure out if there is a smarter approach to use MEASURE and yet do not ask the user to plan ahead of time. |
Hehe, sure, just didn't want to work on in if you are reworking the code at the same time. So my understanding is, for normal fft calls we use ESTIMATE (because we don't want the measuring overhead for just this call) but for "create plan", "execute plan" we use MEASURE. If we make that point clear in the documentation I guess it is not too much to ask from the user to choose one or the other if the performance increase is so vast. |
I was wondering about using the New Array Interface http://www.fftw.org/fftw3_doc/New_002darray-Execute-Functions.html#New_002darray-Execute-Functions which may be adapted to our case. We need also to look the the stride of the input matrix (can be > 1 even for column matrix). The stride is "m.tda" in the general case. I was also wondering about using fftw_malloc for matrix data allocation to guarantee data alignment or otherwise give the flag FFTW_UNALIGNED. My plan was: use fftw_malloc to allocate matrix data block, always. Use the new-array-interface by caching plans for the most common used array size & strides. May be when stride is greater that one we can give up and use ESTIMATE to optimize only the common case with stride = 1. It is different from your approach in that we still does not ask the user to plan. Otherwise your approach seems reasonable but you are giving up very soon by not looking at all to the more advanced interfaces. |
In the current implementation Probably you are right and we should enforce allocating the matrix data via About the caching, do you mean precaching or caching when the user is requesting the FFT of a certain size? Otherwise caching would be a nice idea to "hide" the plans all along from the user. Then we could always use MEASURE and be sure, that the fastest possible scheme is used. I am not sure what the allocation size of a normal plan is and if we would have memory issue if we allocated the first 2^n, n from 4 to 8 or 16 plans. |
I was thinking to use the new array interface also for ordinary "one shot" fft and rfft calls.
Yes.
Yes, this is my nice idea :-) I mean just caching, not precaching. The old FFT implementation was already doing that to avoid allocating a new workspace each time.
We don't to guess. We can just keep in memory the plan for the previous call and if does match just reuse the plan. otherwise we can keep something like a fixed number of plans in memory and overwrite the least recently used when a new is requested and the slots are all already used. This latter schema is significantly more complex. I advice for just caching the last used plan for each kind of transforms, I.e. r2c_1d, c2c_1d, r2c_2d etc. This schema is simple to implement and cover the most common case when a user transform many times arrays of the same size. |
I can modify the code with regard to caching and |
…rdft, rdftinv which are called by the fft, fft2 and fftn functions. Added test file and removed old fft-tests which was the same as the demo file.
Hey Francesco, I have rewritten the whole fftw interface and reduced it down to three fundamental functions The To address the problem that input was nullified during "planning", I allocate extra input memory for this function but I use the original input when I execute the plans via
There is some if/else overhead in the The caching makes the whole thing very fast, just with a very basic comparison to the python FFT implementation I get 3x speed increase. I am happy with it and I look forward to your comments. |
Remark to the fact that the output location can be given as a paramter: This significantly increases performance: time gsl-shell -e "test=matrix.cnew(256,1,|i,j|i-1); for i = 1,1e6 do bla=num.fft(test) end"
gsl-shell -e 2.68s user 0.74s system 99% cpu 3.426 total vs. time gsl-shell -e "test=matrix.cnew(256,1,|i,j|i-1); test2=num.fft(test); for i = 1,1e6 do num.fft(test, test2) end"
gsl-shell -e 0.88s user 0.01s system 99% cpu 0.886 total |
Hi Benjamin, that's looks great but unfortunately it seems there is a problem. On my computer if I do demo'fft1' from the interactive shell I get a SEGFAULT, not always but often. The same happens for fft2 and fft3. I've seen that you didn't fix the FFTW_UNALIGNED vs fftw_malloc but it seems that this is not the cause of the SEGFAULT. I guess there is a subtle error in the allocation/free of some vector. Francesco |
Hey, I will look into it. I did not fix the fftw_malloc yet because we wanted to I will come back to you. On Thu, Feb 13, 2014 at 9:49 PM, Francesco [email protected] wrote:
|
Hi Benjamin, I'm planning to merge the FFTW branch soon but some extra work is needed before it can be merged in the master branch. The most serious thing to be done right now is that we ignore the "tda" field in the column matrices. This is ok in many cases because "tda" is one for matrices created as column matrices but is the most general case it can be > 1. For example this happens if your column matrix is a column view of a matrix with multiple columns. This is actually the only thing that prevent me from merging with master. Otherwise I would like to get rid of UNALIGNED and use fftw_malloc for all matrix data allocations but this can be done later. Then, more on the cosmetic side, we can remove some cruft from the FFTW ffi c definitions. Define fftw_complex as gsl_complex (you can identiify them) and other minor things like that. I hope I will find the time soon to implement these changes and merge the branch. I really like the work that you have done and the FFTW is really interesting in terms of speed. Francesco |
… data (tda of gsl blocks)/
Hey Francesco, I am happy you like it. So I looked into FFTW Advanced Interface which allows to give the stride as well. Now I am giving It works on my machine but I get a segfault when exiting the shell after running the following code: num.fft(matrix.vec{1,2,3,4+0i})
c=matrix.cnew(4,4,|i|1)
num.fft(c:slice(1,1,4,1)) Maybe you have an idea how to fix that. Anyway, with the advanced interface the tda problems should be solved. |
So it is strange: c=matrix.cnew(4,4,|i|1)
num.fft(c:slice(1,1,4,1), c:slice(1,1,4,1)) gives no error when exiting gsl-shell and shows cool inplace fft transformation while c=matrix.cnew(4,4,|i|1)
num.fft(c:slice(1,1,4,1)) gives the segfault. Is the GC trying to free some memory that already has been freed? It seems to be a problem with the output. |
Hi Benjamin, very good work, thank your for your persistence :-) It seems I found the error. Here a fix for the test case: diff --git a/fft.lua b/fft.lua
index b78d36c..0c756ef 100644
--- a/fft.lua
+++ b/fft.lua
@@ -47,7 +47,7 @@ local function dft(invec, dimlist, outvec, direction)
--Or create a new plan
else
--allcoate input for plan making but planning can invalidate input, therefore we use extra array
- local inputtest = ffi.new("fftw_complex[?]", size)
+ local inputtest = ffi.new("fftw_complex[?]", size * invec.tda)
local howmany = 1
local idist,odist = 0,0 It is because the inputtest vector need to have the same memory layout of "invec" itself. If stride is > 1 you have to allocate a bigger one. Please note that i didn't fix the problem in other similar functions. |
… few memory for FFTW input plans
Great, that solved it. |
While changing the documentation and the examples accordingly I came across the problem, that the input is being altered by the |
Hi Benjamin, sorry for the late reply. I've studied the problem and I didn't find anything wrong. I was even able to reproduce the problem with a simple C program: #include <fftw3.h>
int
main()
{
double x1[256], x2[256];
fftw_complex z[129], ztest[129];
double intest[256];
int n = 256, i;
for (i = 0; i < n; i++) {
x1[i] = (i < 86 ? 0 : (i < 172 ? 1 : 0));
}
fftw_plan plan_r2c = fftw_plan_many_dft_r2c(1, &n, 1,
intest, NULL, 1, 0,
z, NULL, 1, 0,
FFTW_MEASURE | FFTW_UNALIGNED);
fftw_execute_dft_r2c(plan_r2c, x1, z);
for (i = 0; i < 129; i++) {
printf("%g + %gi\n", z[i][0], z[i][1]);
}
printf("\n\n\n");
fftw_plan plan_c2r = fftw_plan_many_dft_c2r(1, &n, 1,
ztest, NULL, 1, 0,
x2, NULL, 1, 0,
FFTW_MEASURE | FFTW_UNALIGNED);
fftw_execute_dft_c2r(plan_c2r, z, x2);
for (i = 0; i < 129; i++) {
printf("%g + %gi\n", z[i][0], z[i][1]);
}
} May be you can ask to the FFTW mailing list, probably they should be able to help us. |
Okay, I got the same altered results on my machine which is very curious. I will indeed as the FFTW mailing list. Stay tuned. |
Okay, so I got an answer, that was quick. "Second, the inverse transform (complex to real) has the side-effect of overwriting its input array, by default. Neither of these inconveniences should pose a serious problem for users, but it is important to be aware of them." and also "As noted above, the c2r transform destroys its input array even for out-of-place transforms. This can be prevented, if necessary, by including FFTW_PRESERVE_INPUT in the flags, with unfortunately some sacrifice in performance. This flag is also not currently supported for multi-dimensional real DFTs (next section)." So either we mention it in the documentation OR we copy the input every time OR we implement the FFTW_PRESERVE_INPUT for 1D and mention the altered input for 2D+ functions. It's speed vs. usability again ;) |
Wow! They didn't even mention that in the documentation, because I've read quite carefully the user manual. Anyway, I propose to use the FLAG PRESERVE_INPUT for 1d and always make a copy of the input for the 2d+ case. This can be bad surprise for the users that can make them angry. Look to our case: only the mailing list was able to clarify the issue and the user manual was incomplete. this is the worst you can do for the users! |
…ctions: fft, fftinv, rfft, rfftinv, fft2, fft2inv, rfft2, rfft2inv, fftn, fftninv, rfftn, rfftninv.
…use of strange side effect of all c2r fftw function calls (also documented, see function)
Hey Francesco, sorry for the delay, holidays, work travel, presentations and such delayed my additional modifications to the documentations a bit. I have now tidied up every remaining bit of the original FFT implementation and substituted with the new FFTW wrap. The comments should be self explanatory. Personally I have the FFTN funning at my project and it performs very well. The demos are working for me and the documentation compiles with the proper formatting/LATEX formatting. For me this was fun. |
Hi Benjamin, that looks great, thank you for taking the time for finalizing the work. I will just need the time to review the work but I should be able to merge the fftw branch. Just one question, do you have some predefined tests for the FFT routines ? |
No problem. However, if you want some basic test functions for the test section in order to easily test if the fft module works at all I can supply those right away. BTW: Installed gsl-shell on our cluster and first had the problem that FFTW3 (libfftw3.so) was not installed as a shared library. I have put a note into the INSTALL file concerning this and I hope this will be sufficient. If not installed, a "module not found" error will point toward the reason anyway. Cheers, |
Hi Benjamin, I was willing to make some checks and then merge the fftw branch. I've got only one problem, on Windows the fftw crash quite often. It is actually enough to call demo("fft2") two or three times to have a crash. For the monent I don't have any idea bout that problem. On Linux everything seems fine. Do you have any idea ? Some basic tests that can help to point out the problem ? |
I would like to replicate the bug in my Windows-VM but before I do that I would like to cross check with you, how you compile gsl-shell on windows? Are you using cygwin, gcc, manually download AGG & FOX library, install blas, fftw, etc.. via cygwin interface and then change makepackages accordingly? |
I'm using on windows (32bit) the mingw toolchain. I compile myself everything but for the "fftw" library I took the binary package for windows from the fftw website. Yet I think that it is simpler that I do the tests myself since I have everything ready on my windows box. I don't want to give you a lot of extra work. May be you can suggest me some basic tests I can perform or otherwise I can figure out, I can even begin testing everything methodically! May be I can also try to compile myself the fftw library but I suspect that the problem does not came from the precompiled library. |
Hey Francesco,
here is a branch ("fftw") with the fftw functions replaced by wrapper calls for the FFTW library. Why did I do that? For one thing I wanted 2D, 3D and nD FFT which are far better supported by FFTW. Also I know you like to have the best speed there is and FFTW is faster than the GSL implementation (in case of many FFTs).
I have done some benchmarks that you maybe also want to try.
Original version (4.5 sec):
New version (0.9 sec):
I just want to know your opinion about it and if I should proceed with this (documentation, tests, maybe 2D demos).