But for now, it still a proof of concept to just get the logic implemented on Metal. I will definitely look at GPU performance when I have this running on iOS.
![opencl benchmark mac opencl benchmark mac](https://www.techwalls.com/wp-content/uploads/2016/10/macbook_pro_gpu_benchmark_3.jpeg)
Apple Silicon, on iOS or macOS, is all I care about.
#OPENCL BENCHMARK MAC CODE#
This code might never actually be deployed on Intel. If you expect the code to live into the Apple Silicon era, exploring performance on a modern iPad might not be a waste of time. Unfortunately, I don't think you get the fine grained stats on Intel/Discrete that are available on A10X and newer SOCs. My knowledge of SIMD computing dates from before GPUs were popular. Have you looked at the performance information from Capture GPU Frame? No. It should be faster than this, shouldn't it? I think I will go ahead and file a DTS ticket. Using private buffers imposed another 50% speed reduction.
![opencl benchmark mac opencl benchmark mac](https://developer.nvidia.com/sites/default/files/styles/main_image/public/akamai/cuda/images/OpenCL_mini.png)
None of these optimization strategies yielded any improvement. I tried 3 different optimization schemes:ġ) Use buffers big enough for all threads so I don't have to worry about reduced efficiency or boundary checks.Ģ) Use loops in the Metal code to avoid per-pixel threads that are supposedly less efficient. I have learned a few things about Metal optimization. Metal still runs at about half the speed of OpenCL, but I guess I can live with this. Most of the overall GPU speedup I was seeing was due to the newer, slower version of PROJ. I was also using an older version of the underlying PROJ library for comparison. When I tried it on my 2017 MacBook Pro, with both discrete and integrated graphics, the difference in performance was much smaller. I was doing these tests on my 2014 MacBook Pro, usually with integrated graphics, running 10.16. I worked on the code and got it to the point where OpenCL was only about 3 times as fast as Metal. Is this normal? Is there something I'm missing? I tried using a private buffer but my dataset is so small that the time to do the copying into the private buffer was longer than just using a shared memory buffer. My test data set has 200,000 pairs of floats. I have removed all the Metal setup code from the time comparison. Xy_out.y = params->y + params->scale * y Ĭode Block _kernel void pl_project_mercator_s( Xy_out.x = params->x + params->scale * x It is doing a Mercator map projection, so this is all there is for the Metal code:Ĭode Block kernel void project_mercator_s(ĭevice const float2 * xy_in ],ĭevice const spherical_params * params ], These two test cases are the absolute easiest and simplest ones there are.
![opencl benchmark mac opencl benchmark mac](https://images.techhive.com/images/article/2016/11/macbook_pro_review_geekbench_opencl-100693234-orig.jpg)
It is still several times better than the CPU version, but why would it be so slow compared to OpenCL?
![opencl benchmark mac opencl benchmark mac](https://i.ytimg.com/vi/TWVSL_v16hQ/maxresdefault.jpg)
My Metal version is consistently 4 times slower than OpenCL. On the two simplest test cases, OpenCL runs about 14 and 24 times as fast as on the CPU. The OpenCL package has a nice test set that compares its own output against the reference project. I am trying to port the OpenCL package to Metal so I can get GPU performance on all devices. I have another package that performs the same functionality on OpenCL. I have a popular open source library that runs entirely on the CPU. I am working on porting some code to Metal.