Friday, April 20, 2018

ispc_texcomp BC7 issues

Been studying ispc_texcomp today to better understand why it's so slow compared to my encoder (currently by a factor of 2x at max quality). We do many of the same things, so why is it slower? Overall, there are many clever/smart things in there (overall it's surprisingly well done!), but it's being held back by weak vectorization, some missing optimizations, and it only supports linear RGB metrics.

Here's what I've found so far:

- The inner loops are bogged down with gathers and scatters. Definitely not good. There's even a set of helper functions at the top with a comment of "(perf warning expected)". (Umm - the compiler perf warnings are there for a reason!) For an example, check out block_quant().

The innermost loops should not have gathers, period.

- The partition estimation code is using full PCA. I've found this to be unnecessary in my testing - just using Waveren's bounding box approximation works well and is trivially vectorizable. After all, we're not trying to compute the actual output, just a reasonable approximation.

So ispc_texcomp goes into overkill mode and computes PCA while estimating the partition. At least the way it computes each subset's PCA is smart: it first computes the overall block's statistics/covariance, then it subtracts out the statistics of each partition's active (masked) pixels to compute each subset's individual covar.

Also, it's only computing what looks like an upper bound on the error from the block statistics, not an approximation of the actual error. The approximation of the actual error (factoring in quantization to 4/8/16 selectors) is extremely fast to compute with SIMD code, so it's not clear to me what's better yet.

Overall, the author seems to be favoring cleverness vs. exploiting the properties of fast but simple SIMD code.

- It uses squish-style iterative refinement after choosing the partition: Basically, it computes the PCA, comes up with some initial selectors, uses least squares to optimize the endpoints, then it computes new selectors and tries all over again a few times. In my tests, the PSNR gain from this method is too low (fraction of a dB) to justify the repeated LS computation and selector selection costs. Just do it once and then optionally try other things. It's more effective to just vary the selectors in simple ways (simplified cluster fit) in each trial.

- There's no support for perceptual colorspace metrics in there. This indirectly impacts performance (against other codecs that do support perceptual metrics) because it's stuck competing against RGB PSNR, and getting RGB PSNR up in BC7 is VERY computational intensive. You basically hit a steep quality wall, then it takes massively more compute to get it up above that wall even by a fraction of a dB.

If it supported perceptual metrics (where error in R,G,B is allowed to become a little unbalanced by approx. .25 - 1.5 dB, favoring G and R) it wouldn't have to try as hard because it would gain ~1.5 dB or more instantly before hitting the wall.

- First, the good news: The selector quantizer (see block_quant()) is using a clever algorithm: It dots the desired color by the subset's axis, converts that to a scaled int by rounding, clamps that to between [1,num_selectors-1], then it computes full squared euclidean error between the desired color and the subset's interpolated colors (s-1) and s. It only has to compute the full distance to 2 colors vs. all of them, which is cool.

I've compared this method vs. full distance to all colors and the results are super close (~1/1000th of a dB) on many images (but not all - I've seen .1 dB RGB PSNR loss on some images).

Now the bad news: The implementation is just bad. First, it recomputes the subset axis for every pixel (even though there are only 2 or 3 of them in BC7). And it uses multiple gather's to fetch the endpoints! This is done for all 16 pixels in the block - ouch! There's also a per-pixel divide in there.

Also, with good SIMD computing full distance to all subset colors isn't that expensive, at least for 4 and maybe 8 color blocks. I've implemented optimized forms of full search vs. ispc_texcomp's method. At least with AVX, all the fetches into the weighted_colors[] array (one for each lane) just slow the method down. Brute force leads to simpler code once vectorized and seems to slightly win out overall for 4 and 8 color blocks. With 16 color blocks the smarter method wins.

- After iterative refinement it doesn't have any more ways of improving quality. Trying to vary the selectors in keys ways (say by incrementing the lowest values and decrementing the highest values - to exploit extrapolation) and then LS optimizing the results helps a lot (.3-.5 dB) and is very fast if you SIMD optimize the trial solution evaluator function, yet it doesn't do that.

- Its mode 0 encoder suffers from a lot of quantization error - which is indicative of some weaknesses in its endpoint selection:

ispc_texcomp mode 0 only:

My encoder mode 0 only (no dithering - just stronger endpoint selection):


- ispc_texcomp is weak with grayscale images, by around .6-1.2 dB in my testing. Granted, once you're over ~60dB it doesn't matter much.

The "slow" profile is solidly in the quality "wall" region I described earlier. The basic and faster profiles are in much healthier regions.

No comments:

Post a Comment