initial commit

2025-12-15 16:07:46 +01:00 · 2018-04-16 04:23:07 +02:00
parent 31f7827d7d
commit 784ca88cc3
45 changed files with 12034 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,182 @@
-# ebsynth
-Fast Example-based Image Synthesis and Style Transfer
+# Ebsynth: A Fast Example-based Image Synthesizer
+
+`ebsynth` is a versatile tool for by-example synthesis of images.
+It can be used for a variety of image synthesis tasks, including guided
+texture synthesis, artistic style transfer, content-aware inpainting
+and super-resolution.
+
+The focus of `ebsynth` is on preserving the fidelity of the source material.
+Unlike other recent approaches, `ebsynth` doesn't rely on neural networks.
+Instead, it uses a state-of-the-art implementation of non-parametric
+texture synthesis algorithms. Thanks to its patch-based nature, `ebsynth`
+produces crisp results, which preserve all the fine detail present in the
+original image.
+
+## Basic usage
+
+```
+ebsynth -style <style.png> -guide <source.png> <target.png> -output <output.png>
+```
+
+## Options
+```
+-style <style.png>
+-guide <source.png> <target.png>
+-weight <value>
+-uniformity <value>
+-patchsize <value>
+-pyramidlevels <number>
+-searchvoteiters <number>
+-patchmatchiters <number>
+```
+
+## Download
+
+Pre-built Windows binary can be downloaded from here: [http://jamriska.cz/ebsynth/ebsynth-win64.zip](http://jamriska.cz/ebsynth/ebsynth-win64.zip).
+
+# Examples
+
+## Texture-by-numbers
+
+The first example shows how to perform a basic guided texture synthesis with `ebsynth`.
+This use-case was first proposed in the original Image Analogies paper [1], where they
+called it 'texture-by-numbers'. We start with a photograph of a natural scene together
+with its segmentation (e.g., rock is painted green, sky with blue):
+
+<p align='center'>
+  <img src='doc/images/texbynum.jpg'/>
+</p>
+
+```
+ebsynth -style source_photo.png -guide source_segment.png target_segment.png -output output.png
+```
+
+Next, we paint a target segmentation by hand, and we ask `ebsynth` to produce
+a new 'photograph' that would match it. In the language of style transfer: we want
+to transfer the *style* of the source photograph onto the target segmentation in
+a way that would respect the individual segments. The segmentation acts as a *guide*
+for the synthesis.
+
+## StyLit: Illumination-Guided Stylization of 3D Renderings
+<p align='center'>
+  <img src='doc/images/stylit-teaser.jpg'/>
+</p>
+
+This example shows how to achieve a non-photorealistic rendering with `ebsynth`.
+It is based on the work of Fišer et al. [7]. The goal is to render a 3D model like
+an artist would do. Specifically, we want to capture the way how an artist conveys
+the different illumination effects, like highlights, contact shadows, and indirect
+bounces. To that end, we set up a simple reference scene with an illuminated ball,
+and let the artist draw it in her/his style. We use an off-the-shelf path tracer
+to produce the separate render passes, e.g., full global illumination, just the
+direct diffuse component, just the indirect bounce, etc. Next, we render the same
+set of passes for the target 3D model and use them as guides for `ebsynth`.
+
+<p align='center'>
+  <img src='doc/images/stylit.jpg'/>
+</p>
+
+```
+ebsynth -style source_style.png
+        -guide source_fullgi.png target_fullgi.png -weight 0.66
+        -guide source_dirdif.png target_dirdif.png -weight 0.66
+        -guide source_indirb.png target_indirb.png -weight 0.66
+        -output output.png
+```
+
+Compared to texture-by-numbers, the main difference here is we now have *multiple*
+guiding channels.  Note the guides always come in pairs: source guide first, target
+guide second. For better results, we might want to boost the contribution of guides
+relative to the style. In the example above, the style has a default weight of 1.0,
+while the guide channels have weight of 0.66 each. In sum, the total guide weight
+is 2.0, resulting in 2:1 guide-to-style ratio.
+
+## FaceStyle: Example-based Stylization of Face Portraits
+
+<p align='center'>
+  <img src='doc/images/facestyle-teaser.jpg'/>
+</p>
+
+This example demonstrates how one can use `ebsynth` to transfer the style of
+a portrait painting onto another person's photograph. It is based on the work
+of Fišer et al. [8]. The goal is to reproduce the fine nuances of the source
+painting, while preserving the identity of the target person. I.e., we want
+the person to still be recognizable after the synthesis.
+
+Unlike with StyLit, in this setting we don't have the reference 3D geometry
+to use as a guide. However, we can exploit the fact that both the source painting
+and the target photo contain a human face, which has a well-defined structure.
+We will use this structure to infer the necessary guiding information.
+
+<p align='center'>
+  <img src='doc/images/facestyle.jpg'/>
+</p>
+
+```
+ebsynth -style source_painting.png
+        -guide source_Gapp.png target_Gapp.png -weight 2.0
+        -guide source_Gseg.png target_Gseg.png -weight 1.5 
+        -guide source_Gpos.png target_Gpos.png -weight 1.5 
+        -output output.png
+```
+
+Specifically, we detect the facial landmarks in both the target and source images,
+and use them to produce a soft segmentation guide `Gseg`, and a positional guide
+`Gpos`, which is essentially a dense warp field that maps every target pixel to its
+corresponding position in source. To preserve the person's identity, we use the
+appearance guide `Gapp`, which is a grayscale version of the target photo that was
+equalized to match the luminance of the source painting.
+
+--------------------------------------------------------------------------
+
+## Requirements
+
+`ebsynth` needs a CUDA-capable gpu in order to run. Besides CUDA, there are no other external dependencies. A cpu-only version that doesn't require CUDA will be released later.
+
+## License
+
+The code is released into the public domain. You can do anything you want with it.
+
+However, you should be aware that the code implements the PatchMatch algorithm, which is patented by Adobe (U.S. Patent 8,861,869). Other techniques might be patented as well. It is your responsibility to make sure you're not infringing any patent holders' rights by using this code. 
+
+## Citation
+
+If you find this code useful for your research, please cite:
+
+```
+@misc{Jamriska2018,
+  author = {Jamriska, Ondrej},
+  title = {Ebsynth: Fast Example-based Image Synthesis and Style Transfer},
+  year = {2018},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/jamriska/ebsynth}},
+}
+```
+
+## References
+
+1. Image Analogies  
+   Aaron Hertzmann, Chuck Jacobs, Nuria Oliver, Brian Curless, David H. Salesin  
+   In SIGGRAPH 2001 Conference Proceedings, 327–340.  
+2. Texture optimization for example-based synthesis  
+   Vivek Kwatra, Irfan A. Essa, Aaron F. Bobick, Nipun Kwatra  
+   ACM Transactions on Graphics 24, 3 (2005), 795–802.  
+3. Space-Time Completion of Video  
+   Yonatan Wexler, Eli Shechtman, Michal Irani  
+   IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 3 (2007), 463–476.  
+4. PatchMatch: A randomized correspondence algorithm for structural image editing  
+   Connelly Barnes, Eli Shechtman, Adam Finkelstein, Dan B. Goldman  
+   ACM Transactions on Graphics 28, 3 (2009), 24.  
+5. Self Tuning Texture Optimization  
+   Alexandre Kaspar, Boris Neubert, Dani Lischinski, Mark Pauly, Johannes Kopf  
+   Computer Graphics Forum 34, 2 (2015), 349–360.  
+6. LazyFluids: Appearance Transfer for Fluid Animations  
+   Ondřej Jamriška, Jakub Fišer, Paul Asente, Jingwan Lu, Eli Shechtman, Daniel Sýkora  
+   ACM Transactions on Graphics 34, 4 (2015), 92.  
+7. StyLit: Illumination-Guided Example-Based Stylization of 3D Renderings  
+   Jakub Fišer, Ondřej Jamriška, Michal Lukáč, Eli Shechtman, Paul Asente, Jingwan Lu, Daniel Sýkora  
+   ACM Transactions on Graphics 35, 4 (2016), 92.  
+8. Example-Based Synthesis of Stylized Facial Animations  
+   Jakub Fišer, Ondřej Jamriška, David Simons, Eli Shechtman, Jingwan Lu, Paul Asente, Michal Lukáč, Daniel Sýkora  
+   ACM Transactions on Graphics 36, 4 (2017), 155.  
--- a/bin/.gitkeep
+++ b/bin/.gitkeep
--- a/build-linux.sh
+++ b/build-linux.sh
@@ -0,0 +1,2 @@
+#!/bin/sh
+nvcc -arch compute_30 src/ebsynth.cu -o bin/ebsynth -I "include" -std=c++11 -Xcompiler "-DNDEBUG -O6 -D__CORRECT_ISO_CPP11_MATH_H_PROTO"
--- a/build-win32.bat
+++ b/build-win32.bat
@@ -0,0 +1,12 @@
+@echo off
+setlocal ENABLEDELAYEDEXPANSION
+
+for %%V in (15,14,12,11) do if exist "!VS%%V0COMNTOOLS!" call "!VS%%V0COMNTOOLS!..\..\VC\vcvarsall.bat" x86 && goto compile
+
+:compile
+nvcc -arch compute_30 src\ebsynth.cu -m32 -O6 -w -I "include" -o "bin\ebsynth.exe" -Xcompiler "/DNDEBUG /Ox /Oy /Gy /Oi /fp:fast" -Xlinker "/IMPLIB:\"lib\ebsynth.lib\"" || goto error
+goto :EOF
+
+:error
+echo FAILED
+@%COMSPEC% /C exit 1 >nul
--- a/build-win64.bat
+++ b/build-win64.bat
@@ -0,0 +1,12 @@
+@echo off
+setlocal ENABLEDELAYEDEXPANSION
+
+for %%V in (15,14,12,11) do if exist "!VS%%V0COMNTOOLS!" call "!VS%%V0COMNTOOLS!..\..\VC\vcvarsall.bat" amd64 && goto compile
+
+:compile
+nvcc -arch compute_30 src\ebsynth.cu -m64 -O6 -w -I "include" -o "bin\ebsynth.exe" -Xcompiler "/DNDEBUG /Ox /Oy /Gy /Oi /fp:fast" -Xlinker "/IMPLIB:\"lib\ebsynth.lib\"" || goto error
+goto :EOF
+
+:error
+echo FAILED
+@%COMSPEC% /C exit 1 >nul
--- a/doc/images/facestyle-teaser.jpg
+++ b/doc/images/facestyle-teaser.jpg
--- a/doc/images/facestyle.jpg
+++ b/doc/images/facestyle.jpg
--- a/doc/images/stylit-teaser.jpg
+++ b/doc/images/stylit-teaser.jpg
--- a/doc/images/stylit.jpg
+++ b/doc/images/stylit.jpg
--- a/doc/images/texbynum.jpg
+++ b/doc/images/texbynum.jpg
--- a/examples/facestyle/run.bat
+++ b/examples/facestyle/run.bat
@@ -0,0 +1,9 @@
+@echo off
+setlocal
+set PATH=..\..\bin;%PATH%
+
+ebsynth.exe -style source_painting.png ^
+            -guide source_Gapp.png target_Gapp.png -weight 0.5 ^
+            -guide source_Gseg.png target_Gseg.png -weight 1.5 ^
+            -guide source_Gpos.png target_Gpos.png -weight 10 ^
+            -output output.png
--- a/examples/facestyle/run.sh
+++ b/examples/facestyle/run.sh
@@ -0,0 +1,8 @@
+#!/bin/sh
+export PATH=../../bin:$PATH
+
+ebsynth -style source_painting.png \
+        -guide source_Gapp.png target_Gapp.png -weight 0.5 \
+        -guide source_Gseg.png target_Gseg.png -weight 1.5 \
+        -guide source_Gpos.png target_Gpos.png -weight 10 \
+        -output output.png
--- a/examples/facestyle/source_Gapp.png
+++ b/examples/facestyle/source_Gapp.png
--- a/examples/facestyle/source_Gpos.png
+++ b/examples/facestyle/source_Gpos.png
--- a/examples/facestyle/source_Gseg.png
+++ b/examples/facestyle/source_Gseg.png
--- a/examples/facestyle/source_painting.png
+++ b/examples/facestyle/source_painting.png
--- a/examples/facestyle/target_Gapp.png
+++ b/examples/facestyle/target_Gapp.png
--- a/examples/facestyle/target_Gpos.png
+++ b/examples/facestyle/target_Gpos.png
--- a/examples/facestyle/target_Gseg.png
+++ b/examples/facestyle/target_Gseg.png
--- a/examples/stylit/run.bat
+++ b/examples/stylit/run.bat
@@ -0,0 +1,10 @@
+@echo off
+setlocal
+set PATH=..\..\bin;%PATH%
+
+ebsynth.exe -style source_style.png ^
+            -guide source_fullgi.png target_fullgi.png -weight 0.5 ^
+            -guide source_dirdif.png target_dirdif.png -weight 0.5 ^
+            -guide source_dirspc.png target_dirspc.png -weight 0.5 ^
+            -guide source_indirb.png target_indirb.png -weight 0.5 ^
+            -output output.png
--- a/examples/stylit/run.sh
+++ b/examples/stylit/run.sh
@@ -0,0 +1,9 @@
+#!/bin/sh
+export PATH=../../bin:$PATH
+
+ebsynth -style source_style.png \
+        -guide source_fullgi.png target_fullgi.png -weight 0.5 \
+        -guide source_dirdif.png target_dirdif.png -weight 0.5 \
+        -guide source_dirspc.png target_dirspc.png -weight 0.5 \
+        -guide source_indirb.png target_indirb.png -weight 0.5 \
+        -output output.png
--- a/examples/stylit/source_dirdif.png
+++ b/examples/stylit/source_dirdif.png
--- a/examples/stylit/source_dirspc.png
+++ b/examples/stylit/source_dirspc.png
--- a/examples/stylit/source_fullgi.png
+++ b/examples/stylit/source_fullgi.png
--- a/examples/stylit/source_indirb.png
+++ b/examples/stylit/source_indirb.png
--- a/examples/stylit/source_style.png
+++ b/examples/stylit/source_style.png
--- a/examples/stylit/target_dirdif.png
+++ b/examples/stylit/target_dirdif.png
--- a/examples/stylit/target_dirspc.png
+++ b/examples/stylit/target_dirspc.png
--- a/examples/stylit/target_fullgi.png
+++ b/examples/stylit/target_fullgi.png
--- a/examples/stylit/target_indirb.png
+++ b/examples/stylit/target_indirb.png
--- a/examples/texbynum/run.bat
+++ b/examples/texbynum/run.bat
@@ -0,0 +1,5 @@
+@echo off
+setlocal
+set PATH=..\..\bin;%PATH%
+
+ebsynth.exe -patchsize 3 -uniformity 1000 -style source_photo.png -guide source_segment.png target_segment.png -output output.png
--- a/examples/texbynum/run.sh
+++ b/examples/texbynum/run.sh
@@ -0,0 +1,5 @@
+#!/bin/sh
+export PATH=../../bin:$PATH
+
+ebsynth -patchsize 3 -uniformity 1000 -style source_photo.png -guide source_segment.png target_segment.png -output output.png
+
--- a/examples/texbynum/source_photo.png
+++ b/examples/texbynum/source_photo.png
--- a/examples/texbynum/source_segment.png
+++ b/examples/texbynum/source_segment.png
--- a/examples/texbynum/target_segment.png
+++ b/examples/texbynum/target_segment.png
--- a/include/ebsynth.h
+++ b/include/ebsynth.h
@@ -0,0 +1,71 @@
+// This software is in the public domain. Where that dedication is not
+// recognized, you are granted a perpetual, irrevocable license to copy
+// and modify this file as you see fit.
+
+#ifndef EBSYNTH_H
+#define EBSYNTH_H
+
+#ifndef EBSYNTH_API
+  #ifdef WIN32
+    #define EBSYNTH_API __declspec(dllimport)
+  #else
+    #define EBSYNTH_API
+  #endif
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define EBSYNTH_BACKEND_CUDA        0x0002
+
+#define EBSYNTH_MAX_STYLE_CHANNELS  8
+#define EBSYNTH_MAX_GUIDE_CHANNELS  24
+
+#define EBSYNTH_VOTEMODE_PLAIN      0x0001         // weight = 1
+#define EBSYNTH_VOTEMODE_WEIGHTED   0x0002         // weight = 1/(1+error)
+
+EBSYNTH_API
+int ebsynthBackendAvailable(int ebsynthBackend);   // returns non-zero if the specified backend is available
+
+EBSYNTH_API
+void ebsynthRun(int    ebsynthBackend,             // use BACKEND_CUDA for maximum speed
+
+                int    numStyleChannels,
+                int    numGuideChannels,
+
+                int    sourceWidth,
+                int    sourceHeight,
+                void*  sourceStyleData,            // (width * height * numStyleChannels) bytes, scan-line order
+                void*  sourceGuideData,            // (width * height * numGuideChannels) bytes, scan-line order
+
+                int    targetWidth,
+                int    targetHeight,
+                void*  targetGuideData,            // (width * height * numGuideChannels) bytes, scan-line order
+                void*  targetModulationData,       // (width * height * numGuideChannels) bytes, scan-line order; pass NULL to switch off the modulation
+
+                float* styleWeights,               // (numStyleChannels) floats
+                float* guideWeights,               // (numGuideChannels) floats
+
+                                                   // guideError(txy,sxy,ch) = guideWeights[ch] * (targetModulation[txy][ch]/255) * (targetGuide[txy][ch]-sourceGuide[sxy][ch])^2
+
+                float  uniformityWeight,           // reasonable values are between 500-15000, 3500 is a good default
+
+                int    patchSize,                  // odd sizes only, use 5 for 5x5 patch, 7 for 7x7, etc.
+                int    voteMode,                   // use VOTEMODE_WEIGHTED for sharper result
+
+                int    numPyramidLevels,
+
+                int*   numSearchVoteItersPerLevel, // how many search/vote iters to perform at each level (array of ints, coarse first, fine last)
+                int*   numPatchMatchItersPerLevel, // how many Patch-Match iters to perform at each level (array of ints, coarse first, fine last)
+
+                int*   stopThresholdPerLevel,      // stop improving pixel when its change since last iteration falls under this threshold
+
+                void*  outputData                  // (width * height * numStyleChannels) bytes, scan-line order
+                );
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
--- a/lib/.gitkeep
+++ b/lib/.gitkeep
--- a/src/cudacheck.h
+++ b/src/cudacheck.h
@@ -0,0 +1,20 @@
+#ifndef CUDACHECK_H_
+#define CUDACHECK_H_
+
+template<typename T>
+bool checkCudaError_(T result,char const* const func,const char* const file,int const line)
+{
+  if (result)
+  {
+    printf("CUDA error at %s:%d code=%d \"%s\"\n",file,line,static_cast<unsigned int>(result),func);
+    return true;
+  }
+  else
+  {
+    return false;
+  }
+}
+
+#define checkCudaError(val) checkCudaError_((val),#val,__FILE__,__LINE__)
+
+#endif
--- a/src/ebsynth.cu
+++ b/src/ebsynth.cu
--- a/src/jzq.h
+++ b/src/jzq.h
--- a/src/memarray2.h
+++ b/src/memarray2.h
@@ -0,0 +1,73 @@
+// This software is in the public domain. Where that dedication is not
+// recognized, you are granted a perpetual, irrevocable license to copy
+// and modify this file as you see fit.
+
+#ifndef MEMARRAY2_H_
+#define MEMARRAY2_H_
+
+#include "jzq.h"
+//#include "cudacheck.h"
+
+template<typename T>
+struct MemArray2
+{
+  T* data;
+  int width;
+  int height;
+
+  MemArray2() : width(0),height(0),data(0) {};
+
+  MemArray2(const V2i& size)
+  {
+    width = size(0);
+    height = size(1);
+    checkCudaError(cudaMalloc(&data,width*height*sizeof(T)));
+  }
+
+  MemArray2(int _width,int _height)
+  {
+    width = _width;
+    height = _height;
+    checkCudaError(cudaMalloc(&data,width*height*sizeof(T)));
+  }
+  /*
+  int       __device__ operator()(int i,int j)
+  {
+    return data[i+j*width];
+  }
+
+  const int& __device__ operator()(int i,int j) const
+  {
+    return data[i+j*width];
+  }
+  */
+
+  void destroy()
+  {
+    checkCudaError( cudaFree(data) );
+  }
+};
+
+template<typename T>
+void copy(MemArray2<T>* out_dst,const Array2<T>& src)
+{
+  assert(out_dst != 0);
+  MemArray2<T>& dst = *out_dst;
+  assert(dst.width == src.width());
+  assert(dst.height == src.height());
+
+  checkCudaError(cudaMemcpy(dst.data, src.data(), src.width()*src.height()*sizeof(T), cudaMemcpyHostToDevice));
+}
+
+template<typename T>
+void copy(Array2<T>* out_dst,const MemArray2<T>& src)
+{
+  assert(out_dst != 0);
+  const Array2<T>& dst = *out_dst;
+  assert(dst.width() == src.width);
+  assert(dst.height() == src.height);
+
+  checkCudaError(cudaMemcpy((void*)dst.data(),src.data, src.width*src.height*sizeof(T), cudaMemcpyDeviceToHost));
+}
+
+#endif
--- a/src/patchmatch_gpu.h
+++ b/src/patchmatch_gpu.h
@@ -0,0 +1,342 @@
+// This software is in the public domain. Where that dedication is not
+// recognized, you are granted a perpetual, irrevocable license to copy
+// and modify this file as you see fit.
+
+#ifndef PATCHMATCH_GPU_H_
+#define PATCHMATCH_GPU_H_
+
+#include <cfloat>
+
+#include <curand.h>
+#include <curand_kernel.h>
+
+#include "texarray2.h"
+#include "memarray2.h"
+
+typedef Vec<1,float> V1f;
+typedef Array2<Vec<1,float>> A2V1f;
+
+__global__ void krnlInitRngStates(const int width,
+                                  const int height,
+                                  curandState* rngStates)
+{
+  const int x = blockDim.x*blockIdx.x + threadIdx.x;
+  const int y = blockDim.y*blockIdx.y + threadIdx.y;
+
+  if (x<width && y<height)
+  {
+    const int idx = x+y*width;
+    curand_init((1337 << 20) + idx, 0, 0, &rngStates[idx]);
+  }
+}
+
+curandState* initGpuRng(const int width,
+                        const int height)
+{
+  curandState* gpuRngStates;
+  cudaMalloc(&gpuRngStates,width*height*sizeof(curandState));
+
+  const dim3 threadsPerBlock(16,16);
+  const dim3 numBlocks((width+threadsPerBlock.x)/threadsPerBlock.x,
+                       (height+threadsPerBlock.y)/threadsPerBlock.y);
+
+  krnlInitRngStates<<<numBlocks,threadsPerBlock>>>(width,height,gpuRngStates);
+
+  return gpuRngStates;
+}
+
+template<int N,typename T,int M>
+struct PatchSSD
+{
+  const TexArray2<N,T,M> A;
+  const TexArray2<N,T,M> B;
+  const Vec<N,float> weights;
+
+  PatchSSD(const TexArray2<N,T,M>& A,
+           const TexArray2<N,T,M>& B,
+           const Vec<N,float>& weights)
+
+  : A(A),B(B),weights(weights) {}
+
+   __device__ float operator()(int patchWidth,
+                               const int ax,
+                               const int ay,
+                               const int bx,
+                               const int by,
+                               const float ebest)
+   {
+    const int hpw = patchWidth/2;
+    float ssd = 0;
+
+    for(int py=-hpw;py<=+hpw;py++)
+    {
+      for(int px=-hpw;px<=+hpw;px++)
+      {
+        const Vec<N,T> pixelA = A(ax + px, ay + py);
+        const Vec<N,T> pixelB = B(bx + px, by + py);
+        for(int i=0;i<N;i++)
+        {
+          const float diff = float(pixelA[i])-float(pixelB[i]);
+          ssd += weights[i]*diff*diff;
+        }
+      }
+
+      if (ssd>ebest) { return ssd; }
+    }
+
+    return ssd;
+   }
+};
+
+template<typename FUNC>
+__global__ void krnlEvalErrorPass(const int patchWidth,
+                                  FUNC patchError,
+                                  const TexArray2<2,int> NNF,
+                                  TexArray2<1,float> E)
+{
+  const int x = blockDim.x*blockIdx.x + threadIdx.x;
+  const int y = blockDim.y*blockIdx.y + threadIdx.y;
+
+  if (x<NNF.width && y<NNF.height)
+  {
+    const V2i n = NNF(x,y);
+    E.write(x,y,V1f(patchError(patchWidth,x,y,n[0],n[1],FLT_MAX)));
+  }
+}
+
+void __device__ updateOmega(MemArray2<int>& Omega,const int patchWidth,const int bx,const int by,const int incdec)
+{
+  const int r = patchWidth/2;
+
+  for(int oy=-r;oy<=+r;oy++)
+  for(int ox=-r;ox<=+r;ox++)
+  {
+    const int x = bx+ox;
+    const int y = by+oy;
+    atomicAdd(&Omega.data[x+y*Omega.width],incdec);
+    //Omega.data[x+y*Omega.width] += incdec;
+  }
+}
+
+int __device__ patchOmega(const int patchWidth,const int bx,const int by,const MemArray2<int>& Omega)
+{
+  const int r = patchWidth/2;
+
+  int sum = 0;
+
+  for(int oy=-r;oy<=+r;oy++)
+  for(int ox=-r;ox<=+r;ox++)
+  {
+    const int x = bx+ox;
+    const int y = by+oy;
+    sum += Omega.data[x+y*Omega.width]; /// XXX: atomic read instead ??
+  }
+
+  return sum;
+}
+
+template<typename FUNC>
+__device__ void tryPatch(const  V2i& sizeA,
+                         const  V2i& sizeB,
+                                MemArray2<int>& Omega,
+                         const  int patchWidth,
+                         FUNC   patchError,
+                         const  float lambda,
+                         const  int ax,
+                         const  int ay,
+                         const  int bx,
+                         const  int by,
+                         V2i&   nbest,
+                         float& ebest)
+{
+  const float omegaBest = (float(sizeA(0)*sizeA(1)) /
+                           float(sizeB(0)*sizeB(1))) * float(patchWidth*patchWidth);
+
+  const float curOcc = (float(patchOmega(patchWidth,nbest(0),nbest(1),Omega))/float(patchWidth*patchWidth))/omegaBest;
+  const float newOcc = (float(patchOmega(patchWidth,      bx,      by,Omega))/float(patchWidth*patchWidth))/omegaBest;
+
+  const float curErr = ebest;
+  const float newErr = patchError(patchWidth,ax,ay,bx,by,curErr+lambda*curOcc);
+
+  if ((newErr+lambda*newOcc) < (curErr+lambda*curOcc))
+  {
+    updateOmega(Omega,patchWidth,      bx,      by,+1);
+    updateOmega(Omega,patchWidth,nbest(0),nbest(1),-1);
+    nbest = V2i(bx,by);
+    ebest = newErr;
+  }
+}
+
+template<typename FUNC>
+__device__ void tryNeighborsOffset(const int x,
+                                   const int y,
+                                   const int ox,
+                                   const int oy,
+                                   V2i& nbest,
+                                   float& ebest,
+                                   const V2i& sizeA,
+                                   const V2i& sizeB,
+                                         MemArray2<int>& Omega,
+                                   const int patchWidth,
+                                   FUNC patchError,
+                                   const float lambda,
+                                   const TexArray2<2,int>& NNF)
+{
+  const int hpw = patchWidth/2;
+
+  const V2i on = NNF(x+ox,y+oy);
+  const int nx = on(0)-ox;
+  const int ny = on(1)-oy;
+
+  if (nx>=hpw && nx<sizeB(0)-hpw &&
+      ny>=hpw && ny<sizeB(1)-hpw)
+  {
+    tryPatch(sizeA,sizeB,Omega,patchWidth,patchError,lambda,x,y,nx,ny,nbest,ebest);
+  }
+}
+
+template<typename FUNC>
+__global__ void krnlPropagationPass(const V2i sizeA,
+                                    const V2i sizeB,
+                                          MemArray2<int> Omega,
+                                    const int patchWidth,
+                                    FUNC  patchError,
+                                    const float lambda,
+                                    const int r,
+                                    const TexArray2<2,int> NNF,
+                                    TexArray2<2,int> NNF2,
+                                    TexArray2<1,float> E,
+                                    TexArray2<1,unsigned char> mask)
+{
+  const int x = blockDim.x*blockIdx.x + threadIdx.x;
+  const int y = blockDim.y*blockIdx.y + threadIdx.y;
+
+  if (x<sizeA(0) && y<sizeA(1))
+  {
+    V2i   nbest = NNF(x,y);
+    float ebest = E(x,y)(0);
+
+    if (mask(x,y)[0]==255)
+    {
+      tryNeighborsOffset(x,y,-r,0,nbest,ebest,sizeA,sizeB,Omega,patchWidth,patchError,lambda,NNF);
+      tryNeighborsOffset(x,y,+r,0,nbest,ebest,sizeA,sizeB,Omega,patchWidth,patchError,lambda,NNF);
+      tryNeighborsOffset(x,y,0,-r,nbest,ebest,sizeA,sizeB,Omega,patchWidth,patchError,lambda,NNF);
+      tryNeighborsOffset(x,y,0,+r,nbest,ebest,sizeA,sizeB,Omega,patchWidth,patchError,lambda,NNF);
+    }
+
+    E.write(x,y,V1f(ebest));
+    NNF2.write(x,y,nbest);
+  }
+}
+
+template<typename FUNC>
+__device__ void tryRandomOffsetInRadius(const int r,
+                                        const V2i& sizeA,
+                                        const V2i& sizeB,
+                                              MemArray2<int>& Omega,
+                                        const int patchWidth,
+                                        FUNC  patchError,
+                                        const float lambda,
+                                        const int x,
+                                        const int y,
+                                        const V2i& norg,
+                                        V2i&  nbest,
+                                        float& ebest,
+                                        curandState* rngState)
+{
+  const int hpw = patchWidth/2;
+
+  const int xmin = max(norg(0)-r,hpw);
+  const int xmax = min(norg(0)+r,sizeB(0)-1-hpw);
+  const int ymin = max(norg(1)-r,hpw);
+  const int ymax = min(norg(1)+r,sizeB(1)-1-hpw);
+
+  const int nx = xmin+(curand(rngState)%(xmax-xmin+1));
+  const int ny = ymin+(curand(rngState)%(ymax-ymin+1));
+
+  tryPatch(sizeA,sizeB,Omega,patchWidth,patchError,lambda,x,y,nx,ny,nbest,ebest);
+}
+
+template<typename FUNC>
+__global__ void krnlRandomSearchPass(const V2i sizeA,
+                                     const V2i sizeB,
+                                     MemArray2<int> Omega,
+                                     const int patchWidth,
+                                     FUNC  patchError,
+                                     const float lambda,
+                                     TexArray2<2,int> NNF,
+                                     TexArray2<1,float> E,
+                                     TexArray2<1,unsigned char> mask,
+                                     curandState* rngStates)
+{
+  const int x = blockDim.x*blockIdx.x + threadIdx.x;
+  const int y = blockDim.y*blockIdx.y + threadIdx.y;
+
+  if (x<sizeA(0) && y<sizeA(1))
+  {
+    if (mask(x,y)[0]==255)
+    {
+      V2i nbest = NNF(x,y);
+      float ebest = E(x,y)(0);
+
+      const V2i norg = nbest;
+
+      for(int r=1;r<max(sizeB(0),sizeB(1))/2;r=r*2)
+      {
+        tryRandomOffsetInRadius(r,sizeA,sizeB,Omega,patchWidth,patchError,lambda,x,y,norg,nbest,ebest,&rngStates[x+y*NNF.width]);
+      }
+
+      E.write(x,y,V1f(ebest));
+      NNF.write(x,y,nbest);
+    }
+  }
+}
+
+template<typename FUNC>
+void patchmatchGPU(const V2i sizeA,
+                   const V2i sizeB,
+                   MemArray2<int>& Omega,
+                   const int patchWidth,
+                   FUNC patchError,
+                   const float lambda,
+                   const int numIters,
+                   const int numThreadsPerBlock,
+                   TexArray2<2,int>& NNF,
+                   TexArray2<2,int>& NNF2,
+                   TexArray2<1,float>& E,
+                   TexArray2<1,unsigned char>& mask,
+                   curandState* rngStates)
+{
+  const dim3 threadsPerBlock = dim3(numThreadsPerBlock,numThreadsPerBlock);
+  const dim3 numBlocks = dim3((NNF.width+threadsPerBlock.x)/threadsPerBlock.x,
+                              (NNF.height+threadsPerBlock.y)/threadsPerBlock.y);
+
+  krnlEvalErrorPass<<<numBlocks,threadsPerBlock>>>(patchWidth,patchError,NNF,E);
+
+  checkCudaError(cudaDeviceSynchronize());
+
+  for(int i=0;i<numIters;i++)
+  {
+    krnlPropagationPass<<<numBlocks,threadsPerBlock>>>(sizeA,sizeB,Omega,patchWidth,patchError,lambda,4,NNF,NNF2,E,mask); std::swap(NNF,NNF2);
+
+    checkCudaError(cudaDeviceSynchronize());
+
+    krnlPropagationPass<<<numBlocks,threadsPerBlock>>>(sizeA,sizeB,Omega,patchWidth,patchError,lambda,2,NNF,NNF2,E,mask); std::swap(NNF,NNF2);
+
+    checkCudaError(cudaDeviceSynchronize());
+
+    krnlPropagationPass<<<numBlocks,threadsPerBlock>>>(sizeA,sizeB,Omega,patchWidth,patchError,lambda,1,NNF,NNF2,E,mask); std::swap(NNF,NNF2);
+
+    checkCudaError(cudaDeviceSynchronize());
+
+    krnlRandomSearchPass<<<numBlocks,threadsPerBlock>>>(sizeA,sizeB,Omega,patchWidth,patchError,lambda,NNF,E,mask,rngStates);
+
+    checkCudaError(cudaDeviceSynchronize());
+  }
+
+  krnlEvalErrorPass<<<numBlocks,threadsPerBlock>>>(patchWidth,patchError,NNF,E);
+
+  checkCudaError(cudaDeviceSynchronize());
+}
+
+#endif
--- a/src/stb_image.h
+++ b/src/stb_image.h
--- a/src/stb_image_write.h
+++ b/src/stb_image_write.h
--- a/src/texarray2.h
+++ b/src/texarray2.h
@@ -0,0 +1,300 @@
+// This software is in the public domain. Where that dedication is not
+// recognized, you are granted a perpetual, irrevocable license to copy
+// and modify this file as you see fit.
+
+#ifndef TEXARRAY2_H_
+#define TEXARRAY2_H_
+
+#include "jzq.h"
+#include "cudacheck.h"
+
+#include <cuda_runtime.h>
+
+template<int N, typename T>
+struct CudaVec { };
+
+template<> struct CudaVec<1, unsigned char> { typedef uchar1 type; };
+template<> struct CudaVec<2, unsigned char> { typedef uchar2 type; };
+template<> struct CudaVec<4, unsigned char> { typedef uchar4 type; };
+
+template<> struct CudaVec<1, int> { typedef int1 type; };
+template<> struct CudaVec<2, int> { typedef int2 type; };
+template<> struct CudaVec<4, int> { typedef int4 type; };
+
+template<> struct CudaVec<1, float> { typedef float1 type; };
+template<> struct CudaVec<2, float> { typedef float2 type; };
+template<> struct CudaVec<4, float> { typedef float4 type; };
+
+template<typename T>
+struct CudaKind { };
+
+template<> struct CudaKind<unsigned char> { static const cudaChannelFormatKind kind = cudaChannelFormatKindUnsigned; };
+template<> struct CudaKind<int>           { static const cudaChannelFormatKind kind = cudaChannelFormatKindSigned; };
+template<> struct CudaKind<float>         { static const cudaChannelFormatKind kind = cudaChannelFormatKindFloat; };
+
+__device__ Vec<1, unsigned char> cuda2jzq(const uchar1& vec) { return Vec<1, unsigned char>(vec.x); }
+__device__ Vec<2, unsigned char> cuda2jzq(const uchar2& vec) { return Vec<2, unsigned char>(vec.x, vec.y); }
+__device__ Vec<4, unsigned char> cuda2jzq(const uchar4& vec) { return Vec<4, unsigned char>(vec.x, vec.y, vec.z, vec.w); }
+
+__device__ Vec<1, int> cuda2jzq(const int1& vec) { return Vec<1, int>(vec.x); }
+__device__ Vec<2, int> cuda2jzq(const int2& vec) { return Vec<2, int>(vec.x, vec.y); }
+__device__ Vec<4, int> cuda2jzq(const int4& vec) { return Vec<4, int>(vec.x, vec.y, vec.z, vec.w); }
+
+__device__ Vec<1, float> cuda2jzq(const float1& vec) { return Vec<1, float>(vec.x); }
+__device__ Vec<2, float> cuda2jzq(const float2& vec) { return Vec<2, float>(vec.x, vec.y); }
+__device__ Vec<4, float> cuda2jzq(const float4& vec) { return Vec<4, float>(vec.x, vec.y, vec.z, vec.w); }
+
+#define N_LAYERS(N,M) 1+(N-1)/M
+
+template<int N, typename T>
+struct TexLayer2
+{
+  size_t pitch;
+  void* data;
+  cudaTextureObject_t texObj;
+
+  TexLayer2(){};
+
+  TexLayer2(int width, int height)
+  {
+    checkCudaError(cudaMallocPitch(&data, &pitch, width*N*sizeof(T), height));
+
+    const int bits = 8 * sizeof(T);
+
+    const int bitsTable[4][4] = { { bits, 0, 0, 0 },
+    { bits, bits, 0, 0 },
+    { -1, -1, -1, -1 },
+    { bits, bits, bits, bits } };
+
+    cudaResourceDesc resDesc;
+    memset(&resDesc, 0, sizeof(resDesc));
+    resDesc.resType = cudaResourceTypePitch2D;
+    resDesc.res.pitch2D.devPtr = data;
+    resDesc.res.pitch2D.pitchInBytes = pitch;
+    resDesc.res.pitch2D.width = width;
+    resDesc.res.pitch2D.height = height;
+    resDesc.res.pitch2D.desc = cudaCreateChannelDesc(bitsTable[N - 1][0],
+      bitsTable[N - 1][1],
+      bitsTable[N - 1][2],
+      bitsTable[N - 1][3],
+      CudaKind<T>::kind);
+
+    cudaTextureDesc texDesc;
+    memset(&texDesc, 0, sizeof(texDesc));
+    texDesc.addressMode[0] = cudaAddressModeClamp;
+    texDesc.addressMode[1] = cudaAddressModeClamp;
+    texDesc.filterMode = cudaFilterModePoint;
+    texDesc.readMode = cudaReadModeElementType;
+    texDesc.normalizedCoords = 0;
+
+    texObj = 0;
+    checkCudaError(cudaCreateTextureObject(&texObj, &resDesc, &texDesc, NULL));
+  }
+
+  Vec<N, T> __device__ operator()(int x, int y) const
+  {
+    return cuda2jzq(tex2D<CudaVec<N, T>::type>(texObj, x, y));
+  }
+
+  void __device__ write(int x, int y, const Vec<N, T>& value)
+  {
+    Vec<N, T>* ptr = (Vec<N, T>*)&((unsigned char*)data)[x*sizeof(Vec<N, T>) + y*pitch];
+    *ptr = value;
+  }
+
+  void destroy()
+  {
+    checkCudaError( cudaDestroyTextureObject(texObj) );
+    checkCudaError( cudaFree(data) );
+  }
+};
+
+template<int N, typename T, int M = N<3 ? N : 4>
+struct TexArray2
+{
+  int width;
+  int height;
+
+  TexLayer2<M, T> texLayers[N_LAYERS(N, M)];
+
+  size_t tmp_pitch;
+  void*  tmp_data;
+
+  TexArray2() : width(0),height(0),tmp_pitch(0),tmp_data(0) { }
+
+  TexArray2(const V2i& size)
+  {
+    width = size(0);
+    height = size(1);
+
+    checkCudaError(cudaMallocPitch(&tmp_data, &tmp_pitch, width*N*sizeof(T), height));
+
+    for (int i = 0; i < N_LAYERS(N, M); ++i)
+      texLayers[i] = TexLayer2<M, T>(width, height);
+  }
+
+  TexArray2(int width, int height)
+  {
+    this->width = width;
+    this->height = height;
+
+    checkCudaError(cudaMallocPitch(&tmp_data, &tmp_pitch, width*N*sizeof(T), height));
+
+    for (int i = 0; i < N_LAYERS(N, M); ++i)
+      texLayers[i] = TexLayer2<M, T>(width, height);
+  }
+
+  Vec<N, T> __device__ operator()(int x, int y) const
+  {
+    Vec<N, T> ret;
+    Vec<M, T> tmp;
+
+    for (int i = 0; i < N / M; ++i){
+      tmp = texLayers[i](x, y);
+      for (int j = 0; j < M; ++j)
+        ret[i*M + j] = tmp[j];
+    }
+
+    if (N % M != 0){
+      tmp = texLayers[N / M](x, y);
+      for (int j = 0; j < N % M; ++j)
+        ret[(N / M)*M + j] = tmp[j];
+    }
+
+    return ret;
+  }
+
+  void __device__ write(int x, int y, const Vec<N, T>& value)
+  {
+    Vec<M, T> tmp;
+
+    for (int i = 0; i < N / M; ++i){
+      for (int j = 0; j < M; ++j)
+        tmp[j] = value[i*M + j];
+      texLayers[i].write(x, y, tmp);
+    }
+
+    if (N % M != 0){
+      for (int j = 0; j < N % M; ++j)
+        tmp[j] = value[(N / M)*M + j];
+      texLayers[N / M].write(x, y, tmp);
+    }
+  }
+
+  V2i size() const
+  {
+    return V2i(width,height);
+  }
+
+  void destroy()
+  {
+    for (int i = 0; i < N_LAYERS(N, M); ++i)
+    {
+      texLayers[i].destroy();
+    }
+
+    checkCudaError( cudaFree(tmp_data) );
+  }
+};
+
+template<int N, typename T, int M>
+__global__ void tmpToLayers(TexArray2<N, T, M> A)
+{
+  const int x = blockDim.x*blockIdx.x + threadIdx.x;
+  const int y = blockDim.y*blockIdx.y + threadIdx.y;
+
+  if (x<A.width && y<A.height)
+  {
+    Vec<N, T>* ptr = (Vec<N, T>*)&((unsigned char*)A.tmp_data)[x*sizeof(Vec<N, T>) + y*A.tmp_pitch];
+    A.write(x, y, *ptr);
+  }
+}
+
+template<int N, typename T, int M>
+__global__ void layersToTmp(const TexArray2<N, T, M> A)
+{
+  const int x = blockDim.x*blockIdx.x + threadIdx.x;
+  const int y = blockDim.y*blockIdx.y + threadIdx.y;
+
+  if (x<A.width && y<A.height)
+  {
+    Vec<N, T> value = A(x, y);
+    Vec<N, T>* ptr = (Vec<N, T>*)&((unsigned char*)A.tmp_data)[x*sizeof(Vec<N, T>) + y*A.tmp_pitch];
+    *ptr = value;
+  }
+}
+
+template<int N, typename T, int M>
+void copy(TexArray2<N, T, M>* out_dst,const Array2<Vec<N, T>>& src)
+{
+  assert(out_dst != 0);
+  const TexArray2<N, T, M>& dst = *out_dst;
+  assert(dst.width == src.width());
+  assert(dst.height == src.height());
+
+  const int srcWidthInBytes = src.width()*sizeof(Vec<N, T>);
+  const int srcPitchInBytes = srcWidthInBytes;
+
+  checkCudaError(cudaMemcpy2D(dst.tmp_data, dst.tmp_pitch, src.data(), srcPitchInBytes, srcWidthInBytes, src.height(), cudaMemcpyHostToDevice));
+
+  const int numThreadsPerBlock = 16;
+  const dim3 threadsPerBlock = dim3(numThreadsPerBlock, numThreadsPerBlock);
+  const dim3 numBlocks = dim3((src.width() + threadsPerBlock.x) / threadsPerBlock.x,
+    (src.height() + threadsPerBlock.y) / threadsPerBlock.y);
+
+  tmpToLayers << <numBlocks, threadsPerBlock >> >(dst);
+}
+
+template<int N, typename T, int M>
+void copy(TexArray2<N, T, M>* out_dst,void* src_data)
+{
+  assert(out_dst != 0);
+  const TexArray2<N, T, M>& dst = *out_dst;
+
+  const int srcWidthInBytes = dst.width*sizeof(Vec<N,T>);
+  const int srcPitchInBytes = srcWidthInBytes;
+
+  checkCudaError(cudaMemcpy2D(dst.tmp_data, dst.tmp_pitch, src_data, srcPitchInBytes, srcWidthInBytes, dst.height, cudaMemcpyHostToDevice));
+
+  const int numThreadsPerBlock = 16;
+  const dim3 threadsPerBlock = dim3(numThreadsPerBlock, numThreadsPerBlock);
+  const dim3 numBlocks = dim3((dst.width + threadsPerBlock.x) / threadsPerBlock.x,
+                              (dst.height + threadsPerBlock.y) / threadsPerBlock.y);
+
+  tmpToLayers << <numBlocks, threadsPerBlock >> >(dst);
+}
+
+template<int N, typename T, int M>
+void copy(Array2<Vec<N, T>>* out_dst, const TexArray2<N, T, M>& src)
+{
+  assert(out_dst != 0);
+  const Array2<Vec<N, T>>& dst = *out_dst;
+  assert(dst.width() == src.width);
+  assert(dst.height() == src.height);
+
+  const int numThreadsPerBlock = 16;
+  const dim3 threadsPerBlock = dim3(numThreadsPerBlock, numThreadsPerBlock);
+  const dim3 numBlocks = dim3((dst.width() + threadsPerBlock.x) / threadsPerBlock.x,
+                              (dst.height() + threadsPerBlock.y) / threadsPerBlock.y);
+
+  layersToTmp << <numBlocks, threadsPerBlock >> >(src);
+
+  const int dstPitchInBytes = dst.width()*sizeof(Vec<N, T>);
+  checkCudaError(cudaMemcpy2D((void*)dst.data(), dstPitchInBytes, src.tmp_data, src.tmp_pitch, src.width*N*sizeof(T), src.height, cudaMemcpyDeviceToHost));
+}
+
+template<int N, typename T, int M>
+void copy(void** out_dst_data, const TexArray2<N, T, M>& src)
+{
+  const int numThreadsPerBlock = 16;
+  const dim3 threadsPerBlock = dim3(numThreadsPerBlock, numThreadsPerBlock);
+  const dim3 numBlocks = dim3((src.width + threadsPerBlock.x) / threadsPerBlock.x,
+                              (src.height + threadsPerBlock.y) / threadsPerBlock.y);
+
+  layersToTmp << <numBlocks, threadsPerBlock >> >(src);
+
+  const int dstPitchInBytes = src.width*sizeof(Vec<N, T>);
+  checkCudaError(cudaMemcpy2D((void*)*out_dst_data, dstPitchInBytes, src.tmp_data, src.tmp_pitch, src.width*N*sizeof(T), src.height, cudaMemcpyDeviceToHost));
+}
+
+#endif