ID 577363 - CopyCat training fails with Tensor size errors or crashes Nuke in certain node configurations

Follow

Problem summary:
CopyCat training fails with Tensor size errors or crashes Nuke in certain node configurations
 
This behavior seems to happen when the following conditions have been met, but could occur in other situations as well:

  • A Crop or Reformat is located before the CopyCat node and has an Expression referencing input.height
  • Three or more Input/Ground Truth image pairs are used
  • Not all of the Image Pairs share the same format/resolution
 
Customer reported version:
Nuke 15.0v1
 
Customer reported platform:
Windows 10
 
Steps to reproduce:
1) Download the attached copycat_training.nk file.
2) Launch Nuke and open this script.

3) Open the CopyCat1 node's Properties, and select a valid Data Directory
4) Press the Start Training button, and observe the following error that appears:
ERROR: CopyCat1: start (149) + length (64) exceeds dimension size (181). Exception raised from narrow at W:\.conan\3204d2\1\PyTorch_src\aten\src\ATen\native\TensorShape.cpp:1064 (most recent call first): 00007FFE9FFED19200007FFE9FFED130 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>] 00007FFE9FFECC1E00007FFE9FFECBD0 c10.dll!c10::detail::torchCheckFail [<unknown file> @ <unknown line number>] 00007FFE1D59B59A00007FFE1D59B3D0 torch_cpu.dll!at::native::narrow [<unknown file> @ <unknown line number>] 00007FFE1DFD9BE100007FFE1DFA78E0 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to [<unknown file> @ <unknown line number>] 00007FFE1D883A5E00007FFE1D8838C0 torch_cpu.dll!at::_ops::narrow::call [<unknown file> @ <unknown line number>] 00007FFE1D15389800007FFE1D153880 torch_cpu.dll!at::narrow [<unknown file> @ <unknown line number>] 00007FFE6A1D010400007FFE6A1CFA10 CopyCat.dll!c10::ivalue::Future::exception_ptr [<unknown file> @ <unknown line number>] 00007FFE6A1CFE3100007FFE6A1CFA10 CopyCat.dll!c10::ivalue::Future::exception_ptr [<unknown file> @ <unknown line number>] 00007FFE6A1E400000007FFE6A1E1800 CopyCat.dll!c10::ivalue::Future::synchronizeWithCurrentStreams [<unknown file> @ <unknown line number>] 00007FFE6A1DA7BE00007FFE6A1DA360 CopyCat.dll!c10::ivalue::Future::markCompleted [<unknown file> @ <unknown line number>] 00007FFE6A1E478D00007FFE6A1E4720 CopyCat.dll!c10::ivalue::Future::tryRetrieveErrorMessageInternal [<unknown file> @ <unknown line number>] 00007FFE6A1C867000007FFE6A1BB9F0 CopyCat.dll!RIP::Compute::Exceptions::UnverifiedTypeException::operator= [<unknown file> @ <unknown line number>] 00007FFE7A10178600007FFE7A0CD070 nuke-15.0.4.dll!Nuke::Analytics::StartAnalytics [<unknown file> @ <unknown line number>] 00007FFE7A103D1200007FFE7A0CD070 nuke-15.0.4.dll!Nuke::Analytics::StartAnalytics [<unknown file> @ <unknown line number>] 00007FFE7A100E6100007FFE7A0CD070 nuke-15.0.4.dll!Nuke::Analytics::StartAnalytics [<unknown file> @ <unknown line number>] 00007FFE7A68065900007FFE7A590740 nuke-15.0.4.dll!

 
Expected behavior:
The CopyCat node should train without errors or crashing Nuke.
 
Actual behavior:
Tensor errors appear when attempting to train a CopyCat node if its Image Pairs do not all share the same resolution and a Crop node with an Expression containing input.height is placed before the CopyCat node. This issue will occur regardless of the Crop's Reformat or Black Outside knobs being enabled/disabled.
 
In most configurations, Nuke seems to consistently crash instead of producing any error messages if the CopyCat's Use Multi Resolution Training option is disabled. However, there are configurations of nodes which appear to always cause Nuke to crash, whether or not Use Multi Resolution Training is enabled.
 
In NukeX 15.0v1 and 14.1v1, the Tensor errors do not appear, but CopyCat training will quit unexpectedly without any feedback. These versions of Nuke still seem to crash whenever any crashing configurations are tested (like disabling Use Multi Resolution Training).
 
In versions before Nuke 14.0v5, these CopyCat node networks appear to train without error, but if a training session is cancelled and started again, Nuke will occasionally report that the "Input and ground truth sequences are of different sizes".
 
Workaround:
No know workaround at this time.
 
Reproduced by Support in:
NukeX 15.0v4 - Windows 10, macOS 13 Ventura
NukeX 15.0v3 - Windows 10, macOS 13 Ventura
NukeX 15.0v2 - Windows 10, macOS 13 Ventura
NukeX 14.1v4 - Windows 10
NukeX 14.1v3 - Windows 10
NukeX 14.1v2 - Windows 10
 
Training fails without any Tensor errors, crashes still occur:
NukeX 15.0v1 - Windows 10, macOS 13 Ventura
NukeX 14.1v1 - Windows 10
 
Training crashes Nuke without any Tensor errors:
NukeX 14.1v4 - macOS 13 Ventura
NukeX 14.1v3 - macOS 13 Ventura
NukeX 14.1v2 - macOS 13 Ventura
NukeX 14.1v1 - macOS 13 Ventura
NukeX 14.0v7 - Windows 10
NukeX 14.0v6 - Windows 10
NukeX 14.0v5 - Windows 10 - Regression
 
Unable to reproduce bug in:
NukeX 14.0v4 - Windows 10
 
Earliest version tested:
NukeX 14.0v4 - This issue doesn't appear in this version and has regressed 

    We're sorry to hear that

    Please tell us why