Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)#19015
Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)#19015mcremon-meta wants to merge 2 commits intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19015
Note: Links to docs will display an error until the docs builds have been completed. ❌ 10 New Failures, 1 Cancelled Job, 4 Unrelated FailuresAs of commit 1736792 with merge base 89600b3 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@mcremon-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100712787. |
This PR needs a
|
|
Hi, great work! I have written some comments below, but overall I think it might be easier for us to land the PR and let you run a final internal check rather than vice versa, just since we have a bit easier access to CI and since we have a broader scope in what the backend needs to handle (e.g. dim-order input). The resulting behavior should be very similar from your perspective. Does that sound good to you?
|
c4dbf62 to
7199184
Compare
Summary: Pull Request resolved: #19002 Move 6 permute optimization passes and their shared infrastructure from executorch/backends/cadence/aot/ to executorch/backends/transforms/ so they can be shared between the Cadence and Arm backends without a cross-backend dependency. New files: - permute_pass_utils.py: base classes (HierarchicalInplacePassInterface, RemoveOrReplacePassInterface, FuseOpPairsAcrossBranchesPass) and utilities (get_arg, set_arg, get_transposed_dims, get_permuted_dims, get_shape, get_edge_overload_packet) - fuse_cascaded_transpose_or_permute_ops.py - fuse_cascaded_view_ops.py - fuse_transpose_or_permute_op_pairs_pass.py - remove_permutes_around_elementwise_ops.py - postpone_permute_below_squeeze_view.py - replace_nop_transpose_or_permute_with_view.py The shared versions omit register_cadence_pass decorators and cadence-specific ops from default op sets. Cadence files will subclass these and re-add the decorators and ops. Added OSS tests (test_permute_optimization_passes.py) for the 4 passes that can be imported without quantized op registration: FuseCascadedTransposeOrPermuteOps, FuseCascadedViewOps, PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView, and ReplaceNopTransposeOrPermuteWithViewPass. These run in GitHub CI via pytest and are discovered automatically through pytest.ini testpaths. Differential Revision: D101459577 Reviewed By: ethansfng
Summary: Pull Request resolved: #19015 Replace implicit `tosa_dim_order`-based layout handling with explicit `permute_copy` ops around TOSA operators that require NHWC layout. ### Rewrite passes insert explicit NCHW↔NHWC permutes `RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass` now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op, NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for layout conversion. This makes layout transitions visible in the graph. ### Grouped conv decomposition in NHWC `RewriteConvPass` decomposes grouped convolutions (non-depthwise) into per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single input/output permute pair wrapping the whole group. Supports INT8, INT16 (with and without bias) quantisation paths, including the full INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) → RESCALE(INT32→INT16). ### `ToTosaMemoryFormatPass` scoped down Now only assigns non-identity dim_order to parameter/buffer placeholders (for weight serialisation) and graph I/O. Inserts `permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already carry a matching permute (inserted by the rewrite passes). ### TOSA dialect op metas expect NHWC All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`, `AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC input layout and produce NHWC output shapes. ### Removed `tosa_dim_order` shape remapping `tosa_shape()` no longer reorders dimensions—just resolves symints. `_get_matching_fake_tensor()` returns `node.meta["val"]` directly. Serialisation mapping always uses identity dim_order. ### Operator serialisation simplified `op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute` no longer remap reduction/concat axes through `dim_order` since tensors are already in the layout expected by TOSA. ### Permute optimisation passes added Six shared passes from `executorch/backends/transforms/` are now run after TOSA lowering to fuse, cancel, and simplify the permutes introduced above: - `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`) - `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`) - `ReplaceNopTransposeOrPermuteWithViewPass` - `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView` - `FuseCascadedTransposeOrPermuteOps` - `FuseCascadedViewOps` ### Removed passes `DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass` are removed—their logic is now handled inline by `RewriteConvPass`. `RewriteSlicePass` is repositioned after the permute optimisations. ### Ethos-U55 partitioner simplified The dual NCHW/NHWC permute constraint check is removed since tensors are always in the expected layout at partition time. Differential Revision: D100712787
7199184 to
40bde3c
Compare
Summary: Pull Request resolved: #19015 Replace implicit `tosa_dim_order`-based layout handling with explicit `permute_copy` ops around TOSA operators that require NHWC layout. ### Rewrite passes insert explicit NCHW↔NHWC permutes `RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass` now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op, NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for layout conversion. This makes layout transitions visible in the graph. ### Grouped conv decomposition in NHWC `RewriteConvPass` decomposes grouped convolutions (non-depthwise) into per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single input/output permute pair wrapping the whole group. Supports INT8, INT16 (with and without bias) quantisation paths, including the full INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) → RESCALE(INT32→INT16). ### `ToTosaMemoryFormatPass` scoped down Now only assigns non-identity dim_order to parameter/buffer placeholders (for weight serialisation) and graph I/O. Inserts `permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already carry a matching permute (inserted by the rewrite passes). ### TOSA dialect op metas expect NHWC All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`, `AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC input layout and produce NHWC output shapes. ### Removed `tosa_dim_order` shape remapping `tosa_shape()` no longer reorders dimensions—just resolves symints. `_get_matching_fake_tensor()` returns `node.meta["val"]` directly. Serialisation mapping always uses identity dim_order. ### Operator serialisation simplified `op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute` no longer remap reduction/concat axes through `dim_order` since tensors are already in the layout expected by TOSA. ### Permute optimisation passes added Six shared passes from `executorch/backends/transforms/` are now run after TOSA lowering to fuse, cancel, and simplify the permutes introduced above: - `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`) - `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`) - `ReplaceNopTransposeOrPermuteWithViewPass` - `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView` - `FuseCascadedTransposeOrPermuteOps` - `FuseCascadedViewOps` ### Removed passes `DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass` are removed—their logic is now handled inline by `RewriteConvPass`. `RewriteSlicePass` is repositioned after the permute optimisations. ### Ethos-U55 partitioner simplified The dual NCHW/NHWC permute constraint check is removed since tensors are always in the expected layout at partition time. Differential Revision: D100712787
40bde3c to
1736792
Compare
| return False | ||
| return not self._is_depthwise_conv2d(node) | ||
|
|
||
| def _handle_grouped_conv( # noqa: C901 |
There was a problem hiding this comment.
This goes back to something we discussed in the meeting, here we are trying to live in the world where the expectation is everything which is not between two permutes is always NCHW. For a NHWC-only accelerator shouldn't it be other way around, during lowering? Is it too much to ask from users to do,
example_input = example_input.to(memory_format=[torch.channels_last]) model = model.to(memory_format=[torch.channels_last]) model.forward(example_input) # or export and lower
Or even do this after partitioning and assume its always channels last and deal with ops which are memory format sensitive?
See example
There was a problem hiding this comment.
tensors do not know anything about layouts right? They only know dimensions. If the graph tracer gives dim0, and the next op needs dim1, then the compiler inserts a permute. At the output, same thing. This is the most consistent way we've handled those cases.
Only 2 set of ops (convolutions and pooling) will need those, and if they're back to back or the ops between them are dimension-insensitive, then you should always get exactly what you describe (graph will have a permute on each input, and a permute on each output) and everything else will cancel.
The way to achieve that is very important though, in my opinion.
Last thing I want to mention here is, it's probably not too much to ask, but I think it's the wrong thing to ask. Users shouldn't have to even know they need to do this, if the flow behaves properly. But maybe that's a personal take :)
digantdesai
left a comment
There was a problem hiding this comment.
LGTM, I will let @AdrianLundell stamp it, since he is also looking into this.
I do want to make sure though that we are at least handling NHWC model/inputs properly.
Agreed, I do think I had to fix that in the changes and it basically added a permute in that case. Will confirm locally that it's working as intended. That permute should then cancel out with the first conv/pool layer. |
Summary:
Replace implicit
tosa_dim_order-based layout handling with explicitpermute_copyops around TOSA operators that require NHWC layout.Rewrite passes insert explicit NCHW↔NHWC permutes
RewriteConvPass,RewriteAvgPool2dPass, andRewriteMaxPool2dPassnow insert
aten.permute_copynodes (NCHW→NHWC before the TOSA op,NHWC→NCHW after) instead of relying on
ToTosaMemoryFormatPassforlayout conversion. This makes layout transitions visible in the graph.
Grouped conv decomposition in NHWC
RewriteConvPassdecomposes grouped convolutions (non-depthwise) intoper-group
TOSA.CONV2Dops operating entirely in NHWC, with a singleinput/output permute pair wrapping the whole group. Supports INT8,
INT16 (with and without bias) quantisation paths, including the full
INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
RESCALE(INT32→INT16).
ToTosaMemoryFormatPassscoped downNow only assigns non-identity dim_order to parameter/buffer
placeholders (for weight serialisation) and graph I/O. Inserts
permute_copyinstead oftosa.TRANSPOSE. Skips users that alreadycarry a matching permute (inserted by the rewrite passes).
TOSA dialect op metas expect NHWC
All TOSA op meta functions (
CONV2D,CONV3D,DEPTHWISE_CONV2D,AVG_POOL2D,MAX_POOL2D,TRANSPOSE_CONV2D) now assume NHWCinput layout and produce NHWC output shapes.
Removed
tosa_dim_ordershape remappingtosa_shape()no longer reorders dimensions—just resolves symints._get_matching_fake_tensor()returnsnode.meta["val"]directly.Serialisation mapping always uses identity dim_order.
Operator serialisation simplified
op_amax,op_amin,op_any,op_cat,op_sum, andop_permuteno longer remap reduction/concat axes through
dim_ordersincetensors are already in the layout expected by TOSA.
Permute optimisation passes added
Six shared passes from
executorch/backends/transforms/are now runafter TOSA lowering to fuse, cancel, and simplify the permutes
introduced above:
RemovePermutesAroundElementwiseOps(extended forRESCALE)FuseTransposeOrPermuteOpPairsPass(extended forRESCALE)ReplaceNopTransposeOrPermuteWithViewPassPostponePermuteOpBelowSqueezeOrUnsqueezeLikeViewFuseCascadedTransposeOrPermuteOpsFuseCascadedViewOpsRemoved passes
DecomposeConvWithInt16ActivationPassandDecomposeGroupedConvPassare removed—their logic is now handled inline by
RewriteConvPass.RewriteSlicePassis repositioned after the permute optimisations.Ethos-U55 partitioner simplified
The dual NCHW/NHWC permute constraint check is removed since tensors
are always in the expected layout at partition time.
Differential Revision: D100712787