Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015) by mcremon-meta · Pull Request #19015 · pytorch/executorch

mcremon-meta · 2026-04-21T01:51:03Z

Summary:

Replace implicit tosa_dim_order-based layout handling with explicit
permute_copy ops around TOSA operators that require NHWC layout.

Rewrite passes insert explicit NCHW↔NHWC permutes

RewriteConvPass, RewriteAvgPool2dPass, and RewriteMaxPool2dPass
now insert aten.permute_copy nodes (NCHW→NHWC before the TOSA op,
NHWC→NCHW after) instead of relying on ToTosaMemoryFormatPass for
layout conversion. This makes layout transitions visible in the graph.

Grouped conv decomposition in NHWC

RewriteConvPass decomposes grouped convolutions (non-depthwise) into
per-group TOSA.CONV2D ops operating entirely in NHWC, with a single
input/output permute pair wrapping the whole group. Supports INT8,
INT16 (with and without bias) quantisation paths, including the full
INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
RESCALE(INT32→INT16).

`ToTosaMemoryFormatPass` scoped down

Now only assigns non-identity dim_order to parameter/buffer
placeholders (for weight serialisation) and graph I/O. Inserts
permute_copy instead of tosa.TRANSPOSE. Skips users that already
carry a matching permute (inserted by the rewrite passes).

TOSA dialect op metas expect NHWC

All TOSA op meta functions (CONV2D, CONV3D, DEPTHWISE_CONV2D,
AVG_POOL2D, MAX_POOL2D, TRANSPOSE_CONV2D) now assume NHWC
input layout and produce NHWC output shapes.

Removed `tosa_dim_order` shape remapping

tosa_shape() no longer reorders dimensions—just resolves symints.
_get_matching_fake_tensor() returns node.meta["val"] directly.
Serialisation mapping always uses identity dim_order.

Operator serialisation simplified

op_amax, op_amin, op_any, op_cat, op_sum, and op_permute
no longer remap reduction/concat axes through dim_order since
tensors are already in the layout expected by TOSA.

Permute optimisation passes added

Six shared passes from executorch/backends/transforms/ are now run
after TOSA lowering to fuse, cancel, and simplify the permutes
introduced above:

RemovePermutesAroundElementwiseOps (extended for RESCALE)
FuseTransposeOrPermuteOpPairsPass (extended for RESCALE)
ReplaceNopTransposeOrPermuteWithViewPass
PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView
FuseCascadedTransposeOrPermuteOps
FuseCascadedViewOps

Removed passes

DecomposeConvWithInt16ActivationPass and DecomposeGroupedConvPass
are removed—their logic is now handled inline by RewriteConvPass.
RewriteSlicePass is repositioned after the permute optimisations.

Ethos-U55 partitioner simplified

The dual NCHW/NHWC permute constraint check is removed since tensors
are always in the expected layout at partition time.

Differential Revision: D100712787

pytorch-bot · 2026-04-21T01:51:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19015

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures, 1 Cancelled Job, 4 Unrelated Failures

As of commit 1736792 with merge base 89600b3 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner (gh)
>>> Lint for backends/arm/tosa/utils.py:
Lint / lintrunner-mypy (gh)
>>> Lint for backends/arm/_passes/rewrite_conv_pass.py:
pull / android / run-emulator (gh)
The process '/usr/local/lib/android/sdk/platform-tools/adb' failed with exit code 224
pull / unittest-arm-backend-with-no-deps (test_pytest_models_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t beb4f5791d0a6e79d4326dd30508ebe9c958c52dda6c52280c0ef60bc40e8485 /exec failed with exit code 1
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 69eef4fed105fac44a3fc5bf965eb8c661c31c7d0a5fdcb4aeef89c3cd236e71 /exec failed with exit code 1
trunk / test-arm-backend-ethos-u (test_pytest_models_ethos_u55) / linux-job (gh)
RuntimeError: Command docker exec -t cd340fd364a56e83bd2f35292f79c49e94f6fcda813a27895220be3c867aec79 /exec failed with exit code 1
trunk / test-arm-backend-ethos-u (test_pytest_models_ethos_u85) / linux-job (gh)
RuntimeError: Command docker exec -t e3195270272e66a896a389599a70ce47b5c6245eb5289d2b619f7349468e56a8 /exec failed with exit code 1
trunk / test-arm-backend-ethos-u (test_pytest_ops_ethos_u55) / linux-job (gh)
RuntimeError: Command docker exec -t 118d1a0516a3a6cd331b30cb289579fce980732672537d3b04b1de796685633f /exec failed with exit code 1
trunk / test-arm-backend-ethos-u (test_pytest_ops_ethos_u85) / linux-job (gh)
RuntimeError: Command docker exec -t a84652221eb88a117527cda49d30211c0c5a18b26fe7507b7412a7f97fc095dc /exec failed with exit code 1
trunk / test-arm-backend-vkml (test_pytest_ops_vkml) / linux-job (gh)
RuntimeError: Command docker exec -t e55d9312873d30cb197c4939800eead02c74391d28cd7d5a607f70e4671fde5a /exec failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

pull / unittest / macos / macos-job (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / unittest-release / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-04-21T01:51:36Z

@mcremon-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100712787.

github-actions · 2026-04-21T01:52:13Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

AdrianLundell · 2026-04-21T10:54:56Z

Hi, great work! I have written some comments below, but overall I think it might be easier for us to land the PR and let you run a final internal check rather than vice versa, just since we have a bit easier access to CI and since we have a broader scope in what the backend needs to handle (e.g. dim-order input). The resulting behavior should be very similar from your perspective. Does that sound good to you?

I am getting an increased number of transposes on some graphs in our transpose-count suite, mainly due to the following issues:
- Transposes not fused over elementwise branching in all
- Transposes not fusing in upwards/downward forks
- As long as we don't see regression on important models and have good regression tests I'm fine with landing current behaviour for now and follow up with improvements later.
The override and beartype dependencies will have to be removed from permute_pass_utils
A number of of tests failing:
- The current PR does not consider inputs with non-contiguous dim-order and gives numerical diffs for such graphs. This is fixed with the
  NormalizeDelegateIoPass from https://github.com/pytorch/executorch/pull/18948/changes#diff-6f629d0126cf621c8e72045a9bf407b6ae8bd362ee1672d86014c0d081df6869. (Should be placed earlier in the pass pipeline than in that patch to avoid some cases where passes rewrite the dim-order before it has been handled.)
- Many fails related to upsampling
- A few other conv and other test-cases
No reason to remove the various vaidations done for the max/avg-pool2d tosa ops.
Some aesthetic nits:
- Rewrite max_pool2d/avg_pool are completely rewritten as call-passes, would be preferrable with a minimal diff only inserting tranposes in the call_operator pass.
- The way the new passes are introduced into the arm-backend differs from how we do it normally, would be good to follow the general structure.
- You can remove the ToTosaMemoryFormat and tosa.TRANSPOSE traces and permute op helpers completely since they are not used anymore

Summary: Pull Request resolved: #19002 Move 6 permute optimization passes and their shared infrastructure from executorch/backends/cadence/aot/ to executorch/backends/transforms/ so they can be shared between the Cadence and Arm backends without a cross-backend dependency. New files: - permute_pass_utils.py: base classes (HierarchicalInplacePassInterface, RemoveOrReplacePassInterface, FuseOpPairsAcrossBranchesPass) and utilities (get_arg, set_arg, get_transposed_dims, get_permuted_dims, get_shape, get_edge_overload_packet) - fuse_cascaded_transpose_or_permute_ops.py - fuse_cascaded_view_ops.py - fuse_transpose_or_permute_op_pairs_pass.py - remove_permutes_around_elementwise_ops.py - postpone_permute_below_squeeze_view.py - replace_nop_transpose_or_permute_with_view.py The shared versions omit register_cadence_pass decorators and cadence-specific ops from default op sets. Cadence files will subclass these and re-add the decorators and ops. Added OSS tests (test_permute_optimization_passes.py) for the 4 passes that can be imported without quantized op registration: FuseCascadedTransposeOrPermuteOps, FuseCascadedViewOps, PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView, and ReplaceNopTransposeOrPermuteWithViewPass. These run in GitHub CI via pytest and are discovered automatically through pytest.ini testpaths. Differential Revision: D101459577 Reviewed By: ethansfng

Summary: Pull Request resolved: #19015 Replace implicit `tosa_dim_order`-based layout handling with explicit `permute_copy` ops around TOSA operators that require NHWC layout. ### Rewrite passes insert explicit NCHW↔NHWC permutes `RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass` now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op, NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for layout conversion. This makes layout transitions visible in the graph. ### Grouped conv decomposition in NHWC `RewriteConvPass` decomposes grouped convolutions (non-depthwise) into per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single input/output permute pair wrapping the whole group. Supports INT8, INT16 (with and without bias) quantisation paths, including the full INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) → RESCALE(INT32→INT16). ### `ToTosaMemoryFormatPass` scoped down Now only assigns non-identity dim_order to parameter/buffer placeholders (for weight serialisation) and graph I/O. Inserts `permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already carry a matching permute (inserted by the rewrite passes). ### TOSA dialect op metas expect NHWC All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`, `AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC input layout and produce NHWC output shapes. ### Removed `tosa_dim_order` shape remapping `tosa_shape()` no longer reorders dimensions—just resolves symints. `_get_matching_fake_tensor()` returns `node.meta["val"]` directly. Serialisation mapping always uses identity dim_order. ### Operator serialisation simplified `op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute` no longer remap reduction/concat axes through `dim_order` since tensors are already in the layout expected by TOSA. ### Permute optimisation passes added Six shared passes from `executorch/backends/transforms/` are now run after TOSA lowering to fuse, cancel, and simplify the permutes introduced above: - `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`) - `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`) - `ReplaceNopTransposeOrPermuteWithViewPass` - `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView` - `FuseCascadedTransposeOrPermuteOps` - `FuseCascadedViewOps` ### Removed passes `DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass` are removed—their logic is now handled inline by `RewriteConvPass`. `RewriteSlicePass` is repositioned after the permute optimisations. ### Ethos-U55 partitioner simplified The dual NCHW/NHWC permute constraint check is removed since tensors are always in the expected layout at partition time. Differential Revision: D100712787

digantdesai · 2026-04-22T18:55:13Z

+            return False
+        return not self._is_depthwise_conv2d(node)
+
+    def _handle_grouped_conv(  # noqa: C901


This goes back to something we discussed in the meeting, here we are trying to live in the world where the expectation is everything which is not between two permutes is always NCHW. For a NHWC-only accelerator shouldn't it be other way around, during lowering? Is it too much to ask from users to do,

example_input = example_input.to(memory_format=[torch.channels_last]) model = model.to(memory_format=[torch.channels_last]) model.forward(example_input) # or export and lower

Or even do this after partitioning and assume its always channels last and deal with ops which are memory format sensitive?

See example

tensors do not know anything about layouts right? They only know dimensions. If the graph tracer gives dim0, and the next op needs dim1, then the compiler inserts a permute. At the output, same thing. This is the most consistent way we've handled those cases.
Only 2 set of ops (convolutions and pooling) will need those, and if they're back to back or the ops between them are dimension-insensitive, then you should always get exactly what you describe (graph will have a permute on each input, and a permute on each output) and everything else will cancel.
The way to achieve that is very important though, in my opinion.
Last thing I want to mention here is, it's probably not too much to ask, but I think it's the wrong thing to ask. Users shouldn't have to even know they need to do this, if the flow behaves properly. But maybe that's a personal take :)

digantdesai

LGTM, I will let @AdrianLundell stamp it, since he is also looking into this.

I do want to make sure though that we are at least handling NHWC model/inputs properly.

mcremon-meta · 2026-04-22T20:09:38Z

LGTM, I will let @AdrianLundell stamp it, since he is also looking into this.

I do want to make sure though that we are at least handling NHWC model/inputs properly.

Agreed, I do think I had to fix that in the changes and it basically added a permute in that case. Will confirm locally that it's working as intended. That permute should then cancel out with the first conv/pool layer.

mcremon-meta requested review from digantdesai and kimishpatel as code owners April 21, 2026 01:51

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 21, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 21, 2026

3l1 added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk module: arm Issues related to arm backend labels Apr 21, 2026

mcremon-meta force-pushed the export-D100712787 branch from c4dbf62 to 7199184 Compare April 21, 2026 18:39

meta-codesync Bot changed the title ~~Replace tosa_dim_order with explicit NCHW↔NHWC permutes~~ Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015) Apr 22, 2026

mcremon-meta force-pushed the export-D100712787 branch from 7199184 to 40bde3c Compare April 22, 2026 00:33

mcremon-meta force-pushed the export-D100712787 branch from 40bde3c to 1736792 Compare April 22, 2026 00:47

digantdesai reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)#19015

Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)#19015
mcremon-meta wants to merge 2 commits intomainfrom
export-D100712787

mcremon-meta commented Apr 21, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

AdrianLundell commented Apr 21, 2026

Uh oh!

digantdesai Apr 22, 2026 •

edited

Loading

Uh oh!

mcremon-meta Apr 22, 2026

Uh oh!

digantdesai left a comment

Uh oh!

mcremon-meta commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mcremon-meta commented Apr 21, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rewrite passes insert explicit NCHW↔NHWC permutes

Grouped conv decomposition in NHWC

ToTosaMemoryFormatPass scoped down

TOSA dialect op metas expect NHWC

Removed tosa_dim_order shape remapping

Operator serialisation simplified

Permute optimisation passes added

Removed passes

Ethos-U55 partitioner simplified

Uh oh!

pytorch-bot Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19015

❌ 10 New Failures, 1 Cancelled Job, 4 Unrelated Failures

Uh oh!

meta-codesync Bot commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

This PR needs a release notes: label

Uh oh!

AdrianLundell commented Apr 21, 2026

Uh oh!

digantdesai Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcremon-meta Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

digantdesai left a comment

Choose a reason for hiding this comment

Uh oh!

mcremon-meta commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mcremon-meta commented Apr 21, 2026 •

edited by meta-codesync Bot

Loading

`ToTosaMemoryFormatPass` scoped down

Removed `tosa_dim_order` shape remapping

pytorch-bot Bot commented Apr 21, 2026 •

edited

Loading

This PR needs a `release notes:` label

digantdesai Apr 22, 2026 •

edited

Loading