Skip to content

Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)#19015

Open
mcremon-meta wants to merge 2 commits intomainfrom
export-D100712787
Open

Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)#19015
mcremon-meta wants to merge 2 commits intomainfrom
export-D100712787

Conversation

@mcremon-meta
Copy link
Copy Markdown
Contributor

@mcremon-meta mcremon-meta commented Apr 21, 2026

Summary:

Replace implicit tosa_dim_order-based layout handling with explicit
permute_copy ops around TOSA operators that require NHWC layout.

Rewrite passes insert explicit NCHW↔NHWC permutes

RewriteConvPass, RewriteAvgPool2dPass, and RewriteMaxPool2dPass
now insert aten.permute_copy nodes (NCHW→NHWC before the TOSA op,
NHWC→NCHW after) instead of relying on ToTosaMemoryFormatPass for
layout conversion. This makes layout transitions visible in the graph.

Grouped conv decomposition in NHWC

RewriteConvPass decomposes grouped convolutions (non-depthwise) into
per-group TOSA.CONV2D ops operating entirely in NHWC, with a single
input/output permute pair wrapping the whole group. Supports INT8,
INT16 (with and without bias) quantisation paths, including the full
INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
RESCALE(INT32→INT16).

ToTosaMemoryFormatPass scoped down

Now only assigns non-identity dim_order to parameter/buffer
placeholders (for weight serialisation) and graph I/O. Inserts
permute_copy instead of tosa.TRANSPOSE. Skips users that already
carry a matching permute (inserted by the rewrite passes).

TOSA dialect op metas expect NHWC

All TOSA op meta functions (CONV2D, CONV3D, DEPTHWISE_CONV2D,
AVG_POOL2D, MAX_POOL2D, TRANSPOSE_CONV2D) now assume NHWC
input layout and produce NHWC output shapes.

Removed tosa_dim_order shape remapping

tosa_shape() no longer reorders dimensions—just resolves symints.
_get_matching_fake_tensor() returns node.meta["val"] directly.
Serialisation mapping always uses identity dim_order.

Operator serialisation simplified

op_amax, op_amin, op_any, op_cat, op_sum, and op_permute
no longer remap reduction/concat axes through dim_order since
tensors are already in the layout expected by TOSA.

Permute optimisation passes added

Six shared passes from executorch/backends/transforms/ are now run
after TOSA lowering to fuse, cancel, and simplify the permutes
introduced above:

  • RemovePermutesAroundElementwiseOps (extended for RESCALE)
  • FuseTransposeOrPermuteOpPairsPass (extended for RESCALE)
  • ReplaceNopTransposeOrPermuteWithViewPass
  • PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView
  • FuseCascadedTransposeOrPermuteOps
  • FuseCascadedViewOps

Removed passes

DecomposeConvWithInt16ActivationPass and DecomposeGroupedConvPass
are removed—their logic is now handled inline by RewriteConvPass.
RewriteSlicePass is repositioned after the permute optimisations.

Ethos-U55 partitioner simplified

The dual NCHW/NHWC permute constraint check is removed since tensors
are always in the expected layout at partition time.

Differential Revision: D100712787

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 21, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19015

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures, 1 Cancelled Job, 4 Unrelated Failures

As of commit 1736792 with merge base 89600b3 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 21, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 21, 2026

@mcremon-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100712787.

@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@3l1 3l1 added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk module: arm Issues related to arm backend labels Apr 21, 2026
@AdrianLundell
Copy link
Copy Markdown
Collaborator

Hi, great work! I have written some comments below, but overall I think it might be easier for us to land the PR and let you run a final internal check rather than vice versa, just since we have a bit easier access to CI and since we have a broader scope in what the backend needs to handle (e.g. dim-order input). The resulting behavior should be very similar from your perspective. Does that sound good to you?

  • I am getting an increased number of transposes on some graphs in our transpose-count suite, mainly due to the following issues:

    • Transposes not fused over elementwise branching in all
    • Transposes not fusing in upwards/downward forks
    • As long as we don't see regression on important models and have good regression tests I'm fine with landing current behaviour for now and follow up with improvements later.
  • The override and beartype dependencies will have to be removed from permute_pass_utils

  • A number of of tests failing:

  • No reason to remove the various vaidations done for the max/avg-pool2d tosa ops.

  • Some aesthetic nits:

    • Rewrite max_pool2d/avg_pool are completely rewritten as call-passes, would be preferrable with a minimal diff only inserting tranposes in the call_operator pass.
    • The way the new passes are introduced into the arm-backend differs from how we do it normally, would be good to follow the general structure.
    • You can remove the ToTosaMemoryFormat and tosa.TRANSPOSE traces and permute op helpers completely since they are not used anymore

Summary:
Pull Request resolved: #19002

Move 6 permute optimization passes and their shared infrastructure from
executorch/backends/cadence/aot/ to executorch/backends/transforms/ so
they can be shared between the Cadence and Arm backends without a
cross-backend dependency.

New files:
- permute_pass_utils.py: base classes (HierarchicalInplacePassInterface,
  RemoveOrReplacePassInterface, FuseOpPairsAcrossBranchesPass) and
  utilities (get_arg, set_arg, get_transposed_dims, get_permuted_dims,
  get_shape, get_edge_overload_packet)
- fuse_cascaded_transpose_or_permute_ops.py
- fuse_cascaded_view_ops.py
- fuse_transpose_or_permute_op_pairs_pass.py
- remove_permutes_around_elementwise_ops.py
- postpone_permute_below_squeeze_view.py
- replace_nop_transpose_or_permute_with_view.py

The shared versions omit register_cadence_pass decorators and
cadence-specific ops from default op sets. Cadence files will subclass
these and re-add the decorators and ops.

Added OSS tests (test_permute_optimization_passes.py) for the 4 passes
that can be imported without quantized op registration:
FuseCascadedTransposeOrPermuteOps, FuseCascadedViewOps,
PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView, and
ReplaceNopTransposeOrPermuteWithViewPass. These run in GitHub CI via
pytest and are discovered automatically through pytest.ini testpaths.

Differential Revision: D101459577

Reviewed By: ethansfng
mcremon-meta added a commit that referenced this pull request Apr 22, 2026
Summary:
Pull Request resolved: #19015

Replace implicit `tosa_dim_order`-based layout handling with explicit
`permute_copy` ops around TOSA operators that require NHWC layout.

### Rewrite passes insert explicit NCHW↔NHWC permutes

`RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass`
now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op,
NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for
layout conversion. This makes layout transitions visible in the graph.

### Grouped conv decomposition in NHWC

`RewriteConvPass` decomposes grouped convolutions (non-depthwise) into
per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single
input/output permute pair wrapping the whole group. Supports INT8,
INT16 (with and without bias) quantisation paths, including the full
INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
RESCALE(INT32→INT16).

### `ToTosaMemoryFormatPass` scoped down

Now only assigns non-identity dim_order to parameter/buffer
placeholders (for weight serialisation) and graph I/O. Inserts
`permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already
carry a matching permute (inserted by the rewrite passes).

### TOSA dialect op metas expect NHWC

All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`,
`AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC
input layout and produce NHWC output shapes.

### Removed `tosa_dim_order` shape remapping

`tosa_shape()` no longer reorders dimensions—just resolves symints.
`_get_matching_fake_tensor()` returns `node.meta["val"]` directly.
Serialisation mapping always uses identity dim_order.

### Operator serialisation simplified

`op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute`
no longer remap reduction/concat axes through `dim_order` since
tensors are already in the layout expected by TOSA.

### Permute optimisation passes added

Six shared passes from `executorch/backends/transforms/` are now run
after TOSA lowering to fuse, cancel, and simplify the permutes
introduced above:
- `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`)
- `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`)
- `ReplaceNopTransposeOrPermuteWithViewPass`
- `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView`
- `FuseCascadedTransposeOrPermuteOps`
- `FuseCascadedViewOps`

### Removed passes

`DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass`
are removed—their logic is now handled inline by `RewriteConvPass`.
`RewriteSlicePass` is repositioned after the permute optimisations.

### Ethos-U55 partitioner simplified

The dual NCHW/NHWC permute constraint check is removed since tensors
are always in the expected layout at partition time.

Differential Revision: D100712787
@meta-codesync meta-codesync Bot changed the title Replace tosa_dim_order with explicit NCHW↔NHWC permutes Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015) Apr 22, 2026
Summary:
Pull Request resolved: #19015

Replace implicit `tosa_dim_order`-based layout handling with explicit
`permute_copy` ops around TOSA operators that require NHWC layout.

### Rewrite passes insert explicit NCHW↔NHWC permutes

`RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass`
now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op,
NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for
layout conversion. This makes layout transitions visible in the graph.

### Grouped conv decomposition in NHWC

`RewriteConvPass` decomposes grouped convolutions (non-depthwise) into
per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single
input/output permute pair wrapping the whole group. Supports INT8,
INT16 (with and without bias) quantisation paths, including the full
INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
RESCALE(INT32→INT16).

### `ToTosaMemoryFormatPass` scoped down

Now only assigns non-identity dim_order to parameter/buffer
placeholders (for weight serialisation) and graph I/O. Inserts
`permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already
carry a matching permute (inserted by the rewrite passes).

### TOSA dialect op metas expect NHWC

All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`,
`AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC
input layout and produce NHWC output shapes.

### Removed `tosa_dim_order` shape remapping

`tosa_shape()` no longer reorders dimensions—just resolves symints.
`_get_matching_fake_tensor()` returns `node.meta["val"]` directly.
Serialisation mapping always uses identity dim_order.

### Operator serialisation simplified

`op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute`
no longer remap reduction/concat axes through `dim_order` since
tensors are already in the layout expected by TOSA.

### Permute optimisation passes added

Six shared passes from `executorch/backends/transforms/` are now run
after TOSA lowering to fuse, cancel, and simplify the permutes
introduced above:
- `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`)
- `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`)
- `ReplaceNopTransposeOrPermuteWithViewPass`
- `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView`
- `FuseCascadedTransposeOrPermuteOps`
- `FuseCascadedViewOps`

### Removed passes

`DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass`
are removed—their logic is now handled inline by `RewriteConvPass`.
`RewriteSlicePass` is repositioned after the permute optimisations.

### Ethos-U55 partitioner simplified

The dual NCHW/NHWC permute constraint check is removed since tensors
are always in the expected layout at partition time.

Differential Revision: D100712787
return False
return not self._is_depthwise_conv2d(node)

def _handle_grouped_conv( # noqa: C901
Copy link
Copy Markdown
Contributor

@digantdesai digantdesai Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes back to something we discussed in the meeting, here we are trying to live in the world where the expectation is everything which is not between two permutes is always NCHW. For a NHWC-only accelerator shouldn't it be other way around, during lowering? Is it too much to ask from users to do,

example_input = example_input.to(memory_format=[torch.channels_last]) model = model.to(memory_format=[torch.channels_last]) model.forward(example_input) # or export and lower

Or even do this after partitioning and assume its always channels last and deal with ops which are memory format sensitive?

See example

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensors do not know anything about layouts right? They only know dimensions. If the graph tracer gives dim0, and the next op needs dim1, then the compiler inserts a permute. At the output, same thing. This is the most consistent way we've handled those cases.
Only 2 set of ops (convolutions and pooling) will need those, and if they're back to back or the ops between them are dimension-insensitive, then you should always get exactly what you describe (graph will have a permute on each input, and a permute on each output) and everything else will cancel.
The way to achieve that is very important though, in my opinion.
Last thing I want to mention here is, it's probably not too much to ask, but I think it's the wrong thing to ask. Users shouldn't have to even know they need to do this, if the flow behaves properly. But maybe that's a personal take :)

Copy link
Copy Markdown
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I will let @AdrianLundell stamp it, since he is also looking into this.

I do want to make sure though that we are at least handling NHWC model/inputs properly.

@mcremon-meta
Copy link
Copy Markdown
Contributor Author

LGTM, I will let @AdrianLundell stamp it, since he is also looking into this.

I do want to make sure though that we are at least handling NHWC model/inputs properly.

Agreed, I do think I had to fix that in the changes and it basically added a permute in that case. Will confirm locally that it's working as intended. That permute should then cancel out with the first conv/pool layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported module: arm Issues related to arm backend partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants