Commit Graph

5684 Commits

Author SHA1 Message Date
Peilin Ye e78411948d tc/skbmod: Introduce SKBMOD_F_ECN option
Recently we added SKBMOD_F_ECN option support to the kernel; support it in
the tc-skbmod(8) front end, and update its man page accordingly.

The 2 least significant bits of the Traffic Class field in IPv4 and IPv6
headers are used to represent different ECN states [1]:

	0b00: "Non ECN-Capable Transport", Non-ECT
	0b10: "ECN Capable Transport", ECT(0)
	0b01: "ECN Capable Transport", ECT(1)
	0b11: "Congestion Encountered", CE

This new option, "ecn", marks ECT(0) and ECT(1) IPv{4,6} packets as CE,
which is useful for ECN-based rate limiting.  For example:

	$ tc filter add dev eth0 parent 1: protocol ip prio 10 \
		u32 match ip protocol 1 0xff flowid 1:2 \
		action skbmod \
		ecn

The updated tc-skbmod SYNOPSIS looks like the following:

	tc ... action skbmod { set SETTABLE | swap SWAPPABLE | ecn } ...

Only one of "set", "swap" or "ecn" shall be used in a single tc-skbmod
command.  Trying to use more than one of them at a time is considered
undefined behavior; pipe multiple tc-skbmod commands together instead.
"set" and "swap" only affect Ethernet packets, while "ecn" only affects
IP packets.

Depends on kernel patch "net/sched: act_skbmod: Add SKBMOD_F_ECN option
support", as well as iproute2 patch "tc/skbmod: Remove misinformation
about the swap action".

[1] https://en.wikipedia.org/wiki/Explicit_Congestion_Notification

Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-08-08 11:56:55 -06:00
David Ahern 09d8ce3db1 Merge branch 'main' into next
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-08-04 09:24:12 -06:00
David Ahern e8763fc9ab Merge branch 'ipv6-oam' into next
Justin Iurman says:

====================

The IOAM patchset was merged recently (see net-next commits [1,2,3,4,5,6]).
Therefore, this patchset provides support for IOAM inside iproute2, as well as
manpage documentation. Here is a summary of added features inside iproute2.

(1) configure IOAM namespaces and schemas:

$ ip ioam
Usage:  ip ioam { COMMAND | help }
        ip ioam namespace show
        ip ioam namespace add ID [ data DATA32 ] [ wide DATA64 ]
        ip ioam namespace del ID
        ip ioam schema show
        ip ioam schema add ID DATA
        ip ioam schema del ID
        ip ioam namespace set ID schema { ID | none }

(2) provide a new encap type to insert the IOAM pre-allocated trace:

$ ip -6 ro ad fc00::1/128 encap ioam6 trace prealloc type 0x800000 ns 1 size 12 dev eth0

  [1] db67f219fc9365a0c456666ed7c134d43ab0be8a
  [2] 9ee11f0fff205b4b3df9750bff5e94f97c71b6a0
  [3] 8c6f6fa6772696be0c047a711858084b38763728
  [4] 3edede08ff37c6a9370510508d5eeb54890baf47
  [5] de8e80a54c96d2b75377e0e5319a64d32c88c690
  [6] 968691c777af78d2daa2ee87cfaeeae825255a58

====================

Signed-off-by: David Ahern <dsahern@kernel.org>
2021-08-02 11:34:09 -06:00
Justin Iurman 78832863ef IOAM man8
This patch provides man8 documentation for IOAM inside ip, ip-ioam and ip-route.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-08-02 11:33:35 -06:00
Justin Iurman 32f4969d44 New IOAM6 encap type for routes
This patch provides a new encap type for routes to insert an IOAM pre-allocated
trace:

$ ip -6 ro ad fc00::1/128 encap ioam6 trace prealloc type 0x800000 ns 1 size 12 dev eth0

where:
 - "trace" and "prealloc" may appear as useless but just anticipate for future
   implementations of other ioam option types.
 - "type" is a bitfield (=u32) defining the IOAM pre-allocated trace type (see
   the corresponding uapi).
 - "ns" is an IOAM namespace ID attached to the pre-allocated trace.
 - "size" is the trace pre-allocated size in bytes; must be a 4-octet multiple;
   limited size (see IOAM6_TRACE_DATA_SIZE_MAX).

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-08-02 11:33:31 -06:00
Justin Iurman 2909812583 Add, show, link, remove IOAM namespaces and schemas
This patch provides support for adding, listing and removing IOAM namespaces
and schemas with iproute2. When adding an IOAM namespace, both "data" (=u32)
and "wide" (=u64) are optional. Therefore, you can either have none, one of
them, or both at the same time. When adding an IOAM schema, there is no
restriction on "DATA" except its size (see IOAM6_MAX_SCHEMA_DATA_LEN). By
default, an IOAM namespace has no active IOAM schema (meaning an IOAM namespace
is not linked to an IOAM schema), and an IOAM schema is not considered
as "active" (meaning an IOAM schema is not linked to an IOAM namespace). It is
possible to link an IOAM namespace with an IOAM schema, thanks to the last
command below (meaning the IOAM schema will be considered as "active" for the
specific IOAM namespace).

$ ip ioam
Usage:	ip ioam { COMMAND | help }
	ip ioam namespace show
	ip ioam namespace add ID [ data DATA32 ] [ wide DATA64 ]
	ip ioam namespace del ID
	ip ioam schema show
	ip ioam schema add ID DATA
	ip ioam schema del ID
	ip ioam namespace set ID schema { ID | none }

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-08-02 11:33:05 -06:00
David Ahern e53f4cd504 Import ioam6 uapi headers
Import ioam6 uapi headers from kernel headers at last sync commit.

Signed-off-by: David Ahern <dsahern@kernel.org>
2021-08-02 11:32:26 -06:00
David Ahern 236696e52c Update kernel headers
Update kernel headers to commit:
    1187c8c4642d ("net: phy: mscc: make some arrays static const, makes object smaller")

Signed-off-by: David Ahern <dsahern@kernel.org>
2021-08-02 10:25:09 -06:00
Gokul Sivakumar cf866f0a5a ipneigh: add support to print brief output of neigh cache in tabular format
Make use of the already available brief flag and print the basic details of
the IPv4 or IPv6 neighbour cache in a tabular format for better readability
when the brief output is expected.

$ ip -br neigh
172.16.12.100                           bridge0          b0:fc:36:2f:07:43
172.16.12.174                           bridge0          8c:16:45:2f:bc:1c
172.16.12.250                           bridge0          04:d9:f5:c1:0c:74
fe80::267b:9f70:745e:d54d               bridge0          b0:fc:36:2f:07:43
fd16:a115:6a62:0:8744:efa1:9933:2c4c    bridge0          8c:16:45:2f:bc:1c
fe80::6d9:f5ff:fec1:c74                 bridge0          04:d9:f5:c1:0c:74

And add "ip neigh show" to the list of ip sub commands mentioned in the man
page that support the brief output in tabular format.

Signed-off-by: Gokul Sivakumar <gokulkumar792@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-08-02 10:14:50 -06:00
Peilin Ye c06d313d86 tc/skbmod: Remove misinformation about the swap action
Currently man 8 tc-skbmod says that "...the swap action will occur after
any smac/dmac substitutions are executed, if they are present."

This is false.  In fact, trying to "set" and "swap" in a single skbmod
command causes the "set" part to be completely ignored.  As an example:

	$ tc filter add dev eth0 parent 1: protocol ip prio 10 \
		matchall action skbmod \
        	set dmac AA:AA:AA:AA:AA:AA smac BB:BB:BB:BB:BB:BB \
        	swap mac

The above command simply does a "swap", without setting DMAC or SMAC to
AA's or BB's.  The root cause of this is in the kernel, see
net/sched/act_skbmod.c:tcf_skbmod_init():

	parm = nla_data(tb[TCA_SKBMOD_PARMS]);
	index = parm->index;
	if (parm->flags & SKBMOD_F_SWAPMAC)
		lflags = SKBMOD_F_SWAPMAC;
		^^^^^^^^^^^^^^^^^^^^^^^^^^

Doing a "=" instead of "|=" clears all other "set" flags when doing a
"swap".  Discourage using "set" and "swap" in the same command by
documenting it as undefined behavior, and update the "SYNOPSIS" section
as well as tc -help text accordingly.

If one really needs to e.g. "set" DMAC to all AA's then "swap" DMAC and
SMAC, one should do two separate commands and "pipe" them together.

Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-07-22 15:14:29 -07:00
Roi Dayan 71d36000dc police: Fix normal output back to what it was
With the json support fix the normal output was
changed. set it back to what it was.
Print overhead with print_size().
Print newline before ref.

Fixes: 0d5cf51e0d ("police: Add support for json output")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-07-17 11:14:30 -07:00
Lahav Schlesinger f760bff328 ipmonitor: Fix recvmsg with ancillary data
A successful call to recvmsg() causes msg.msg_controllen to contain the length
of the received ancillary data. However, the current code in the 'ip' utility
doesn't reset this value after each recvmsg().

This means that if a call to recvmsg() doesn't have ancillary data, then
'msg.msg_controllen' will be set to 0, causing future recvmsg() which do
contain ancillary data to get MSG_CTRUNC set in msg.msg_flags.

This fixes 'ip monitor' running with the all-nsid option - With this option the
kernel passes the nsid as ancillary data. If while 'ip monitor' is running an
even on the current netns is received, then no ancillary data will be sent,
causing 'msg.msg_controllen' to be set to 0, which causes 'ip monitor' to
indefinitely print "[nsid current]" instead of the real nsid.

Fixes: 449b824ad1 ("ipmonitor: allows to monitor in several netns")
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-07-17 11:13:36 -07:00
Stephen Hemminger 7a7e9ed98f uapi: headers update
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-07-17 11:12:47 -07:00
Christian Schürmann 1f2c908d53 man8/ip-tunnel.8: fix typo, 'encaplim' is not a valid option
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-07-15 09:31:51 -07:00
Alexander Mikhalitsyn 115e987035 libnetlink: check error handler is present before a call
Fix nullptr dereference of errhndlr from rtnl_dump_filter_arg
struct in rtnl_dump_done and rtnl_dump_error functions.

Fixes: 459ce6e3d7 ("ip route: ignore ENOENT during save if RT_TABLE_MAIN is being dumped")
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Roi Dayan <roid@nvidia.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Reported-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-07-11 10:33:44 -07:00
Stephen Hemminger 0015ada629 libnetlink: cosmetic changes
Don't initialize arguments that are NULL, and format initialization
in a more logical way.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-07-07 07:39:07 -07:00
Alexander Mikhalitsyn 459ce6e3d7 ip route: ignore ENOENT during save if RT_TABLE_MAIN is being dumped
We started to use in-kernel filtering feature which allows to get only
needed tables (see iproute_dump_filter()). From the kernel side it's
implemented in net/ipv4/fib_frontend.c (inet_dump_fib), net/ipv6/ip6_fib.c
(inet6_dump_fib). The problem here is that behaviour of "ip route save"
was changed after
c7e6371bc ("ip route: Add protocol, table id and device to dump request").
If filters are used, then kernel returns ENOENT error if requested table
is absent, but in newly created net namespace even RT_TABLE_MAIN table
doesn't exist. It is really allocated, for instance, after issuing
"ip l set lo up".

Reproducer is fairly simple:
$ unshare -n ip route save > dump
Error: ipv4: FIB table does not exist.
Dump terminated

Expected result here is to get empty dump file (as it was before this
change).

v2: reworked, so, now it takes into account NLMSGERR_ATTR_MSG
(see nl_dump_ext_ack_done() function). We want to suppress error messages
in stderr about absent FIB table from kernel too.

v3: reworked to make code clearer. Introduced rtnl_suppressed_errors(),
rtnl_suppress_error() helpers. User may suppress up to 3 errors (may be
easily extended by changing SUPPRESS_ERRORS_INIT macro).

v4: reworked, rtnl_dump_filter_errhndlr() was introduced. Thanks
to Stephen Hemminger for comments and suggestions

v5: space fixes, commit message reformat, empty initializers

Fixes: c7e6371bc ("ip route: Add protocol, table id and device to dump request")
Cc: David Ahern <dsahern@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-07-07 07:32:56 -07:00
Stephen Hemminger 8f85d085fe uapi: update kernel headers from 5.14-rc1
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-07-06 17:07:24 -07:00
Martynas Pumputis 83d4d61bc9 libbpf: fix attach of prog with multiple sections
When BPF programs which consists of multiple executable sections via
iproute2+libbpf (configured with LIBBPF_FORCE=on), we noticed that a
wrong section can be attached to a device. E.g.:

    # tc qdisc replace dev lxc_health clsact
    # tc filter replace dev lxc_health ingress prio 1 \
        handle 1 bpf da obj bpf_lxc.o sec from-container
    # tc filter show dev lxc_health ingress filter protocol all
        pref 1 bpf chain 0 filter protocol all pref 1 bpf chain 0
        handle 0x1 bpf_lxc.o:[__send_drop_notify] <-- WRONG SECTION
        direct-action not_in_hw id 38 tag 7d891814eda6809e jited

After taking a closer look into load_bpf_object() in lib/bpf_libbpf.c,
we noticed that the filter used in the program iterator does not check
whether a program section name matches a requested section name
(cfg->section). This can lead to a wrong prog FD being used to attach
the program.

Fixes: 6d61a2b557 ("lib: add libbpf support")
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Acked-by: Hangbin Liu <haliu@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-07-06 16:59:39 -07:00
David Ahern 02c06ffc13 Merge branch 'main' into next
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-07-01 14:29:42 +00:00
Stephen Hemminger fc3511962d lib: remove blank line at eof
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-06-29 13:20:44 -07:00
Stephen Hemminger 0e7ea3e8fe v5.13.0 2021-06-29 11:24:17 -07:00
Ben Hutchings 33cf9306c8 devlink: Fix printf() type mismatches on 32-bit architectures
devlink currently uses "%lu" to format values of type uint64_t,
but on 32-bit architectures uint64_t is defined as unsigned
long long and this does not work correctly.

Fix this by using the standard macro PRIu64 instead.

Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-06-29 11:10:14 -07:00
Ben Hutchings 4ac0383a59 utils: Fix BIT() to support up to 64 bits on all architectures
devlink and vdpa use BIT() together with 64-bit flag fields.  devlink
is already using bit numbers greater than 31 and so does not work
correctly on 32-bit architectures.

Fix this by making BIT() use uint64_t instead of unsigned long.

Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-06-29 11:10:14 -07:00
Stephen Hemminger c73fb66070 uapi: update headers to 5.13
Final 5.13 header update

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-06-28 10:19:08 -07:00
Roi Dayan 6f15c21719 devlink: Fix link errors on some systems
On some systems we fail to link because of missing math lib.
add -lm to devlink.

    LINK     devlink
../lib/libutil.a(utils_math.o): In function `get_rate':
utils_math.c:(.text+0xcc): undefined reference to `floor'
../lib/libutil.a(utils_math.o): In function `get_size':
utils_math.c:(.text+0x384): undefined reference to `floor'
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:16: devlink] Error 1
make: *** [Makefile:64: all] Error 2

Fixes: 6c70aca76e ("devlink: Add port func rate support")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-06-26 14:57:27 -07:00
Asbjørn Sloth Tønnesen 2ff4761db4 tc: pedit: add decrement operation
Implement a decrement operation for ttl and hoplimit.

Since this is just syntactic sugar, it goes that:

  tc filter add ... action pedit ex munge ip ttl dec ...
  tc filter add ... action pedit ex munge ip6 hoplimit dec ...

is just a more readable version of this:

  tc filter add ... action pedit ex munge ip ttl add 0xff ...
  tc filter add ... action pedit ex munge ip6 hoplimit add 0xff ...

This feature was suggested by some pseudo tc examples in Mellanox's
documentation[1], but wasn't present in neither their mlnx-iproute2
nor iproute2.

Tested with skip_sw on Mellanox ConnectX-6 Dx.

[1] https://docs.mellanox.com/pages/viewpage.action?pageId=47033989

v3:
   - Use dedicated flags argument in parse_cmd() (David Ahern)
   - Minor rewording of the man page

v2:
   - Fix whitespace issue (Stephen Hemminger)
   - Add to usage info in explain()

Signed-off-by: Asbjørn Sloth Tønnesen <asbjorn@asbjorn.st>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-26 04:45:19 +00:00
Asbjørn Sloth Tønnesen bc5e8473aa tc: pedit: parse_cmd: add flags argument
This patch just prepares the flags argument, so it's
available to the next patch.

Signed-off-by: Asbjørn Sloth Tønnesen <asbjorn@asbjorn.st>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-26 04:44:35 +00:00
Sergey Ryazanov 6acccd52a2 iplink: support for WWAN devices
The WWAN subsystem has been extended to generalize the per data channel
network interfaces management. This change implements support for WWAN
links handling. And actively uses the earlier introduced ip-link
capability to specify the parent by its device name.

The WWAN interface for a new data channel should be created with a
command like this:

ip link add dev wwan0-2 parentdev wwan0 type wwan linkid 2

Where: wwan0 is the modem HW device name (should be taken from
/sys/class/wwan) and linkid is an identifier of the opened data
channel.

Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-26 04:40:57 +00:00
Sergey Ryazanov 362da458a4 iplink: add support for parent device
Add support for specifying a parent device (struct device) by its name
during the link creation and printing parent name in the links list.
This option will be used to create WWAN links and possibly by other
device classes that do not have a "natural parent netdev".

Add the parent device bus name printing for links list info
completeness. But do not add a corresponding command line argument, as
we do not have a use case for this attribute.

Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-26 04:40:22 +00:00
David Ahern 083e2706e1 Import wwan.h uapi file
Import wwan.h uapi file at version from last kernel headers sync.

Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-26 04:39:47 +00:00
Stephen Hemminger 8316825a52 man: fix syntax for ip link property
The ip link property add/delete requires a device; but the
device argument was not show on the man page.
It is correct in the usage message.

Fixes: 3aa0e51be6 ("ip: add support for alternative name addition/deletion/list")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-06-24 11:54:04 -07:00
Paolo Lungaroni 3e26254f31 seg6: add support for SRv6 End.DT46 Behavior
We introduce the new "End.DT46" action for supporting the SRv6 End.DT46
Behavior in iproute2.
The SRv6 End.DT46 Behavior, defined in RFC 8986 [1] section 4.8, can be
used to implement L3 VPNs based on Segment Routing over IPv6 networks in
multi-tenants environments and it is capable of handling both IPv4 and
IPv6 tenant traffic at the same time.
The SRv6 End.DT46 Behavior decapsulates the received packets and it
performs the IPv4 or IPv6 routing lookup in the routing table of the
tenant.

As for the End.DT4 and for the End.DT6 in VRF mode, the SRv6 End.DT46
Behavior leverages a VRF device in order to force the routing lookup into
the associated routing table using the "vrftable" attribute.

To make the End.DT46 work properly, it must be guaranteed that the
routing table used for routing lookup operations is bound to one and
only one VRF during the tunnel creation. Such constraint has to be
enforced by enabling the VRF strict_mode sysctl parameter, i.e.:

 $ sysctl -wq net.vrf.strict_mode=1

Note that the same approach is used for the End.DT4 Behavior and for the
End.DT6 Behavior in VRF mode.

An SRv6 End.DT46 Behavior instance can be created as follows:

 $ ip -6 route add 2001:db8::1 encap seg6local action End.DT46 vrftable 100 dev vrf100

Standard Output:
 $ ip -6 route show 2001:db8::1
 2001:db8::1  encap seg6local action End.DT46 vrftable 100 dev vrf100 metric 1024 pref medium

JSON Output:
$ ip -6 -j -p route show 2001:db8::1
[ {
        "dst": "2001:db8::1",
        "encap": "seg6local",
        "action": "End.DT46",
        "vrftable": 100,
        "dev": "vrf100",
        "metric": 1024,
        "flags": [ ],
        "pref": "medium"
} ]

This patch updates the route.8 man page and the ip route help with the
information related to End.DT46.
Considering that the same information was missing for the SRv6 End.DT4 and
the End.DT6 Behaviors, we have also added it.

[1] https://www.rfc-editor.org/rfc/rfc8986.html#name-enddt46-decapsulation-and-s

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Paolo Lungaroni <paolo.lungaroni@uniroma2.it>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-22 15:36:17 +00:00
David Ahern 1d11326a57 Update kernel headers
Update kernel headers to commit:
    ef2c3ddaa4ed ("ibmvnic: Use strscpy() instead of strncpy()")

Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-22 15:33:45 +00:00
Guillaume Nault f8879e85f0 utils: bump max args number to 512 for batch files
Large tc filters can have many arguments. For example the following
filter matches the first 7 MPLS LSEs, pops all of them, then updates
the Ethernet header and redirects the resulting packet to eth1.

filter add dev eth0 ingress handle 44 priority 100 \
  protocol mpls_uc flower mpls                     \
    lse depth 1 label 1040076 tc 4 bos 0 ttl 175   \
    lse depth 2 label 89648 tc 2 bos 0 ttl 9       \
    lse depth 3 label 63417 tc 5 bos 0 ttl 185     \
    lse depth 4 label 593135 tc 5 bos 0 ttl 67     \
    lse depth 5 label 857021 tc 0 bos 0 ttl 181    \
    lse depth 6 label 239239 tc 1 bos 0 ttl 254    \
    lse depth 7 label 30 tc 7 bos 1 ttl 237        \
  action mpls pop protocol mpls_uc pipe            \
  action mpls pop protocol mpls_uc pipe            \
  action mpls pop protocol mpls_uc pipe            \
  action mpls pop protocol mpls_uc pipe            \
  action mpls pop protocol mpls_uc pipe            \
  action mpls pop protocol mpls_uc pipe            \
  action mpls pop protocol ipv6 pipe               \
  action vlan pop_eth pipe                         \
  action vlan push_eth                             \
    dst_mac 00:00:5e:00:53:7e                      \
    src_mac 00:00:5e:00:53:03 pipe                 \
  action mirred egress redirect dev eth1

This filter has 149 arguments, so it can't be used with tc -batch
which is limited to a 100.

Let's bump the limit to 512. That should leave a lot of room for big
batch commands.

v2:
   -Define the limit in utils.h (Stephen Hemminger)
   -Bump the limit even higher (256 -> 512) (Stephen Hemminger)

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-18 02:57:05 +00:00
Stephen Hemminger e1d3ac755d uapi: update kernel headers to 5.13-rc6
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-06-17 15:54:05 -07:00
David Ahern d8b3b9d32d Merge branch 'devlink-rate-support' into next
Dmytro Linkin says:

====================

Series implements devlink rate commands, which are:
- Dump particular or all rate objects (JSON or non-JSON)
- Add/Delete node rate object
- Set tx rate share/max values for rate object
- Set/Unset parent rate object for other rate object

Examples:

Display all rate objects:

    # devlink port function rate show
    pci/0000:03:00.0/1 type leaf parent some_group
    pci/0000:03:00.0/2 type leaf tx_share 12Mbit
    pci/0000:03:00.0/some_group type node tx_share 1Gbps tx_max 5Gbps

Display leaf rate object bound to the 1st devlink port of the
pci/0000:03:00.0 device:

    # devlink port function rate show pci/0000:03:00.0/1
    pci/0000:03:00.0/1 type leaf

Display node rate object with name some_group of the pci/0000:03:00.0
device:

    # devlink port function rate show pci/0000:03:00.0/some_group
    pci/0000:03:00.0/some_group type node

Display leaf rate object rate values using IEC units:

    # devlink -i port function rate show pci/0000:03:00.0/2
    pci/0000:03:00.0/2 type leaf 11718Kibit

Display pci/0000:03:00.0/2 leaf rate object as pretty JSON output:

    # devlink -jp port function rate show pci/0000:03:00.0/2
    {
        "rate": {
            "pci/0000:03:00.0/2": {
                "type": "leaf",
                "tx_share": 1500000
            }
        }
    }

Create node rate object with name "1st_group" on pci/0000:03:00.0 device:

    # devlink port function rate add pci/0000:03:00.0/1st_group

Create node rate object with specified parameters:

    # devlink port function rate add pci/0000:03:00.0/2nd_group \
        tx_share 10Mbit tx_max 30Mbit parent 1st_group

Set parameters to the specified leaf rate object:

    # devlink port function rate set pci/0000:03:00.0/1 \
        tx_share 2Mbit tx_max 10Mbit

Set leaf's parent to "1st_group":

    # devlink port function rate set pci/0000:03:00.0/1 parent 1st_group

Unset leaf's parent:

    # devlink port function rate set pci/0000:03:00.0/1 noparent

Delete node rate object:

    # devlink port function rate del pci/0000:03:00.0/2nd_group

Rate values can be specified in bits or bytes per second (bit|bps), with
any SI (k, m, g, t) or IEC (ki, mi, gi, ti) prefix. Bare number means
bits per second. Units also printed in "show" command output, but not
necessarily the same which were specified with "set" or "add" command.
-i/--iec switch force output in IEC units. JSON output always print
values as bytes per sec.

====================

Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-12 04:38:34 +00:00
Dmytro Linkin dedf895184 devlink: Add ISO/IEC switch
Add -i/--iec switch to print rate values using binary prefixes.
Update devlink(8) and devlink-rate(8) pages.

Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-12 04:38:13 +00:00
Dmytro Linkin 6c70aca76e devlink: Add port func rate support
Implement user commands to manage devlink port func rate objects.
List all rate commands:

    $ devlink port func rate help

or just

    $ devlink port func rate

To list all OR particular rate object:

    $ devlink port func rate show
    pci/0000:03:00.0/some_group: type node
    pci/0000:03:00.0/0: type leaf
    pci/0000:03:00.0/1: type leaf

    $ devlink prot func rate show pci/0000:03:00.0/1
    pci/0000:03:00.0/0: type leaf

    $ devlink prot func rate show pci/0000:03:00.0/some_group
    pci/0000:03:00.0/some_group: type node

Rate object of type "leaf" created by it's driver where name is the name
of corresponding devlink port. Rate object of type "node" represents
rate group created by the user using commands:

    $ devlink port func rate add pci/0000:03:00.0/some_group

or with defining tx rate limits

    $ devlink port func rate add pci/0000:03:00.0/some_group \
        tx_shara 10kbit tx_max 100mbit

NOTE: node name cannot be a decimal value because it conflicts with
devlink port indexes.

To delete node object:

    $ devlink port func rate del pci/0000:03:00.0/some_group

Set rate limits of existing rate object:

    $ devlink prot func rate set pci/0000:03:00.0/0 \
        tx_share 5MBps tx_max 25GBps

    $ devlink prot func rate set pci/0000:03:00.0/some_group \
        tx_share 0

Both SET and ADD commands accept any units of rates defined in IEC
60027-2 standard.

NOTE: rate value 0 means that rate is unlimited. Such value is also
ommited in show command output.

NOTE: In SHOW command output rate values will be printed with suffixes
as well, but in JSON output they are always units of Bps.

Set or unset parent of existing rate object:

    $ devlink prot func rate set pci/0000:03:00.0/0 parent some_group

    $ devlink port func rate set pci/0000:03:00.0/0 noparent

NOTE: Setting parent to empty ("") name due to kernel logic means unset
parent and shouldn't be used to avoid unexpected parent unsets.

Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-12 04:38:06 +00:00
Dmytro Linkin 95339955c5 devlink: Add helper function to validate object handler
Every handler argument validated in two steps, first of which, form
checking, expects identifier is few words separated by slashes.
For device and region handlers just checked if identifier have expected
number of slashes.
Add generic function to do that and make code cleaner & consistent.

Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-12 04:37:21 +00:00
David Ahern 85903c9a29 Update kernel headers
Update kernel headers to commit:
    76cf404c40ae ("Merge branch 'ipa-mem-2'")

Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-11 02:38:23 +00:00
Parav Pandit fbd4b581cb devlink: Add optional controller user input
A user optionally provides the external controller number when user
wants to create devlink port for the external controller.

An example on eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev

$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-11 02:28:49 +00:00
Roi Dayan 0d5cf51e0d police: Add support for json output
Change to use the print wrappers instead of fprintf().

This is example output of the options part before this commit:

        "options": {
            "handle": 1,
            "in_hw": true,
            "actions": [ {
                    "order": 1 police 0x2 ,
                    "control_action": {
                        "type": "drop"
                    },
                    "control_action": {
                        "type": "continue"
                    }overhead 0b linklayer unspec
        ref 1 bind 1
,
                    "used_hw_stats": [ "delayed" ]
                } ]
        }

This is the output of the same dump with this commit:

        "options": {
            "handle": 1,
            "in_hw": true,
            "actions": [ {
                    "order": 1,
                    "kind": "police",
                    "index": 2,
                    "control_action": {
                        "type": "drop"
                    },
                    "control_action": {
                        "type": "continue"
                    },
                    "overhead": 0,
                    "linklayer": "unspec",
                    "ref": 1,
                    "bind": 1,
                    "used_hw_stats": [ "delayed" ]
                } ]
        }

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-11 02:28:36 +00:00
Eric Dumazet 52f136f640 tc: fq: add horizon attributes
Commit 39d010504e6b ("net_sched: sch_fq: add horizon attribute")
added kernel support for horizon attributes in linux-5.8

$ tc -s -d qd sh dev wlp2s0
qdisc fq 8006: root refcnt 2 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 690924 bytes 3234 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 112 (inactive 104 throttled 0)
  gc 0 highprio 0 throttled 2 latency 8.25us

$ tc qd change dev wlp2s0 root fq horizon 500ms horizon_cap

$ tc -s -d qd sh dev wlp2s0
qdisc fq 8006: root refcnt 2 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 500ms horizon_cap
 Sent 831220 bytes 3844 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 122 (inactive 120 throttled 0)
  gc 0 highprio 0 throttled 2 latency 8.25us

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-07 02:56:01 +00:00
Hangbin Liu 7ae2585b86 configure: convert LIBBPF environment variables to command-line options
Signed-off-by: Hangbin Liu <haliu@redhat.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-03 03:25:59 +00:00
Hangbin Liu a9c3d70d90 configure: add options ability
There are more and more global environment variables that land everywhere
in configure, which is making user hard to know which one does what.
Using command-line options would make it easier for users to learn or
remember the config options.

This patch converts the INCLUDE variable to command option first. Check
if the first variable has '-' to compile with the old INCLUDE path
setting method.

Signed-off-by: Hangbin Liu <haliu@redhat.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-06-03 03:25:11 +00:00
Roman Mashak 9d9b1a84a5 ss: update ss man page
'-b' option allows to request BPF filter opcodes, however
currently the kernel returns only classic BPF filter, so
reflect this in man page.

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-06-01 15:55:06 -07:00
Ariel Levkovich 825bd5dacb tc: f_flower: Add missing ct_state flags to usage description
Add ct_state flags rpl and inv to the commands usage
description

Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-05-27 14:40:05 +00:00
Ariel Levkovich 7fda6c588a tc: f_flower: Add option to match on related ct state
Add support for matching on ct_state flag related.
The related state indicates a packet is associated with an existing
connection.

Example:
$ tc filter add dev ens1f0_0 ingress prio 1 chain 1 proto ip flower \
  ct_state -est-rel+trk \
  action mirred egress redirect dev ens1f0_1

$ tc filter add dev ens1f0_0 ingress prio 1 chain 1 proto ip flower \
  ct_state +rel+trk \
  action mirred egress redirect dev ens1f0_1

Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
2021-05-27 14:39:14 +00:00
Florian Westphal d3740fdc26 libgenl: make genl_add_mcast_grp set errno on error
genl_add_mcast_grp doesn't set errno in all cases.

On kernels that support mptcp but lack event support (all kernels <= 5.11)
MPTCP_PM_EV_GRP_NAME won't be found and ip will exit with

    "can't subscribe to mptcp events: Success"

Set errno to a meaningful value (ENOENT) when the group name isn't found
and also cover other spots where it returns nonzero with errno unset.

Fixes: ff619e4fd3 ("mptcp: add support for event monitoring")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2021-05-17 11:59:37 -07:00