The new attribute indicates that the kernel copies DMA pages on fork,
hence fork support through madvise and MADV_DONTFORK is not needed.
If the attribute is not reported (expected on older kernels),
copy-on-fork is disabled.
Example:
$ rdma sys
netns shared copy-on-fork on
Signed-off-by: Gal Pressman <galpress@amazon.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
When add address with port, it is mean to send an ADD_ADDR to remote,
so it must have flag signal set.
Fixes: 42fbca91cd ("mptcp: add support for port based endpoint")
Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David Ahern <dsahern@kernel.org>
The default behavior for source MACVLAN is to duplicate packets to
appropriate type source devices, and then do the normal destination MACVLAN
flow. This patch adds an option to skip destination MACVLAN processing if
any matching source MACVLAN device has the option set.
This allows setting up a "catch all" device for source MACVLAN: create one
or more devices with type source nodst, and one device with e.g. type vepa,
and incoming traffic will be received on exactly one device.
Signed-off-by: Jethro Beekman <kernel@jbeekman.nl>
Signed-off-by: David Ahern <dsahern@kernel.org>
Leon Romanovsky says:
====================
This is the user space part of already accepted to the kernel series
that extends RDMA netlink interface to return uverbs context and SRQ
information.
The accepted kernel series can be seen here:
https://lore.kernel.org/linux-rdma/20210422133459.GA2390260@nvidia.com/
====================
Signed-off-by: David Ahern <dsahern@kernel.org>
Sample output:
$ rdma res show srq
dev ibp8s0f0 srqn 0 type BASIC pdn 3 comm [ib_ipoib]
dev ibp8s0f0 srqn 4 type BASIC lqpn 125-128,130-140 pdn 9 pid 3581 comm ibv_srq_pingpon
dev ibp8s0f0 srqn 5 type BASIC lqpn 141-156 pdn 10 pid 3584 comm ibv_srq_pingpon
dev ibp8s0f0 srqn 6 type BASIC lqpn 157-172 pdn 11 pid 3590 comm ibv_srq_pingpon
dev ibp8s0f1 srqn 0 type BASIC pdn 3 comm [ib_ipoib]
dev ibp8s0f1 srqn 1 type BASIC lqpn 329-344 pdn 4 pid 3586 comm ibv_srq_pingpon
$ rdma res show srq lqpn 126-141
dev ibp8s0f0 srqn 4 type BASIC lqpn 126-128,130-140 pdn 9 pid 3581 comm ibv_srq_pingpon
dev ibp8s0f0 srqn 5 type BASIC lqpn 141 pdn 10 pid 3584 comm ibv_srq_pingpon
$ rdma res show srq lqpn 127
dev ibp8s0f0 srqn 4 type BASIC lqpn 127 pdn 9 pid 3581 comm ibv_srq_pingpon
Reviewed-by: Ido Kalir <idok@nvidia.com>
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Sample output:
$ rdma res show ctx
dev ibp8s0f0 ctxn 0 pid 980 comm ibv_rc_pingpong
dev ibp8s0f0 ctxn 1 pid 981 comm ibv_rc_pingpong
dev ibp8s0f0 ctxn 2 pid 992 comm ibv_rc_pingpong
dev ibp8s0f1 ctxn 0 pid 984 comm ibv_rc_pingpong
dev ibp8s0f1 ctxn 1 pid 987 comm ibv_rc_pingpong
$ rdma res show ctx dev ibp8s0f1
dev ibp8s0f1 ctxn 0 pid 984 comm ibv_rc_pingpong
dev ibp8s0f1 ctxn 1 pid 987 comm ibv_rc_pingpong
Reviewed-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Ido Kalir <idok@nvidia.com>
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Update kernel headers to commit:
99ba0ea616aa ("sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues")
Signed-off-by: David Ahern <dsahern@kernel.org>
In functions bpf_{send,recv}_map_fds(), when connect fails after a
socket is successfully opened, we return with error missing a close on
the socket.
Fix this closing the socket if opened and using a single return point
for both the functions.
Fixes: 6256f8c9e4 ("tc, bpf: finalize eBPF support for cls and act front-end")
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
As stated in the man page(), open returns a non-negative integer as a
file descriptor. Hence, when checking for its return value to be ok, we
should include 0 as a valid value.
This fixes a covscan warning about a missing close() in this function.
Fixes: ecb05c0f99 ("bpf: improve error reporting around tail calls")
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
envp_run is dinamically allocated with a malloc, and not freed in the
out: return path. This commit fix it.
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
In functions netns_pids() and netns_identify_pid(), the netns file is
not closed on some error paths.
Fix this using a conditional close and a single return point on both
functions.
Fixes: 44b563269e ("ip-nexthop: support flush by id")
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
When I added support for new vlan rtm dumping, I made a mistake in the
output format when there are no vlans on the port. This patch fixes it by
not printing ports without vlan entries (similar to current situation).
Example (no vlans):
$ bridge -d vlan show
port vlan-id
Fixes: e5f87c8341 ("bridge: vlan: add support for the new rtm dump call")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
The 'ip' utility hardcodes the assumption of being a 2-char command, where
any follow-on characters are passed as an argument:
$ ./ip-full help
Object "-full" is unknown, try "ip help".
This confusing behaviour isn't seen with 'tc' for example, and was added in
a 2005 commit without documentation. It was noticed during testing of 'ip'
variants built/packaged with different feature sets (e.g. w/o BPF support).
Mitigate the problem by redoing the command without the 2-char assumption
if the follow-on characters fail to parse as a valid command.
Fixes: 351efcde4e ("Update header files to 2.6.14")
Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
The build of iproute2 relies on having correct copy of santized
kernel headers. The vdpa utility introduced a dependency on
the vdpa related headers, but these headers were not present
in iproute2 repo.
Fixes: c2ecc82b9d ("vdpa: Add vdpa tool")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
The function get_task_name() is used to get the name of a process from
its pid, and its implementation is similar to ip/iptuntap.c:pid_name().
Move it to lib/fs.c to use a single implementation and make it easily
reusable.
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Nikolay Aleksandrov says:
====================
From: Nikolay Aleksandrov <nikolay@nvidia.com>
This set extends the bridge vlan code to use the new vlan RTM calls
which allow to dump detailed per-port, per-vlan information and also to
manipulate the per-vlan options. It also allows to monitor any vlan
changes (add/del/option change). The rtm vlan dumps have an extensible
format which allows us to add new options and attributes easily, and
also to request the kernel to filter on different vlan information when
dumping. The new kernel dump code tries to use compressed vlan format as
much as possible (it includes netlink attributes for vlan start and
end) to reduce the number of generated messages and netlink traffic.
The iproute2 support is activated by using the "-d" flag when showing
vlan information, that will cause it to use the new rtm dump call and
get all the detailed information, if "-s" is also specified it will dump
per-vlan statistics as well. Obviously in that case the vlans cannot be
compressed. To change per-vlan options (currently only STP state is
supported) a new vlan command is added - "set". It can be used to set
options of bridge or port vlans and vlan ranges can be used, all of the
new vlan option code uses extack to show more understandable errors.
The set adds the first supported per-vlan option - STP state.
Man pages and usage information are updated accordingly.
Example:
$ bridge -d vlan show
port vlan-id
ens13 1 PVID Egress Untagged
state forwarding
bridge 1 PVID Egress Untagged
state forwarding
$ bridge vlan set vid 1 dev ens13 state blocking
$ bridge -d vlan show
port vlan-id
ens13 1 PVID Egress Untagged
state blocking
bridge 1 PVID Egress Untagged
state forwarding
====================
Signed-off-by: David Ahern <dsahern@kernel.org>
Add support for vlan activity monitoring, we display vlan notifications on
vlan add/del/options change. The man page and help are also updated
accordingly.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Use the new bridge vlan rtm dump helper to dump all of the available
vlan information when -details (-d) is used with vlan show. It is also
capable of dumping vlan stats if -statistics (-s) is added.
Currently this is the only interface capable of dumping per-vlan
options. The vlan dump format is compatible with current vlan show, it
uses the same helpers to dump vlan information. The new addition is one
line which will contain the per-vlan options (similar to ip -d link show
for ports). Currently only the vlan STP state is printed.
The call uses compressed vlan format by default.
Example:
$ bridge -s -d vlan show
port vlan-id
virbr1 1 PVID Egress Untagged
state forwarding
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Add rtnl bridge vlan dump request helper which will be used to retrieve
bridge vlan information and options.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Add a new per-vlan option set command. It allows to manipulate vlan
options, those can be bridge-wide or per-port depending on what device
is specified. The first option that can be set is the vlan STP state,
it is identical to the bridge port STP state. The man page is also
updated accordingly.
Example:
$ bridge vlan set vid 10 dev br0 state learning
or a range:
$ bridge vlan set vid 10-20 dev swp1 state blocking
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Add a helper which parses an STP state string to its numeric value.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Rename print_portstate to print_stp_state in preparation for use by vlan
code as well (per-vlan state), and export it. To be in line with the new
naming rename also port_states to stp_states as they'll be used for
vlans, too.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
This adds iproute2 support for mptcp event monitoring, e.g. creation,
establishment, address announcements from the peer, subflow establishment
and so on.
While the kernel-generated events are primarily aimed at mptcpd (e.g. for
subflow management), this is also useful for debugging.
This adds print support for the existing events.
Sample output of 'ip mptcp monitor':
[ CREATED] token=83f3a692 remid=0 locid=0 saddr4=10.0.1.2 daddr4=10.0.1.1 sport=58710 dport=10011
[ ESTABLISHED] token=83f3a692 remid=0 locid=0 saddr4=10.0.1.2 daddr4=10.0.1.1 sport=58710 dport=10011
[SF_ESTABLISHED] token=83f3a692 remid=0 locid=1 saddr4=10.0.2.2 daddr4=10.0.1.1 sport=40195 dport=10011 backup=0
[ CLOSED] token=83f3a692
Signed-off-by: Florian Westphal <fw@strlen.de>
libmnl defines MNL_CB_OK as 1 and MNL_CB_ERROR as -1. rdma uses these
return codes, and stat_qp_show_parse_cb() should do the same.
Fixes: 16ce4d2366 ("rdma: stat: initialize ret in stat_qp_show_parse_cb()")
Reported-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
In the unlikely case in which the mnl_attr_for_each_nested() cycle is
not executed, this function return an uninitialized value.
Fix this initializing ret to 0.
Fixes: 5937552b42 ("rdma: Add "stat qp show" support")
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
grps is dinamically allocated with a calloc, and not freed in a return
path in the for cycle. This commit fix it.
While at it, make the function use a single return point.
Fixes: 63df8e8543 ("Add support for nexthop objects")
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
In cake_parse_opt(), *argv is checked not to be null when parsing for
overhead and mpu parameters. However this is useless, since *argv
matches right before for "overhead" or "mpu".
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
strslashrsplit() return value is not checked in __dl_argv_handle(),
despite the fact that it can return EINVAL.
This commit fix it and make __dl_argv_handle() return error if
strslashrsplit() return an error code.
Fixes: 2f85a9c535 ("devlink: allow to parse both devlink and port handle in the same time")
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
The format for erspan/erspan6 output is not valid JSON, as on version 2 a
valueless key was presented. The direction should be value and erspan_dir
should be the key.
Fixes: 2897636267 ("erspan: add erspan version II support")
Cc: u9012063@gmail.com
Reported-by: Christian Pössinger <christian@poessinger.com>
Signed-off-by: Christian Pössinger <christian@poessinger.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
since id is unique for nexthop, it is heavy to dump all nexthops.
use existing delete_nexthop to support flush by id
Signed-off-by: Chunmei Xu <xuchunmei@linux.alibaba.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
To avoid code duplication, tipc should be converted to use the helper
functions for working with libmnl in lib/mnl_utils.c
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David Ahern <dsahern@kernel.org>
Allow a policer action to enforce a rate-limit based on packets-per-second,
configurable using a packet-per-second rate and burst parameters.
e.g.
# $TC actions add action police pkts_rate 1000 pkts_burst 200 index 1
# $TC actions ls action police
total acts 1
action order 0: police 0x1 rate 0bit burst 0b mtu 4096Mb pkts_rate 1000 pkts_burst 200
ref 1 bind 0
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
- Open Routing is using ID 99 for it's installed routes
- https://github.com/facebook/openr
- Kernel has accepted 99 in `rtnetlink.h`
Signed-of-by: Cooper Lees <me@cooperlees.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
After the comment cited below, batch mode neglects to set the global
variable batch_mode to a non-zero value. Netns and VRF commands use this
variable, and break in batch mode. Fix by setting the value again.
Fixes: 1d9a81b8c9 ("Unify batch processing across tools")
Reported-by: Tim Rice <trice@posteo.net>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
This patch adds support for setting and displaying the Traffic Flow
Confidentiality attribute for an XFRM state, which allows padding ESP
packets to a specified length.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David Ahern <dsahern@kernel.org>
The out of date documentation was removed in 2017, but the instructions
in the README were not removed.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Petr Machata says:
====================
Support for resilient next-hop groups was recently accepted to Linux
kernel[1]. Resilient next-hop groups add a layer of indirection between the
SKB hash and the next hop. Thus the hash is used to reference a hash table
bucket, which is then used to reference a particular next hop. This allows
the system more flexibility when assigning SKB hash space to next hops.
Previously, each next hop had to be assigned a continuous range of SKB hash
space. With a hash table as an intermediate layer, it is possible to
reassign next hops with a hash table bucket granularity. In turn, this
mends issues with traffic flow redirection resulting from next hop removal
or adjustments in next-hop weights.
In this patch set, introduce support for resilient next-hop groups to
iproute2.
- Patch #1 brings include/uapi/linux/nexthop.h and /rtnetlink.h up to date.
- Patches #2 and #3 add new helpers that will be useful later.
- Patch #4 extends the ip/nexthop sub-tool to accept group type as a
command line argument, and to dispatch based on the specified type.
- Patch #5 adds the support for resilient next-hop groups.
- Patch #6 adds the support for resilient next-hop group bucket interface.
To illustrate the usage, consider the following commands:
# ip nexthop add id 1 via 192.0.2.2 dev dummy1
# ip nexthop add id 2 via 192.0.2.3 dev dummy1
# ip nexthop add id 10 group 1/2 type resilient \
buckets 8 idle_timer 60 unbalanced_timer 300
The last command creates a resilient next-hop group. It will have 8
buckets, each bucket will be considered idle when no traffic hits it for at
least 60 seconds, and if the table remains out of balance for 300 seconds,
it will be forcefully brought into balance.
And this is how the next-hop group bucket interface looks:
# ip nexthop bucket show id 10
id 10 index 0 idle_time 5.59 nhid 1
id 10 index 1 idle_time 5.59 nhid 1
id 10 index 2 idle_time 8.74 nhid 2
id 10 index 3 idle_time 8.74 nhid 2
id 10 index 4 idle_time 8.74 nhid 1
id 10 index 5 idle_time 8.74 nhid 1
id 10 index 6 idle_time 8.74 nhid 1
id 10 index 7 idle_time 8.74 nhid 1
[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=2a0186a37700b0d5b8cc40be202a62af44f02fa2
====================
Signed-off-by: David Ahern <dsahern@kernel.org>
Add ability to dump multiple nexthop buckets and get a specific one.
Example:
# ip nexthop add id 10 group 1/2 type resilient buckets 8
# ip nexthop
id 1 via 192.0.2.2 dev dummy10 scope link
id 2 via 192.0.2.19 dev dummy20 scope link
id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 unbalanced_time 0
# ip nexthop bucket
id 10 index 0 idle_time 28.1 nhid 2
id 10 index 1 idle_time 28.1 nhid 2
id 10 index 2 idle_time 28.1 nhid 2
id 10 index 3 idle_time 28.1 nhid 2
id 10 index 4 idle_time 28.1 nhid 1
id 10 index 5 idle_time 28.1 nhid 1
id 10 index 6 idle_time 28.1 nhid 1
id 10 index 7 idle_time 28.1 nhid 1
# ip nexthop bucket show nhid 1
id 10 index 4 idle_time 53.59 nhid 1
id 10 index 5 idle_time 53.59 nhid 1
id 10 index 6 idle_time 53.59 nhid 1
id 10 index 7 idle_time 53.59 nhid 1
# ip nexthop bucket get id 10 index 5
id 10 index 5 idle_time 81 nhid 1
# ip -j -p nexthop bucket get id 10 index 5
[ {
"id": 10,
"bucket": {
"index": 5,
"idle_time": 104.89,
"nhid": 1
},
"flags": [ ]
} ]
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Add ability to configure resilient nexthop groups and show their current
configuration. Example:
# ip nexthop add id 10 group 1/2 type resilient buckets 8
# ip nexthop show id 10
id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0
# ip -j -p nexthop show id 10
[ {
"id": 10,
"group": [ {
"id": 1
},{
"id": 2
} ],
"type": "resilient",
"resilient_args": {
"buckets": 8,
"idle_timer": 120,
"unbalanced_timer": 0
},
"flags": [ ]
} ]
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>