Add capability to ip-monitor to listen and dump nexthop messages.
Since the nexthop group = 32 which exceeds the max groups bit
field, 2 separate flags are needed - one that defaults on to indicate
nexthop group is joined by default and a second that indicates a
specific selection by the user (e.g, ip mon nexthop route).
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Add nhid option for routes to use nexthop objects by id.
Example:
$ ip nexthop add id 1 via 10.99.1.2 dev veth1
$ ip route add 10.100.1.0/24 nhid 1
$ ip route ls
...
10.100.1.0/24 nhid 1 via 10.99.1.2 dev veth1
Signed-off-by: David Ahern <dsahern@gmail.com>
Add nexthop subcommand to ip. Implement basic commands for creating,
deleting and dumping nexthop objects. Syntax follows 'nexthop' syntax
from existing 'ip route' command.
Examples:
1. Single path
$ ip nexthop add id 1 via 10.99.1.2 dev veth1
$ ip nexthop ls
id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link
2. ECMP
$ ip nexthop add id 2 via 10.99.3.2 dev veth3
$ ip nexthop add id 1001 group 1/2
--> creates a nexthop group with 2 component nexthops:
id 1 and id 2 both the same weight
$ ip nexthop ls
id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link
id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link
id 1001 group 1/2
3. Weighted multipath
$ ip nexthop add id 1002 group 1,10/2,20
--> creates a nexthop group with 2 component nexthops:
id 1 with a weight of 10 and id 2 with a weight of 20
$ ip nexthop ls
id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link
id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link
id 1001 group 1/2
id 1002 group 1,10/2,20
Signed-off-by: David Ahern <dsahern@gmail.com>
Export print_rt_flags and print_rta_if for use by the nexthop
command.
Change print_rta_gateway to take the family versus rtmsg struct and
export for use by the nexthop command.
Signed-off-by: David Ahern <dsahern@gmail.com>
groups > 31 have to be joined using the setsockopt. Since the nexthop
group is 32, add a helper to allow 'ip monitor' to listen for nexthop
messages.
Signed-off-by: David Ahern <dsahern@gmail.com>
lwt_parse_encap currently assumes the encap attribute is RTA_ENCAP
and the type is RTA_ENCAP_TYPE. Change lwt_parse_encap to take these
as input arguments for reuse by nexthop code which has the attributes
as NHA_ENCAP and NHA_ENCAP_TYPE.
Signed-off-by: David Ahern <dsahern@gmail.com>
ctinfo is a tc action restoring data stored in conntrack marks to
various fields. At present it has two independent modes of operation,
restoration of DSCP into IPv4/v6 diffserv and restoration of conntrack
marks into packet skb marks.
It understands a number of parameters specific to this action in
additional to the usual action syntax. Each operating mode is
independent of the other so all options are optional, however not
specifying at least one mode is a bit pointless.
Usage: ... ctinfo [dscp mask [statemask]] [cpmark [mask]] [zone ZONE]
[CONTROL] [index <INDEX>]
DSCP mode
dscp enables copying of a DSCP stored in the conntrack mark into the
ipv4/v6 diffserv field. The mask is a 32bit field and specifies where
in the conntrack mark the DSCP value is located. It must be 6
contiguous bits long. eg. 0xfc000000 would restore the DSCP from the
upper 6 bits of the conntrack mark.
The DSCP copying may be optionally controlled by a statemask. The
statemask is a 32bit field, usually with a single bit set and must not
overlap the dscp mask. The DSCP restore operation will only take place
if the corresponding bit/s in conntrack mark ANDed with the statemask
yield a non zero result.
eg. dscp 0xfc000000 0x01000000 would retrieve the DSCP from the top 6
bits, whilst using bit 25 as a flag to do so. Bit 26 is unused in this
example.
CPMARK mode
cpmark enables copying of the conntrack mark to the packet skb mark. In
this mode it is completely equivalent to the existing act_connmark
action. Additional functionality is provided by the optional mask
parameter, whereby the stored conntrack mark is logically ANDed with the
cpmark mask before being stored into skb mark. This allows shared usage
of the conntrack mark between applications.
eg. cpmark 0x00ffffff would restore only the lower 24 bits of the
conntrack mark, thus may be useful in the event that the upper 8 bits
are used by the DSCP function.
Usage: ... ctinfo [dscp mask [statemask]] [cpmark [mask]] [zone ZONE]
[CONTROL] [index <INDEX>]
where :
dscp MASK is the bitmask to restore DSCP
STATEMASK is the bitmask to determine conditional restoring
cpmark MASK mask applied to restored packet mark
ZONE is the conntrack zone
CONTROL := reclassify | pipe | drop | continue | ok |
goto chain <CHAIN_INDEX>
Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
the following TDC test case:
b776 - Replace simple action with invalid goto chain control
checks if the kernel correctly validates the 'goto chain' control action,
when it is specified in 'act_simple' rules. The test systematically fails
because the control action is hardcoded in parse_simple(), i.e. it is not
parsed by command line arguments, so its value is constantly TC_ACT_PIPE.
Because of that, the following command:
# tc action add action simple sdata "test" drop index 7
installs an 'act_simple' rule that never drops packets, and whose 'index'
is the first IDR available, plus an 'act_gact' rule with 'index' equal to
7, that drops packets.
Use parse_action_control_dflt(), like we did on many other TC actions, to
make the control action configurable also with 'act_simple'. The expected
results of test b776 are summarized below:
iproute2
v kernel->| 5.1-rc2 (and previous) | 5.1-rc3 (and subsequent)
------------------+-------------------------+-------------------------
5.1.0 | FAIL (bad IDR) | FAIL (bad IDR)
5.1.0(patched) | FAIL (no rule/bad sdata)| PASS
Changes since v1:
- reword commit message, thanks Stephen Hemminger
Fixes: 087f46ee4e ("tc: introduce simple action")
CC: Andrea Claudi <aclaudi@redhat.com>
CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
The following operation fails:
% sudo tc actions add action pipe index 1
% sudo tc filter add dev lo parent ffff: \
protocol ip pref 10 u32 match ip src 127.0.0.2 \
flowid 1:10 action gact index 1
Bad action type index
Usage: ... gact <ACTION> [RAND] [INDEX]
Where: ACTION := reclassify | drop | continue | pass | pipe |
goto chain <CHAIN_INDEX> | jump <JUMP_COUNT>
RAND := random <RANDTYPE> <ACTION> <VAL>
RANDTYPE := netrand | determ
VAL : = value not exceeding 10000
JUMP_COUNT := Absolute jump from start of action list
INDEX := index value used
However, passing a control action of gact rule during filter binding works:
% sudo tc filter add dev lo parent ffff: \
protocol ip pref 10 u32 match ip src 127.0.0.2 \
flowid 1:10 action gact pipe index 1
Binding by reference, i.e. by index, has to consistently work with
any tc action.
Since tc is sensitive to the order of keywords passed on the command line,
we can teach gact to skip parsing arguments as soon as it sees 'gact'
followed by 'index' keyword.
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Device name on mdev bus is 36 characters long which follow standard uuid
RFC 4122.
This is probably the longest name that a kernel will return for a
device.
Hence increase the buffer size to 64 bytes.
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
while at it, fix missing square bracket near 'ptype' and a typo in the
action description (it's -> its).
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Andrea Claudi <aclaudi@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Parav Pandit says:
====================
RDMA subsystem can be running in either of the modes.
(a) Sharing RDMA devices among multiple net namespaces or
(b) Exclusive mode where RDMA device is bound to single net namespace
This patch series adds
(1) query command to query rdma subsystem sharing mode
(2) set command to change rdma subsystem sharing mode
(3) assign rdma device to a net namespace
rdma tool examples:
(a) Query current rdma subsys net namespace sharing mode
$ rdma sys show
netns shared
(b) Change rdma subsys mode to exclusive mode
$ rdma sys set netns exclusive
$ rdma sys show
netns exclusive
(c) Assign rdma device to a specific newly created net namespace
$ ip netns add foo
$ rdma dev set mlx5_1 netns foo
====================
Signed-off-by: David Ahern <dsahern@gmail.com>
Add man page to describe additional set netns command
for rdma device.
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Enrich rdmatool with an option to set network namespace of RDMA
device. After successful execution of it, rdma device will
be accessible only in assigned network namespace.
rdma tool command examples and output.
First set netns mode to exclusive.
$ rdma system set netns exclusive
Now create network namespace and assign RDMA device to this
network namespace.
$ ip netns add foo
$ rdma dev set mlx5_1 netns foo
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Enrich rdmatool with an option to query rdma subsystem parameter
whether rdma devices are shared among multiple network namespaces
or exclusive to single network namespace.
rdma tool command examples and output.
$ rdma system show
netns shared
$ rdma system set netns exclusive
$ rdma system show
netns exclusive
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
It will obviously fail. This is a follow up of the
commit 757837230a ("lib: suppress error msg when filling the cache").
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
While I fixed the mdb json output, I did overlook the text output.
This patch returns the original text output format:
dev <bridge> port <port> grp <mcast group> <temp|permanent> <flags> <timer>
Example (old format, restored by this patch):
dev br0 port eth8 grp 239.1.1.11 temp
Example (changed format after the commit below):
23: br0 eth8 239.1.1.11 temp
We had some reports of failing scripts which were parsing the output.
Also the old format matches the bridge mdb command syntax which makes
it easier to build commands out of the output.
Fixes: c7c1a1ef51 ("bridge: colorize output and use JSON print library")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
sscanf truncates read port values silently without any error. As sscanf
man says:
(...) sscanf() conform to C89 and C99 and POSIX.1-2001. These standards
do not specify the ERANGE error.
Replace sscanf with safer get_be16 that returns error when value is out
of range.
Example:
tc filter add dev eth0 protocol ip parent ffff: prio 1 flower ip_proto
tcp dst_port 70000 hw_tc 1
Would result in filter for port 4464 without any warning.
Fixes: 8930840e67 ("tc: flower: Classify packets based port ranges")
Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Before the patch:
$ ip netns add foo
$ ip link add name veth1 address 2a:a5:5c:b9:52:89 type veth peer name veth2 address 2a:a5:5c:b9:53:90 netns foo
RTNETLINK answers: No such device
RTNETLINK answers: No such device
But the command was successful. This may break script. Let's remove those
error messages.
Fixes: 55870dfe7f ("Improve batch and dump times by caching link lookups")
Reported-by: Philippe Guibert <philippe.guibert@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
The mirred act admits an optional control action, defaulting
to TC_ACT_PIPE. The parsing code currently emits an error message
if the control action is not provided on the command line, even
if the command itself completes with no error.
This change shuts down the error message, using the appropriate
parsing helper.
Fixes: e67aba5595 ("tc: actions: add helpers to parse and print control actions")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Every tool in the iproute2 package have one or more function to show
an help message to the user. Some of these functions print the help
line by line with a series of printf call, e.g. ip/xfrm_state.c does
60 fprintf calls.
If we group all the calls to a single one and just concatenate strings,
we save a lot of libc calls and thus object size. The size difference
of the compiled binaries calculated with bloat-o-meter is:
ip/ip:
add/remove: 0/0 grow/shrink: 5/15 up/down: 103/-4796 (-4693)
Total: Before=672591, After=667898, chg -0.70%
ip/rtmon:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-54 (-54)
Total: Before=48879, After=48825, chg -0.11%
tc/tc:
add/remove: 0/2 grow/shrink: 31/10 up/down: 882/-6133 (-5251)
Total: Before=351912, After=346661, chg -1.49%
bridge/bridge:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-459 (-459)
Total: Before=70502, After=70043, chg -0.65%
misc/lnstat:
add/remove: 0/1 grow/shrink: 1/0 up/down: 48/-486 (-438)
Total: Before=9960, After=9522, chg -4.40%
tipc/tipc:
add/remove: 0/0 grow/shrink: 1/1 up/down: 18/-62 (-44)
Total: Before=79182, After=79138, chg -0.06%
While at it, indent some strings which were starting at column 0,
and use tabs where possible, to have a consistent style across helps.
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Fix typo in usnic_udp node type and add a string for the unspecified
node type.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
import asm-generic/sockios.h to fix the compile errors from the
movement of timestamp macros.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Update kernel headers to commit
b970afcfcabd ("Merge tag 'powerpc-5.2-1'")
and import asm-generic/sockios.h to fix the compile errors from the
movement of timestamp macros.
Signed-off-by: David Ahern <dsahern@gmail.com>
Allow to limit 'ip xfrm {state|policy} list' output to a certain address
family and to delete all states/policies by family.
Although preferred_family was already set in filters, the filter
function ignored it. To enable filtering despite the lack of other
selectors, filter.use has to be set if family is not AF_UNSPEC.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Follow the following steps:
# ip netns add net1
# export MALLOC_MMAP_THRESHOLD_=0
# ip netns list
then Segmentation fault (core dumped) will occur.
In get_netnsid_from_name func, answer is freed before
rta_getattr_u32(tb[NETNSA_NSID]), where tb[] refers to answer`s
content. If we set MALLOC_MMAP_THRESHOLD_=0, mmap will be adoped to
malloc memory, which will be freed immediately after calling free
func. So reading tb[NETNSA_NSID] will access the released memory
after free(answer).
Here, we will call get_netnsid_from_name(tb[NETNSA_NSID]) before free(answer).
Fixes: 86bf43c7c2 ("lib/libnetlink: update rtnl_talk to support malloc buff at run time")
Reported-by: Huiying Kou <kouhuiying@huawei.com>
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
The command is supposed to allow users to filter events related to
certain objects, but returns an error when an object is specified:
# devlink mon dev
Command "dev" not found
Fix this by allowing the command to process the specified objects.
Example:
# devlink/devlink mon dev &
# echo "10 1" > /sys/bus/netdevsim/new_device
[dev,new] netdevsim/netdevsim10
# devlink/devlink mon port &
# echo "11 1" > /sys/bus/netdevsim/new_device
[port,new] netdevsim/netdevsim11/0: type notset flavour physical
[port,new] netdevsim/netdevsim11/0: type eth netdev eth1 flavour physical
# devlink/devlink mon &
# echo "12 1" > /sys/bus/netdevsim/new_device
[dev,new] netdevsim/netdevsim12
[port,new] netdevsim/netdevsim12/0: type notset flavour physical
[port,new] netdevsim/netdevsim12/0: type eth netdev eth2 flavour physical
Fixes: a3c4b484a1 ("add devlink tool")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
This allows a cycle-time and a cycle-time-extension to be specified.
Specifying a cycle-time will truncate that cycle, so when that instant
is reached, the cycle will start from its beginning.
A cycle-time-extension may cause the last entry of a cycle, just
before the start of a new schedule (the base-time of the "admin"
schedule) to be extended by at maximum "cycle-time-extension"
nanoseconds. The idea of this feauture, as described by the IEEE
802.1Q, is too avoid too narrow gate states.
Example:
tc qdisc change dev IFACE parent root handle 100 taprio \
sched-entry S 0x1 1000000 \
sched-entry S 0x0 2000000 \
sched-entry S 0x1 3000000 \
sched-entry S 0x0 4000000 \
cycle-time-extension 100000 \
cycle-time 9000000 \
base-time 12345678900000000
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David Ahern <dsahern@gmail.com>