* The argument to src_mac and dst_mac may now take an optional mask
to limit the scope of matching.
* This address is is documented as a LLADDR in keeping with ip-link(8).
* The formats accepted match those already output when dumping flower
filters from the kernel.
Example of use of LLADDR with and without a mask:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ip parent ffff: flower indev eth0 \
src_mac 52:54:01:00:00:00/ff:ff:00:00:00:01 action drop
tc filter add dev eth0 protocol ip parent ffff: flower indev eth0 \
src_mac 52:54:00:00:00:00/23 action drop
tc filter add dev eth0 protocol ip parent ffff: flower indev eth0 \
src_mac 52:54:00:00:00:00 action drop
Signed-off-by: Simon Horman <simon.horman@netronome.com>
* The argument to src_ip, dst_ip, enc_src_ip and enc_dst_ip take an
optional prefix length which is used to provide a mask to limit the scope
of matching.
* This is documented as a PREFIX in keeping with ip-route(8).
Example of uses of IPv4 and IPv6 prefixes
tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ip parent ffff: flower \
indev eth0 dst_ip 192.168.1.1 action drop
tc filter add dev eth0 protocol ip parent ffff: flower \
indev eth0 src_ip 10.0.0.0/8 action drop
tc filter add dev eth0 protocol ipv6 parent ffff: flower \
indev eth0 src_ip 2001:DB8:1::/48 action drop
tc filter add dev eth0 protocol ipv6 parent ffff: flower \
indev eth0 dst_ip 2001:DB8::1 action drop
Signed-off-by: Simon Horman <simon.horman@netronome.com>
* The argument to src_mac and dst_mac may now take an optional mask
to limit the scope of matching.
* This address is is documented as a LLADDR in keeping with ip-link(8).
* The formats accepted match those already output when dumping flower
filters from the kernel.
Example of use of LLADDR with and without a mask:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ip parent ffff: flower indev eth0 \
src_mac 52:54:01:00:00:00/ff:ff:00:00:00:01 action drop
tc filter add dev eth0 protocol ip parent ffff: flower indev eth0 \
src_mac 52:54:00:00:00:00/23 action drop
tc filter add dev eth0 protocol ip parent ffff: flower indev eth0 \
src_mac 52:54:00:00:00:00 action drop
Signed-off-by: Simon Horman <simon.horman@netronome.com>
* The argument to src_ip, dst_ip, enc_src_ip and enc_dst_ip take an
optional prefix length which is used to provide a mask to limit the scope
of matching.
* This is documented as a PREFIX in keeping with ip-route(8).
Example of uses of IPv4 and IPv6 prefixes
tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ip parent ffff: flower \
indev eth0 dst_ip 192.168.1.1 action drop
tc filter add dev eth0 protocol ip parent ffff: flower \
indev eth0 src_ip 10.0.0.0/8 action drop
tc filter add dev eth0 protocol ipv6 parent ffff: flower \
indev eth0 src_ip 2001:DB8:1::/48 action drop
tc filter add dev eth0 protocol ipv6 parent ffff: flower \
indev eth0 dst_ip 2001:DB8::1 action drop
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Enhance tunnel key action parameters by adding destination UDP port.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Enhance IP tunnel parameters by adding destination UDP port.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Support matching on ICMP type and code.
Example usage:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ip parent ffff: flower \
indev eth0 ip_proto icmp type 8 code 0 action drop
tc filter add dev eth0 protocol ipv6 parent ffff: flower \
indev eth0 ip_proto icmpv6 type 128 code 0 action drop
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Introduce enum flower_endpoint and use it instead of a bool
as the type for paramatising source and destination.
This is intended to improve read-ability and provide some type
checking of endpoint parameters.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Make use of flower_port_attr_type() safe:
* flower_port_attr_type() may return a valid index into tb[] or -1.
Only access tb[] in the case of the former.
* Do not access null entries in tb[]
Also make usage silent - it is valid for ip_proto to be invalid,
for example if it is not specified as part of the filter.
Fixes: a1fb0d4842 ("tc: flower: Support matching on SCTP ports")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
This action could be used before redirecting packets to a shared tunnel
device, or when redirecting packets arriving from a such a device.
The 'unset' action is optional. It is used to explicitly unset the
metadata created by the tunnel device during decap. If not used, the
metadata will be released automatically by the kernel.
The 'set' operation, will set the metadata with the specified values for
the encap.
For example, the following flower filter will forward all ICMP packets
destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before
redirecting, a metadata for the vxlan tunnel is created using the
tunnel_key action and it's arguments:
$ tc filter add dev net0 protocol ip parent ffff: \
flower \
ip_proto 1 \
dst_ip 11.11.11.2 \
action tunnel_key set \
src_ip 11.11.0.1 \
dst_ip 11.11.0.2 \
id 11 \
action mirred egress redirect dev vxlan0
Signed-off-by: Amir Vadai <amir@vadai.me>
Introduce classifying by metadata extracted by the tunnel device.
Outer header fields - source/dest ip and tunnel id, are extracted from
the metadata when classifying.
For example, the following will add a filter on the ingress Qdisc of shared
vxlan device named 'vxlan0'. To forward packets with outer src ip
11.11.0.2, dst ip 11.11.0.1 and tunnel id 11. The packets will be
forwarded to tap device 'vnet0':
$ tc filter add dev vxlan0 protocol ip parent ffff: \
flower \
enc_src_ip 11.11.0.2 \
enc_dst_ip 11.11.0.1 \
enc_key_id 11 \
dst_ip 11.11.11.1 \
action mirred egress redirect dev vnet0
Signed-off-by: Amir Vadai <amir@vadai.me>
This work moves the bpf loader into the iproute2 library and reworks
the tc specific parts into generic code. It's useful as we can then
more easily support new program types by just having the same ELF
loader backend. Joint work with Thomas Graf. I hacked a rough start
of a test suite to make sure nothing breaks [1] and looks all good.
[1] https://github.com/borkmann/clsact/blob/master/test_bpf.sh
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Support matching on SCTP ports in the same way that matching
on TCP and UDP ports is already supported.
Example usage:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ip parent ffff: \
flower indev eth0 ip_proto sctp dst_port 80 \
action drop
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Remove left over usage from removal of eth_type argument.
Fixes: 488b41d020 ('tc: flower no need to specify the ethertype')
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
So far, only the 'egress' direction was implemented.
Allow specifying 'ingress' as the direction packet appears on the target
interface.
For example, this takes incoming 802.1q frames on veth0 and redirects
them for input on dummy0:
# tc filter add dev veth0 parent ffff: pref 1 protocol 802.1q basic \
action mirred ingress redirect dev dummy0
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Since 5cd1adba79 ("Update to current iptables headers") compilation
of iproute2 broke for systems without iptables-devel package [1].
Reason is that even though we fall back to build m_ipt.c, the include
depends on a xtables-version.h header, which only ships with
iptables-devel. Machines not having this package fail compilation with:
[...]
CC m_ipt.o
In file included from ../include/iptables.h:5:0,
from m_ipt.c:17:
../include/xtables.h:34:29: fatal error: xtables-version.h: No such file or directory
compilation terminated.
../Config:31: recipe for target 'm_ipt.o' failed
make[1]: *** [m_ipt.o] Error 1
The configure script only barks that package xtables was not found in
the pkg-config search path. The generated Config then only contains f.e.
TC_CONFIG_IPSET. In tc's Makefile we thus fall back to adding m_ipt.o
to TCMODULES. m_ipt.c then includes the local include/iptables.h header
copy, which includes the include/xtables.h copy. Latter then includes
xtables-version.h, which only ships with iptables-devel.
One way to resolve this is to skip this whole mess when pkg-config has
no xtables config available. I've carried something along these lines
locally for a while now, but it's just too annyoing. :/ Build works fine
now also when xtables.pc is not available.
[1] http://www.spinics.net/lists/netdev/msg366162.html
Fixes: 5cd1adba79 ("Update to current iptables headers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add support for controling hardware offload using (now standard)
skip_sw and skip_hw flags in cls_bpf.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
sudo $TC filter add dev $ETH parent ffff: prio 2 protocol ip \
u32 match u32 0 0 flowid 1:1 \
action ok
sudo $TC filter add dev $ETH parent ffff: prio 1 protocol ip \
u32 match ip protocol 1 0xff flowid 1:10 \
action ok
now dump to see all rules..
$TC -s filter ls dev $ETH parent ffff: protocol ip
....
filter pref 1 u32
filter pref 1 u32 fh 801: ht divisor 1
filter pref 1 u32 fh 801::800 order 2048 key ht 801 bkt 0 flowid 1:10 (rule hit 0 success 0)
match 00010000/00ff0000 at 8 (success 0 )
action order 1: gact action drop
random type none pass val 0
index 6 ref 1 bind 1 installed 4 sec used 4 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
filter pref 2 u32
filter pref 2 u32 fh 800: ht divisor 1
filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 (rule hit 336 success 336)
match 00000000/00000000 at 0 (success 336 )
action order 1: gact action pass
random type none pass val 0
index 5 ref 1 bind 1 installed 38 sec used 4 sec
Action statistics:
Sent 24864 bytes 336 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
....
..get filter 801::800
$TC -s filter get dev $ETH parent ffff: protocol ip \
handle 801:0:800 prio 2 u32
....
filter parent ffff: protocol ip pref 1 u32 fh 801::800 order 2048 key ht 801 bkt 0 flowid 1:10 (rule hit 260 success 130)
match 00010000/00ff0000 at 8 (success 130 )
action order 1: gact action drop
random type none pass val 0
index 6 ref 1 bind 1 installed 348 sec used 0 sec
Action statistics:
Sent 11440 bytes 130 pkt (dropped 130, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
....
..get other one
$TC -s filter get dev $ETH parent ffff: protocol ip \
handle 800:0:800 prio 2 u32
....
filter parent ffff: protocol ip pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 (rule hit 514 success 514)
match 00000000/00000000 at 0 (success 514 )
action order 1: gact action pass
random type none pass val 0
index 5 ref 1 bind 1 installed 506 sec used 4 sec
Action statistics:
Sent 35544 bytes 514 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
....
..try something that doesnt exist
$TC -s filter get dev $ETH parent ffff: protocol ip handle 800:0:803 prio 2 u32
.....
RTNETLINK answers: No such file or directory
We have an error talking to the kernel
.....
Note, added NLM_F_ECHO is for backward compatibility. old kernels never
before Eric's patch will not respond without it and newer kernels (after Erics patch)
will ignore it.
In old kernels there is a side effect:
In addition to a response to the GET you will receive an event (if you do tc mon).
But this is still better than what it was before (not working at all).
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe
to:
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15
Or worse, try to debug a policy with destination mac, source mac and
etherype. Then make that a hundred rules and you'll get my point.
The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesn't get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac).
In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
In linux-4.9 fq packet scheduler got a new stat :
unthrottle_latency in nano second units.
Gives a good indication of system load or timer implementation
latencies.
Signed-off-by: Eric Dumazet <edumazet@google.com>
The 'vlan modify' action allows to replace an existing 802.1q tag
according to user provided settings.
It accepts same arguments as the 'vlan push' action.
For example, this replaces vid 6 with vid 5:
# tc filter add dev veth0 parent ffff: pref 1 protocol 802.1q \
basic match 'meta(vlan mask 0xfff eq 6)' \
action vlan modify id 5 continue
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Currently, 'linkid' input by the user is parsed but 'handle' is appended to the netlink message.
# tc filter add dev enp1s0f1 protocol ip parent ffff: prio 99 u32 ht 800: \
order 1 link 1: offset at 0 mask 0f00 shift 6 plus 0 eat match ip \
protocol 6 ff
resulted in:
filter protocol ip pref 99 u32 fh 800::1 order 1 key ht 800 bkt 0
match 00060000/00ff0000 at 8
offset 0f00>>6 at 0 eat
This patch results in:
filter protocol ip pref 99 u32 fh 800::1 order 1 key ht 800 bkt 0 link 1:
match 00060000/00ff0000 at 8
offset 0f00>>6 at 0 eat
Signed-off-by Sushma Sitaram: Sushma Sitaram <sushma.sitaram@intel.com>
since get_qdisc_handle() truncates the input value to 16 bit, return an
error and prompt "invalid qdisc ID" in case input 'handle' parameter needs
more than 16 bit to be stored.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Phil Sutter <phil@nwl.cc>
The current vlan push action supports only vid and protocol options.
Add priority option.
Example script that adds vlan push action with vid and priority:
tc filter add dev veth0 protocol ip parent ffff: \
flower \
indev veth0 \
action vlan push id 100 priority 5
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Classification according to vlan id and vlan priority.
Example script that adds vlan filter:
# add ingress qdisc
tc qdisc add dev ens4f0 ingress
# add a flower filter with vlan id and priority classification
tc filter add dev ens4f0 protocol 802.1Q parent ffff: \
flower \
indev ens4f0 \
vlan_ethtype ipv4 \
vlan_id 100 \
vlan_prio 3 \
action vlan pop
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
The matchall classifier matches every packet and allows the user to apply
actions on it. In addition, it supports the skip_sw and skip_hw (as can
be found on u32 and flower filter) that direct the kernel to skip the
software/hardware processing of the actions.
This filter is very useful in usecases where every packet should be
matched. For example, packet mirroring (SPAN) can be setup very easily
using that filter.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Before this patch:
# ./tc/tc actions add action drop index 11
RTNETLINK answers: File exists
We have an error talking to the kernel
Command "(null)" is unknown, try "tc actions help".
After this patch:
# ./tc/tc actions add action drop index 11
RTNETLINK answers: File exists
We have an error talking to the kernel
Cc: Stephen Hemminger <shemming@brocade.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
The function returns zero on success.
Reported-by: Mark Bloch <markb@mellanox.com>
Fixes: 69f5aff63c ("tc: use action_a2n() everywhere")
Signed-off-by: Phil Sutter <phil@nwl.cc>
When switching to C99 initializers, I forgot to add this one. This means
that when trying to set an estimator value, tc would complain about
spurious duplicate estimator parameter. But much worse, the random
variable content is sent to the kernel regardless of whether an
estimator was given or not.
Fixes: d17b136f7d ("Use C99 style initializers everywhere")
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
It's a pitty this function is used nowhere, so let's polish it for use:
* Loop over branch names, makes it clear that every former conditional
was exactly identical.
* Support 'pipe' branch name, too.
* Make number parsing optional.
Signed-off-by: Phil Sutter <phil@nwl.cc>
* Drop 'extern' keyword before function declarations.
* Add parameter names where they were missing for matters of
consistency.
* Drop fancy indenting (e.g. tab between type and name).
* Break long lines to not exceed 80 columns.
Signed-off-by: Phil Sutter <phil@nwl.cc>
The optional mask which may be added to int values is considered by the
kernel only if it is non-zero, therefore tc should only then also print
it.
Without this, not passing a mask value like so:
| # tc filter add dev d0 parent 8001: \
| basic match meta\(vlan eq 1\) \
| classid 8001:1
Would lead to tc printing an all-zero mask later:
| # tc filter show dev d0
| filter parent 8001: protocol all pref 49151 basic
| filter parent 8001: protocol all pref 49151 basic handle 0x1 flowid 8001:1
| meta(vlan mask 0x00000000 eq 1)
This is obviously confusing as an all-zero mask strictly means to
eliminate all bits from the value, but the opposite is the case.
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Since parse_rtattr_flags() calls memset already, there is no need for
callers to do so themselves.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
This only replaces occurrences where the newly allocated memory is
cleared completely afterwards, as in other cases it is a theoretical
performance hit although code would be cleaner this way.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
This big patch was compiled by vimgrepping for memset calls and changing
to C99 initializer if applicable. One notable exception is the
initialization of union bpf_attr in tc/tc_bpf.c: changing it would break
for older gcc versions (at least <=3.4.6).
Calls to memset for struct rtattr pointer fields for parse_rtattr*()
were just dropped since they are not needed.
The changes here allowed the compiler to discover some unused variables,
so get rid of them, too.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
This improves my initial change in the following points:
- Flatten embedded struct's initializers.
- No need to initialize variables to zero as the key feature of C99
initializers is to do this implicitly.
- By relocating the declaration of struct rtattr *tail, it can be
initialized at the same time.
Fixes: a0a73b298a ("tc: m_action: Use C99 style initializers for struct req")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Use the official BPF ELF e_machine value that was assigned recently [1]
and will be propagated to glibc, libelf et al. LLVM will switch to it
in 3.9 release, therefore we need to prepare tc to check for EM_ELF as
well, older version still have the EM_NONE.
[1] 36b9c09330
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
On devices that support TC flower offloads, these flags enable a filter to be
added only to HW or only to SW. skip_sw and skip_hw are mutually exclusive
flags. By default without any flags, the filter is added to both HW and SW,
but no error checks are done in case of failure to add to HW.
With skip-sw, failure to add to HW is treated as an error.
Here is a sample script that adds 2 filters, one with skip_sw and the other
with skip_hw flag.
# add ingress qdisc
tc qdisc add dev enp0s9 ingress
# enable hw tc offload.
ethtool -K enp0s9 hw-tc-offload on
# add a flower filter with skip-sw flag.
tc filter add dev enp0s9 protocol ip parent ffff: flower \
ip_proto 1 indev enp0s9 skip_sw \
action drop
# add a flower filter with skip-hw flag.
tc filter add dev enp0s9 protocol ip parent ffff: flower \
ip_proto 3 indev enp0s9 skip_hw \
action drop
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
I'll make a formal submission sans the header when the kernel patches
makes it in. This version is for someone who wants to play around with
the net-next kernel patches i sent
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Instead of initializing fields after (or sometimes even before) zeroing
the whole struct via memset(), initialize the whole thing at declaration
time.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Since commit 5cd1adb ("Update to current iptables headers") the build
with m_ipt.o and the following config will fail:
TC_CONFIG_XT:=n
TC_CONFIG_XT_OLD:=n
TC_CONFIG_XT_OLD_H:=n
This patch renames "iptables_target" to "xtables_target" and some other
things which gets renamed and I noticed while reading iptables git log.
Functions which are not used in m_ipt.c and not exported by the header
are removed, if they still used in m_ipt.c I added a static to the function.
Reported-by: Clemens Gruber <clemens.gruber@pqgruber.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
This pulls common code from parse_ipt() and print_ipt() functions
together.
While here, also fix for incorrect use of the global 'optarg' variable
in print_ipt().
Signed-off-by: Phil Sutter <phil@nwl.cc>
After dropping the unused decrement of argc in the function's tail, it
can fully take over what iargc has been used for.
Signed-off-by: Phil Sutter <phil@nwl.cc>
By exiting early if xtables_find_target() fails, one indenting level can
be dropped. Some of the wrongly indented code then happens to sit at the
right spot by accident which is why this patch is smaller than expected.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Without this, the following call to tc would segfault:
| tc filter add dev d0 parent ffff: u32 match u32 0 0 \
| action xt -j MARK --set-mark 0x1 \
| action xt -j MARK --set-mark 0x1
The reason is basically the same as for 6e2e5ec28b ("fix print_ipt:
segfault if more then one filter with action -j MARK.") but in
parse_ipt() instead of print_ipt().
Signed-off-by: Phil Sutter <phil@nwl.cc>
Iptables standard targets like DROP or REJECT don't implement the print
callback in libxtables. Hence the following command would segfault:
| tc filter add dev d0 parent ffff: u32 match u32 0 0 action xt -j DROP
With this patch standard targets still can't be used (and are not really
useful anyway), but at least it doesn't crash anymore.
Signed-off-by: Phil Sutter <phil@nwl.cc>
On devices that support TC U32 offloads, these flags enable a filter to be
added only to HW or only to SW. skip_sw and skip_hw are mutually exclusive
flags. By default without any flags, the filter is added to both HW and SW,
but no error checks are done in case of failure to add to HW.
With skip-sw, failure to add to HW is treated as an error.
Here is a sample script that adds 2 filters, one with skip_sw and the other
with skip_hw flag.
# add ingress qdisc
tc qdisc add dev p4p1 ingress
# enable hw tc offload.
ethtool -K p4p1 hw-tc-offload on
# add u32 filter with skip-sw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:1 u32 ht 800: flowid 800:1 \
skip-sw \
match ip src 192.168.1.0/24 \
action drop
# add u32 filter with skip-hw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:2 u32 ht 800: flowid 800:2 \
skip-hw \
match ip src 192.168.2.0/24 \
action drop
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
"handle" was being used several times for different things.
Fix the 80 character limit abuse and other little issues while at it.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
The user must at least specify a choice of the token bucket or
ewma policing or late binding index. TB policing requires at minimal
a rate and burst.
In addition fix formatting issues (80 chars etc).
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Similar to the Linux kernel and perf add infrastructure to reduce the
amount of output tossed to a user during a build. Full build output
can be obtained with 'make V=1'
Builds go from:
make[1]: Leaving directory `/home/dsa/iproute2.git/lib'
make[1]: Entering directory `/home/dsa/iproute2.git/ip'
gcc -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES -DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -c -o ip.o ip.c
gcc -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES -DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -c -o ipaddress.o ipaddress.c
to:
...
AR libutil.a
ip
CC ip.o
CC ipaddress.o
...
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Failed compile
m_simple.c: In function ‘parse_simple’:
m_simple.c:154:6: warning: too many arguments for format [-Wformat-extra-args]
*argv);
^
m_simple.c:103:14: warning: unused variable ‘maybe_bind’ [-Wunused-variable]
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
following late binding didn't work
sudo tc actions add action ife encode \
type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
We need to fill handle when provided by the user, even if no further
argument is provided. Thus, move the test for arg to the correct location,
so that it works correctly:
# tc filter show dev foo egress
filter protocol all pref 1 bpf
filter protocol all pref 1 bpf handle 0x1 bpf.o:[classifier] direct-action
filter protocol all pref 1 bpf handle 0x2 bpf.o:[classifier] direct-action
# tc filter del dev foo egress prio 1 handle 2 bpf
# tc filter show dev foo egress
filter protocol all pref 1 bpf
filter protocol all pref 1 bpf handle 0x1 bpf.o:[classifier] direct-action
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In ingress and clsact qdisc TCA_OPTIONS are ignored, since it's
parameterless. In tc, we add an empty addattr_l(... TCA_OPTIONS,
NULL, 0) to the netlink message nevertheless. This has the
side effect that when someone tries a 'tc qdisc replace' and
already an existing such qdisc is present, tc fails with
EINVAL here.
Reason is that in the kernel, this invokes qdisc_change() when
such requested qdisc is already present. When TCA_OPTIONS are
passed to modify parameters, it looks whether qdisc implements
.change() callback, and if not present (like in both cases here)
it returns with error. Rather than adding an empty stub to the
kernel that ignores TCA_OPTIONS again, just don't add TCA_OPTIONS
to the netlink message in the first place.
Before:
# tc qdisc replace dev foo clsact # first try
# tc qdisc replace dev foo clsact # second one
RTNETLINK answers: Invalid argument
After:
# tc qdisc replace dev foo clsact
# tc qdisc replace dev foo clsact
# tc qdisc replace dev foo clsact
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Brings it closer to more serious actions (adding branching
and allowing for late binding)
Unfortunately this breaks old syntax of the simple action.
But because simple is a pedagogical example unlikely to be used
in production environments (i.e its role is to serve as an example
on how to write actions), then this is ok.
New syntax for simple has new keyword "sdata". Example usage is:
sudo tc actions add action simple sdata "foobar" index 1
or
tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action simple sdata "foobar"
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
This action allows for a sending side to encapsulate arbitrary metadata
which is decapsulated by the receiving end.
The sender runs in encoding mode and the receiver in decode mode.
Both sender and receiver must specify the same ethertype.
At some point we hope to have a registered ethertype and we'll
then provide a default so the user doesnt have to specify it.
For now we enforce the user specify it.
Described in netdev01 paper:
"Distributing Linux Traffic Control Classifier-Action Subsystem"
Authors: Jamal Hadi Salim and Damascene M. Joachimpillai
Also refer to IETF draft-ietf-forces-interfelfb-04.txt
Lets show example usage where we encode icmp from a sender towards
a receiver with an skbmark of 17; both sender and receiver use
ethertype of 0xdead to interop.
YYYY: Lets start with Receiver-side policy config:
xxx: add an ingress qdisc
sudo tc qdisc add dev $ETH ingress
xxx: any packets with ethertype 0xdead will be subjected to ife decoding
xxx: we then restart the classification so we can match on icmp at prio 3
sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify
xxx: on restarting the classification from above if it was an icmp
xxx: packet, then match it here and continue to the next rule at prio 4
xxx: which will match based on skb mark of 17
sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue
xxx: match on skbmark of 0x11 (decimal 17) and accept
sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \
handle 0x11 fw flowid 1:1 \
action ok
xxx: Lets show the decoding policy
sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead
xxx:
filter pref 2 u32
filter pref 2 u32 fh 800: ht divisor 1
filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 (rule hit 0 success 0)
match 00000000/00000000 at 0 (success 0 )
action order 1: ife decode action reclassify type 0x0
allow mark allow prio
index 11 ref 1 bind 1 installed 45 sec used 45 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
Observe that above lists all metadatum it can decode. Typically these
submodules will already be compiled into a monolithic kernel or
loaded as modules
YYYY: Lets show the sender side now ..
xxx: Add an egress qdisc on the sender netdev
sudo tc qdisc add dev $ETH root handle 1: prio
xxx:
xxx: Match all icmp packets to 192.168.122.237/24, then
xxx: tag the packet with skb mark of decimal 17, then
xxx: Encode it with:
xxx: ethertype 0xdead
xxx: add skb->mark to whitelist of metadatum to send
xxx: rewrite target dst MAC address to 02:15:15:15:15:15
xxx:
sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 u32 \
match ip dst 192.168.122.237/24 \
match ip protocol 1 0xff \
flowid 1:2 \
action skbedit mark 17 \
action ife encode \
type 0xDEAD \
allow mark \
dst 02:15:15:15:15:15
xxx: Lets show the encoding policy
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2 (rule hit 118 success 0)
match c0a87a00/ffffff00 at 16 (success 0 )
match 00010000/00ff0000 at 8 (success 0 )
action order 1: skbedit mark 17
index 11 ref 1 bind 1 installed 3 sec used 3 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
action order 2: ife encode action pipe type 0xDEAD
allow mark dst 02:15:15:15:15:15
index 12 ref 1 bind 1 installed 3 sec used 3 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
Now test by sending ping from sender to destination
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
We need limits.h for PATH_MAX, fixes:
tc_bpf.c: In function ‘bpf_map_selfcheck_pinned’:
tc_bpf.c:222:12: error: ‘PATH_MAX’ undeclared (first use in this
function)
char file[PATH_MAX], buff[4096];
Signed-off-by: Gustavo Zacarias <gustavo@zacarias.com.ar>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Follow-up to kernel commit 6c9059817432 ("bpf: pre-allocate hash map
elements"). Add flags support, so that we can pass in BPF_F_NO_PREALLOC
flag for disallowing preallocation. Update examples accordingly and also
remove the BPF_* map helper macros from them as they were not very useful.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Make it easier to spot issues when loading the object file fails. This
includes reporting in what pinned object specs differ, better indication
when we've reached instruction limits. Don't retry to load a non relo
program once we failed with bpf(2), and report out of bounds tail call key.
Also, add truncation of huge log outputs by default. Sometimes errors are
quite easy to spot by only looking at the tail of the verifier log, but
logs can get huge in size e.g. up to few MB (due to verifier checking all
possible program paths). Thus, by default limit output to the last 4096
bytes and indicate that it's truncated. For the full log, the verbose option
can be used.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There is only a single user who needs it to be reentrant (not really,
but it's safer like this), add rt_addr_n2a_r() for it to use.
Signed-off-by: Phil Sutter <phil@nwl.cc>
There are only three users which require it to be reentrant, the rest is
fine without. Instead, provide a reentrant format_host_r() for users
which need it.
Signed-off-by: Phil Sutter <phil@nwl.cc>
As Jamal suggested, BRANCH is the wrong name, as these keywords go
beyond simple branch control - e.g. loops are possible, too. Therefore
rename the non-terminal to CONTROL instead which should be more
appropriate.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
The retain value was wrong for u16 and u8 types.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
This was tricky to get right:
- The 'stride' value used for 8 and 16 bit values must behave inverse to
the value's intra word offset to work correctly with big-endian data
act_pedit is editing.
- The 'm' array's values are in host byte order, so they have to be
converted as well (and the ordering was just inverse, for some
reason).
- The only sane way of getting this right is to manipulate value/mask in
host byte order and convert the output.
- TIPV4 (i.e. 'munge ip src/dst') had it's own pitfall: the address
parser converts to network byte order automatically. This patch fixes
this by converting it back before calling pack_key32, which is a hack
but at least does not require to implement a completely separate code
flow.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Break overlong function definitions and remove one extraneous
whitespace.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Since the IP Header Length field is just half a byte, adjust retain to
only match these bits so the Version field is not overwritten by
accident.
The whole concept is actually broken due to dependency on endianness
which pedit ignores.
Signed-off-by: Phil Sutter <phil@nwl.cc>
This was horribly broken:
* pack_key8() and pack_key16() ...
* missed to invert retain value when applying it to the mask,
* did not sanitize val by ANDing it with retain,
* and ignored the mask which is necessary for 'invert' command.
* pack_key16() did not convert mask to network byte order.
* Changing the retain value for 'invert' or 'retain' operation seems
just plain wrong.
* While here, also got rid of unnecessary offset sanitization in
pack_key32().
* Simplify code a bit by always assigning the local mask variable to
tkey->mask before calling any of the pack_key*() variants.
Signed-off-by: Phil Sutter <phil@nwl.cc>
After lookup of the layered op submodule, pedit would pass argv and argc
including the layered op identifier at first position which confused the
submodule parser. Fix this by calling NEXT_ARG() before calling the
parse_peopt() callback.
Signed-off-by: Phil Sutter <phil@nwl.cc>
This seems to have been a hidden feature, though it's very useful and
necessary at least when combining multiple pedit actions.
Signed-off-by: Phil Sutter <phil@nwl.cc>
b3 buffer has been deleted previously so b2 is followed by b4
which is not consistent.
Signed-off-by: Dmitrii Shcherbakov <fw.dmitrii@yandex.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Phil Sutter <phil@nwl.cc>
Remove printing according to the previously used encoding of mpu and
overhead values within the tc_ratespec's mpu field. This encoding is
no longer being used as a separate 'overhead' field in the ratespec
structure has been introduced.
Signed-off-by: Dmitrii Shcherbakov <fw.dmitrii@yandex.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Phil Sutter <phil@nwl.cc>
Don't reimplement them and rather use the macros from the gelf header,
that is, GELF_ST_BIND()/GELF_ST_TYPE().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Provide some more hints to the user/developer when relos have been found
that don't point to ld64 imm instruction. Ran couple of times into relos
generated by clang [1], where the compiler tried to uninline inlined
functions with eBPF and emitted BPF_JMP | BPF_CALL opcodes. If this seems
the case, give a hint that the user should do a work-around to use
always_inline annotation.
[1] https://llvm.org/bugs/show_bug.cgi?id=26243#c3
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
With a bit larger, branchy eBPF programs f.e. already ~BPF_MAXINSNS/7 in
size, it happens rather quickly that bpf(2) rejects also valid programs
when only the verifier log buffer size we have in tc is too small.
Change that, so by default we don't do any logging, and only in error
case we retry with logging enabled. If we should fail providing a
reasonable dump of the verifier analysis, retry few times with a larger
log buffer so that we can at least give the user a chance to debug the
program.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Commit 8f80d450c3 ("tc: fix compilation with old gcc (< 4.6)") was reverted
to ease the merge of the net-next branch.
Here is the new version.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add a test that symbol from relocation entry is actually related
to map section and bail out with an error message if it's not the
case; in relation to [1].
[1] https://llvm.org/bugs/show_bug.cgi?id=26243
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
eBPF llvm backend can support different BPF formats, make sure the object
we're trying to load matches with regards to endiannes and while at it, also
check for other attributes related to BPF ELFs.
# llc --version
LLVM (http://llvm.org/):
LLVM version 3.8.0svn
Optimized build.
Built Jan 9 2016 (02:08:10).
Default target: x86_64-unknown-linux-gnu
Host CPU: ivybridge
Registered Targets:
bpf - BPF (host endian)
bpfeb - BPF (big endian)
bpfel - BPF (little endian)
[...]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
When extracting sections, we better check for name and type. Noticed
that some llvm versions emit .strtab and .shstrtab (e.g. saw it on pre
3.7), while more recent ones only seem to emit .strtab. Thus, make sure
we get the right sections.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Clean it up a bit, we can also get rid of some ugly ifdefs as in our case
TC_H_INGRESS is always defined.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
since all tc classifiers are required to specify ethertype as part of grammar
By not allowing eth_type to be specified we remove contradiction for
example when a user specifies:
tc filter add ... priority xxx protocol ip flower eth_type ipv6
This patch removes that contradiction
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
gcc < 4.6 does not handle C11 syntax for the static initialization of
anonymous struct/union, hence the following error:
tc_bpf.c:260: error: unknown field map_type specified in initializer
Signed-off-by: Julien Floret <julien.floret@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Fix a whitespace in bpf_dump_error() usage, and also a missing closing
bracket in ntohl() macro for eBPF programs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Since we have all infrastructure in place now, allow atomic live updates
on program arrays. This can be very useful e.g. in case programs that are
being tail-called need to be replaced, f.e. when classifier functionality
needs to be changed, new protocols added/removed during runtime, etc.
Thus, provide a way for in-place code updates, minimal example: Given is
an object file cls.o that contains the entry point in section 'classifier',
has a globally pinned program array 'jmp' with 2 slots and id of 0, and
two tail called programs under section '0/0' (prog array key 0) and '0/1'
(prog array key 1), the section encoding for the loader is <id/key>.
Adding the filter loads everything into cls_bpf:
tc filter add dev foo parent ffff: bpf da obj cls.o
Now, the program under section '0/1' needs to be replaced with an updated
version that resides in the same section (also full path to tc's subfolder
of the mount point can be passed, e.g. /sys/fs/bpf/tc/globals/jmp):
tc exec bpf graft m:globals/jmp obj cls.o sec 0/1
In case the program resides under a different section 'foo', it can also
be injected into the program array like:
tc exec bpf graft m:globals/jmp key 1 obj cls.o sec foo
If the new tail called classifier program is already available as a pinned
object somewhere (here: /sys/fs/bpf/tc/progs/parser), it can be injected
into the prog array like:
tc exec bpf graft m:globals/jmp key 1 fd m:progs/parser
In the kernel, the program on key 1 is being atomically replaced and the
old one's refcount dropped.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
The recently introduced object pinning can be further extended in order
to allow sharing maps beyond tc namespace. F.e. maps that are being pinned
from tracing side, can be accessed through this facility as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Make use of the new show_fdinfo() facility and verify that when a
pinned map is being fetched that its basic attributes are the same
as the map we declared from the ELF file. I.e. when placed into the
globalns, collisions could occur. In such a case warn the user and
bail out.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Now that we have the possibility of sharing maps, it's time we get the
ELF loader fully working with regards to tail calls. Since program array
maps are pinned, we can keep them finally alive. I've noticed two bugs
that are being fixed in bpf_fill_prog_arrays() with this patch. Example
code comes as follow-up.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
This larger work addresses one of the bigger remaining issues on
tc's eBPF frontend, that is, to allow for persistent file descriptors.
Whenever tc parses the ELF object, extracts and loads maps into the
kernel, these file descriptors will be out of reach after the tc
instance exits.
Meaning, for simple (unnested) programs which contain one or
multiple maps, the kernel holds a reference, and they will live
on inside the kernel until the program holding them is unloaded,
but they will be out of reach for user space, even worse with
(also multiple nested) tail calls.
For this issue, we introduced the concept of an agent that can
receive the set of file descriptors from the tc instance creating
them, in order to be able to further inspect/update map data for
a specific use case. However, while that is more tied towards
specific applications, it still doesn't easily allow for sharing
maps accross multiple tc instances and would require a daemon to
be running in the background. F.e. when a map should be shared by
two eBPF programs, one attached to ingress, one to egress, this
currently doesn't work with the tc frontend.
This work solves exactly that, i.e. if requested, maps can now be
_arbitrarily_ shared between object files (PIN_GLOBAL_NS) or within
a single object (but various program sections, PIN_OBJECT_NS) without
"loosing" the file descriptor set. To make that happen, we use eBPF
object pinning introduced in kernel commit b2197755b263 ("bpf: add
support for persistent maps/progs") for exactly this purpose.
The shipped examples/bpf/bpf_shared.c code from this patch can be
easily applied, for instance, as:
- classifier-classifier shared:
tc filter add dev foo parent 1: bpf obj shared.o sec egress
tc filter add dev foo parent ffff: bpf obj shared.o sec ingress
- classifier-action shared (here: late binding to a dummy classifier):
tc actions add action bpf obj shared.o sec egress pass index 42
tc filter add dev foo parent ffff: bpf obj shared.o sec ingress
tc filter add dev foo parent 1: bpf bytecode '1,6 0 0 4294967295,' \
action bpf index 42
The toy example increments a shared counter on egress and dumps its
value on ingress (if no sharing (PIN_NONE) would have been chosen,
map value is 0, of course, due to the two map instances being created):
[...]
<idle>-0 [002] ..s. 38264.788234: : map val: 4
<idle>-0 [002] ..s. 38264.788919: : map val: 4
<idle>-0 [002] ..s. 38264.789599: : map val: 5
[...]
... thus if both sections reference the pinned map(s) in question,
tc will take care of fetching the appropriate file descriptor.
The patch has been tested extensively on both, classifier and
action sides.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add missing spaces around operators to increase readability. Aside from
that, make "preference" match a real synonym for "tos" and "dsfield" as
it's effect was identical to them.
Signed-off-by: Phil Sutter <phil@nwl.cc>
This fixes a few syntax errors and changes route filter help text to use
classid instead of flowid to be consistent with other filters' help
texts.
Signed-off-by: Phil Sutter <phil@nwl.cc>
After the patch, the most minimal command to load an eBPF action
for late binding with auto index selection through tc is:
tc actions add action bpf obj prog.o
We already set TC_ACT_PIPE in tc as default opcode, so if nothing
further has been specified, just use it. Also, allow "ok" next to
"pass" for matching cmdline on TC_ACT_OK.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When having optional classid, most minimal command can be sth
like:
tc filter add dev foo parent X: bpf obj prog.o
Therefore, adapt the code so that a next argument will not be
enforced as the case currently.
Also, minor cleanup on the classid, where we should rather
have used addattr32(), and add flags for exec configuration,
for example (using short notation):
tc filter add dev foo parent X: bpf da obj prog.o
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
linux-3.19 fq packet scheduler got a new attribute, controlling
number of 'flows' holding packets not attached to a socket
(forwarding usage)
kernel commit is 06eb395fa9856b5a87cf7d80baee2a0ed3cdb9d7
("pkt_sched: fq: better control of DDOS traffic")
This patch adds corresponding code to tc command.
tc qd replace dev eth0 root fq orphan_mask 511
Signed-off-by: Eric Dumazet <edumazet@google.com>
Code to parse and export this tuneable via netlink is already present in
sched_fq.c of the kernel, so not making it accessible for users would be
a waste of resources.
Signed-off-by: Phil Sutter <phil@nwl.cc>
This patch follows the changes of commit 4d98ab0 ("Fix FSF address in
file headers"), fixing file headers added after it.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Frontend support for kernel commit a5c90b29e5cc ("act_bpf: properly
support late binding of bpf action to a classifier").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Error was:
f_bpf.o: In function `bpf_parse_opt':
f_bpf.c:(.text+0x88f): undefined reference to `secure_getenv'
m_bpf.o: In function `parse_bpf':
m_bpf.c:(.text+0x587): undefined reference to `secure_getenv'
collect2: error: ld returned 1 exit status
There is no special reason to use the secure version of getenv, thus let's
simply use getenv().
CC: Daniel Borkmann <daniel@iogearbox.net>
Fixes: 88eea53954 ("tc: {f,m}_bpf: allow to retrieve uds path from env")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Yegor Yefremov <yegorslists@googlemail.com>
Kernel commit 04fd61ab36ec ("bpf: allow bpf programs to tail-call other
bpf programs") added support for tail calls, this patch here adds tc
front end parts for the object parser to prepopulate a given eBPF prog
array before the root prog is pushed down for classifier creation. The
prepopulation works with any number of prog arrays in any dependencies,
e.g. prog or normal maps could also be used from progs that are
tail-called themself, etc.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The initializers are simply not needed.
These if-blocks are outright dead code, because '0 > unsigned' is always
false, so only else clause triggers and regardless of which clause triggers
it only updates 'ind' which is later unconditionally written to before
being used anyway.
Otherwise we get errors from clang:
m_pedit.c:166:8: error: comparison of 0 > unsigned expression is always false [-Werror,-Wtautological-compare]
if (0 > tkey->off) {
~ ^ ~~~~~~~~~
m_pedit.c:209:8: error: comparison of 0 > unsigned expression is always false [-Werror,-Wtautological-compare]
if (0 > tkey->off) {
~ ^ ~~~~~~~~~
2 errors generated.
Change-Id: I3c9e9092915088fc56f992e5df736851541a4458
The for loop should only probe up to G[i]bit rates, so that we
end up with T[i]bit as the last max units[] slot for snprintf(3),
and not possibly an invalid pointer in case rate is multiple of
kilo.
Fixes: 8cecdc2837 ("tc: more user friendly rates")
Reported-by: Jose R. Guzman Mosqueda <jose.r.guzman.mosqueda@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There have been several instances where response from kernel
has overrun the stack buffer from the caller. Avoid future problems
by passing a size argument.
Also drop the unused peer and group arguments to rtnl_talk.
Allow the qdisc limit to be set, which is particularly useful when
the default VQ is not configured with RED parameters.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
codel & fq_codel packet schedulers are now able to have a threshold
for CE marking packets, regardless of the drop/nodrop decision taken by
CoDel.
This is particularly useful for dctcp and variants, that do not use
traditional ECN.
Note that fq_codel users would have to specify noecn if ce_threshold is
used, otherwise results would be not very interesting, as ecn is default
on for fq_codel.
$ tc -s qdisc show dev eth1
qdisc codel 8002: root refcnt 45 limit 1000p target 5.0ms ce_threshold
1.0ms interval 100.0ms
Sent 4908469888317 bytes 3351813967 pkt (dropped 0, overlimits 0
requeues 21624365)
rate 37671Mbit 3231836pps backlog 4904740b 250p requeues 21624365
count 0 lastcount 0 ldelay 1.1ms drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 410861803
Signed-off-by: Eric Dumazet <edumazet@google.com>
In the GRED kernel source code, both of the terms "drop parameters"
(DP) and "virtual queue" (VQ) are used to refer to the same thing.
Each "DP" is better understood as a "set of drop parameters", since
it has values for limit, min, max, avpkt, etc. This terminology can
result in confusion when creating a GRED qdisc having multiple DPs.
Netlink attributes and struct members with the DP name seem to have
been left intact for compatibility, while the term VQ was otherwise
adopted in the code, which is more intuitive.
Use the VQ term in the tc command syntax and output (but maintain
compatibility with the old syntax).
Rewrite the usage text to be concise and similar to other qdiscs.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
DPs, def_DP, and DP are unsigned values that are sent and received
in TCA_GRED_* netlink attributes; handle them properly when they
are parsed or printed. Use MAX_DPs as the initial value for def_DP
and DP, and fix the operator used for bounds checking them.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Make the output more consistent with the RED qdisc, and only show
details/statistics if the appropriate flag is set when calling tc.
Show the parameters used with "gred setup". Add missing statistics
"pdrop" and "other". Fix format specifiers for unsigned values.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
This is more helpful to the user, since the command takes two forms,
and the message that would otherwise appear about missing parameters
assumes one of those forms.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
The "bandwidth" parameter is optional, but ensure the user is aware
of its default value, to proactively avoid configuration problems.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
It is used when parsing three different parameters, only one of
which is Wlog. Change the name to make the code less confusing.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
When deleting a specific basic filter with handle,
tc command always ignores the 'handle' option, so
tcm_handle is always 0 and kernel deletes all filters
in the selected group. This is wrong, we should respect
'handle' in cmdline.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Currently, only on error we get a log dump, but I found it useful when
working with eBPF to have an option to also dump the log on success.
Also spotted a typo in a header comment, which is fixed here as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
This work follows upon commit 6256f8c9e4 ("tc, bpf: finalize eBPF
support for cls and act front-end") and takes up the idea proposed by
Hannes Frederic Sowa to spawn a shell (or any other command) that holds
generated eBPF map file descriptors.
File descriptors, based on their id, are being fetched from the same
unix domain socket as demonstrated in the bpf_agent, the shell spawned
via execvpe(2) and the map fds passed over the environment, and thus
are made available to applications in the fashion of std{in,out,err}
for read/write access, for example in case of iproute2's examples/bpf/:
# env | grep BPF
BPF_NUM_MAPS=3
BPF_MAP1=6 <- BPF_MAP_ID_QUEUE (id 1)
BPF_MAP0=5 <- BPF_MAP_ID_PROTO (id 0)
BPF_MAP2=7 <- BPF_MAP_ID_DROPS (id 2)
# ls -la /proc/self/fd
[...]
lrwx------. 1 root root 64 Apr 14 16:46 0 -> /dev/pts/4
lrwx------. 1 root root 64 Apr 14 16:46 1 -> /dev/pts/4
lrwx------. 1 root root 64 Apr 14 16:46 2 -> /dev/pts/4
[...]
lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
The advantage (as opposed to the direct/native usage) is that now the
shell is map fd owner and applications can terminate and easily reattach
to descriptors w/o any kernel changes. Moreover, multiple applications
can easily read/write eBPF maps simultaneously.
To further allow users for experimenting with that, next step is to add
a small helper that can get along with simple data types, so that also
shell scripts can make use of bpf syscall, f.e to read/write into maps.
Generally, this allows for prepopulating maps, or any runtime altering
which could influence eBPF program behaviour (f.e. different run-time
classifications, skb modifications, ...), dumping of statistics, etc.
Reference: http://thread.gmane.org/gmane.linux.network/357471/focus=357860
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
The warning was:
m_simple.c: In function ‘parse_simple’:
m_simple.c:142:4: warning: format ‘%ld’ expects argument of type ‘long int’, but argument 3 has type ‘size_t’ [-Wformat]
Useful to be able to compile with -Werror.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Add ability to add the netfilter connmark support.
Typical usage:
...lets tag outgoing icmp with mark 0x10..
iptables -tmangle -A PREROUTING -p icmp -j CONNMARK --set-mark 0x10
..add on ingress of $ETH an extractor for connmark...
tc filter add dev $ETH parent ffff: prio 4 protocol ip \
u32 match ip protocol 1 0xff \
flowid 1:1 \
action connmark continue
...if the connmark was 0x11, we police to a ridic rate of 10Kbps
tc filter add dev $ETH parent ffff: prio 5 protocol ip \
handle 0x11 fw flowid 1:1 \
action police rate 10kbit burst 10k
Other ways to use the connmark is to supply the zone, index and
branching choice. Refer to help.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
This work finalizes both eBPF front-ends for the classifier and action
part in tc, it allows for custom ELF section selection, a simplified tc
command frontend (while keeping compat), reusing of common maps between
classifier and actions residing in the same object file, and exporting
of all map fds to an eBPF agent for handing off further control in user
space.
It also adds an extensive example of how eBPF can be used, and a minimal
self-contained example agent that dumps map data. The example is well
documented and hopefully provides a good starting point into programming
cls_bpf and act_bpf.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
If '-nm' specified that do not fail if there is no
default class names file in /etc/iproute2.
Changed default class name file cls_names -> tc_cls.
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
This work adds the tc frontend for kernel commit e2e9b6541dd4 ("cls_bpf:
add initial eBPF support for programmable classifiers").
A C-like classifier program (f.e. see e2e9b6541dd4) is being compiled via
LLVM's eBPF backend into an ELF file, that is then being passed to tc. tc
then loads, if any, eBPF maps and eBPF opcodes (with fixed-up eBPF map file
descriptors) out of its dedicated sections, and via bpf(2) into the kernel
and then the resulting fd via netlink down to cls_bpf. cls_bpf allows for
annotations, currently, I've used the file name for that, so that the user
can easily identify his filter when dumping configurations back.
Example usage:
clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
tc filter add dev em1 parent 1: bpf run object-file cls.o classid x:y
tc filter show dev em1 [...]
filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid x:y cls.o
I placed the parser bits derived from Alexei's kernel sample, into tc_bpf.c
as my next step is to also add the same support for BPF action, so we can
have a fully fledged eBPF classifier and action in tc.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Next argument after the tc opcode/verdict is optional, using NEXT_ARG()
requires to have another argument after that one otherwise tc will bail
out. Therefore, we need to advance to the next argument manually as done
elsewhere.
Fixes: 86ab59a666 ("tc: add support for BPF based actions")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Left-overs when copying this over from cls_bpf. ;) Lets remove them.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
When specified in a graph such as:
action vlan ... action foobar
the vlan action chewed more than it can swallow
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
The man page and the "fail" example are missing an underscore in the
nf_mark ematch.
eg.
tc filter add dev eth0 parent ffff: basic match 'meta(nfmark gt 24)'
classid 2:4
meta: unknown meta id
... >>meta(nfmark gt 24)<< ...
... meta(>>nfmark<< gt 24)...
Usage: meta(OBJECT { eq | lt | gt } OBJECT)
where: OBJECT := { META_ID | VALUE }
META_ID := id [ shift SHIFT ] [ mask MASK ]
Example: meta(nfmark gt 24)
meta(indev shift 1 eq "ppp")
meta(tcindex mask 0xf0 eq 0xf0)
For a list of meta identifiers, use meta(list).
Illegal "ematch"
meta(list) does correctly show nf_mark and the above test works with
nf_mark.
Signed-off-by: Andy Furniss adf.lists@gmail.com
Was broken by commit 288abf513f
Lets not be too clever and have a separate call to print flushed
actions info.
Broken looks like:
root@moja-1:~# tc actions add action drop index 4
root@moja-1:~# tc -s actions ls action gact
action order 0: gact action drop
random type none pass val 0
index 4 ref 1 bind 0 installed 9 sec used 4 sec
The fixed version looks like:
action order 0: gact action drop
random type none pass val 0
index 4 ref 1 bind 0 installed 9 sec used 4 sec
Sent 108948 bytes 1297 pkts (dropped 1297, overlimits 0)
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
First, the default value for 1-k is documented as being 0, but is
currently being set to 1. (100%). This causes all packets to be dropped
in the good state if 1-k is not explicitly specified. Fix this by setting
the default to 0.
Second, the 1-h option is parsed correctly, however, the kernel is
expecting "h", not 1-h. Fix this by inverting the "1-h" percentage before
sending to and after receiving from the kernel. This does change the
behavior, but makes it consistent with the netem documentation and the
literature on the Gilbert-Elliot model, which refer to "1-h" and "1-k,"
not "h" or "k" directly.
Last, fix a minor formatting issue for the options reporting.
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
When limit<burst latency becomes <0, for example:
# tc qdisc add dev eth0 root handle 1: tbf limit 100K burst 256K rate 256kbit
# tc qdisc show
qdisc tbf 1: dev eth0 root refcnt 2 rate 256Kbit burst 256Kb lat 4290.0s
If latency<0 there is no reason to show it. Limit will be printed instead of
latency when latency<0:
# tc qdisc show
qdisc tbf 1: dev eth0 root refcnt 2 rate 256Kbit burst 256Kb limit 100Kb
Signed-off-by: Sergey V. Lobanov <sergey@lobanov.in>
This also fixes a long standing bug of not sanely reporting the
action chain ordering
Sample scenario test
on window 1(event window):
run "tc monitor" and observe events
on window 2:
sudo tc actions add action drop index 10
sudo tc actions add action ok index 12
sudo tc actions ls action gact
sudo tc actions flush action gact
See the event window reporting two entries
(doing another listing should show empty generic actions)
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
We need limits.h for LONG_MIN and LONG_MAX, sys/param.h for MIN and
sys/select for struct timeval.
This fixes the following compile errors with musl libc:
f_bpf.c: In function 'bpf_parse_opt':
f_bpf.c:181:12: error: 'LONG_MIN' undeclared (first use in this function)
if (h == LONG_MIN || h == LONG_MAX) {
^
...
tc_util.o: In function `print_tcstats2_attr':
tc_util.c:(.text+0x13fe): undefined reference to `MIN'
tc_util.c:(.text+0x1465): undefined reference to `MIN'
tc_util.c:(.text+0x14ce): undefined reference to `MIN'
tc_util.c:(.text+0x154c): undefined reference to `MIN'
tc_util.c:(.text+0x160a): undefined reference to `MIN'
tc_util.o:tc_util.c:(.text+0x174e): more undefined references to `MIN' follow
...
tc_stab.o: In function `print_size_table':
tc_stab.c:(.text+0x40f): undefined reference to `MIN'
...
fdb.c:247:30: error: 'ULONG_MAX' undeclared (first use in this function)
(vni >> 24) || vni == ULONG_MAX)
^
lnstat.h:28:17: error: field 'last_read' has incomplete type
struct timeval last_read; /* last time of read */
^
Signed-off-by: Natanael Copa <ncopa@alpinelinux.org>
BUG: tc filter show ... produce a segmentation fault if more than one
filter rule with action -j MARK exists.
Reason: In print_ipt(...) xtables will be initialzed with a
pointer to the static struct tcipt_globals at xtables_init_all().
Later on the fields .opts and .options_offset of tcipt_globals are
modified. The call of xtables_free_opts(1) at the end of print(...)
does not restore the original values of tcipt_globals for the
modified fields. It only frees some allocated memory and sets
.opts to NULL. This leads to a segmentation fault when print_ipt()
is called for the next filter rule with action -j MARK.
Fix: Cloneing tcipt_globals on the stack as tmp_tcipt_globals and
use it instead of tcipt_globals, so tcipt_globals will be not
modified.
Signed-off-by: Andreas Greve <andreas.greve@a-greve.de>
The display of the entire netem loss state is shown as if it
were gemodel state, as the loss state information is assigned to the
wrong pointer. Correct this by assigning the loss state to the correct
pointer.
Additionally, attempting to set netem loss state will result in
random values in the p14 state probability because the option value
passed to the kernel by tc netem is not parsed or initialized. Fix this
by supplying a default value of 0 for p14 and parsing the p14 value if
one is supplied.
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
The direct_qlen command option is used with qdisc operation.
It happened to be implemented in htb_parse_class_opt() which is called
with class operation.
Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
netem support 64bit rates start from linux-3.13.
Add 64bit rates support in tc tools.
tc qdisc show dev eth0
qdisc netem 1: dev eth4 root refcnt 2 limit 1000 rate 35Gbit
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Eric Dumazet <edumazet@google.com>
To avoid loss when transforming burst to buffer in userspace, send
burst/mtu to kernel directly.
Kernel commit 2e04ad424b("sch_tbf: add TBF_BURST/TBF_PBURST attribute")
make it can handle burst/mtu.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Proportional Integral controller Enhanced (PIE) is a scheduler to address the
bufferbloat problem.
We present here a lightweight design, PIE(Proportional Integral controller
Enhanced) that can effectively control the average queueing latency to a target
value. Simulation results, theoretical analysis and Linux testbed results have
shown that PIE can ensure low latency and achieve high link utilization under
various congestion situations. The design does not require per-packet
timestamp, so it incurs very small overhead and is simple enough to implement
in both hardware and software. "
For more information, please see technical paper about PIE in the IEEE
Conference on High Performance Switching and Routing 2013. A copy of the paper
can be found at ftp://ftpeng.cisco.com/pie/.
Please also refer to the IETF draft submission at
http://tools.ietf.org/html/draft-pan-tsvwg-pie-00
All relevant code, documents and test scripts and results can be found at
ftp://ftpeng.cisco.com/pie/.
For problems with the iproute2/tc or Linux kernel code, please contact Vijay
Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
(mysuryan@cisco.com)
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Mythili Prabhu <mysuryan@cisco.com>
CC: Dave Taht <dave.taht@bufferbloat.net>
attached.
cheers,
jamal
commit 58d78f9f6447df324cdeb99262442c5e3f1f924b
Author: Jamal Hadi Salim <jhs@mojatatu.com>
Date: Sun Dec 22 10:34:18 2013 -0500
dont skip displaying of action chains or lists by TCA_ACT_MAX_PRIO
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
attached.
cheers,
jamal
commit d7869e6167c3553e93e254940b0647032b40fed8
Author: Jamal Hadi Salim <jhs@mojatatu.com>
Date: Sun Dec 22 07:46:28 2013 -0500
print new line at the end for aesthetics
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
attached.
cheers,
jamal
commit b82057d9ec851a8aba8a295b959190ef5098f330
Author: Jamal Hadi Salim <jhs@mojatatu.com>
Date: Sat Dec 21 17:00:11 2013 -0500
After a decade of trying to deprecate the old policer syntax,
I believe it is time to kill it. The kernel build option for old
policer is gone for at least 5 years now (although backward
compatibility is still there). Being backward compatible meant
hijacking the keyword "action" and was obstructing policies like:
tc filter add dev eth0 parent ffff: protocol ip pref 10 \
u32 match ip protocol 1 0xff flowid 1:10 \
action skbedit mark 1 \
action police rate 10kbit burst 10k pipe \
action skbedit mark 2 \
action police rate 20kbit burst 20k pipe \
action action mirred egress mirror dev dummy0
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Display more user friendly rates.
10Mbit is more readable than 10000Kbit
Before :
class htb 1:2 root prio 0 rate 10000Kbit ceil 10000Kbit ...
After:
class htb 1:2 root prio 0 rate 10Mbit ceil 10Mbit ...
Signed-off-by: Eric Dumazet <edumazet@google.com>
tbf support 64bit rates start from linux-3.13.
Add 64bit rates support in tc tools.
tc qdisc show dev eth0
qdisc tbf 1: root refcnt 2 rate 40000Mbit burst 230000b peakrate 50000Mbit minburst 87500b lat 50.0ms
This is a followup to ("htb: support 64bit rates").
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Cc: Eric Dumazet <edumazet@google.com>
This is the iproute2 part of the kernel patch "net: sched:
add BPF-based traffic classifier".
[Will re-submit later again for iproute2 when window for
-next submissions opens.]
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
There are two global variables in tc/tc_class.c:
__u32 filter_qdisc;
__u32 filter_classid;
These are not re-initialized for each line received in -batch mode:
class show dev eth0 parent 1: classid 1:1
class show dev eth0 parent 1: classid 1:1
Error: duplicate "classid": "1:1" is the second value.
This patch fixes the issue by initializing the two globals when we
enter print_class().
Signed-off-by: Nigel Kukard <nkukard@lbsd.net>
Some qdisc like htb want the parse_qopt to be called even if no options
present. Fixes regression caused by:
e9e78b0db0 is the first bad commit
commit e9e78b0db0
Author: Stephen Hemminger <stephen@networkplumber.org>
Date: Mon Aug 26 08:41:19 2013 -0700
tc: allow qdisc without options
If you taketh you giveth.
I Went the LinuxWay and copied this for m_simple.c and noticed
this one typo (I wonder where it came from?;->).
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Simple action is already in the kernel for years now as an
example. This complements it with user space control.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
TCA_HTB_DIRECT_QLEN attribute is supported since linux-3.10
HTB classes use an internal pfifo queue, which limit was not reported
by tc, and value inherited from device tx_queue_len at setup time.
With this patch, tc displays the value and can change it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Since linux-3.11, rate estimator can provide TCA_STATS_RATE_EST64
when rate (bytes per second) is above 2^32 (~34 Mbits)
Change tc to use this attribute for high rates.
Signed-off-by: Eric Dumazet <edumazet@google.com>
This iproute2 tc patch is connected to the kernel
- commit 8a8e3d84b17 (net_sched: restore "linklayer atm" handling)
The rate table calculated by tc, have gotten replaced in the kernel
and is no-longer used for lookups.
This happened in kernel release v3.8 caused by kernel
- commit 56b765b79 ("htb: improved accuracy at high rates").
This change unfortunately caused breakage of tc overhead and
linklayer parameters.
Kernel overhead handling got fixed in kernel v3.10 by
- commit 01cb71d2d47 (net_sched: restore "overhead xxx" handling)
Kernel linklayer handling got fixed in kernel v3.11 by
- commit 8a8e3d84b17 (net_sched: restore "linklayer atm" handling)
The linklayer fix introduced a struct change, that allow the linklayer
attribute to be transferred between tc and kernel. This patch make use
of this linklayer attribute.
The linklayer setting is transfer to the kernel. And linklayer
setting received from the kernel is printed with a prefixed
"linklayer" when listing current configuration. The default
TC_LINKLAYER_ETHERNET is only printed in detailed output mode.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
On Mon, 2013-06-03 at 16:36 +0100, Ben Hutchings wrote:
> Oops, I read this as being strtol() currently, not strtod(). Currently
> '1.5gbit' will work, but this change will break that. So I think you
> need to keep bps as a double.
Arg
> Then here I think the check should be *rate != floor(bps), i.e. accept
> rounding down of a non-integer number of bytes but any other change is
> assumed to be overflow.
Thanks Ben, here is v4 then ;)
[PATCH v4] get_rate: detect 32bit overflows
Current rate limit is 34.359.738.360 bit per second, and
unfortunately 40Gbps links are above it.
overflows in get_rate() are currently not detected, and some
users are confused. Let's detect this and complain.
Note that some qdisc are ready to get extended range, but this will
need additional attributes and new iproute2
With help from Ben Hutchings
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
"tc class show dev ..." omits the overhead attribute for HTB.
After patch I have :
tc class add dev $DEV parent 1: classid 1:1 est 1sec 4sec htb \
rate 12Mbit mtu 1500 quantum 1514 overhead 20
tc class show dev $DEV
class htb 1:1 root prio 0 rate 12000Kbit overhead 20 ceil 12000Kbit
burst 1500b cburst 1500b
Signed-off-by: Eric Dumazet <edumazet@google.com>
In trying to build on a RHEL6.3 I ran into several build issues that are
addressed in this patch.
The first is that xtables_merge_options only has 3 parameters. It appears
this is how this code was originally. As such for the case where the version
is less than 6 I am assuming it would be correct to maintain the original
setup that only had 3 parameters being passed instead of 4.
I also ran into an issue with the define for __ALIGN_KERNEL not being present.
I believe this may be due to the fact that __ALIGN_KERNEL was moved into a
separate header from ALIGN after the UAPI changes. In order to just cover all
of the bases I have moved the main definition for the macros into
__ALIGN_KERNEL_MASK and __ALIGN_KERNEL and if ALIGN is also needed then it is
just a direct redefine to __ALIGN_KERNEL.
Cc: Hasan Chowdhury <shemonc@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Clearer error messages for fifo and tbf qdiscs:
- Say who is complaining
- Don't just say a parameter is bad, show the offending parameter
- Be clearer about duplicate parameters vs illegal pairs of parameters
- Try to give multiple error messages rather than let the user discover the errors one by one
- When there are parameter aliases, try to use the variant that was used, or at least mention them all
Note that in the old version an empty parameter list to tbf would just cause an explain() message
without a specific error message. By simply removing the relevant error check, the code now
handles this error more gracefully by printing an error message for all mandatory parameters.
It still prints the explain() message.
Signed-off-by: Kees van Reeuwijk <reeuwijk@few.vu.nl>
On openSUSE 12.2 (at least) xtables.h is not installed in the system-wide
include dir but in /usr/include/iptables-1.4.16.3/. This results in the
following build failure:
em_ipset.c:26:21: fatal error: xtables.h: No such file or directory
Other includers of xtables.h already call out to pkg-config
Fixes breakage with xtables API starting with version 1.4.10
Signed-off-by: Hasan Chowdhury <shemonc@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Rather than hard coding `pkg-config`, use ${PKG_CONFIG} so people can
override it to their specific version (like when cross-compiling).
This is the same way the upstream pkg-config code works.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Hi,
When compiling iproute2-3.6.0 on a host that doesn't have iptables available, I get the following error:
gcc -Wall -Wstrict-prototypes -O2 -I../include -DRESOLVE_HOSTNAMES
-DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE
-DCONFIG_GACT -DCONFIG_GACT_PROB -DYY_NO_INPUT -c -o em_ipset.o
em_ipset.c
em_ipset.c:26:21: fatal error: xtables.h: No such file or directory
Fixed by the following patch, which guards the building of em_ipset.o on
the presence of suitable headers.
Thanks,
Matt.
This ematch enables effective filtering of CAN frames (AF_CAN) based
on CAN identifiers with masking of compared bits. Implementation
utilizes bitmap based classification for standard frame format (SFF)
which is optimized for minimal overhead.
Signed-off-by: Rostislav Lisovy <lisovy@gmail.com>
example usage:
tc filter add dev $dev parent $id: basic match not ipset'(foobar src)' ..
also updates iproute2/ematch_map, else tc complains:
Error: Unable to find ematch "ipset" in /etc/iproute2/ematch_map
Please assign a unique ID to the ematch kind the suggested entry is:
8 ipset
when trying to use this ematch.
(text ematch (5) only exists in kernel, a vlan ematch (6) exists neither in
kernel nor userspace, but kernel headers define TCF_EM_VLAN == 6).
Since the get_rate() code incorrectly interpreted bare number, the
behavior is not the same as man page and comment described.
We need to change the man page and comment for compatible with the
existing usage by scripts.
Because we use the high 16 bits of tcm_info to pass prio value to
kernel, thus it's range would be [0, 0xffff], without validation
in tc when user pass a lager(>65535) priority, the actual priority
set in kernel would confuse the user.
So, add a validation to ensure prio in the range.
On current firstfrag filter, all non fragmented packets are matched.
firstfrag should check MF bit.
Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
The off of icmp_code is not 20 but 21. Also offmask should be 0 unless
nexthdr+ is specified.
Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Fair Queue Codel packet scheduler
Principles :
- Packets are classified (internal classifier or external) on flows.
- This is a Stochastic model (as we use a hash, several flows might
be hashed on same slot)
- Each flow has a CoDel managed queue.
- Flows are linked onto two (Round Robin) lists,
so that new flows have priority on old ones.
- For a given flow, packets are not reordered (CoDel uses a FIFO)
- head drops only.
- ECN capability is on by default.
- Very low memory footprint (64 bytes per flow)
tc qdisc ... fq_codel [ limit PACKETS ] [ flows number ]
[ target TIME ] [ interval TIME ] [ noecn ]
[ quantum BYTES ]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Van Jacobson <van@pollere.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Changli Gao <xiaosuo@gmail.com>
An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.
http://queue.acm.org/detail.cfm?id=2209336
This AQM main input is no longer queue size in bytes or packets, but the
delay packets stay in (FIFO) queue.
As we don't have infinite memory, we still can drop packets in enqueue()
in case of massive load, but mean of CoDel is to drop packets in
dequeue(), using a control law based on two simple parameters :
target : target sojourn time (default 5ms)
interval : width of moving time window (default 100ms)
Selected packets are dropped, unless ECN is enabled and packets can get
ECN mark instead.
Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
[ interval TIME ] [ ecn ]
qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
maxpacket 1514 ecn_mark 84399 drop_overlimit 0
CoDel must be seen as a base module, and should be used keeping in mind
there is still a FIFO queue. So a typical setup will probably need a
hierarchy of several qdiscs and packet classifiers to be able to meet
whatever constraints a user might have.
One possible example would be to use fq_codel, which combines Fair
Queueing and CoDel, in replacement of sfq / sfq_red.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dave Taht <dave.taht@bufferbloat.net>
This patch provides support for marking packets with ECN instead of
dropping them with netem. This makes it possible to make use of the
netem ECN marking feature that was added recently to the kernel.
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Define where is the are located the iproute2 config files.
Get rid of trailing slashes for paths in several file.
Signed-off-by: Christoph J. Thompson <cjsthompson@gmail.com>
As reported by Thomas Mühlgrabner <muehltom@cable.vol.at>
in http://bugs.debian.org/662979 :
When showing htb class configuration with "tc -iec class show",
the output for Mibit is actually the value for bit.
Example: configure a class with a ceil of 1000Mibit.
Output states 1048576000 Mibit.
The cause is missing parenteses in the display code of tc....
(Please also note that a lower value of 100Mibit will be displayed
as 102400 Kibit, which I think is kind of ugly.)
Reported-by: Thomas Mühlgrabner <muehltom@cable.vol.at>
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
LIBNETLINK will be defined in the main Makefile, so
both ../lib/libnetlink.a ../lib/libutil.a will be
automatically appended during linking. Otherwise
../lib/libnetlink.a ../lib/libutil.a will appear
twice during linking.
Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>
(Resending patch since it looks like my earlier mail did not make it to
netdev).
netem reordering requires that the delay parameter be given. Currently, if no
delay is given, tc prints the error message but still installs the qdisc. Fix
this by printing the usage and failing cleanly.
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
TCA_CHOKE_MAX_P permits to express high resolution RED probability.
tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec choke \
limit 90 ecn min 10 max 30 probability 0.05 bandwidth 10Mbit
Before patch :
tc -s -d qdisc show dev eth3
qdisc ... limit 90p min 10p max 30p ecn ewma 3 Plog 19 Scell_log 13
After :
qdisc ... limit 90p min 10p max 30p ecn ewma 3 probability 0.05
Scell_log 13
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Adds an optional Random Early Detection on each SFQ flow queue.
Traditional SFQ limits count of packets, while RED permits to also
control number of bytes per flow, and adds ECN capability as well.
1) We dont handle the idle time management in this RED implementation,
since each 'new flow' begins with a null qavg. We really want to address
backlogged flows.
2) if headdrop is selected, we try to ecn mark first packet instead of
currently enqueued packet. This gives faster feedback for tcp flows
compared to traditional RED [ marking the last packet in queue ]
Example of use :
tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \
limit 3000 headdrop flows 512 divisor 16384 \
redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn
qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
ewma 6 min 8000b max 60000b probability 0.2 ecn
prob_mark 0 prob_mark_head 4876 prob_drop 6131
forced_mark 0 forced_mark_head 0 forced_drop 0
Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007
requeues 0)
rate 99483Kbit 8219pps backlog 689392b 456p requeues 0
In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled
flows, we can see number of packets CE marked is smaller than number of
drops (for non ECN flows)
If same test is run, without RED, we can check backlog is much bigger.
qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0)
rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Enable Adaptative RED algo, using :
tc qdisc ... red limit BYTES ... adaptative ...
Support of high precision probability/max_p setting and reporting, with
support of old kernels.
With a new kernel, "Plog ..." is replaced in tc output by "probability
value" :
qdisc red 10: dev eth3 parent 1:1 limit 360Kb min 30Kb max 90Kb ecn ewma
5 probability 0.09 Scell_log 15
This patch add rate shaping as well as cell support. The link-rate can be
specified via rate options. Three optional arguments control the cell
knobs: packet-overhead, cell-size, cell-overhead. To ratelimit eth0 root
queue to 5kbit/s, with a 20 byte packet overhead, 100 byte cell size and
a 5 byte per cell overhead:
tc qdisc add dev eth0 root netem rate 5kbit 20 100 5
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Both rtnl_talk and rtnl_dump had a callback for handling portions
of netlink message that do not match the correct pid or seq.
But this callback was never used by any part of iproute2 so remove
it.
Add harddrop support (kernel support added a long time ago), and various
cleanups.
min BYTES, max BYTES are now optional and follow Sally Floyd's
recommendations.
By the way, our default 2% probability is a bit low, Sally recommends 10%.
Not a big deal if upcoming adaptative algo is deployed.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Documentation advises to set burst to (min+min+max)/(3*avpkt)
Let tc do this automatically if user doesnt provide burst himself.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
get_distribution() returns an int.
cppcheck reported:
[tc/q_netem.c:243]: (style) Checking if unsigned variable 'dist_size' is less than zero.
The mismatch actually rendered the error checking
after get_distribution() ineffective.
Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
This patch adds detailed documentation for HFSC scheduler. It roughly
follows HFSC paper, but tries to not rely too much on math side of things.
Post-paper/Linux specific subjects (timer resolution, ul service curve, etc.)
are also discussed.
I've read it many times over, but it's a lengthy chunk of text - so try
to be understanding in case I made some mistakes.
tc-hfsc(7): explains algorithm in detail (very long)
tc-hfsc(8): explains command line options briefly
tc(8): adds references to new man pages
Makefile: adds man7 directory to install target
q_hfsc.c: minimal help text changes, consistency with tc-hfsc(8)
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Building iproute2 in parallel might hit the race failure:
emp_ematch.l:2:30: fatal error: emp_ematch.yacc.h:
No such file or directory
make[1]: *** [emp_ematch.lex.o] Error 1
This is because we currently allow the yacc/lex files to generate and
compile in parallel. So add a simple dependency to make sure yacc has
finished before we attempt to compile the lex output.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
iptables/xtables apparently changed API again.... Now you need to pass
and extra parameter (orig_opts) which was not needed before.
Sprinkle some lovely pre-processor magic to be compatible with both older
and new versions. In the beginning of times XTABLES_VERSION_CODE didn't
exist. Then it was (0x10000 * major + 0x100 * minor + patch) when it was
first introduced (according to git), but now it's at 6...
Don't know what official iptables releases has defined it to over time.
Lets just hope none of the older versions with is has the define
higher then 6 is still around.... so only the "current" versioning
scheme is supported.... lets see how long this lasts now.
For the API change in xtables, see:
http://git.netfilter.org/cgi-bin/gitweb.cgi?p=iptables.git;a=commitdiff;h=600f38db82548a683775fd89b6e136673e924097
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
Add the iproute2 support for the ACT_CSUM action. Can be used as
following, certainly in conjunction with the ACT_PEDIT action (pedit):
# In order to DNAT (stateless) IPv4 packet from 192.168.1.100 to
# 0x12345678 (18.52.86.120), and update the IPv4 header checksum and
# the UDP checksum (the last one, only if the packet is UDP).
tc filter add eth0 prio 1 protocol ip parent ffff: \
u32 match ip src 192.168.1.100/32 flowid :1 \
action pedit munge offset 16 u32 set 0x12345678 \
pipe csum ip and udp
# In order to alter destination address of IPv6 TCP packets from fc00::1
# and correct the TCP checksum (nothing happened? except maybe for
# checksums in the TCP payload ...).
tc filter add eth0 prio 1 protocol ipv6 parent ffff: \
u32 match ip6 src fc00::1/128 match ip6 protocol 0x06 0xff flowid :1 \
action pedit munge offset 24 u32 set 0x12345678 \
pipe csum tcp
We can use rxhash to classify the traffic into flows. As rxhash maybe
supplied by NIC or RPS, it is cheaper.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
iptables dropped the xtables_set_revision() function around version 1.4.9,
so set the rev directly ourselves. This should be compatible back to the
original version m_xt itself is designed for.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Fixes problems with xtables based MARK target ("ipt" module).
When tc loads the "ipt" (xt) module it kept the symbols local,
this made loading of libxtables not find the required struct.
currently ipt/xt is the only tc action module.
iproute2 never seem to do dlclose.
hopefully the modules doesn't export more symbols then needed.
In this situation hopefully the RTLD_GLOBAL flag won't hurt us.
I've been using this patch in the Debian package of iproute for
the last 3 weeks and noone has complained.
( This fixes http://bugs.debian.org/584898 )
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
This patch adds ipv6 filter priority/traffic class function
static int parse_ip6_class(int *argc_p, char ***argv_p, struct tc_u32_sel *sel)
shifting filter value to 5th bit and ignoring "at" as header position
is exactly given.
Signed-off-by: Petr Lautrbach <plautrba@redhat.com>
The recent commit "iproute2: add option to build m_xt as a tc module"
(ab814d6355) looks like it wrongly included debug changes in the
install target. So drop the `echo` so the tc binary actually gets
installed again.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
When building on an old environment, the flex generated
tc/emp_ematch.lex.c file would not compile. The error given was:
emp_ematch.lex.c:1686: error: expected â;â, â,â or â)â before numeric constant
The emp_ematch.l uses 'str' as a start symbol name, and flex would create
a '#define str 1' statement. This particular version of flex,
unfortunately, used 'str' as names of string variables in the generated
parser functions. This is line 1686 in the generated file:
YY_BUFFER_STATE ematch__scan_string (yyconst char * str )
This patch just substitutes 'str' for 'lexstr' in emp_ematch.l to avoid
the collision.
This will build the xt module (action ipt) of tc as a
shared object that is linked at runtime by tc if used,
rather then built into tc.
This is similar to how the atm qdisc support
is handled (q_atm.so).
Signed-off-by: Andreas Henriksson <andreas@xxxxxxxx>
Try to automatically detect iptables modules directory.
Make the configure script look for iptables modules.
This also makes it possible to specify it on the
command line while building via "make IPT_LIB_DIR=/foo/bar".
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
parsing a mark as a classid allows for acceptance of strange
informal input.
cheers,
jamal
commit aad0da6507ff8a95a63ed8e529c05f52be5b0e75
Author: Jamal Hadi Salim <hadi@cyberus.ca>
Date: Mon Feb 15 06:45:29 2010 -0500
skbedit: use get_u32 for parsing mark
get_u32 is the more appropriate parser for a mark.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
This adds the required changes to gain access to
the head drop classfull queuing discipline named
pfifo_head_drop. In difference to pfifo or pfifo_fast
this queuing discipline will drop the first packet
in the case of queue congestion. As a result the queue
contain always the freshest packets.
To replace the current a root queueing discipline
for eth0:
$ tc qdisc replace dev eth0 root pfifo_head_drop
And show statistics:
$ tc -s qdisc show dev eth0
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Since there aren't any targets that currently use this pattern rule, this
is more of a proactive fix.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
This adds support for setting the skb mark.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Add a new cleaned up m_xt.c based on m_xt_old.c
The new m_xt.c has been updated to use the new names and new api
that xtables exposes in iptables 1.4.5.
All the old internal api cruft has also been dropped.
Additionally, a configure script test is added to check for
the new xtables api and set the TC_CONFIG_XT flag in Config.
(tc/Makefile already handles this flag in previous commit.)
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
Move the file and rename the configure flags.
The file is being kept around for iptables < 1.4.5 compatibility.
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
The kernel takes a lack of options as indication that the fw classifier
should operate in compatibility mode, where marks are mapped directly to
classids.
Commit e22b42a (tc mask patch) broke this by adding an empty TCA_OPTIONS
attribute even if no handle is specified. Restore the old behaviour.
Signed-off-by: Patrick McHardy <kaber@trash.net>
A bunch of source files look like they're copy & pasted from other files,
and some include header files that they don't actually need. Since dlfcn
has very specific usage (and is a pain on a static-only system), drop it
where it isn't really needed.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
The iptables code supports a "no shared libs" mode where it can be used
without requiring dlfcn related functionality. This adds similar support
to iproute2 so that it can easily be used on systems like nommu Linux (but
obviously with a few limitations -- no dynamic plugins).
Rather than modify every location that uses dlfcn.h, I hooked the dlfcn.h
header with stub functions when shared library support is disabled. Then
symbol lookup is done via a local static lookup table (which is generated
automatically at build time) so that internal symbols can be found.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Sometimes while dividing bandwidth by classes it is useful to see how some
specific class doing things live.
Which my simple patch it is possible to do
watch -n1 "tc -s -d class show dev eth0.2022 classid 1:1520"
and to get live statistics, how packets queued or dropped, and how much
bandwidth used (if estimator defined) for specific class.
Signed-off-by: Denys Fedoryshchenko <denys@visp.net.lb>
This change was forgotten by Stephen in the last release
Signed-off-by: Denys Fedoryschenko <denys@visp.net.lb>
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Many thanks to Yevgeny Kosarzhevsky <yevg@pisem.net> for reporting
and a lot of testing
Thanks to Jan Engelhardt <jengelh@medozas.de> for a lot of advice
Thanks to Denys Fedoryschenko <denys@visp.net.lb> for some sample
code that he tried and thanks to Andreas Henriksson <andreas@fatal.se>
(who maintains iproute2 on debian) for the persistent followup.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Original from: Alexander Duyck <alexander.h.duyck@intel.com>
A bug was found in which the memory for the tc_skbedit struct was being
used uninitialized to 0. Alternative version of original fix
using initializer rather than memset.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
1)optind according iptables sources have to be set to 0. If it is set to 1, in
batch it will mess up things. Also in iptables sources i notice that ->tflags
and ->used need to be reset.
2)Since target->t = fw_calloc(1, size); allocated memory in function build_st,
it have to be freed at the end, or in batch we will have memory leak. TODO:
Probably it must be freed in all "return -1" cases in parse_ipt after
build_st. About this i am not sure, up to Stephen.
3)new_name was malloc'ed, but not freed
Add support for multiq qdisc
This patch adds the ability to configure the multiq qdisc. Since the qdisc does not require any input it will pull the number of bands directly from the device that it is added to the root of.
usage: tc qdisc add dev <DEV> root handle <HANDLE> multiq
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Provides ability to edit queue_mapping field
Provides ability to edit priority field
usage: action skbedit [queue_mapping QUEUE_MAPPING] [priority PRIORITY]
at least one option must be select, or both at the same time
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Hello Rafael Almeida.
I noticed your patch adding DESTDIR support in the latest iproute2 release.
Much appreciated! Soon the debian packages might be able to move to actually
using "make install" rather then it's own installation procedure when
building packages. I've noticed something that will break though....
Debian packages usually sets DESTDIR=debian/tmp/ and packages the contents
of that directory as if it where the root file system. This will break
the /usr/lib/{tc,ip}/ module loading, because they DESTDIR (/usr) will be
/whatever-the-build-path-was/debian/tmp/lib/{tc,ip}/.
I beleive others usually call this the LIBDIR to make the separation between
DISTDIR being the (possibly temporary) place things are put when build is
done, and LIBDIR (and others) are used for actual runtime paths.
I'm attaching a patch that I think fixes this, but would be really happy if
you could have a look at to verify I'm not screwing something up.
--
Regards,
Andreas Henriksson
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Patch adds generic size table that is similiar to rate table, with
difference that size table stores link layer packet size.
Based on patch by Patrick McHardy
http://marc.info/?l=linux-netdev&m=115201979221729&w=2
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
After changing the DESTDIR the installated binaries have some issues
due to hard coded paths. For example, using distributions on NetEm
would segfault.
I've changed iplink.c and tc_util.c so they are now aware of DESTDIR.
Along with that change I needed to change the main Makefile so it
defines the DESTDIR macro when calling gcc.
I also changed the paths so that during the installation sbin, etc,
share and lib directories are created directly inside of the DESTDIR,
instead of creating a usr directory inside that. That's the behaviour
of most packages out there, so I think most users will be expecting
that to happen.
> # tc filter show dev eth1 | grep 4:29:d1
> filter parent 1: protocol ip pref 5 u32 fh 4:29:d1 order 209 key ht 4
> bkt 29 flowid 1:b7aa
>
> # tc filter del dev eth1 parent 1: pref 5 handle 4:29:d1 u32
> RTNETLINK answers: Invalid argument
> We have an error talking to the kernel
>
> after rollback to package"sys-apps/iproute2-2.6.24.20080108" all
> deleted normal...
The current iproute version uses "protocol all" by default
if its not specified. This is actually only useful for creating
new filters, on deletion an unset protocol is treated as wildcard.
And last for now ..
cheers,
jamal
[PATCH 3/3] [TC/U32] Infrastructure for pretty printing
This patch makes it easy to add pretty printers of different protocols.
For starters it makes use of ipv4 and raw printers.
Add more later ...
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
makes protocol accessible ..
cheers,
jamal
[PATCH 2/3] [TC/FILTERS] Expose the filter protocol
Expose the filter protocol so it can be used by underlying
classifiers when they need it.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Don't break scripts that depend on previous offset/value format.
Introduce a new -pretty flag for decoding, and (*gasp*) document
the formatting arguments.
commit c504ffd627ac211eebf5ed34ef0fbfd7f1dbb347
Author: Patrick McHardy <kaber@trash.net>
Date: Wed Mar 26 07:38:43 2008 +0100
[IPROUTE]: Fix classifier help
The new check whether the user has specified a protocol makes
"ip filter <type> help" fails with "protocol is required".
This could be fixed by moving it further down, but a more user-friendly
way it to simply use ETH_P_ALL as default if nothing is specified.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Introducing the function that does the ATM cell alignment, and
modifying tc_calc_rtable() to use this based upon a linklayer
parameter.
Modified from original to use constants from atm.h and
fix all the usages of rtable in same patch.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
For police, implement overhead parameter parsing.
The change is ABI (Application Binary Interface) backward compatible
with older kernels, but will first have effect from kernel 2.6.24.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
For CBQ, implement overhead parameter parsing.
The change is ABI (Application Binary Interface) backward compatible
with older kernels, but will first have effect from kernel 2.6.24.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Change CBQ to use matches() function instead of strcmp().
This resembels the usage in other parse functions, and allows
partial command parameter matching.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
For TBF, implement overhead parameter parsing.
The change is ABI (Application Binary Interface) backward compatible
with older kernels, but will first have effect from kernel 2.6.24.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
update: Fix the spelling of "hexidecimal"
This updates the help output to specify that CLASSID should be hexidecimal.
This makes sure that a user entering "flowid 1:10" gets his flow put into
band 15 (0x10) and knows why.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
commit 94e9cba778cb97d77d9146dc3bd38ff195bc2c8a
Author: Patrick McHardy <kaber@trash.net>
Date: Sat Feb 2 18:22:16 2008 +0100
[IPROUTE]: cls_flow: add vlan-tag support
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
[IPROUTE]: Add flow classifier support
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
[...]
> Commands like "tc filter add dev ppp0 parent ffff: protocol ip prio 50
> u32 match ip src 0.0.0.0/0 police rate 4mbit burst 10k drop flowid :1"
> apparently no longer works. The flowid is not accepted anymore.
> Reverting commit 720a2e8d99... which you authored seems to "fix" this.
[...]
After further investigation it seems clear to me that reverting the
commit 720a2e8d990707749b2... is the correct thing to do, since the real
fix for the problem this commit was supposed to fix was instead fixed in
commit c29391c7c68f031e246c...
Whatever you specify after a u32 police you will now get a syntax error,
and according to "tc filter add u32 help" there are several things that
you are supposed to be able to specify after a police.
This reverts commit 720a2e8d99.
New iptables 1.4.0 has some library names changed from libipt to libxt.
It is prefferable also to open libxt_ first, as newer "style".
Signed-off-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Patrick McHardy, Cite: 'its better to overestimate than underestimate
to stay in control of the queue'.
Illustrating the rate table array:
Legend description
rtab[x] : Array index x of rtab[x]
xmit_sz : Transmit size contained in rtab[x] (normally transmit time)
maps[a-b] : Packet sizes from a to b, will map into rtab[x]
Current/old rate table mapping (cell_log:3):
rtab[0]:=xmit_sz:0 maps[0-7]
rtab[1]:=xmit_sz:8 maps[8-15]
rtab[2]:=xmit_sz:16 maps[16-23]
rtab[3]:=xmit_sz:24 maps[24-31]
rtab[4]:=xmit_sz:32 maps[32-39]
rtab[5]:=xmit_sz:40 maps[40-47]
rtab[6]:=xmit_sz:48 maps[48-55]
New rate table mapping, with kernel cell_align support.
rtab[0]:=xmit_sz:8 maps[0-8]
rtab[1]:=xmit_sz:16 maps[9-16]
rtab[2]:=xmit_sz:24 maps[17-24]
rtab[3]:=xmit_sz:32 maps[25-32]
rtab[4]:=xmit_sz:40 maps[33-40]
rtab[5]:=xmit_sz:48 maps[41-48]
rtab[6]:=xmit_sz:56 maps[49-56]
New TC util on a kernel WITHOUT support for cell_align
rtab[0]:=xmit_sz:8 maps[0-7]
rtab[1]:=xmit_sz:16 maps[8-15]
rtab[2]:=xmit_sz:24 maps[16-23]
rtab[3]:=xmit_sz:32 maps[24-31]
rtab[4]:=xmit_sz:40 maps[32-39]
rtab[5]:=xmit_sz:48 maps[40-47]
rtab[6]:=xmit_sz:56 maps[48-55]
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Change tc_calc_rtable() to take a tc_ratespec struct as an
argument. (cell_log still needs to be passed on as a parameter,
because -1 indicate that the cell_log needs to be computed by the
function.).
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
The only current user is HTB. HTB overhead argument is now passed on
to the kernel (in the struct tc_ratespec). Also correct the data
types.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Hello Stephen,
As the current maintainer of iproute2 package, you could be interested
in including the attached patch that allow using masks in the fw filter
of the tc utility (very useful at least for me). AFAK, it works at least
from iproute2 version 2.6.20-?. Feel free to make the appropriate
cleaning changes if necessary, or contact me if you see any trouble.
Best regards,
François Delawarde.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
tc_core_time2big only used in tc/q_netem.c where it gets passed an unsigned.
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Follow up patch to "Fix overflow in time2tick / tick2time." which switches
the remaining two helper functions from long to unsigned as well.
These functions are only used in "tc/q_hfsc.c" where both the passed argument
and the place the return value is stored are unsigned/u32 variables, so this
change should be safe to make but hasn't been tested as extensively as the
time2tick patch.
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
The helper functions gets passed an unsigned int, which gets cast to long
and overflows. See http://bugs.debian.org/175462
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
The help/usage screen of ematch cmp and nbyte say recognised symbolic
values for "layer FOO" are link, header and next-header, but the code
does _not_ implement that: it will recognise "next-header" as what is
supposed to be "header" and will not recognise "header". The right
symbolic values seem to be link, network, transport. Here is a patch
that changes the help/usage screen to match the code.
(http://bugs.debian.org/438653)
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
em_meta doesn't send 0 values to the kernel. breaking matching on them and
resulting in "Missing value TLV" messages on dump.
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch applies on top of Patrick McHardy's RTNETLINK
patches to add nested compat attributes. This is needed to maintain
ABI for sch_{rr|prio} in the kernel with respect to tc. A new option,
namely multiqueue, was added to sch_prio and sch_rr. This will allow
a user to turn multiqueue support on for sch_prio or sch_rr at loadtime.
Also, tc qdisc ls will display whether or not multiqueue is enabled on
that qdisc. When in multiqueue mode, a user can specify a value of 0 for
bands, and the number of bands will be created to match the number of
queues on the device.
This patch is to support the new sch_rr (round-robin) qdisc being proposed
in NET for multiqueue network device support in the Linux network stack.
It uses q_prio.c as the template, since the qdiscs are nearly identical,
outside of the ->dequeue() routine.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
>>That command is from a script that used to work with iproute2-ss020116
>>(2002!), which had the following in tc/m_police.c:
>>
>>210 } else if (strcmp(*argv, "action") == 0) {
>>211 NEXT_ARG();
>>212 if (get_police_result(&p.action, &presult, *argv)) {
>>
>>I don't know when that bit was dropped, but it used to be there. :-)
>
>
>
> Indeed, I missed that. I'll fix up the patch ..
OK this patch fixes parsing of "action ...". I've removed
the erroring on unknown arguments again since in that case
the caller should continue parsing.
>
> Is it a bug that:
>
> # tc filter add dev eth0 parent 1: protocol ip prio 0 handle 0xfffffff
> fw police rate 1 burst 1 mpu 0 mtu 1 action drop
> ^^^^^^^^^^^
> creates a filter that looks like:
>
> # tc filter ls dev eth0
> filter parent 1: protocol ip pref 49152 fw
> filter parent 1: protocol ip pref 49152 fw handle 0xfffffff police 0x1
> rate 0bit burst 0b mtu 1b action reclassify
> ^^^^^^^^^^^^^^^^^
> ref -543190236 bind 4
>
> (which reclassifies and thus lets 0xfffffff-marked packets through).
>
> I'm pretty sure this used to work under 2.4.x (though I no longer have a
> 2.4 box to test with), but it hasn't worked on any of the 2.6.x kernels
> I've tried (with both iproute2-ss060323 and 070710).
Good catch. It seems this is merely a parsing error, iproute doesn't
have an "action" parameter and aborts parsing, so it uses the default
value of "RECLASSIFY". It never had this parameter, so this patch
removes it from the help text and makes it return an error.
Make netem static rather than shared library. It saves problems
on 64 bit platforms.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
This one also makes sense for the release I guess.
-------- Original Message --------
Subject: Re: more iproute2 issues (not critical)
Date: Sat, 31 Mar 2007 16:16:56 +0200
From: Patrick McHardy <kaber@trash.net>
To: Denys <denys@visp.net.lb>
CC: Stephen Hemminger <shemminger@linux-foundation.org>,
netdev@vger.kernel.org
References: <20070321175951.M73913@visp.net.lb>
<46026717.9060909@trash.net> <20070322124533.M79867@visp.net.lb>
<46027FF2.6020001@trash.net> <20070322101224.3e6bb899@freekitty>
<20070331021401.M17326@visp.net.lb> <20070331023011.M8101@visp.net.lb>
Denys wrote:
> Ooops, sorry, it seems my fault, no library exist on this system.
> But i guess it must not coredump in this case? Is it possible to check if
> library not exist and just print some nice message?
> It is trivial i guess.
The problem is that lib_dir is NULL when calling get_target_names.
This patch fixes it.
[IPROUTE]: m_ipt: fix crash when dumping rules
lib_dir is NULL when calling get_target_name, causing a NULL pointer
dereference in the strlen call.
Signed-off-by: Patrick McHardy <kaber@trash.net>
In order to support these new flags add current
linux/if.h into the directory with the local copies.
This caused troubles with outdated redefinitions from net/if.h
so I've removed the dependency on it.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
This reverts fd784ccaf6 commit.
Thanks Stephen, but actually I think the last patch (increase clock
resolution) shouldn't go in yet. I'm not done yet looking at all
the compatibility issues and it does change the range of valid
values for everything dealing with times. Most places I looked
at still accept reasonable ranges, but I would feel more comfortable
to make sure everything is fine first.
> It is in current git tree.
A small fix attached after some testing.
Please dont forget to apply my other patches. When you have them let me
know so i can do some more testing.
cheers,
jamal
[TC] Get iptables path selection to set correct path
A small tweak on top of Stephens patch
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
[IPROUTE]: Increase internal clock resolution to nsec
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
[IPROUTE]: Handle different kernel clock resolutions
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
[IPROUTE]: Add sprint_ticks() function and use in CBQ
Add helper function to print ticks to avoid assumptions about clock
resolution in CBQ.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
[IPROUTE]: Replace "usec" by "time" in function names
Rename functions containing "usec" since they don't necessarily return
usec units anymore.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
[IPROUTE]: Introduce TIME_UNITS_PER_SEC to represent internal clock resolution
Introduce TIME_UNITS_PER_SEC and conversion functions between internal
resolution and resolution expected by the kernel (currently implemented as
NOPs, only needed by HFSC, which currently always uses microseconds).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
[IPROUTE]: Introduce tc_calc_xmitsize and use where appropriate
Add tc_calc_xmitsize() as complement to tc_calc_xmittime(), which calculates
the size that can be transmitted at a given rate during a given time.
Replace all expressions of the form "size = rate*tc_core_tick2usec(time))/1000000"
by tc_calc_xmitsize() calls.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>