iproute2

Commit Graph

Author	SHA1	Message	Date
Petr Machata	01bb0bcd00	tc: Add helpers to support qevent handling Introduce a set of helpers to make it easy to add support for qevents into qdisc. The idea behind this is that qevent types will be generally reused between qdiscs, rather than each having a completely idiosyncratic set of qevents. The qevent module holds functions for parsing, dumping and formatting of these common qevent types, and for dispatch to the appropriate set of handlers based on the qevent name. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David Ahern <dsahern@kernel.org>	2020-07-05 15:37:27 +00:00
Po Liu	07d5ee70b5	iproute2-next:tc:action: add a gate control action Introduce a ingress frame gate control flow action. Tc gate action does the work like this: Assume there is a gate allow specified ingress frames can pass at specific time slot, and also drop at specific time slot. Tc filter chooses the ingress frames, and tc gate action would specify what slot does these frames can be passed to device and what time slot would be dropped. Tc gate action would provide an entry list to tell how much time gate keep open and how much time gate keep state close. Gate action also assign a start time to tell when the entry list start. Then driver would repeat the gate entry list cyclically. For the software simulation, gate action require the user assign a time clock type. Below is the setting example in user space. Tc filter a stream source ip address is 192.168.0.20 and gate action own two time slots. One is last 200ms gate open let frame pass another is last 100ms gate close let frames dropped. # tc qdisc add dev eth0 ingress # tc filter add dev eth0 parent ffff: protocol ip \ flower src_ip 192.168.0.20 \ action gate index 2 clockid CLOCK_TAI \ sched-entry open 200000000ns -1 8000000b \ sched-entry close 100000000ns # tc chain del dev eth0 ingress chain 0 "sched-entry" follow the name taprio style. Gate state is "open"/"close". Follow the period nanosecond. Then next -1 is internal priority value means which ingress queue should put to. "-1" means wildcard. The last value optional specifies the maximum number of MSDU octets that are permitted to pass the gate during the specified time interval, the overlimit frames would be dropped. Below example shows filtering a stream with destination mac address is 10:00:80:00:00:00 and ip type is ICMP, follow the action gate. The gate action would run with one close time slot which means always keep close. The time cycle is total 200000000ns. The base-time would calculate by: 1357000000000 + (N + 1) * cycletime When the total value is the future time, it will be the start time. The cycletime here would be 200000000ns for this case. #tc filter add dev eth0 parent ffff: protocol ip \ flower skip_hw ip_proto icmp dst_mac 10:00:80:00:00:00 \ action gate index 12 base-time 1357000000000ns \ sched-entry CLOSE 200000000ns \ clockid CLOCK_TAI Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: David Ahern <dsahern@gmail.com>	2020-05-13 02:19:46 +00:00
Mohit P. Tahiliani	9dced637f8	tc: add support for FQ-PIE packet scheduler This patch adds support for the FQ-PIE packet Scheduler Principles: - Packets are classified on flows. - This is a Stochastic model (as we use a hash, several flows might be hashed to the same slot) - Each flow has a PIE managed queue. - Flows are linked onto two (Round Robin) lists, so that new flows have priority on old ones. - For a given flow, packets are not reordered. - Drops during enqueue only. - ECN capability is off by default. - ECN threshold (if ECN is enabled) is at 10% by default. - Uses timestamps to calculate queue delay by default. Usage: tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ] [ target TIME ] [ tupdate TIME ] [ alpha NUMBER ] [ beta NUMBER ] [ quantum BYTES ] [ memory_limit BYTES ] [ ecn_prob PERCENTAGE ] [ [no]ecn ] [ [no]bytemode ] [ [no_]dq_rate_estimator ] defaults: limit: 10240 packets, flows: 1024 target: 15 ms, tupdate: 15 ms (in jiffies) alpha: 1/8, beta : 5/4 quantum: device MTU, memory_limit: 32 Mb ecnprob: 10%, ecn: off bytemode: off, dq_rate_estimator: off Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in> Signed-off-by: Sachin D. Patil <sdp.sachin@gmail.com> Signed-off-by: V. Saicharan <vsaicharan1998@gmail.com> Signed-off-by: Mohit Bhasi <mohitbhasi1998@gmail.com> Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2020-02-04 03:24:39 -08:00
Stephen Hemminger	d80d22d5fd	Merge branch 'master' of git://git.kernel.org/pub/scm/network/iproute2/iproute2-next Resolved conflict in tc/f_flower.c	2020-01-29 05:44:53 -08:00
Ethan Sommer	5f78bc3e1d	make yacc usage POSIX compatible config: put YACC in config.mk and use environmental variable if present ss: use YACC variable instead of hardcoding bison place options before source file argument use -b to specify file prefix instead of output file, as -o isn't POSIX compatible, this generates ssfilter.tab.c instead of ssfilter.c replace any references to ssfilter.c with references to ssfilter.tab.c tc: use -p flag to set name prefix instead of bison-specific api.prefix directive remove unneeded bison-specific directives use -b instead of -o, replace references to previously generated emp_ematch.yacc.[ch] with references to newly generated emp_ematch.tab.[ch] Signed-off-by: Ethan Sommer <e5ten.arch@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2020-01-20 09:43:22 -08:00
Petr Machata	d2773f1261	tc: Add support for ETS Qdisc Add a new module to generate and parse options specific to the ETS Qdisc. Example output: bands 8 strict 3 priomap 0 1 2 3 4 5 6 7 qdisc ets 1: root refcnt 2 offloaded bands 8 strict 3 quanta 1514 1514 1514 1514 1514 priomap 0 1 2 3 4 5 6 7 7 7 7 7 7 7 7 7 [ { "kind": "ets", "handle": "1:", "root": true, "refcnt": 2, "offloaded": true, "options": { "bands": 8, "strict": 3, "quanta": [1514, 1514, 1514, 1514, 1514], "priomap": [0, 1, 2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7] } } ] Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David Ahern <dsahern@gmail.com>	2020-01-18 21:54:12 +00:00
Paul Blakey	c8a494314c	tc: Introduce tc ct action New tc action to send packets to conntrack module, commit them, and set a zone, labels, mark, and nat on the connection. It can also clear the packet's conntrack state by using clear. Usage: ct clear ct commit [force] [zone] [mark] [label] [nat] ct [nat] [zone] Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Yossi Kuperman <yossiku@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: David Ahern <dsahern@gmail.com>	2019-07-18 15:41:02 -07:00
John Hurley	fb57b0920f	tc: add mpls actions Create a new action type for TC that allows the pushing, popping, and modifying of MPLS headers. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David Ahern <dsahern@gmail.com>	2019-07-10 14:06:32 -07:00
Kevin Darbyshire-Bryant	d7f2bccd0f	tc: add support for action act_ctinfo ctinfo is a tc action restoring data stored in conntrack marks to various fields. At present it has two independent modes of operation, restoration of DSCP into IPv4/v6 diffserv and restoration of conntrack marks into packet skb marks. It understands a number of parameters specific to this action in additional to the usual action syntax. Each operating mode is independent of the other so all options are optional, however not specifying at least one mode is a bit pointless. Usage: ... ctinfo [dscp mask [statemask]] [cpmark [mask]] [zone ZONE] [CONTROL] [index <INDEX>] DSCP mode dscp enables copying of a DSCP stored in the conntrack mark into the ipv4/v6 diffserv field. The mask is a 32bit field and specifies where in the conntrack mark the DSCP value is located. It must be 6 contiguous bits long. eg. 0xfc000000 would restore the DSCP from the upper 6 bits of the conntrack mark. The DSCP copying may be optionally controlled by a statemask. The statemask is a 32bit field, usually with a single bit set and must not overlap the dscp mask. The DSCP restore operation will only take place if the corresponding bit/s in conntrack mark ANDed with the statemask yield a non zero result. eg. dscp 0xfc000000 0x01000000 would retrieve the DSCP from the top 6 bits, whilst using bit 25 as a flag to do so. Bit 26 is unused in this example. CPMARK mode cpmark enables copying of the conntrack mark to the packet skb mark. In this mode it is completely equivalent to the existing act_connmark action. Additional functionality is provided by the optional mask parameter, whereby the stored conntrack mark is logically ANDed with the cpmark mask before being stored into skb mark. This allows shared usage of the conntrack mark between applications. eg. cpmark 0x00ffffff would restore only the lower 24 bits of the conntrack mark, thus may be useful in the event that the upper 8 bits are used by the DSCP function. Usage: ... ctinfo [dscp mask [statemask]] [cpmark [mask]] [zone ZONE] [CONTROL] [index <INDEX>] where : dscp MASK is the bitmask to restore DSCP STATEMASK is the bitmask to determine conditional restoring cpmark MASK mask applied to restored packet mark ZONE is the conntrack zone CONTROL := reclassify \| pipe \| drop \| continue \| ok \| goto chain <CHAIN_INDEX> Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David Ahern <dsahern@gmail.com>	2019-06-10 10:24:38 -07:00
Paolo Abeni	c865c52365	tc: add support for plug qdisc sch_plug can be used to perform functional qdisc unit tests controlling explicitly the queuing behaviour from user-space. Plug support lacks since its introduction in 2012. This change introduces basic support, to control the tc status. v1 -> v2: - use the SPDX identifier Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David Ahern <dsahern@gmail.com>	2019-05-04 09:22:14 -07:00
Syrone Wong	6ddb36c3a9	tc: fix xtables incorrect usage of LDFLAGS The incorrect setting of LDFLAGS causes error below: > em_ipt.o: In function `em_ipt_print_epot': > em_ipt.c:(.text.em_ipt_print_epot+0x2e): undefined reference to > `xtables_init_all' em_ipt.c gets involved when TC_CONFIG_XT=y, which requires xtables, while tc/Makefile doesn't pass flags correctly. It adds '-lxtables' to LDFLAGS instead of LDLIBS. Fixes: `dd296215` ("tc: add em_ipt ematch for calling xtables matches from tc matching context") Signed-off-by: Syrone Wong <wong.syrone@gmail.com> Acked-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2018-12-13 11:38:43 -08:00
Luca Boccassi	1a03ac6b05	Pass CPPFLAGS to the compiler When building Debian packages pre-processor flags are passed via CPPFLAGS, as the convention indicates. Specifically, the hardening -D_FORTIFY_SOURCE=2 flag is used. Pass CPPFLAGS to all calls of QUIET_CC together with CFLAGS. Signed-off-by: Luca Boccassi <bluca@debian.org> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2018-11-09 08:07:18 -08:00
Vinicius Costa Gomes	0dd1644935	tc: Add support for configuring the taprio scheduler This traffic scheduler allows traffic classes states (transmission allowed/not allowed, in the simplest case) to be scheduled, according to a pre-generated time sequence. This is the basis of the IEEE 802.1Qbv specification. Example configuration: tc qdisc replace dev enp3s0 parent root handle 100 taprio \ num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \ queues 1@0 1@1 2@2 \ base-time 1528743495910289987 \ sched-entry S 01 300000 \ sched-entry S 02 300000 \ sched-entry S 04 300000 \ clockid CLOCK_TAI The configuration format is similar to mqprio. The main difference is the presence of a schedule, built by multiple "sched-entry" definitions, each entry has the following format: sched-entry <CMD> <GATE MASK> <INTERVAL> The only supported <CMD> is "S", which means "SetGateStates", following the IEEE 802.1Qbv-2015 definition (Table 8-6). <GATE MASK> is a bitmask where each bit is a associated with a traffic class, so bit 0 (the least significant bit) being "on" means that traffic class 0 is "active" for that schedule entry. <INTERVAL> is a time duration in nanoseconds that specifies for how long that state defined by <CMD> and <GATE MASK> should be held before moving to the next entry. This schedule is circular, that is, after the last entry is executed it starts from the first one, indefinitely. The other parameters can be defined as follows: - base-time: specifies the instant when the schedule starts, if 'base-time' is a time in the past, the schedule will start at base-time + (N * cycle-time) where N is the smallest integer so the resulting time is greater than "now", and "cycle-time" is the sum of all the intervals of the entries in the schedule; - clockid: specifies the reference clock to be used; The parameters should be similar to what the IEEE 802.1Q family of specification defines. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David Ahern <dsahern@gmail.com>	2018-10-07 10:32:08 -07:00
Nishanth Devarajan	141b55f854	Add SKB Priority qdisc support in tc(8) sch_skbprio is a qdisc that prioritizes packets according to their skb->priority field. Under congestion, it drops already-enqueued lower priority packets to make space available for higher priority packets. Skbprio was conceived as a solution for denial-of-service defenses that need to route packets with different priorities as a means to overcome DoS attacks. Signed-off-by: Nishanth Devarajan <ndev2021@gmail.com> Reviewed-by: Michel Machado <michel@digirati.com.br> Signed-off-by: David Ahern <dsahern@gmail.com>	2018-08-14 07:06:43 -07:00
Toke Høiland-Jørgensen	714444c0cb	Add support for CAKE qdisc sch_cake is intended to squeeze the most bandwidth and latency out of even the slowest ISP links and routers, while presenting an API simple enough that even an ISP can configure it. Example of use on a cable ISP uplink: tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter To shape a cable download link (ifb and tc-mirred setup elided) tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort Cake is filled with: * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel derived Flow Queuing system, which autoconfigures based on the bandwidth. * A novel "triple-isolate" mode (the default) which balances per-host and per-flow FQ even through NAT. * An deficit based shaper, that can also be used in an unlimited mode. * 8 way set associative hashing to reduce flow collisions to a minimum. * A reasonable interpretation of various diffserv latency/loss tradeoffs. * Support for zeroing diffserv markings for entering and exiting traffic. * Support for interacting well with Docsis 3.0 shaper framing. * Support for DSL framing types and shapers. * Support for ack filtering. * Extensive statistics for measuring, loss, ecn markings, latency variation. Various versions baking have been available as an out of tree build for kernel versions going back to 3.10, as the embedded router world has been running a few years behind mainline Linux. A stable version has been generally available on lede-17.01 and later. sch_cake replaces a combination of iptables, tc filter, htb and fq_codel in the sqm-scripts, with sane defaults and vastly simpler configuration. Cake's principal author is Jonathan Morton, with contributions from Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller, Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht, and Loganaden Velvindron. Testing from Pete Heist, Georgios Amanakis, and the many other members of the cake@lists.bufferbloat.net mailing list. Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David Ahern <dsahern@gmail.com>	2018-07-19 09:23:46 -07:00
Vinicius Costa Gomes	7da5ef2200	tc: Add support for the ETF Qdisc The "Earliest TxTime First" (ETF) queueing discipline allows precise control of the transmission time of packets by providing a sorted time-based scheduling of packets. The syntax is: tc qdisc add dev DEV parent NODE etf delta <DELTA> clockid <CLOCKID> [offload] [deadline_mode] Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David Ahern <dsahern@gmail.com>	2018-07-11 17:50:10 -07:00
Eyal Birger	dd29621578	tc: add em_ipt ematch for calling xtables matches from tc matching context The commit calls a new tc ematch for using netfilter xtable matches. This allows early classification as well as mirroning/redirecting traffic based on logic implemented in netfilter extensions. Current supported use case is classification based on the incoming IPSec state used during decpsulation using the 'policy' iptables extension (xt_policy). The matcher uses libxtables for parsing the input parameters. Example use for matching an IPSec state with reqid 1: tc qdisc add dev eth0 ingress tc filter add dev eth0 protocol ip parent ffff: \ basic match 'ipt(-m policy --dir in --pol ipsec --reqid 1)' \ action drop This is the user-space counter part of kernel commit ccc007e4a746 ("net: sched: add em_ipt ematch for calling xtables matches") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David Ahern <dsahern@gmail.com>	2018-02-27 09:43:16 -08:00
Stephen Hemminger	6054c1ebf7	SPDX license identifiers For all files in iproute2 which do not have an obvious license identification, mark them with SPDK GPL-2 Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2017-11-24 12:21:35 -08:00
Vinicius Costa Gomes	c9681ac1b3	tc: Add support for the CBS qdisc The Credit Based Shaper (CBS) queueing discipline allows bandwidth reservation with sub-milisecond precision. It is defined by the 802.1Q-2014 specification (section 8.6.8.2 and Annex L). The syntax is: tc qdisc add dev DEV parent NODE cbs locredit <LOCREDIT> hicredit <HICREDIT> sendslope <SENDSLOPE> idleslope <IDLESLOPE> (The order is not important) Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2017-11-01 22:22:48 +01:00
Stephen Hemminger	5f1df307b4	config: put CFLAGS/LDLIBS in config.mk This renames Config to config.mk and includes more Make input. Now configure generates all the required CFLAGS and LDLIBS for the optional libraries. Also, use pkg-config to test for libelf, rather than using a test program. This makes it consistent with other libraries. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2017-08-23 10:03:09 -07:00
Daniel Borkmann	8cc360fe48	bpf: unbreak libelf linkage for bpf obj loader Commit `69fed534a5` ("change how Config is used in Makefile's") moved HAVE_MNL specific CFLAGS/LDLIBS for building with libmnl out of the top level Makefile into sub-Makefiles. However, it also removed the HAVE_ELF specific CFLAGS/LDLIBS entirely, which breaks the BPF object loader for tc and ip with "No ELF library support compiled in." despite having libelf detected in configure script. Fix it similarly as in `69fed534a5` for HAVE_ELF. Fixes: `69fed534a5` ("change how Config is used in Makefile's") Reported-by: Jeffrey Panneman <jeffrey.panneman@tno.nl> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2017-08-10 16:40:02 -07:00
Stephen Hemminger	6ff66acc60	tc, ip: more Makefile updates for LIBMNL Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2017-08-09 08:38:51 -07:00
Roman Mashak	cba134ae70	tc: fix Makefile to build skbmod Signed-off-by: Roman Mashak <mrv@mojatatu.com>	2017-05-22 13:33:51 -07:00
Amir Vadai	f3e1b2448a	pedit: Introduce ipv6 support Add support for modifying IPv6 headers using pedit. Signed-off-by: Amir Vadai <amir@vadai.me>	2017-05-15 15:05:20 -07:00
Amir Vadai	3cd5149ecd	tc/pedit: p_eth: ETH header editor For example, forward tcp traffic to veth0 and set destination mac address to 11:22:33:44:55:66 : $ tc filter add dev enp0s9 protocol ip parent ffff: \ flower \ ip_proto tcp \ action pedit ex munge \ eth dst set 11:22:33:44:55:66 \ action mirred egress \ redirect dev veth0 Signed-off-by: Amir Vadai <amir@vadai.me>	2017-05-01 09:22:16 -07:00
Jiri Kosina	be67f81297	iproute2: tc: introduce build dependency on libnetlink Rebuilding libnetlink doesn't trigger rebuild of tc, which is wrong (especially so for builds where libnetlink.a gets statically linked into tc). Fix that by introducing an explicit dependency. Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2017-02-24 15:11:32 -08:00
Yotam Gigi	0b1abd84fb	tc: Add support for the sample tc action The sample tc action allows sampling packets matching a classifier. It peeks randomly packets, and samples them using the psample netlink channel. The user can specify the psample group, which the packet will be sampled to, the sampling rate and the packet truncation (to save kernel-user traffic). The sampled packets contain informative metadata, for example, the input interface and the original packet length. The action syntax: tc filter add [...] \ action sample rate <RATE> group <GROUP> [trunc <SIZE>] [...] Where: RATE := The sampling rate which is the ratio of packets observed at the data source to the samples generated GROUP := the psample module sampling group SIZE := optional truncation size An example for a common usecase of the sample tc action: to sample ingress traffic from interface eth1, one may use the commands: tc qdisc add dev eth1 handle ffff: ingress tc filter add dev eth1 parent ffff: \ matchall action sample rate 12 group 4 Where the first command adds an ingress qdisc and the second starts sampling randomly with an average of one sampled packet per 12 packets on dev eth1 to psample group 4. Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>	2017-02-06 14:24:52 -08:00
David Michael	bb18c98198	tc: make tc linking depend on libtc.a There was a race condition where the command to link the tc binary could (rarely) run before the libtc.a archive existed.	2017-01-09 12:06:58 -08:00
Amir Vadai	d57639a475	tc/act_tunnel: Introduce ip tunnel action This action could be used before redirecting packets to a shared tunnel device, or when redirecting packets arriving from a such a device. The 'unset' action is optional. It is used to explicitly unset the metadata created by the tunnel device during decap. If not used, the metadata will be released automatically by the kernel. The 'set' operation, will set the metadata with the specified values for the encap. For example, the following flower filter will forward all ICMP packets destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before redirecting, a metadata for the vxlan tunnel is created using the tunnel_key action and it's arguments: $ tc filter add dev net0 protocol ip parent ffff: \ flower \ ip_proto 1 \ dst_ip 11.11.11.2 \ action tunnel_key set \ src_ip 11.11.0.1 \ dst_ip 11.11.0.2 \ id 11 \ action mirred egress redirect dev vxlan0 Signed-off-by: Amir Vadai <amir@vadai.me>	2016-12-02 14:12:09 -08:00
Daniel Borkmann	e42256699c	bpf: make tc's bpf loader generic and move into lib This work moves the bpf loader into the iproute2 library and reworks the tc specific parts into generic code. It's useful as we can then more easily support new program types by just having the same ELF loader backend. Joint work with Thomas Graf. I hacked a rough start of a test suite to make sure nothing breaks [1] and looks all good. [1] https://github.com/borkmann/clsact/blob/master/test_bpf.sh Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Thomas Graf <tgraf@suug.ch>	2016-11-29 12:35:32 -08:00
Daniel Borkmann	4710e46ec3	tc, ipt: don't enforce iproute2 dependency on iptables-devel Since `5cd1adba79` ("Update to current iptables headers") compilation of iproute2 broke for systems without iptables-devel package [1]. Reason is that even though we fall back to build m_ipt.c, the include depends on a xtables-version.h header, which only ships with iptables-devel. Machines not having this package fail compilation with: [...] CC m_ipt.o In file included from ../include/iptables.h:5:0, from m_ipt.c:17: ../include/xtables.h:34:29: fatal error: xtables-version.h: No such file or directory compilation terminated. ../Config:31: recipe for target 'm_ipt.o' failed make[1]: *** [m_ipt.o] Error 1 The configure script only barks that package xtables was not found in the pkg-config search path. The generated Config then only contains f.e. TC_CONFIG_IPSET. In tc's Makefile we thus fall back to adding m_ipt.o to TCMODULES. m_ipt.c then includes the local include/iptables.h header copy, which includes the include/xtables.h copy. Latter then includes xtables-version.h, which only ships with iptables-devel. One way to resolve this is to skip this whole mess when pkg-config has no xtables config available. I've carried something along these lines locally for a while now, but it's just too annyoing. :/ Build works fine now also when xtables.pc is not available. [1] http://www.spinics.net/lists/netdev/msg366162.html Fixes: `5cd1adba79` ("Update to current iptables headers") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2016-10-26 10:58:22 -07:00
Yotam Gigi	d5cbf3ff05	tc: Add support for the matchall traffic classifier. The matchall classifier matches every packet and allows the user to apply actions on it. In addition, it supports the skip_sw and skip_hw (as can be found on u32 and flower filter) that direct the kernel to skip the software/hardware processing of the actions. This filter is very useful in usecases where every packet should be matched. For example, packet mirroring (SPAN) can be setup very easily using that filter. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com>	2016-09-01 08:37:01 -07:00
David Ahern	57bdf8b764	Make builds default to quiet mode Similar to the Linux kernel and perf add infrastructure to reduce the amount of output tossed to a user during a build. Full build output can be obtained with 'make V=1' Builds go from: make[1]: Leaving directory `/home/dsa/iproute2.git/lib' make[1]: Entering directory `/home/dsa/iproute2.git/ip' gcc -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES -DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -c -o ip.o ip.c gcc -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wold-style-definition -Wformat=2 -O2 -I../include -DRESOLVE_HOSTNAMES -DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -c -o ipaddress.o ipaddress.c to: ... AR libutil.a ip CC ip.o CC ipaddress.o ... Signed-off-by: David Ahern <dsa@cumulusnetworks.com>	2016-05-31 12:13:07 -07:00
Jamal Hadi Salim	d3e511223f	tc: introduce IFE action This action allows for a sending side to encapsulate arbitrary metadata which is decapsulated by the receiving end. The sender runs in encoding mode and the receiver in decode mode. Both sender and receiver must specify the same ethertype. At some point we hope to have a registered ethertype and we'll then provide a default so the user doesnt have to specify it. For now we enforce the user specify it. Described in netdev01 paper: "Distributing Linux Traffic Control Classifier-Action Subsystem" Authors: Jamal Hadi Salim and Damascene M. Joachimpillai Also refer to IETF draft-ietf-forces-interfelfb-04.txt Lets show example usage where we encode icmp from a sender towards a receiver with an skbmark of 17; both sender and receiver use ethertype of 0xdead to interop. YYYY: Lets start with Receiver-side policy config: xxx: add an ingress qdisc sudo tc qdisc add dev $ETH ingress xxx: any packets with ethertype 0xdead will be subjected to ife decoding xxx: we then restart the classification so we can match on icmp at prio 3 sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \ u32 match u32 0 0 flowid 1:1 \ action ife decode reclassify xxx: on restarting the classification from above if it was an icmp xxx: packet, then match it here and continue to the next rule at prio 4 xxx: which will match based on skb mark of 17 sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \ u32 match ip protocol 1 0xff flowid 1:1 \ action continue xxx: match on skbmark of 0x11 (decimal 17) and accept sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \ handle 0x11 fw flowid 1:1 \ action ok xxx: Lets show the decoding policy sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead xxx: filter pref 2 u32 filter pref 2 u32 fh 800: ht divisor 1 filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 (rule hit 0 success 0) match 00000000/00000000 at 0 (success 0 ) action order 1: ife decode action reclassify type 0x0 allow mark allow prio index 11 ref 1 bind 1 installed 45 sec used 45 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 xxx: Observe that above lists all metadatum it can decode. Typically these submodules will already be compiled into a monolithic kernel or loaded as modules YYYY: Lets show the sender side now .. xxx: Add an egress qdisc on the sender netdev sudo tc qdisc add dev $ETH root handle 1: prio xxx: xxx: Match all icmp packets to 192.168.122.237/24, then xxx: tag the packet with skb mark of decimal 17, then xxx: Encode it with: xxx: ethertype 0xdead xxx: add skb->mark to whitelist of metadatum to send xxx: rewrite target dst MAC address to 02:15:15:15:15:15 xxx: sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 u32 \ match ip dst 192.168.122.237/24 \ match ip protocol 1 0xff \ flowid 1:2 \ action skbedit mark 17 \ action ife encode \ type 0xDEAD \ allow mark \ dst 02:15:15:15:15:15 xxx: Lets show the encoding policy filter pref 10 u32 filter pref 10 u32 fh 800: ht divisor 1 filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2 (rule hit 118 success 0) match c0a87a00/ffffff00 at 16 (success 0 ) match 00010000/00ff0000 at 8 (success 0 ) action order 1: skbedit mark 17 index 11 ref 1 bind 1 installed 3 sec used 3 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 action order 2: ife encode action pipe type 0xDEAD allow mark dst 02:15:15:15:15:15 index 12 ref 1 bind 1 installed 3 sec used 3 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 xxx: Now test by sending ping from sender to destination Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>	2016-05-16 11:13:26 -07:00
Daniel Borkmann	8f9afdd531	tc, clsact: add clsact frontend Add the tc part for the kernel commit 1f211a1b929c ("net, sched: add clsact qdisc"). Quoting example usage from that commit description: Example, adding qdisc: # tc qdisc add dev foo clsact # tc qdisc show dev foo qdisc mq 0: root qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact ffff: parent ffff:fff1 Adding filters (deleting, etc works analogous by specifying ingress/egress): # tc filter add dev foo ingress bpf da obj bar.o sec ingress # tc filter add dev foo egress bpf da obj bar.o sec egress # tc filter show dev foo ingress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action # tc filter show dev foo egress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action The ingress parent alias can also be used with ingress qdisc. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2016-01-18 11:41:27 -08:00
Jiri Pirko	30eb304ecd	tc: add support for Flower classifier Signed-off-by: Jiri Pirko <jiri@resnulli.us>	2015-05-21 15:22:49 -07:00
Daniel Borkmann	4bd624467b	tc: built-in eBPF exec proxy This work follows upon commit `6256f8c9e4` ("tc, bpf: finalize eBPF support for cls and act front-end") and takes up the idea proposed by Hannes Frederic Sowa to spawn a shell (or any other command) that holds generated eBPF map file descriptors. File descriptors, based on their id, are being fetched from the same unix domain socket as demonstrated in the bpf_agent, the shell spawned via execvpe(2) and the map fds passed over the environment, and thus are made available to applications in the fashion of std{in,out,err} for read/write access, for example in case of iproute2's examples/bpf/: # env \| grep BPF BPF_NUM_MAPS=3 BPF_MAP1=6 <- BPF_MAP_ID_QUEUE (id 1) BPF_MAP0=5 <- BPF_MAP_ID_PROTO (id 0) BPF_MAP2=7 <- BPF_MAP_ID_DROPS (id 2) # ls -la /proc/self/fd [...] lrwx------. 1 root root 64 Apr 14 16:46 0 -> /dev/pts/4 lrwx------. 1 root root 64 Apr 14 16:46 1 -> /dev/pts/4 lrwx------. 1 root root 64 Apr 14 16:46 2 -> /dev/pts/4 [...] lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map The advantage (as opposed to the direct/native usage) is that now the shell is map fd owner and applications can terminate and easily reattach to descriptors w/o any kernel changes. Moreover, multiple applications can easily read/write eBPF maps simultaneously. To further allow users for experimenting with that, next step is to add a small helper that can get along with simple data types, so that also shell scripts can make use of bpf syscall, f.e to read/write into maps. Generally, this allows for prepopulating maps, or any runtime altering which could influence eBPF program behaviour (f.e. different run-time classifications, skb modifications, ...), dumping of statistics, etc. Reference: http://thread.gmane.org/gmane.linux.network/357471/focus=357860 Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com>	2015-04-27 16:39:23 -07:00
Felix Fietkau	b8d5c9a71b	tc: add support for connmark action Add ability to add the netfilter connmark support. Typical usage: ...lets tag outgoing icmp with mark 0x10.. iptables -tmangle -A PREROUTING -p icmp -j CONNMARK --set-mark 0x10 ..add on ingress of $ETH an extractor for connmark... tc filter add dev $ETH parent ffff: prio 4 protocol ip \ u32 match ip protocol 1 0xff \ flowid 1:1 \ action connmark continue ...if the connmark was 0x11, we police to a ridic rate of 10Kbps tc filter add dev $ETH parent ffff: prio 5 protocol ip \ handle 0x11 fw flowid 1:1 \ action police rate 10kbit burst 10k Other ways to use the connmark is to supply the zone, index and branching choice. Refer to help. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>	2015-04-13 10:49:45 -07:00
Daniel Borkmann	11c39b5e98	tc: add eBPF support to f_bpf This work adds the tc frontend for kernel commit e2e9b6541dd4 ("cls_bpf: add initial eBPF support for programmable classifiers"). A C-like classifier program (f.e. see e2e9b6541dd4) is being compiled via LLVM's eBPF backend into an ELF file, that is then being passed to tc. tc then loads, if any, eBPF maps and eBPF opcodes (with fixed-up eBPF map file descriptors) out of its dedicated sections, and via bpf(2) into the kernel and then the resulting fd via netlink down to cls_bpf. cls_bpf allows for annotations, currently, I've used the file name for that, so that the user can easily identify his filter when dumping configurations back. Example usage: clang -O2 -emit-llvm -c cls.c -o - \| llc -march=bpf -filetype=obj -o cls.o tc filter add dev em1 parent 1: bpf run object-file cls.o classid x:y tc filter show dev em1 [...] filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid x:y cls.o I placed the parser bits derived from Alexei's kernel sample, into tc_bpf.c as my next step is to also add the same support for BPF action, so we can have a fully fledged eBPF classifier and action in tc. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com>	2015-03-24 15:45:23 -07:00
Jiri Pirko	86ab59a666	tc: add support for BPF based actions Signed-off-by: Jiri Pirko <jiri@resnulli.us>	2015-02-05 10:38:13 -08:00
Jiri Pirko	1d129d191a	tc: push bpf common code into separate file Signed-off-by: Jiri Pirko <jiri@resnulli.us>	2015-02-05 10:38:13 -08:00
Vadim Kochan	67e1d73be1	tc: Allow to easy change network namespace Added new '-netns' option to simplify executing following cmd: ip netns exec NETNS tc OPTIONS COMMAND OBJECT to tc -n[etns] NETNS OPTIONS COMMAND OBJECT e.g.: tc -net vnet0 qdisc Signed-off-by: Vadim Kochan <vadim4j@gmail.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us>	2014-12-27 10:22:34 -08:00
Jiri Pirko	8b1c0216d8	tc: add support for vlan tc action Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Reviewed-by: Cong Wang <cwang@twopensource.com>	2014-12-03 09:29:21 -08:00
Terry Lam	ac74bd2a71	support for Heavy Hitter Filter (HHF) qdisc $tc qdisc add dev eth0 hhf help Usage: ... hhf [ limit PACKETS ] [ quantum BYTES] [ hh_limit NUMBER ] [ reset_timeout TIME ] [ admit_bytes BYTES ] [ evict_timeout TIME ] [ non_hh_weight NUMBER ] $tc -s -d qdisc show dev eth0 qdisc hhf 8005: root refcnt 32 limit 1000p quantum 1514 hh_limit 2048 reset_timeout 40.0ms admit_bytes 131072 evict_timeout 1.0s non_hh_weight 2 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 drop_overlimit 0 hh_overlimit 0 tot_hh 0 cur_hh 0 HHF qdisc parameters: - limit: max number of packets in qdisc (default 1000) - quantum: max deficit per RR round (default 1 MTU) - hh_limit: max number of HHs to keep states (default 2048) - reset_timeout: time to reset HHF counters (default 40ms) - admit_bytes: counter thresh to classify as HH (default 128KB) - evict_timeout: threshold to evict idle HHs (default 1s) - non_hh_weight: DRR weight for mice (default 2) Signed-off-by: Terry Lam <vtlam@google.com>	2014-05-09 12:10:47 -07:00
Vijay Subramanian	80dd880dd0	PIE: Proportional Integral controller Enhanced Proportional Integral controller Enhanced (PIE) is a scheduler to address the bufferbloat problem. We present here a lightweight design, PIE(Proportional Integral controller Enhanced) that can effectively control the average queueing latency to a target value. Simulation results, theoretical analysis and Linux testbed results have shown that PIE can ensure low latency and achieve high link utilization under various congestion situations. The design does not require per-packet timestamp, so it incurs very small overhead and is simple enough to implement in both hardware and software. " For more information, please see technical paper about PIE in the IEEE Conference on High Performance Switching and Routing 2013. A copy of the paper can be found at ftp://ftpeng.cisco.com/pie/. Please also refer to the IETF draft submission at http://tools.ietf.org/html/draft-pan-tsvwg-pie-00 All relevant code, documents and test scripts and results can be found at ftp://ftpeng.cisco.com/pie/. For problems with the iproute2/tc or Linux kernel code, please contact Vijay Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu (mysuryan@cisco.com) Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: Mythili Prabhu <mysuryan@cisco.com> CC: Dave Taht <dave.taht@bufferbloat.net>	2014-01-09 22:50:47 -08:00
Daniel Borkmann	d05df6861f	tc: add cls_bpf frontend This is the iproute2 part of the kernel patch "net: sched: add BPF-based traffic classifier". [Will re-submit later again for iproute2 when window for -next submissions opens.] Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Thomas Graf <tgraf@suug.ch>	2013-10-30 16:45:05 -07:00
Jamal Hadi Salim	087f46ee4e	tc: introduce simple action Simple action is already in the kernel for years now as an example. This complements it with user space control. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>	2013-09-30 21:29:34 -07:00
Eric Dumazet	bc113e46a3	pkt_sched: fq: Fair Queue packet scheduler Support for FQ packet scheduler $ tc qd add dev eth0 root fq help Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ] [ quantum BYTES ] [ initial_quantum BYTES ] [ maxrate RATE ] [ buckets NUMBER ] [ [no]pacing ] $ tc -s -d qd qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140 Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14) backlog 0b 0p requeues 14 511 flows (511 inactive, 0 throttled) 110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit limit : max number of packets on whole Qdisc (default 10000) flow_limit : max number of packets per flow (default 100) quantum : the max deficit per RR round (default is 2 MTU) initial_quantum : initial credit for new flows (default is 10 MTU) maxrate : max per flow rate (default : unlimited) buckets : number of RB trees (default : 1024) in hash table. (consumes 8 bytes per bucket) [no]pacing : disable/enable pacing (default is enable) Usage : tc qdisc add dev $ETH root fq tc qdisc del dev $ETH root 2>/dev/null tc qdisc add dev $ETH root handle 1: mq for i in `seq 1 4` do tc qdisc add dev $ETH parent 1:$i est 1sec 4sec fq done Signed-off-by: Eric Dumazet <edumazet@google.com>	2013-09-20 09:43:40 -07:00
Benjamin Poirier	5ab3a4de5e	Use pkg-config to obtain xtables.h path On openSUSE 12.2 (at least) xtables.h is not installed in the system-wide include dir but in /usr/include/iptables-1.4.16.3/. This results in the following build failure: em_ipset.c:26:21: fatal error: xtables.h: No such file or directory Other includers of xtables.h already call out to pkg-config	2013-02-11 09:19:54 -08:00
Mike Frysinger	e4fc4ada33	allow pkg-config to be customized Rather than hard coding `pkg-config`, use ${PKG_CONFIG} so people can override it to their specific version (like when cross-compiling). This is the same way the upstream pkg-config code works. Signed-off-by: Mike Frysinger <vapier@gentoo.org>	2012-11-11 16:21:34 -08:00

1 2 3

117 Commits