This work follows upon commit 6256f8c9e4 ("tc, bpf: finalize eBPF
support for cls and act front-end") and takes up the idea proposed by
Hannes Frederic Sowa to spawn a shell (or any other command) that holds
generated eBPF map file descriptors.
File descriptors, based on their id, are being fetched from the same
unix domain socket as demonstrated in the bpf_agent, the shell spawned
via execvpe(2) and the map fds passed over the environment, and thus
are made available to applications in the fashion of std{in,out,err}
for read/write access, for example in case of iproute2's examples/bpf/:
# env | grep BPF
BPF_NUM_MAPS=3
BPF_MAP1=6 <- BPF_MAP_ID_QUEUE (id 1)
BPF_MAP0=5 <- BPF_MAP_ID_PROTO (id 0)
BPF_MAP2=7 <- BPF_MAP_ID_DROPS (id 2)
# ls -la /proc/self/fd
[...]
lrwx------. 1 root root 64 Apr 14 16:46 0 -> /dev/pts/4
lrwx------. 1 root root 64 Apr 14 16:46 1 -> /dev/pts/4
lrwx------. 1 root root 64 Apr 14 16:46 2 -> /dev/pts/4
[...]
lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
The advantage (as opposed to the direct/native usage) is that now the
shell is map fd owner and applications can terminate and easily reattach
to descriptors w/o any kernel changes. Moreover, multiple applications
can easily read/write eBPF maps simultaneously.
To further allow users for experimenting with that, next step is to add
a small helper that can get along with simple data types, so that also
shell scripts can make use of bpf syscall, f.e to read/write into maps.
Generally, this allows for prepopulating maps, or any runtime altering
which could influence eBPF program behaviour (f.e. different run-time
classifications, skb modifications, ...), dumping of statistics, etc.
Reference: http://thread.gmane.org/gmane.linux.network/357471/focus=357860
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Add ability to add the netfilter connmark support.
Typical usage:
...lets tag outgoing icmp with mark 0x10..
iptables -tmangle -A PREROUTING -p icmp -j CONNMARK --set-mark 0x10
..add on ingress of $ETH an extractor for connmark...
tc filter add dev $ETH parent ffff: prio 4 protocol ip \
u32 match ip protocol 1 0xff \
flowid 1:1 \
action connmark continue
...if the connmark was 0x11, we police to a ridic rate of 10Kbps
tc filter add dev $ETH parent ffff: prio 5 protocol ip \
handle 0x11 fw flowid 1:1 \
action police rate 10kbit burst 10k
Other ways to use the connmark is to supply the zone, index and
branching choice. Refer to help.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
This work adds the tc frontend for kernel commit e2e9b6541dd4 ("cls_bpf:
add initial eBPF support for programmable classifiers").
A C-like classifier program (f.e. see e2e9b6541dd4) is being compiled via
LLVM's eBPF backend into an ELF file, that is then being passed to tc. tc
then loads, if any, eBPF maps and eBPF opcodes (with fixed-up eBPF map file
descriptors) out of its dedicated sections, and via bpf(2) into the kernel
and then the resulting fd via netlink down to cls_bpf. cls_bpf allows for
annotations, currently, I've used the file name for that, so that the user
can easily identify his filter when dumping configurations back.
Example usage:
clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
tc filter add dev em1 parent 1: bpf run object-file cls.o classid x:y
tc filter show dev em1 [...]
filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid x:y cls.o
I placed the parser bits derived from Alexei's kernel sample, into tc_bpf.c
as my next step is to also add the same support for BPF action, so we can
have a fully fledged eBPF classifier and action in tc.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Proportional Integral controller Enhanced (PIE) is a scheduler to address the
bufferbloat problem.
We present here a lightweight design, PIE(Proportional Integral controller
Enhanced) that can effectively control the average queueing latency to a target
value. Simulation results, theoretical analysis and Linux testbed results have
shown that PIE can ensure low latency and achieve high link utilization under
various congestion situations. The design does not require per-packet
timestamp, so it incurs very small overhead and is simple enough to implement
in both hardware and software. "
For more information, please see technical paper about PIE in the IEEE
Conference on High Performance Switching and Routing 2013. A copy of the paper
can be found at ftp://ftpeng.cisco.com/pie/.
Please also refer to the IETF draft submission at
http://tools.ietf.org/html/draft-pan-tsvwg-pie-00
All relevant code, documents and test scripts and results can be found at
ftp://ftpeng.cisco.com/pie/.
For problems with the iproute2/tc or Linux kernel code, please contact Vijay
Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
(mysuryan@cisco.com)
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Mythili Prabhu <mysuryan@cisco.com>
CC: Dave Taht <dave.taht@bufferbloat.net>
This is the iproute2 part of the kernel patch "net: sched:
add BPF-based traffic classifier".
[Will re-submit later again for iproute2 when window for
-next submissions opens.]
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Simple action is already in the kernel for years now as an
example. This complements it with user space control.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
On openSUSE 12.2 (at least) xtables.h is not installed in the system-wide
include dir but in /usr/include/iptables-1.4.16.3/. This results in the
following build failure:
em_ipset.c:26:21: fatal error: xtables.h: No such file or directory
Other includers of xtables.h already call out to pkg-config
Rather than hard coding `pkg-config`, use ${PKG_CONFIG} so people can
override it to their specific version (like when cross-compiling).
This is the same way the upstream pkg-config code works.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Hi,
When compiling iproute2-3.6.0 on a host that doesn't have iptables available, I get the following error:
gcc -Wall -Wstrict-prototypes -O2 -I../include -DRESOLVE_HOSTNAMES
-DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" -D_GNU_SOURCE
-DCONFIG_GACT -DCONFIG_GACT_PROB -DYY_NO_INPUT -c -o em_ipset.o
em_ipset.c
em_ipset.c:26:21: fatal error: xtables.h: No such file or directory
Fixed by the following patch, which guards the building of em_ipset.o on
the presence of suitable headers.
Thanks,
Matt.
This ematch enables effective filtering of CAN frames (AF_CAN) based
on CAN identifiers with masking of compared bits. Implementation
utilizes bitmap based classification for standard frame format (SFF)
which is optimized for minimal overhead.
Signed-off-by: Rostislav Lisovy <lisovy@gmail.com>
example usage:
tc filter add dev $dev parent $id: basic match not ipset'(foobar src)' ..
also updates iproute2/ematch_map, else tc complains:
Error: Unable to find ematch "ipset" in /etc/iproute2/ematch_map
Please assign a unique ID to the ematch kind the suggested entry is:
8 ipset
when trying to use this ematch.
(text ematch (5) only exists in kernel, a vlan ematch (6) exists neither in
kernel nor userspace, but kernel headers define TCF_EM_VLAN == 6).
Fair Queue Codel packet scheduler
Principles :
- Packets are classified (internal classifier or external) on flows.
- This is a Stochastic model (as we use a hash, several flows might
be hashed on same slot)
- Each flow has a CoDel managed queue.
- Flows are linked onto two (Round Robin) lists,
so that new flows have priority on old ones.
- For a given flow, packets are not reordered (CoDel uses a FIFO)
- head drops only.
- ECN capability is on by default.
- Very low memory footprint (64 bytes per flow)
tc qdisc ... fq_codel [ limit PACKETS ] [ flows number ]
[ target TIME ] [ interval TIME ] [ noecn ]
[ quantum BYTES ]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Van Jacobson <van@pollere.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Changli Gao <xiaosuo@gmail.com>
An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.
http://queue.acm.org/detail.cfm?id=2209336
This AQM main input is no longer queue size in bytes or packets, but the
delay packets stay in (FIFO) queue.
As we don't have infinite memory, we still can drop packets in enqueue()
in case of massive load, but mean of CoDel is to drop packets in
dequeue(), using a control law based on two simple parameters :
target : target sojourn time (default 5ms)
interval : width of moving time window (default 100ms)
Selected packets are dropped, unless ECN is enabled and packets can get
ECN mark instead.
Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
[ interval TIME ] [ ecn ]
qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
maxpacket 1514 ecn_mark 84399 drop_overlimit 0
CoDel must be seen as a base module, and should be used keeping in mind
there is still a FIFO queue. So a typical setup will probably need a
hierarchy of several qdiscs and packet classifiers to be able to meet
whatever constraints a user might have.
One possible example would be to use fq_codel, which combines Fair
Queueing and CoDel, in replacement of sfq / sfq_red.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dave Taht <dave.taht@bufferbloat.net>
Define where is the are located the iproute2 config files.
Get rid of trailing slashes for paths in several file.
Signed-off-by: Christoph J. Thompson <cjsthompson@gmail.com>
LIBNETLINK will be defined in the main Makefile, so
both ../lib/libnetlink.a ../lib/libutil.a will be
automatically appended during linking. Otherwise
../lib/libnetlink.a ../lib/libutil.a will appear
twice during linking.
Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>
Building iproute2 in parallel might hit the race failure:
emp_ematch.l:2:30: fatal error: emp_ematch.yacc.h:
No such file or directory
make[1]: *** [emp_ematch.lex.o] Error 1
This is because we currently allow the yacc/lex files to generate and
compile in parallel. So add a simple dependency to make sure yacc has
finished before we attempt to compile the lex output.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Add the iproute2 support for the ACT_CSUM action. Can be used as
following, certainly in conjunction with the ACT_PEDIT action (pedit):
# In order to DNAT (stateless) IPv4 packet from 192.168.1.100 to
# 0x12345678 (18.52.86.120), and update the IPv4 header checksum and
# the UDP checksum (the last one, only if the packet is UDP).
tc filter add eth0 prio 1 protocol ip parent ffff: \
u32 match ip src 192.168.1.100/32 flowid :1 \
action pedit munge offset 16 u32 set 0x12345678 \
pipe csum ip and udp
# In order to alter destination address of IPv6 TCP packets from fc00::1
# and correct the TCP checksum (nothing happened? except maybe for
# checksums in the TCP payload ...).
tc filter add eth0 prio 1 protocol ipv6 parent ffff: \
u32 match ip6 src fc00::1/128 match ip6 protocol 0x06 0xff flowid :1 \
action pedit munge offset 24 u32 set 0x12345678 \
pipe csum tcp
The recent commit "iproute2: add option to build m_xt as a tc module"
(ab814d6355) looks like it wrongly included debug changes in the
install target. So drop the `echo` so the tc binary actually gets
installed again.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
This will build the xt module (action ipt) of tc as a
shared object that is linked at runtime by tc if used,
rather then built into tc.
This is similar to how the atm qdisc support
is handled (q_atm.so).
Signed-off-by: Andreas Henriksson <andreas@xxxxxxxx>
Try to automatically detect iptables modules directory.
Make the configure script look for iptables modules.
This also makes it possible to specify it on the
command line while building via "make IPT_LIB_DIR=/foo/bar".
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
Since there aren't any targets that currently use this pattern rule, this
is more of a proactive fix.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Move the file and rename the configure flags.
The file is being kept around for iptables < 1.4.5 compatibility.
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
The iptables code supports a "no shared libs" mode where it can be used
without requiring dlfcn related functionality. This adds similar support
to iproute2 so that it can easily be used on systems like nommu Linux (but
obviously with a few limitations -- no dynamic plugins).
Rather than modify every location that uses dlfcn.h, I hooked the dlfcn.h
header with stub functions when shared library support is disabled. Then
symbol lookup is done via a local static lookup table (which is generated
automatically at build time) so that internal symbols can be found.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Many thanks to Yevgeny Kosarzhevsky <yevg@pisem.net> for reporting
and a lot of testing
Thanks to Jan Engelhardt <jengelh@medozas.de> for a lot of advice
Thanks to Denys Fedoryschenko <denys@visp.net.lb> for some sample
code that he tried and thanks to Andreas Henriksson <andreas@fatal.se>
(who maintains iproute2 on debian) for the persistent followup.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Add support for multiq qdisc
This patch adds the ability to configure the multiq qdisc. Since the qdisc does not require any input it will pull the number of bands directly from the device that it is added to the root of.
usage: tc qdisc add dev <DEV> root handle <HANDLE> multiq
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Provides ability to edit queue_mapping field
Provides ability to edit priority field
usage: action skbedit [queue_mapping QUEUE_MAPPING] [priority PRIORITY]
at least one option must be select, or both at the same time
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Hello Rafael Almeida.
I noticed your patch adding DESTDIR support in the latest iproute2 release.
Much appreciated! Soon the debian packages might be able to move to actually
using "make install" rather then it's own installation procedure when
building packages. I've noticed something that will break though....
Debian packages usually sets DESTDIR=debian/tmp/ and packages the contents
of that directory as if it where the root file system. This will break
the /usr/lib/{tc,ip}/ module loading, because they DESTDIR (/usr) will be
/whatever-the-build-path-was/debian/tmp/lib/{tc,ip}/.
I beleive others usually call this the LIBDIR to make the separation between
DISTDIR being the (possibly temporary) place things are put when build is
done, and LIBDIR (and others) are used for actual runtime paths.
I'm attaching a patch that I think fixes this, but would be really happy if
you could have a look at to verify I'm not screwing something up.
--
Regards,
Andreas Henriksson
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Patch adds generic size table that is similiar to rate table, with
difference that size table stores link layer packet size.
Based on patch by Patrick McHardy
http://marc.info/?l=linux-netdev&m=115201979221729&w=2
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
After changing the DESTDIR the installated binaries have some issues
due to hard coded paths. For example, using distributions on NetEm
would segfault.
I've changed iplink.c and tc_util.c so they are now aware of DESTDIR.
Along with that change I needed to change the main Makefile so it
defines the DESTDIR macro when calling gcc.
I also changed the paths so that during the installation sbin, etc,
share and lib directories are created directly inside of the DESTDIR,
instead of creating a usr directory inside that. That's the behaviour
of most packages out there, so I think most users will be expecting
that to happen.
[IPROUTE]: Add flow classifier support
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>