iproute2/tc
Daniel Borkmann 32e93fb7f6 {f,m}_bpf: allow for sharing maps
This larger work addresses one of the bigger remaining issues on
tc's eBPF frontend, that is, to allow for persistent file descriptors.
Whenever tc parses the ELF object, extracts and loads maps into the
kernel, these file descriptors will be out of reach after the tc
instance exits.

Meaning, for simple (unnested) programs which contain one or
multiple maps, the kernel holds a reference, and they will live
on inside the kernel until the program holding them is unloaded,
but they will be out of reach for user space, even worse with
(also multiple nested) tail calls.

For this issue, we introduced the concept of an agent that can
receive the set of file descriptors from the tc instance creating
them, in order to be able to further inspect/update map data for
a specific use case. However, while that is more tied towards
specific applications, it still doesn't easily allow for sharing
maps accross multiple tc instances and would require a daemon to
be running in the background. F.e. when a map should be shared by
two eBPF programs, one attached to ingress, one to egress, this
currently doesn't work with the tc frontend.

This work solves exactly that, i.e. if requested, maps can now be
_arbitrarily_ shared between object files (PIN_GLOBAL_NS) or within
a single object (but various program sections, PIN_OBJECT_NS) without
"loosing" the file descriptor set. To make that happen, we use eBPF
object pinning introduced in kernel commit b2197755b263 ("bpf: add
support for persistent maps/progs") for exactly this purpose.

The shipped examples/bpf/bpf_shared.c code from this patch can be
easily applied, for instance, as:

 - classifier-classifier shared:

  tc filter add dev foo parent 1: bpf obj shared.o sec egress
  tc filter add dev foo parent ffff: bpf obj shared.o sec ingress

 - classifier-action shared (here: late binding to a dummy classifier):

  tc actions add action bpf obj shared.o sec egress pass index 42
  tc filter add dev foo parent ffff: bpf obj shared.o sec ingress
  tc filter add dev foo parent 1: bpf bytecode '1,6 0 0 4294967295,' \
     action bpf index 42

The toy example increments a shared counter on egress and dumps its
value on ingress (if no sharing (PIN_NONE) would have been chosen,
map value is 0, of course, due to the two map instances being created):

  [...]
          <idle>-0     [002] ..s. 38264.788234: : map val: 4
          <idle>-0     [002] ..s. 38264.788919: : map val: 4
          <idle>-0     [002] ..s. 38264.789599: : map val: 5
  [...]

... thus if both sections reference the pinned map(s) in question,
tc will take care of fetching the appropriate file descriptor.

The patch has been tested extensively on both, classifier and
action sides.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2015-11-23 16:10:44 -08:00
..
.gitignore Add ignore files to make using git easier 2006-08-08 12:04:38 -07:00
Makefile tc: add support for Flower classifier 2015-05-21 15:22:49 -07:00
README.last tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
e_bpf.c {f,m}_bpf: allow for sharing maps 2015-11-23 16:10:44 -08:00
em_canid.c Ematch used to classify CAN frames according to their identifiers 2012-08-20 13:11:55 -07:00
em_cmp.c Fix wrong comparison in cmp_print_eopt() 2011-10-07 11:16:15 -07:00
em_ipset.c tc: add ipset ematch 2012-08-13 08:33:50 -07:00
em_meta.c add missing underscore to man page and example nf_mark ematch 2014-10-09 08:24:00 -07:00
em_nbyte.c tc: remove dlfcn.h from files that dont need it 2009-11-13 14:14:07 -08:00
em_u32.c tc: remove dlfcn.h from files that dont need it 2009-11-13 14:14:07 -08:00
emp_ematch.l fix build issues with flex ver 2.5 2010-04-22 15:27:42 -07:00
emp_ematch.y tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
f_basic.c tc: fill in handle before checking argc 2015-05-11 09:13:20 -07:00
f_bpf.c {f,m}_bpf: allow for sharing maps 2015-11-23 16:10:44 -08:00
f_cgroup.c discourage use of direct policer interface 2014-10-09 08:26:57 -07:00
f_flow.c whitespace cleanup 2014-12-20 15:47:17 -08:00
f_flower.c tc: improve filter help texts a bit 2015-10-23 15:37:26 -07:00
f_fw.c discourage use of direct policer interface 2014-10-09 08:26:57 -07:00
f_route.c tc: improve filter help texts a bit 2015-10-23 15:37:26 -07:00
f_rsvp.c tc: improve filter help texts a bit 2015-10-23 15:37:26 -07:00
f_tcindex.c tcindex classifier support for multiple actions 2014-10-09 08:26:56 -07:00
f_u32.c tc: u32 filter coding style cleanup 2015-10-23 15:37:26 -07:00
m_action.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
m_bpf.c {f,m}_bpf: allow for sharing maps 2015-11-23 16:10:44 -08:00
m_connmark.c comment: Fix remaining listings of wrong FSF address 2015-09-23 15:58:54 -07:00
m_csum.c csum action, fix typo 2012-03-15 14:24:59 -07:00
m_ematch.c Fix NULL pointer reference when using basic match 2010-07-29 18:03:35 -07:00
m_ematch.h include needed files 2012-12-23 11:49:06 -08:00
m_estimator.c ip: make local functions static 2013-02-12 11:38:35 -08:00
m_gact.c ip: make local functions static 2013-02-12 11:38:35 -08:00
m_ipt.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
m_mirred.c More minor spelling fixes 2013-08-04 15:10:05 -07:00
m_nat.c action: typo nat fix 2013-09-30 21:31:40 -07:00
m_pedit.c iproute2: tc/m_pedit.c - remove dead code 2015-06-25 08:52:06 -04:00
m_pedit.h Remove trailing whitespace 2006-12-05 10:10:22 -08:00
m_police.c Remove unnecessary debug statement 2014-05-28 16:54:26 -07:00
m_simple.c tc: fix compilation warning on 32bits arch 2015-04-27 11:41:46 -07:00
m_skbedit.c whitespace cleanup 2014-12-20 15:47:17 -08:00
m_vlan.c actions: Get vlan action to work in pipeline 2015-01-13 17:22:44 -08:00
m_xt.c whitespace cleanup 2014-12-20 15:47:17 -08:00
m_xt_old.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
p_icmp.c Remove trailing whitespace 2006-12-05 10:10:22 -08:00
p_ip.c Remove trailing whitespace 2006-12-05 10:10:22 -08:00
p_tcp.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
p_udp.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
q_atm.c Convert to use rta_getattr_ functions 2012-04-10 08:47:55 -07:00
q_cbq.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
q_choke.c whitespace cleanup 2014-12-20 15:47:17 -08:00
q_codel.c codel: add ce_threshold support to codel & fc_codel 2015-05-21 15:25:05 -07:00
q_drr.c Convert to use rta_getattr_ functions 2012-04-10 08:47:55 -07:00
q_dsmark.c Convert to use rta_getattr_ functions 2012-04-10 08:47:55 -07:00
q_fifo.c iproute2: clearer error messages for fifo and tbf qdiscs 2013-02-21 08:34:34 -08:00
q_fq.c fq: fix whitespace 2015-09-25 12:40:00 -07:00
q_fq_codel.c codel: add ce_threshold support to codel & fc_codel 2015-05-21 15:25:05 -07:00
q_gred.c tc: gred: Add support for TCA_GRED_LIMIT attribute 2015-05-21 15:30:39 -07:00
q_hfsc.c HFSC (7) & (8) documentation + assorted changes 2011-11-02 16:33:50 -07:00
q_hhf.c support for Heavy Hitter Filter (HHF) qdisc 2014-05-09 12:10:47 -07:00
q_htb.c htb: Move direct_qlen code part to htb_parse_opt(). 2014-03-21 14:20:06 -07:00
q_ingress.c tc: minor cleanup on ingress 2015-05-11 09:18:10 -07:00
q_mqprio.c ip: make local functions static 2013-02-12 11:38:35 -08:00
q_multiq.c whitespace cleanup 2014-12-20 15:47:17 -08:00
q_netem.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
q_pie.c PIE: Proportional Integral controller Enhanced 2014-01-09 22:50:47 -08:00
q_prio.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
q_qfq.c Convert to use rta_getattr_ functions 2012-04-10 08:47:55 -07:00
q_red.c tc: red: Mark "bandwidth" parameter as optional in usage text 2015-05-21 14:16:03 -07:00
q_rr.c ip: make local functions static 2013-02-12 11:38:35 -08:00
q_sfb.c tc : SFB flow scheduler 2011-04-12 14:27:37 -07:00
q_sfq.c whitespace cleanup 2014-12-20 15:47:17 -08:00
q_tbf.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
static-syms.c Fix build when shared libraries are disabled 2013-03-13 08:29:59 -07:00
tc.c tc : add timestamps to tc monitor 2015-09-25 12:35:46 -07:00
tc_bpf.c {f,m}_bpf: allow for sharing maps 2015-11-23 16:10:44 -08:00
tc_bpf.h {f,m}_bpf: allow for sharing maps 2015-11-23 16:10:44 -08:00
tc_cbq.c Replace "usec" by "time" in function names 2007-03-13 14:42:17 -07:00
tc_cbq.h (Logical change 1.3) 2004-04-15 20:56:59 +00:00
tc_class.c libnetlink: add size argument to rtnl_talk 2015-05-27 13:00:21 -07:00
tc_common.h tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
tc_core.c htb: support 64bit rates 2013-11-22 17:36:18 -08:00
tc_core.h htb: support 64bit rates 2013-11-22 17:36:18 -08:00
tc_estimator.c Introduce TIME_UNITS_PER_SEC to represent internal clock resolution 2007-03-13 14:42:16 -07:00
tc_exec.c tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
tc_filter.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
tc_monitor.c tc : add timestamps to tc monitor 2015-09-25 12:35:46 -07:00
tc_qdisc.c libnetlink: add size argument to rtnl_talk 2015-05-27 13:00:21 -07:00
tc_red.c red: give a hint about burst value 2011-12-01 09:23:43 -08:00
tc_red.h (Logical change 1.3) 2004-04-15 20:56:59 +00:00
tc_stab.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
tc_util.c tc: remove extra whitespace 2015-10-23 15:43:28 -07:00
tc_util.h tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00

README.last

Kernel code and interface.
--------------------------

* Compile time switches

There is only one, but very important, compile time switch.
It is not settable by "make config", but should be selected
manually and after a bit of thinking in <include/net/pkt_sched.h>

PSCHED_CLOCK_SOURCE can take three values:

	PSCHED_GETTIMEOFDAY
	PSCHED_JIFFIES
	PSCHED_CPU


 PSCHED_GETTIMEOFDAY

Default setting is the most conservative PSCHED_GETTIMEOFDAY.
It is very slow both because of weird slowness of do_gettimeofday()
and because it forces code to use unnatural "timeval" format,
where microseconds and seconds fields are separate.
Besides that, it will misbehave, when delays exceed 2 seconds
(f.e. very slow links or classes bounded to small slice of bandwidth)
To resume: as only you will get it working, select correct clock
source and forget about PSCHED_GETTIMEOFDAY forever.


 PSCHED_JIFFIES

Clock is derived from jiffies. On architectures with HZ=100
granularity of this clock is not enough to make reasonable
bindings to real time. However, taking into account Linux
architecture problems, which force us to use artificial
integrated clock in any case, this switch is not so bad
for schduling even on high speed networks, though policing
is not reliable.


 PSCHED_CPU

It is available only for alpha and pentiums with correct
CPU timestamp. It is the fastest way, use it when it is available,
but remember: not all pentiums have this facility, and
a lot of them have clock, broken by APM etc. etc.