iproute2/tc
Daniel Borkmann 4bd624467b tc: built-in eBPF exec proxy
This work follows upon commit 6256f8c9e4 ("tc, bpf: finalize eBPF
support for cls and act front-end") and takes up the idea proposed by
Hannes Frederic Sowa to spawn a shell (or any other command) that holds
generated eBPF map file descriptors.

File descriptors, based on their id, are being fetched from the same
unix domain socket as demonstrated in the bpf_agent, the shell spawned
via execvpe(2) and the map fds passed over the environment, and thus
are made available to applications in the fashion of std{in,out,err}
for read/write access, for example in case of iproute2's examples/bpf/:

  # env | grep BPF
  BPF_NUM_MAPS=3
  BPF_MAP1=6        <- BPF_MAP_ID_QUEUE (id 1)
  BPF_MAP0=5        <- BPF_MAP_ID_PROTO (id 0)
  BPF_MAP2=7        <- BPF_MAP_ID_DROPS (id 2)

  # ls -la /proc/self/fd
  [...]
  lrwx------. 1 root root 64 Apr 14 16:46 0 -> /dev/pts/4
  lrwx------. 1 root root 64 Apr 14 16:46 1 -> /dev/pts/4
  lrwx------. 1 root root 64 Apr 14 16:46 2 -> /dev/pts/4
  [...]
  lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
  lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
  lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map

The advantage (as opposed to the direct/native usage) is that now the
shell is map fd owner and applications can terminate and easily reattach
to descriptors w/o any kernel changes. Moreover, multiple applications
can easily read/write eBPF maps simultaneously.

To further allow users for experimenting with that, next step is to add
a small helper that can get along with simple data types, so that also
shell scripts can make use of bpf syscall, f.e to read/write into maps.

Generally, this allows for prepopulating maps, or any runtime altering
which could influence eBPF program behaviour (f.e. different run-time
classifications, skb modifications, ...), dumping of statistics, etc.

Reference: http://thread.gmane.org/gmane.linux.network/357471/focus=357860
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
2015-04-27 16:39:23 -07:00
..
.gitignore Add ignore files to make using git easier 2006-08-08 12:04:38 -07:00
Makefile tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
README.last (Logical change 1.3) 2004-04-15 20:56:59 +00:00
e_bpf.c tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
em_canid.c Ematch used to classify CAN frames according to their identifiers 2012-08-20 13:11:55 -07:00
em_cmp.c Fix wrong comparison in cmp_print_eopt() 2011-10-07 11:16:15 -07:00
em_ipset.c tc: add ipset ematch 2012-08-13 08:33:50 -07:00
em_meta.c add missing underscore to man page and example nf_mark ematch 2014-10-09 08:24:00 -07:00
em_nbyte.c tc: remove dlfcn.h from files that dont need it 2009-11-13 14:14:07 -08:00
em_u32.c tc: remove dlfcn.h from files that dont need it 2009-11-13 14:14:07 -08:00
emp_ematch.l fix build issues with flex ver 2.5 2010-04-22 15:27:42 -07:00
emp_ematch.y emp: fix warning on deprecated bison directive 2014-10-09 08:31:10 -07:00
f_basic.c discourage use of direct policer interface 2014-10-09 08:26:57 -07:00
f_bpf.c tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
f_cgroup.c discourage use of direct policer interface 2014-10-09 08:26:57 -07:00
f_flow.c whitespace cleanup 2014-12-20 15:47:17 -08:00
f_fw.c discourage use of direct policer interface 2014-10-09 08:26:57 -07:00
f_route.c route classifier support for multiple actions 2014-10-09 08:26:57 -07:00
f_rsvp.c discourage use of direct policer interface 2014-10-09 08:26:57 -07:00
f_tcindex.c tcindex classifier support for multiple actions 2014-10-09 08:26:56 -07:00
f_u32.c whitespace cleanup 2014-12-20 15:47:17 -08:00
m_action.c tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
m_bpf.c tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
m_connmark.c tc: add support for connmark action 2015-04-13 10:49:45 -07:00
m_csum.c csum action, fix typo 2012-03-15 14:24:59 -07:00
m_ematch.c Fix NULL pointer reference when using basic match 2010-07-29 18:03:35 -07:00
m_ematch.h include needed files 2012-12-23 11:49:06 -08:00
m_estimator.c ip: make local functions static 2013-02-12 11:38:35 -08:00
m_gact.c ip: make local functions static 2013-02-12 11:38:35 -08:00
m_ipt.c whitespace cleanup 2014-12-20 15:47:17 -08:00
m_mirred.c More minor spelling fixes 2013-08-04 15:10:05 -07:00
m_nat.c action: typo nat fix 2013-09-30 21:31:40 -07:00
m_pedit.c whitespace cleanup 2014-12-20 15:47:17 -08:00
m_pedit.h Remove trailing whitespace 2006-12-05 10:10:22 -08:00
m_police.c Remove unnecessary debug statement 2014-05-28 16:54:26 -07:00
m_simple.c tc: fix compilation warning on 32bits arch 2015-04-27 11:41:46 -07:00
m_skbedit.c whitespace cleanup 2014-12-20 15:47:17 -08:00
m_vlan.c actions: Get vlan action to work in pipeline 2015-01-13 17:22:44 -08:00
m_xt.c whitespace cleanup 2014-12-20 15:47:17 -08:00
m_xt_old.c whitespace cleanup 2014-12-20 15:47:17 -08:00
p_icmp.c Remove trailing whitespace 2006-12-05 10:10:22 -08:00
p_ip.c Remove trailing whitespace 2006-12-05 10:10:22 -08:00
p_tcp.c Remove trailing whitespace 2006-12-05 10:10:22 -08:00
p_udp.c Remove trailing whitespace 2006-12-05 10:10:22 -08:00
q_atm.c Convert to use rta_getattr_ functions 2012-04-10 08:47:55 -07:00
q_cbq.c linklayer interface between kernel and tc/userspace 2013-09-03 08:21:24 -07:00
q_choke.c whitespace cleanup 2014-12-20 15:47:17 -08:00
q_codel.c tc-codel: Update usage text 2012-05-24 15:02:05 -07:00
q_drr.c Convert to use rta_getattr_ functions 2012-04-10 08:47:55 -07:00
q_dsmark.c Convert to use rta_getattr_ functions 2012-04-10 08:47:55 -07:00
q_fifo.c iproute2: clearer error messages for fifo and tbf qdiscs 2013-02-21 08:34:34 -08:00
q_fq.c fq: allow options of fair queue set to ~0U 2014-06-09 12:42:36 -07:00
q_fq_codel.c fq_codel: Fair Queue Codel AQM 2012-05-22 14:17:49 -07:00
q_gred.c whitespace cleanup 2014-12-20 15:47:17 -08:00
q_hfsc.c HFSC (7) & (8) documentation + assorted changes 2011-11-02 16:33:50 -07:00
q_hhf.c support for Heavy Hitter Filter (HHF) qdisc 2014-05-09 12:10:47 -07:00
q_htb.c htb: Move direct_qlen code part to htb_parse_opt(). 2014-03-21 14:20:06 -07:00
q_ingress.c tc: remove stale code 2010-01-21 10:13:01 -08:00
q_mqprio.c ip: make local functions static 2013-02-12 11:38:35 -08:00
q_multiq.c whitespace cleanup 2014-12-20 15:47:17 -08:00
q_netem.c whitespace cleanup 2014-12-20 15:47:17 -08:00
q_pie.c PIE: Proportional Integral controller Enhanced 2014-01-09 22:50:47 -08:00
q_prio.c tc: prio: Perform more strict check on priomap. 2012-06-18 12:25:08 -07:00
q_qfq.c Convert to use rta_getattr_ functions 2012-04-10 08:47:55 -07:00
q_red.c Convert to use rta_getattr_ functions 2012-04-10 08:47:55 -07:00
q_rr.c ip: make local functions static 2013-02-12 11:38:35 -08:00
q_sfb.c tc : SFB flow scheduler 2011-04-12 14:27:37 -07:00
q_sfq.c whitespace cleanup 2014-12-20 15:47:17 -08:00
q_tbf.c Fixed 'tc qdisc show' for tbf when latency<0 2014-05-28 17:08:16 -07:00
static-syms.c Fix build when shared libraries are disabled 2013-03-13 08:29:59 -07:00
tc.c tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
tc_bpf.c tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
tc_bpf.h tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
tc_cbq.c Replace "usec" by "time" in function names 2007-03-13 14:42:17 -07:00
tc_cbq.h (Logical change 1.3) 2004-04-15 20:56:59 +00:00
tc_class.c tc class: Show classes as ASCII graph 2014-12-27 10:16:51 -08:00
tc_common.h tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
tc_core.c htb: support 64bit rates 2013-11-22 17:36:18 -08:00
tc_core.h htb: support 64bit rates 2013-11-22 17:36:18 -08:00
tc_estimator.c Introduce TIME_UNITS_PER_SEC to represent internal clock resolution 2007-03-13 14:42:16 -07:00
tc_exec.c tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
tc_filter.c tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00
tc_monitor.c ip: make local functions static 2013-02-12 11:38:35 -08:00
tc_qdisc.c whitespace cleanup 2014-12-20 15:47:17 -08:00
tc_red.c red: give a hint about burst value 2011-12-01 09:23:43 -08:00
tc_red.h (Logical change 1.3) 2004-04-15 20:56:59 +00:00
tc_stab.c iproute2: various header include fixes for compiling with musl libc 2014-05-28 16:51:39 -07:00
tc_util.c tc util: Fix possible buffer overflow when print class id 2015-04-20 10:06:02 -07:00
tc_util.h tc: built-in eBPF exec proxy 2015-04-27 16:39:23 -07:00

README.last

Kernel code and interface.
--------------------------

* Compile time switches

There is only one, but very important, compile time switch.
It is not settable by "make config", but should be selected
manually and after a bit of thinking in <include/net/pkt_sched.h>

PSCHED_CLOCK_SOURCE can take three values:

	PSCHED_GETTIMEOFDAY
	PSCHED_JIFFIES
	PSCHED_CPU


 PSCHED_GETTIMEOFDAY

Default setting is the most conservative PSCHED_GETTIMEOFDAY.
It is very slow both because of weird slowness of do_gettimeofday()
and because it forces code to use unnatural "timeval" format,
where microseconds and seconds fields are separate.
Besides that, it will misbehave, when delays exceed 2 seconds
(f.e. very slow links or classes bounded to small slice of bandwidth)
To resume: as only you will get it working, select correct clock
source and forget about PSCHED_GETTIMEOFDAY forever.


 PSCHED_JIFFIES

Clock is derived from jiffies. On architectures with HZ=100
granularity of this clock is not enough to make reasonable
bindings to real time. However, taking into account Linux
architecture problems, which force us to use artificial
integrated clock in any case, this switch is not so bad
for schduling even on high speed networks, though policing
is not reliable.


 PSCHED_CPU

It is available only for alpha and pentiums with correct
CPU timestamp. It is the fastest way, use it when it is available,
but remember: not all pentiums have this facility, and
a lot of them have clock, broken by APM etc. etc.