iproute2

Commit Graph

Author	SHA1	Message	Date
Jamal Hadi Salim	fdf1bdd0f1	tc simple action update and breakage Brings it closer to more serious actions (adding branching and allowing for late binding) Unfortunately this breaks old syntax of the simple action. But because simple is a pedagogical example unlikely to be used in production environments (i.e its role is to serve as an example on how to write actions), then this is ok. New syntax for simple has new keyword "sdata". Example usage is: sudo tc actions add action simple sdata "foobar" index 1 or tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action simple sdata "foobar" Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>	2016-05-16 11:15:12 -07:00
Jamal Hadi Salim	43726b750a	tc: don't ignore ok as an action branch This is what used to happen before: tc filter add dev tap1 parent ffff: protocol 0xfefe prio 10 \ u32 match u32 0 0 flowid 1:16 \ action ife decode allow mark ok tc -s filter ls dev tap1 parent ffff: filter protocol [65278] pref 10 u32 filter protocol [65278] pref 10 u32 fh 800: ht divisor 1 filter protocol [65278] pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:16 match 00000000/00000000 at 0 action order 1: ife decode action pipe index 2 ref 1 bind 1 installed 4 sec used 4 sec type: 0x0 Metadata: allow mark Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 action order 2: gact action pass random type none pass val 0 index 1 ref 1 bind 1 installed 4 sec used 4 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Note the extra action added at the end.. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>	2016-05-16 11:13:58 -07:00
Jamal Hadi Salim	d3e511223f	tc: introduce IFE action This action allows for a sending side to encapsulate arbitrary metadata which is decapsulated by the receiving end. The sender runs in encoding mode and the receiver in decode mode. Both sender and receiver must specify the same ethertype. At some point we hope to have a registered ethertype and we'll then provide a default so the user doesnt have to specify it. For now we enforce the user specify it. Described in netdev01 paper: "Distributing Linux Traffic Control Classifier-Action Subsystem" Authors: Jamal Hadi Salim and Damascene M. Joachimpillai Also refer to IETF draft-ietf-forces-interfelfb-04.txt Lets show example usage where we encode icmp from a sender towards a receiver with an skbmark of 17; both sender and receiver use ethertype of 0xdead to interop. YYYY: Lets start with Receiver-side policy config: xxx: add an ingress qdisc sudo tc qdisc add dev $ETH ingress xxx: any packets with ethertype 0xdead will be subjected to ife decoding xxx: we then restart the classification so we can match on icmp at prio 3 sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \ u32 match u32 0 0 flowid 1:1 \ action ife decode reclassify xxx: on restarting the classification from above if it was an icmp xxx: packet, then match it here and continue to the next rule at prio 4 xxx: which will match based on skb mark of 17 sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \ u32 match ip protocol 1 0xff flowid 1:1 \ action continue xxx: match on skbmark of 0x11 (decimal 17) and accept sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \ handle 0x11 fw flowid 1:1 \ action ok xxx: Lets show the decoding policy sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead xxx: filter pref 2 u32 filter pref 2 u32 fh 800: ht divisor 1 filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 (rule hit 0 success 0) match 00000000/00000000 at 0 (success 0 ) action order 1: ife decode action reclassify type 0x0 allow mark allow prio index 11 ref 1 bind 1 installed 45 sec used 45 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 xxx: Observe that above lists all metadatum it can decode. Typically these submodules will already be compiled into a monolithic kernel or loaded as modules YYYY: Lets show the sender side now .. xxx: Add an egress qdisc on the sender netdev sudo tc qdisc add dev $ETH root handle 1: prio xxx: xxx: Match all icmp packets to 192.168.122.237/24, then xxx: tag the packet with skb mark of decimal 17, then xxx: Encode it with: xxx: ethertype 0xdead xxx: add skb->mark to whitelist of metadatum to send xxx: rewrite target dst MAC address to 02:15:15:15:15:15 xxx: sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 u32 \ match ip dst 192.168.122.237/24 \ match ip protocol 1 0xff \ flowid 1:2 \ action skbedit mark 17 \ action ife encode \ type 0xDEAD \ allow mark \ dst 02:15:15:15:15:15 xxx: Lets show the encoding policy filter pref 10 u32 filter pref 10 u32 fh 800: ht divisor 1 filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2 (rule hit 118 success 0) match c0a87a00/ffffff00 at 16 (success 0 ) match 00010000/00ff0000 at 8 (success 0 ) action order 1: skbedit mark 17 index 11 ref 1 bind 1 installed 3 sec used 3 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 action order 2: ife encode action pipe type 0xDEAD allow mark dst 02:15:15:15:15:15 index 12 ref 1 bind 1 installed 3 sec used 3 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 xxx: Now test by sending ping from sender to destination Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>	2016-05-16 11:13:26 -07:00
Gustavo Zacarias	5c5a0f3df9	iproute2: tc_bpf.c: fix building with musl libc We need limits.h for PATH_MAX, fixes: tc_bpf.c: In function ‘bpf_map_selfcheck_pinned’: tc_bpf.c:222:12: error: ‘PATH_MAX’ undeclared (first use in this function) char file[PATH_MAX], buff[4096]; Signed-off-by: Gustavo Zacarias <gustavo@zacarias.com.ar> Acked-by: Daniel Borkmann <daniel@iogearbox.net>	2016-04-11 22:09:57 +00:00
Daniel Borkmann	4dd3f50af4	tc, bpf: add support for map pre/allocation Follow-up to kernel commit 6c9059817432 ("bpf: pre-allocate hash map elements"). Add flags support, so that we can pass in BPF_F_NO_PREALLOC flag for disallowing preallocation. Update examples accordingly and also remove the BPF_* map helper macros from them as they were not very useful. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2016-04-11 21:54:47 +00:00
Daniel Borkmann	afc1a2000b	tc, bpf: further improve error reporting Make it easier to spot issues when loading the object file fails. This includes reporting in what pinned object specs differ, better indication when we've reached instruction limits. Don't retry to load a non relo program once we failed with bpf(2), and report out of bounds tail call key. Also, add truncation of huge log outputs by default. Sometimes errors are quite easy to spot by only looking at the tail of the verifier log, but logs can get huge in size e.g. up to few MB (due to verifier checking all possible program paths). Thus, by default limit output to the last 4096 bytes and indicate that it's truncated. For the full log, the verbose option can be used. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2016-04-11 21:53:58 +00:00
Jiri Pirko	4952b45946	include: add linked list implementation from kernel Rename hlist.h to list.h while adding it to be aligned with kernel Signed-off-by: Jiri Pirko <jiri@mellanox.com>	2016-03-27 10:56:11 -07:00
Stephen Hemminger	e9e9365b56	scrub out whitespace issues Run script that removes trailing whitespace everywhere.	2016-03-27 10:50:14 -07:00
Phil Sutter	7faf1588a7	lib/utils: introduce rt_addr_n2a_rta() This simple macro eases calling rt_addr_n2a() with data from an rt_attr pointer. Signed-off-by: Phil Sutter <phil@nwl.cc>	2016-03-27 10:37:35 -07:00
Phil Sutter	2e96d2ccd0	utils: make rt_addr_n2a() non-reentrant by default There is only a single user who needs it to be reentrant (not really, but it's safer like this), add rt_addr_n2a_r() for it to use. Signed-off-by: Phil Sutter <phil@nwl.cc>	2016-03-27 10:37:34 -07:00
Phil Sutter	a418e45164	make format_host non-reentrant by default There are only three users which require it to be reentrant, the rest is fine without. Instead, provide a reentrant format_host_r() for users which need it. Signed-off-by: Phil Sutter <phil@nwl.cc>	2016-03-27 10:37:34 -07:00
Phil Sutter	51011dac36	tc/m_vlan.c: mention CONTROL option in help text Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2016-03-27 10:34:48 -07:00
Phil Sutter	1672f42195	tc: connmark, pedit: Rename BRANCH to CONTROL As Jamal suggested, BRANCH is the wrong name, as these keywords go beyond simple branch control - e.g. loops are possible, too. Therefore rename the non-terminal to CONTROL instead which should be more appropriate. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2016-03-27 10:34:42 -07:00
Phil Sutter	a33786b582	tc: pedit: Fix raw op The retain value was wrong for u16 and u8 types. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2016-03-27 10:34:36 -07:00
Phil Sutter	77bed404d0	tc: pedit: Fix for big-endian systems This was tricky to get right: - The 'stride' value used for 8 and 16 bit values must behave inverse to the value's intra word offset to work correctly with big-endian data act_pedit is editing. - The 'm' array's values are in host byte order, so they have to be converted as well (and the ordering was just inverse, for some reason). - The only sane way of getting this right is to manipulate value/mask in host byte order and convert the output. - TIPV4 (i.e. 'munge ip src/dst') had it's own pitfall: the address parser converts to network byte order automatically. This patch fixes this by converting it back before calling pack_key32, which is a hack but at least does not require to implement a completely separate code flow. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2016-03-27 10:34:33 -07:00
Phil Sutter	952f89deba	tc/p_ip.c: Minor coding style cleanup Break overlong function definitions and remove one extraneous whitespace. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2016-03-27 10:34:22 -07:00
Stephen Hemminger	32a121cba2	tc: code cleanup Use checkpatch to fix whitespace and other style issues.	2016-03-21 11:48:36 -07:00
Luca Lemmo	4733b18a5e	tc: q_{codel,fq_codel}: add missing space in help text Signed-off-by: Luca Lemmo <luca@linux.com>	2016-03-21 11:42:13 -07:00
Luca Lemmo	725f2a872d	tc: f_u32: trivial coding style cleanups Signed-off-by: Luca Lemmo <luca@linux.com>	2016-03-21 11:42:12 -07:00
Luca Lemmo	dd0c8d193f	tc: f_u32: add missing spaces around operators Signed-off-by: Luca Lemmo <luca@linux.com>	2016-03-21 11:42:12 -07:00
Phil Sutter	338b003bcc	tc: pedit: Fix retain value for ihl adjustments Since the IP Header Length field is just half a byte, adjust retain to only match these bits so the Version field is not overwritten by accident. The whole concept is actually broken due to dependency on endianness which pedit ignores. Signed-off-by: Phil Sutter <phil@nwl.cc>	2016-03-06 12:53:11 -08:00
Phil Sutter	f440e9d8c2	tc: pedit: Fix parse_cmd() This was horribly broken: * pack_key8() and pack_key16() ... * missed to invert retain value when applying it to the mask, * did not sanitize val by ANDing it with retain, * and ignored the mask which is necessary for 'invert' command. * pack_key16() did not convert mask to network byte order. * Changing the retain value for 'invert' or 'retain' operation seems just plain wrong. * While here, also got rid of unnecessary offset sanitization in pack_key32(). * Simplify code a bit by always assigning the local mask variable to tkey->mask before calling any of the pack_key*() variants. Signed-off-by: Phil Sutter <phil@nwl.cc>	2016-03-06 12:53:11 -08:00
Phil Sutter	ec0ceeec49	tc: pedit: Fix layered op parsing After lookup of the layered op submodule, pedit would pass argv and argc including the layered op identifier at first position which confused the submodule parser. Fix this by calling NEXT_ARG() before calling the parse_peopt() callback. Signed-off-by: Phil Sutter <phil@nwl.cc>	2016-03-06 12:53:11 -08:00
Phil Sutter	c024acc641	tc: pedit: document branch control in help output This seems to have been a hidden feature, though it's very useful and necessary at least when combining multiple pedit actions. Signed-off-by: Phil Sutter <phil@nwl.cc>	2016-03-04 15:27:52 -08:00
Dmitrii Shcherbakov	467f9fce60	htb: rename b4 buffer to b3 to make its name more consistent b3 buffer has been deleted previously so b2 is followed by b4 which is not consistent. Signed-off-by: Dmitrii Shcherbakov <fw.dmitrii@yandex.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Phil Sutter <phil@nwl.cc>	2016-02-17 17:50:14 -08:00
Dmitrii Shcherbakov	1aea7fea26	htb: remove printing of a deprecated overhead value Remove printing according to the previously used encoding of mpu and overhead values within the tc_ratespec's mpu field. This encoding is no longer being used as a separate 'overhead' field in the ratespec structure has been introduced. Signed-off-by: Dmitrii Shcherbakov <fw.dmitrii@yandex.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Phil Sutter <phil@nwl.cc>	2016-02-17 17:49:47 -08:00
Daniel Borkmann	5230a2ede0	tc, bpf: use bind/type macros from gelf Don't reimplement them and rather use the macros from the gelf header, that is, GELF_ST_BIND()/GELF_ST_TYPE(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2016-02-07 11:27:38 -08:00
Daniel Borkmann	a576c6b977	tc, bpf: give some more hints wrt false relos Provide some more hints to the user/developer when relos have been found that don't point to ld64 imm instruction. Ran couple of times into relos generated by clang [1], where the compiler tried to uninline inlined functions with eBPF and emitted BPF_JMP \| BPF_CALL opcodes. If this seems the case, give a hint that the user should do a work-around to use always_inline annotation. [1] https://llvm.org/bugs/show_bug.cgi?id=26243#c3 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2016-02-07 11:27:38 -08:00
Daniel Borkmann	f31645d138	tc, bpf: improve verifier logging With a bit larger, branchy eBPF programs f.e. already ~BPF_MAXINSNS/7 in size, it happens rather quickly that bpf(2) rejects also valid programs when only the verifier log buffer size we have in tc is too small. Change that, so by default we don't do any logging, and only in error case we retry with logging enabled. If we should fail providing a reasonable dump of the verifier analysis, retry few times with a larger log buffer so that we can at least give the user a chance to debug the program. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.r.fastabend@intel.com>	2016-02-07 11:27:38 -08:00
Nicolas Dichtel	67584e3ab2	tc: fix compilation with old gcc (< 4.6) (bis) Commit `8f80d450c3` ("tc: fix compilation with old gcc (< 4.6)") was reverted to ease the merge of the net-next branch. Here is the new version. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2016-02-05 11:46:18 +11:00
Daniel Borkmann	2486337aac	tc, bpf: make sure relo is in relation with map section Add a test that symbol from relocation entry is actually related to map section and bail out with an error message if it's not the case; in relation to [1]. [1] https://llvm.org/bugs/show_bug.cgi?id=26243 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>	2016-02-02 16:04:11 +11:00
Stephen Hemminger	62392ecbbb	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2	2016-02-02 15:57:23 +11:00
Daniel Borkmann	8187b01273	tc, bpf: more header checks on loading elf eBPF llvm backend can support different BPF formats, make sure the object we're trying to load matches with regards to endiannes and while at it, also check for other attributes related to BPF ELFs. # llc --version LLVM (http://llvm.org/): LLVM version 3.8.0svn Optimized build. Built Jan 9 2016 (02:08:10). Default target: x86_64-unknown-linux-gnu Host CPU: ivybridge Registered Targets: bpf - BPF (host endian) bpfeb - BPF (big endian) bpfel - BPF (little endian) [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>	2016-01-18 11:41:27 -08:00
Daniel Borkmann	cce3d4664c	tc, bpf: check section names and type everywhere When extracting sections, we better check for name and type. Noticed that some llvm versions emit .strtab and .shstrtab (e.g. saw it on pre 3.7), while more recent ones only seem to emit .strtab. Thus, make sure we get the right sections. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>	2016-01-18 11:41:27 -08:00
Daniel Borkmann	8f9afdd531	tc, clsact: add clsact frontend Add the tc part for the kernel commit 1f211a1b929c ("net, sched: add clsact qdisc"). Quoting example usage from that commit description: Example, adding qdisc: # tc qdisc add dev foo clsact # tc qdisc show dev foo qdisc mq 0: root qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact ffff: parent ffff:fff1 Adding filters (deleting, etc works analogous by specifying ingress/egress): # tc filter add dev foo ingress bpf da obj bar.o sec ingress # tc filter add dev foo egress bpf da obj bar.o sec egress # tc filter show dev foo ingress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action # tc filter show dev foo egress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action The ingress parent alias can also be used with ingress qdisc. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2016-01-18 11:41:27 -08:00
Daniel Borkmann	0d45c4b420	tc, ingress: clean up ingress handling a bit Clean it up a bit, we can also get rid of some ugly ifdefs as in our case TC_H_INGRESS is always defined. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2016-01-18 11:41:27 -08:00
Stephen Hemminger	2505780c20	Merge branch 'net-next'	2016-01-18 09:37:45 -08:00
Stephen Hemminger	bc223ab861	Revert "tc: fix compilation with old gcc (< 4.6)" This reverts commit `8f80d450c3`.	2016-01-18 09:37:38 -08:00
Jamal Hadi Salim	488b41d020	tc: flower no need to specify the ethertype since all tc classifiers are required to specify ethertype as part of grammar By not allowing eth_type to be specified we remove contradiction for example when a user specifies: tc filter add ... priority xxx protocol ip flower eth_type ipv6 This patch removes that contradiction Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>	2016-01-11 08:24:01 -08:00
Julien Floret	8f80d450c3	tc: fix compilation with old gcc (< 4.6) gcc < 4.6 does not handle C11 syntax for the static initialization of anonymous struct/union, hence the following error: tc_bpf.c:260: error: unknown field map_type specified in initializer Signed-off-by: Julien Floret <julien.floret@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>	2016-01-11 08:23:36 -08:00
Phil Sutter	de7db5d857	tc: m_connmark: Fix help text When specifying a conntrack zone, the 'zone' keyword has to be used before the actual zone index. Signed-off-by: Phil Sutter <phil@nwl.cc>	2016-01-07 10:35:08 -08:00
Stephen Hemminger	e49b51d663	monitor: fix file handle leak In some cases passing file to monitor left file open. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2015-12-30 17:26:38 -08:00
Daniel Borkmann	fd7f9c7fd1	bpf: minor fix in api and bpf_dump_error() usage Fix a whitespace in bpf_dump_error() usage, and also a missing closing bracket in ntohl() macro for eBPF programs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2015-12-17 17:22:25 -08:00
Daniel Borkmann	91d88eeb10	{f,m}_bpf: allow updates on program arrays Since we have all infrastructure in place now, allow atomic live updates on program arrays. This can be very useful e.g. in case programs that are being tail-called need to be replaced, f.e. when classifier functionality needs to be changed, new protocols added/removed during runtime, etc. Thus, provide a way for in-place code updates, minimal example: Given is an object file cls.o that contains the entry point in section 'classifier', has a globally pinned program array 'jmp' with 2 slots and id of 0, and two tail called programs under section '0/0' (prog array key 0) and '0/1' (prog array key 1), the section encoding for the loader is <id/key>. Adding the filter loads everything into cls_bpf: tc filter add dev foo parent ffff: bpf da obj cls.o Now, the program under section '0/1' needs to be replaced with an updated version that resides in the same section (also full path to tc's subfolder of the mount point can be passed, e.g. /sys/fs/bpf/tc/globals/jmp): tc exec bpf graft m:globals/jmp obj cls.o sec 0/1 In case the program resides under a different section 'foo', it can also be injected into the program array like: tc exec bpf graft m:globals/jmp key 1 obj cls.o sec foo If the new tail called classifier program is already available as a pinned object somewhere (here: /sys/fs/bpf/tc/progs/parser), it can be injected into the prog array like: tc exec bpf graft m:globals/jmp key 1 fd m:progs/parser In the kernel, the program on key 1 is being atomically replaced and the old one's refcount dropped. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>	2015-11-29 11:55:16 -08:00
Daniel Borkmann	f6793eec46	{f, m}_bpf: allow for user-defined object pinnings The recently introduced object pinning can be further extended in order to allow sharing maps beyond tc namespace. F.e. maps that are being pinned from tracing side, can be accessed through this facility as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>	2015-11-29 11:55:16 -08:00
Daniel Borkmann	9e607f2e72	{f, m}_bpf: check map attributes when fetching as pinned Make use of the new show_fdinfo() facility and verify that when a pinned map is being fetched that its basic attributes are the same as the map we declared from the ELF file. I.e. when placed into the globalns, collisions could occur. In such a case warn the user and bail out. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>	2015-11-29 11:55:16 -08:00
Daniel Borkmann	910b543dcc	{f,m}_bpf: make tail calls working Now that we have the possibility of sharing maps, it's time we get the ELF loader fully working with regards to tail calls. Since program array maps are pinned, we can keep them finally alive. I've noticed two bugs that are being fixed in bpf_fill_prog_arrays() with this patch. Example code comes as follow-up. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>	2015-11-29 11:55:16 -08:00
Daniel Borkmann	32e93fb7f6	{f,m}_bpf: allow for sharing maps This larger work addresses one of the bigger remaining issues on tc's eBPF frontend, that is, to allow for persistent file descriptors. Whenever tc parses the ELF object, extracts and loads maps into the kernel, these file descriptors will be out of reach after the tc instance exits. Meaning, for simple (unnested) programs which contain one or multiple maps, the kernel holds a reference, and they will live on inside the kernel until the program holding them is unloaded, but they will be out of reach for user space, even worse with (also multiple nested) tail calls. For this issue, we introduced the concept of an agent that can receive the set of file descriptors from the tc instance creating them, in order to be able to further inspect/update map data for a specific use case. However, while that is more tied towards specific applications, it still doesn't easily allow for sharing maps accross multiple tc instances and would require a daemon to be running in the background. F.e. when a map should be shared by two eBPF programs, one attached to ingress, one to egress, this currently doesn't work with the tc frontend. This work solves exactly that, i.e. if requested, maps can now be _arbitrarily_ shared between object files (PIN_GLOBAL_NS) or within a single object (but various program sections, PIN_OBJECT_NS) without "loosing" the file descriptor set. To make that happen, we use eBPF object pinning introduced in kernel commit b2197755b263 ("bpf: add support for persistent maps/progs") for exactly this purpose. The shipped examples/bpf/bpf_shared.c code from this patch can be easily applied, for instance, as: - classifier-classifier shared: tc filter add dev foo parent 1: bpf obj shared.o sec egress tc filter add dev foo parent ffff: bpf obj shared.o sec ingress - classifier-action shared (here: late binding to a dummy classifier): tc actions add action bpf obj shared.o sec egress pass index 42 tc filter add dev foo parent ffff: bpf obj shared.o sec ingress tc filter add dev foo parent 1: bpf bytecode '1,6 0 0 4294967295,' \ action bpf index 42 The toy example increments a shared counter on egress and dumps its value on ingress (if no sharing (PIN_NONE) would have been chosen, map value is 0, of course, due to the two map instances being created): [...] <idle>-0 [002] ..s. 38264.788234: : map val: 4 <idle>-0 [002] ..s. 38264.788919: : map val: 4 <idle>-0 [002] ..s. 38264.789599: : map val: 5 [...] ... thus if both sections reference the pinned map(s) in question, tc will take care of fetching the appropriate file descriptor. The patch has been tested extensively on both, classifier and action sides. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2015-11-23 16:10:44 -08:00
Stephen Hemminger	037660b351	qfq: fix parse_opt dead code Fix Coverity warning from dead code.	2015-10-27 15:46:20 +09:00
Stephen Hemminger	86c392f958	Merge branch 'master' into net-next	2015-10-23 15:46:08 -07:00

1 2 3 4 5 ...

534 Commits