[lxc-users] lxc container network occasional problem with bridge network on bonding device

Discussion:

toshinao

2018-09-17 04:02:08 UTC

Hi.

I experienced occasional network problem of containers running on ubuntu server 18.04.1. Containers
can communicate with host IP always and they can communicate sometimes to the other hosts but they
are disconnected occasionally. When the problem occurs, the ping from the container to external hosts
does not reach at all, but very rarely they recover after, for example, several hours later.
Disconnection happens much more easily.

The host network is organized by using netplan in the following topology.

+-eno1-< <--lan_cable--> >-+
br0--bond0-+ +-- Cisco 3650
+-en02-< <--lan_cable--> >-+

The bonding mode is balance-a1b. I also found that if one of the LAN cables is physically disconnected,
this problem has never happened.

By using iptraf-ng, I watched the bridge device, the following br0, as well as the slave devices.
Even if containers send a ping to the external hosts, no ping packet is detected, when they cannot
communicate. Ping packets are detected by iptraf-ng on these devices when the communication is working.

I guess this can be a low-level problem of virtual networking. Are there any suggestions to solve
the problem ?

Here's the detail of the setting.

host's netplan setting

network:
version: 2
renderer: networkd
ethernets:
eno1:
dhcp4: no
eno2:
dhcp4: no
bonds:
bond0:
interfaces: [eno1, eno2]
parameters:
mode: balanec-a1b
bridges:
br0:
interfaces:
- bond0
addresses: [10.1.2.3/24]
gateway4: 10.1.2.254
dhcp4: no

host network interface status

host# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 0b:25:b5:f2:e1:34 brd ff:ff:ff:ff:ff:ff
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 0b:25:b5:f2:e1:35 brd ff:ff:ff:ff:ff:ff
4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 0a:1a:6c:85:ff:ed brd ff:ff:ff:ff:ff:ff
inet 10.1.2.3/24 brd 10.1.2.255 scope global br0
valid_lft forever preferred_lft forever
inet6 fe80::81a:6cff:fe85:ffed/64 scope link
valid_lft forever preferred_lft forever
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP group default qlen 1000
link/ether 0a:54:4b:f2:d7:10 brd ff:ff:ff:ff:ff:ff
7: ***@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP group default qlen 1000
link/ether fe:ca:07:3e:2b:2d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::fcca:7ff:fe3e:2b2d/64 scope link
valid_lft forever preferred_lft forever
9: ***@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP group default qlen 1000
link/ether fe:85:f0:ef:78:b2 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::fc85:f0ff:feef:78b2/64 scope link
valid_lft forever preferred_lft forever

container's network interface status

***@bionic0:~# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
6: ***@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:16:3e:cb:ef:ce brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.1.2.20/24 brd 10.1.2.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::216:3eff:fecb:efce/64 scope link
valid_lft forever preferred_lft forever

Tomasz Chmielewski

2018-09-17 08:09:57 UTC

Permalink

FYI I've seen a similar phenomenon when launching new containers.

Sometimes, connectivity freezes for several seconds after that.

What usually "helps" is sending an arping to the gateway IP from an
affected container.

Tomasz

Post by toshinao
Hi.
I experienced occasional network problem of containers running on
ubuntu server 18.04.1. Containers
can communicate with host IP always and they can communicate sometimes
to the other hosts but they
are disconnected occasionally. When the problem occurs, the ping from
the container to external hosts
does not reach at all, but very rarely they recover after, for
example, several hours later.
Disconnection happens much more easily.
The host network is organized by using netplan in the following topology.
+-eno1-< <--lan_cable--> >-+
br0--bond0-+ +-- Cisco 3650
+-en02-< <--lan_cable--> >-+
The bonding mode is balance-a1b. I also found that if one of the LAN
cables is physically disconnected,
this problem has never happened.
By using iptraf-ng, I watched the bridge device, the following br0, as
well as the slave devices.
Even if containers send a ping to the external hosts, no ping packet
is detected, when they cannot
communicate. Ping packets are detected by iptraf-ng on these devices
when the communication is working.
I guess this can be a low-level problem of virtual networking. Are
there any suggestions to solve
the problem ?
Here's the detail of the setting.
host's netplan setting
version: 2
renderer: networkd
dhcp4: no
dhcp4: no
interfaces: [eno1, eno2]
mode: balanec-a1b
- bond0
addresses: [10.1.2.3/24]
gateway4: 10.1.2.254
dhcp4: no
host network interface status
host# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq
master bond0 state UP group default qlen 1000
link/ether 0b:25:b5:f2:e1:34 brd ff:ff:ff:ff:ff:ff
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq
master bond0 state UP group default qlen 1000
link/ether 0b:25:b5:f2:e1:35 brd ff:ff:ff:ff:ff:ff
4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state
UP group default qlen 1000
link/ether 0a:1a:6c:85:ff:ed brd ff:ff:ff:ff:ff:ff
inet 10.1.2.3/24 brd 10.1.2.255 scope global br0
valid_lft forever preferred_lft forever
inet6 fe80::81a:6cff:fe85:ffed/64 scope link
valid_lft forever preferred_lft forever
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc
noqueue master br0 state UP group default qlen 1000
link/ether 0a:54:4b:f2:d7:10 brd ff:ff:ff:ff:ff:ff
noqueue master br0 state UP group default qlen 1000
link/ether fe:ca:07:3e:2b:2d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::fcca:7ff:fe3e:2b2d/64 scope link
valid_lft forever preferred_lft forever
noqueue master br0 state UP group default qlen 1000
link/ether fe:85:f0:ef:78:b2 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::fc85:f0ff:feef:78b2/64 scope link
valid_lft forever preferred_lft forever
container's network interface status
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
state UP group default qlen 1000
link/ether 00:16:3e:cb:ef:ce brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.1.2.20/24 brd 10.1.2.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::216:3eff:fecb:efce/64 scope link
valid_lft forever preferred_lft forever
_______________________________________________
lxc-users mailing list
http://lists.linuxcontainers.org/listinfo/lxc-users

Andrey Repin

2018-09-17 17:47:06 UTC

Permalink

Greetings, toshinao!

ALB
Adaptive Load Balancing

Post by toshinao
I also found that if one of the LAN cables is physically disconnected,
this problem has never happened.

How do you connect containers to the bridge?

Post by toshinao
By using iptraf-ng, I watched the bridge device, the following br0, as well as the slave devices.
Even if containers send a ping to the external hosts, no ping packet is detected, when they cannot
communicate. Ping packets are detected by iptraf-ng on these devices when the communication is working.
I guess this can be a low-level problem of virtual networking. Are there any suggestions to solve
the problem ?

Can containers talk to each other when this happens?
Can host talk to the world at that same time?

Post by toshinao
Here's the detail of the setting.
host's netplan setting
version: 2
renderer: networkd
dhcp4: no
dhcp4: no
interfaces: [eno1, eno2]
mode: balanec-a1b

And netplan did not yell at you?

Post by toshinao
- bond0
addresses: [10.1.2.3/24]
gateway4: 10.1.2.254
dhcp4: no
host network interface status
host# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master
bond0 state UP group default qlen 1000
link/ether 0b:25:b5:f2:e1:34 brd ff:ff:ff:ff:ff:ff
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq master
bond0 state UP group default qlen 1000
link/ether 0b:25:b5:f2:e1:35 brd ff:ff:ff:ff:ff:ff
4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 0a:1a:6c:85:ff:ed brd ff:ff:ff:ff:ff:ff
inet 10.1.2.3/24 brd 10.1.2.255 scope global br0
valid_lft forever preferred_lft forever
inet6 fe80::81a:6cff:fe85:ffed/64 scope link
valid_lft forever preferred_lft forever
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
master br0 state UP group default qlen 1000
link/ether 0a:54:4b:f2:d7:10 brd ff:ff:ff:ff:ff:ff
master br0 state UP group default qlen 1000
link/ether fe:ca:07:3e:2b:2d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::fcca:7ff:fe3e:2b2d/64 scope link
valid_lft forever preferred_lft forever
master br0 state UP group default qlen 1000
link/ether fe:85:f0:ef:78:b2 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::fc85:f0ff:feef:78b2/64 scope link
valid_lft forever preferred_lft forever
container's network interface status
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
link/ether 00:16:3e:cb:ef:ce brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.1.2.20/24 brd 10.1.2.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::216:3eff:fecb:efce/64 scope link
valid_lft forever preferred_lft forever

--
With best regards,
Andrey Repin
Monday, September 17, 2018 20:41:41

Sorry for my terrible english...

toshinao

2018-09-19 15:20:28 UTC

Permalink

Hi, Andrey. Thanks for reply.

It took some time to reproduce the problem. But now I found what to do.

Post by Andrey Repin
How do you connect containers to the bridge?

Here’s lxc info shows.

# lxc profile show default
config: {}
description: Default LXD profile
devices:
eth0:
name: eth0
nictype: bridged
parent: br0
type: nic
root:
path: /
pool: default
type: disk
name: default

Post by Andrey Repin
Can containers talk to each other when this happens?

Yes.

Post by Andrey Repin
Can host talk to the world at that same time?

Yes.

I do not attach log of results of ping commands since they are just trivial.

Post by Andrey Repin
And netplan did not yell at you?

I identified the time when this problem happened and inspected /var/log/syslog at that time.
There was nothing.

I found a way to reproduce the problem quickly. The procedure is

(1) connect both of the LAN cables
(2) stop all containers (I am not sure whether “all” is necessary.)
(3) start some of the containers
(4) the problem occurs on the started container just after or serveral minutes after
the restart

Regards,

Post by Andrey Repin
Greetings, toshinao!

ALB
Adaptive Load Balancing

Post by toshinao
I also found that if one of the LAN cables is physically disconnected,
this problem has never happened.

How do you connect containers to the bridge?

Can containers talk to each other when this happens?
Can host talk to the world at that same time?

Post by toshinao
Here's the detail of the setting.
host's netplan setting
version: 2
renderer: networkd
dhcp4: no
dhcp4: no
interfaces: [eno1, eno2]
mode: balanec-a1b

And netplan did not yell at you?

--
With best regards,
Andrey Repin
Monday, September 17, 2018 20:41:41
Sorry for my terrible english...
_______________________________________________
lxc-users mailing list
http://lists.linuxcontainers.org/listinfo/lxc-users