[lxc-users] Does cpu cgroup has been enabled in lxc/lxd

Post by kemi
Hi, Everyone
I am new comer of LXC/LXD community, and want to run a container on a limited cpu set.
a) lxd init
b) lxc launch Ubuntu:18.04 first
c) lxc stop first
d) lxc config set first limits.cpu 0 // set container running on CPU 0

I'm not sure, but I believe "0" here means all cpu, and not pin to cpu 0?

Try changing this to "1", "0-0", and "1-2". Observe the difference.

Post by kemi
e) lxc start first
f) lxc exec first -- bash
g) nproc // the expected result would be 1, however, it still equals
to cpu number of host
h) ls /sys/devices/system/cpu // the expected result should only
include cpu0 directory, however, it's not

g) and h) read files from /proc, not cgroup. You need lxcfs. You should
already have that on ubuntu though.

Post by kemi
So, it seems that CPU cgroup has not been enabled in LXC/LXD, right? The
version of lxc is 2.0.11 on Ubuntu 16.04.
Anyone can help me on that, thx very much.

If this is a new install, I highly suggest you just switch to ubuntu 18.04
+ lxd-3 host.
If you simply want to have "correct" /proc entries, make sure lxcfs is
installed, and then restart lxd (if needed).

--
Fajar

kemi

2018-11-01 05:35:04 UTC

Hi, Fajar
thx for your reply.

I'm not sure, but I believe "0" here means all cpu, and not pin to cpu 0?

Hmm, I expected to pin on CPU 0. Seems I misunderstood the *limit* configuration here:)
I will try use another number as you suggested.

Post by Fajar A. Nugraha
Try changing this to "1", "0-0", and "1-2". Observe the difference.

g) and h) read files from /proc, not cgroup. You need lxcfs. You should
already have that on ubuntu though.

Hmm, I will take a look at it. thx for suggestion

Post by kemi
So, it seems that CPU cgroup has not been enabled in LXC/LXD, right? The
version of lxc is 2.0.11 on Ubuntu 16.04.
Anyone can help me on that, thx very much.

If this is a new install, I highly suggest you just switch to ubuntu 18.04
+ lxd-3 host.

Lxd has a simple online testing using lxd-3.6. May I use it for testing purpose?
https://linuxcontainers.org/lxd/try-it/

Post by Fajar A. Nugraha
If you simply want to have "correct" /proc entries, make sure lxcfs is
installed, and then restart lxd (if needed).

Not only I want to get correct number of cpus in container, but still hope to have independent
procfs and sysfs virtualized in container.

Post by Fajar A. Nugraha
_______________________________________________
lxc-users mailing list
http://lists.linuxcontainers.org/listinfo/lxc-users

kemi

2018-11-01 06:38:38 UTC

Post by kemi
Hi, Fajar
thx for your reply.

Post by kemi
Hi, Everyone
I am new comer of LXC/LXD community, and want to run a container on a
limited cpu set.
a) lxd init
b) lxc launch Ubuntu:18.04 first
c) lxc stop first
d) lxc config set first limits.cpu 0 // set container running on CPU 0

I'm not sure, but I believe "0" here means all cpu, and not pin to cpu 0?

Hmm, I expected to pin on CPU 0. Seems I misunderstood the *limit* configuration here:)
I will try use another number as you suggested.

Post by Fajar A. Nugraha
Try changing this to "1", "0-0", and "1-2". Observe the difference.

have tried. `nproc` works well.

g) and h) read files from /proc, not cgroup. You need lxcfs. You should
already have that on ubuntu though.

/proc/cpuinfo also matches the expected result.
However, it seems that sysfs in container still shares with host /sys file system.
Right?

Fajar A. Nugraha

2018-11-01 06:53:31 UTC

Post by Fajar A. Nugraha
g) and h) read files from /proc, not cgroup. You need lxcfs. You should
already have that on ubuntu though.

/proc/cpuinfo also matches the expected result.
However, it seems that sysfs in container still shares with host /sys file system.
Right?

Correct. See https://linuxcontainers.org/lxcfs/introduction/

--
Fajar

kemi

2018-11-01 07:16:17 UTC

Post by Fajar A. Nugraha
g) and h) read files from /proc, not cgroup. You need lxcfs. You should
already have that on ubuntu though.

/proc/cpuinfo also matches the expected result.
However, it seems that sysfs in container still shares with host /sys file system.
Right?

Correct. See https://linuxcontainers.org/lxcfs/introduction/

OK, then I have a question on scalability and security issues on running multiple containers.

Background: Our customers hope to run hundreds or even thousands of containers in their production environment.

Sharing sysfs of containers with host sysfs in lxc/lxd may have:
a) security issue.
If a malicious program in a container changes a sensitive file in /sys,
e.g. reduce CPU frequency, does it really works? Does it affect other running containers?

b) Scalability issue.
E.g. During launching a ubuntu OS(not kernel) or Android OS in a container,it usually use udev/ueventd
to manage their device. This device manager daemon will read or write uevent file in /sys, the kernel
then broadcast a uevent to all the listeners(udev daemon) via netlink, if there are already hundreds
of containers in the system, all of udev daemons need to deal with it, it would lead to a long boot
latency which we have observed in docker.

Anyway to fix that?

Fajar A. Nugraha

2018-11-01 07:30:54 UTC

Post by Fajar A. Nugraha
g) and h) read files from /proc, not cgroup. You need lxcfs. You

should

Post by Fajar A. Nugraha
already have that on ubuntu though.

/proc/cpuinfo also matches the expected result.
However, it seems that sysfs in container still shares with host /sys file system.
Right?

Correct. See https://linuxcontainers.org/lxcfs/introduction/

OK, then I have a question on scalability and security issues on running
multiple containers.
Background: Our customers hope to run hundreds or even thousands of
containers in their production environment.
a) security issue.
If a malicious program in a container changes a sensitive file in /sys,
e.g. reduce CPU frequency, does it really works? Does it affect other running containers?

Why don't you try it and see :)

Even privileged container should get something like this

# echo 1000000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq
-su: /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq: Read-only
file system

There were some known security issues with /sys in the past (not cpufreq
though), but even back then it should be non issue for the default lxd
containers, which are unprivileged.

b) Scalability issue.

Post by kemi
E.g. During launching a ubuntu OS(not kernel) or Android OS in a
container,it usually use udev/ueventd
to manage their device. This device manager daemon will read or write
uevent file in /sys, the kernel
then broadcast a uevent to all the listeners(udev daemon) via netlink, if
there are already hundreds
of containers in the system, all of udev daemons need to deal with it, it
would lead to a long boot
latency which we have observed in docker.

LXD containers don't use udev.

Post by kemi
Anyway to fix that?

Try it, and if you find anything wrong, ask.

--
Fajar

kemi

2018-11-01 08:04:00 UTC

Post by Fajar A. Nugraha
g) and h) read files from /proc, not cgroup. You need lxcfs. You

should

Post by Fajar A. Nugraha
already have that on ubuntu though.

/proc/cpuinfo also matches the expected result.
However, it seems that sysfs in container still shares with host /sys file system.
Right?

Correct. See https://linuxcontainers.org/lxcfs/introduction/

OK, then I have a question on scalability and security issues on running
multiple containers.
Background: Our customers hope to run hundreds or even thousands of
containers in their production environment.
a) security issue.
If a malicious program in a container changes a sensitive file in /sys,
e.g. reduce CPU frequency, does it really works? Does it affect other running containers?

Why don't you try it and see :)

this is just an example I used to help describe my question clearly.
I believe there are some other security issues in lxc/ldx due to shared sysfs.

Post by Fajar A. Nugraha
Even privileged container should get something like this
# echo 1000000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq
-su: /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq: Read-only
file system
There were some known security issues with /sys in the past (not cpufreq
though), but even back then it should be non issue for the default lxd
containers, which are unprivileged.

Yes. Using unprivileged container is a workaround, not very good though.

Post by Fajar A. Nugraha
b) Scalability issue.

LXD containers don't use udev.

I have no idea on how LXD container works now.
Maybe udev is disabled or some other mechanism may be used to manage device.

Post by kemi
Anyway to fix that?

Try it, and if you find anything wrong, ask.

The reason why I have not tried it is there is no available android image provided on existed
images server for LXD container. Do you know something about that?
So, I have to make a Android image first. Anyway I will try it next.
Thx for your answers again.

Fajar A. Nugraha

2018-11-01 08:30:33 UTC

Post by kemi
The reason why I have not tried it is there is no available android image
provided on existed
images server for LXD container. Do you know something about that?

I don't believe anybody has succesfully run android in lxd yet (sucess as
in "you can use vnc or similar to view the screen"). Perhaps porting a
working docker setup is a good start (I assume this is not you or your
team, apologies if you're already familiar with it) :
https://github.com/butomo1989/docker-android

There were some hints on debugging android on lxd in the list archive:
https://discuss.linuxcontainers.org/t/how-to-debug-container-boot-on-lxd/849
, which might be relevant to what you want.

--
Fajar

Jäkel, Guido

2018-11-01 10:00:17 UTC

Post by kemi
I have no idea on how LXD container works now.
Maybe udev is disabled or some other mechanism may be used to manage device.

LXC/LXD is not limited to that, but the probably most used model for Containers it that it should form a key-ready environment for a Linux userland to run applications, where all resourced are already provided at a high level: The access to file system may be provided by a bind mount, the network connection is provided by an virtual Ethernet device. This is obviously a good way to keep a concrete Container independent from the concrete hosting platform by separation of concerns.

But if you want or need, you may also put the border on lower levels. You may reach in the access to a mount point and have the Container be responsible to drive a filesystem on this. Or to pass in access to a NIC of the host and have the Container to configure it. And even lower, you may give access to devices and let the Container to bind drivers on it. The lower this border, the more complex (and problematic in terms of security) and more dependent from the hosts this will become. And you may step into current limits of "LXC" or the used kernel features it's based on.

I've no deeper experience with Android at all. Said this, I wonder if it's possible to provide "Android Device" in an Container general manner. I could imagine that it's possible to run a kind of "Andriod Emulator". A quick google point me to https://medium.com/@AndreSand/android-emulator-on-docker-container-f20c49b129ef . There's a description of running an Android Emulator on Docker and this should be show up the way to do it with LXC.

Andrey Repin

2018-11-01 16:20:40 UTC

Greetings, kemi!

Post by kemi
Yes. Using unprivileged container is a workaround, not very good though.

You're confusing containerization with virtualization.
Container not supposed to have direct access to devices on the host.
It provides a ready system for **userspace** applications to run.
Said that, what kind of hundreds of containers your customer wants to run
which require access to host hardware?

--
With best regards,
Andrey Repin
Thursday, November 1, 2018 19:18:15

Sorry for my terrible english...

kemi

2018-11-02 01:44:22 UTC

Post by Andrey Repin
Greetings, kemi!

Post by kemi
Yes. Using unprivileged container is a workaround, not very good though.

You're confusing containerization with virtualization.

Maybe. I am not familiar with containerization.

Post by Andrey Repin
Container not supposed to have direct access to devices on the host.
It provides a ready system for **userspace** applications to run.
Said that, what kind of hundreds of containers your customer wants to run
which require access to host hardware?

thx for your question.
In our case, our customers want to run android games within containers on cloud.
There are two problems we have known.
The first one occurs during Android OS boot, the coldboot of Android requires to
write uevent file in /sys, this will trigger an uevent broadcast to all of listeners
(udev daemons) in user space (this uevent is sent from kernel via netlink),
with the increase of container number (200+), we found the boot latency has
reached 1~2 mins. And latency would be intolerable when the number reaches 500.

The second one occurs when an app in container begins to run, it will read
/sys/devices/system/cpu/online file to get avilable cpu number before creating
threads accordingly. Then. the problem is, sysfs now is shared with host,
it will get the CPU number equals to host thread number even if the cpu number
of container is limited.

Fajar A. Nugraha

2018-11-02 12:05:44 UTC

Post by kemi
thx for your question.
In our case, our customers want to run android games within containers on cloud.

It might be possible for you to adjust https://anbox.io/ to run on lxd
instead of lxc. YMMV.

There are two problems we have known.

Post by kemi
The first one occurs during Android OS boot, the coldboot of Android requires to
write uevent file in /sys, this will trigger an uevent broadcast to all of listeners
(udev daemons) in user space (this uevent is sent from kernel via netlink),
with the increase of container number (200+), we found the boot latency has
reached 1~2 mins. And latency would be intolerable when the number reaches 500.

I don't see udev running inside it's lxc container, so perhaps they've
managed to solve that issue

The second one occurs when an app in container begins to run, it will read

Post by kemi
/sys/devices/system/cpu/online file to get avilable cpu number before creating
threads accordingly. Then. the problem is, sysfs now is shared with host,
it will get the CPU number equals to host thread number even if the cpu number
of container is limited.

If it simply reads the file, you could simply mount a text file on it.
Similar to what lxcfs does, but simpler.

--
Fajar

kemi

2018-11-05 05:12:35 UTC

Post by kemi
thx for your question.
In our case, our customers want to run android games within containers on cloud.

It might be possible for you to adjust https://anbox.io/ to run on lxd
instead of lxc. YMMV.

anbox provides a GUI interface to run android in container.
We don't need that GUI which leads to extra overhead. Also,
Anbox can't offer thousands of containers running in parallel.

Post by Fajar A. Nugraha
There are two problems we have known.

I don't see udev running inside it's lxc container, so perhaps they've
managed to solve that issue

Post by Fajar A. Nugraha
The second one occurs when an app in container begins to run, it will read

If it simply reads the file, you could simply mount a text file on it.
Similar to what lxcfs does, but simpler.

Good suggestion. We are considering this workaround.
But it may not be a common solution, because on one knows which file in /sys
will be used by app in userspace.

Post by Fajar A. Nugraha
_______________________________________________
lxc-users mailing list
http://lists.linuxcontainers.org/listinfo/lxc-users

Christian Brauner

2018-11-05 16:32:07 UTC

Post by kemi
thx for your question.
In our case, our customers want to run android games within

containers on

Post by kemi
cloud.

It might be possible for you to adjust https://anbox.io/ to run on

lxd

Post by Fajar A. Nugraha
instead of lxc. YMMV.

anbox provides a GUI interface to run android in container.
We don't need that GUI which leads to extra overhead. Also,
Anbox can't offer thousands of containers running in parallel.

Post by Fajar A. Nugraha
There are two problems we have known.

Post by kemi
The first one occurs during Android OS boot, the coldboot of Android requires to
write uevent file in /sys, this will trigger an uevent broadcast to

all of

Post by kemi
listeners
(udev daemons) in user space (this uevent is sent from kernel via netlink),
with the increase of container number (200+), we found the boot

latency

Post by kemi
has
reached 1~2 mins. And latency would be intolerable when the number

reaches

Post by kemi
500.

That is no longer true from kernels 4.17 onwards. I should really write a
blogpost about my patchset it seems. This keeps popping up every now and then.
So, I'm going to explain this in a little more detail here.
Uevents were previously broadcast into all network namespaces. This was
obviously problematic:
- You could be smarter than you should be and trick the system into running a
second udev daemon in a non-initial network namespace that is owned by the
initial user namespace. That has the potential to wreck the system. However
this only affects privileged containers that would be dumb enough to mount /sys
read-write.
- You could see an insane performance hit when you ran large numbers of
containers that each ran a udev daemon since the kernel would broadcast these
events to all of them. This is made worse by the fact that in non-initial
network namespaces that are owned by non-initial user namespaces the kernel
would not fix up the uid and gid relative to the owning user namespace of the
network namespace. That meant user space would see those events with
INVALD_{G,U}ID which causes udev to ignore those events. Effectively, the
kernel was screaming uevents into the void for absolutely no good reason.
Moreover the id permissions weren't even fixed up for namespaced devices such
as network devices that can be owned by different network namespaces (e.g.
moving a physical network device into an unprivileged container)
- You could technically spy on the hosts device events from an unprivileged
container. It's probably not an attack vector but it is definitely an
information leak.
- You had no way of delegating a device to a container since uevents that were
received for it were unuseable (cf. above) but you also had no way of
injecting/forwarding uevents to a container.

For all those reasons I wrote several patches that namespace uevents and allow
injecting uevents:

- 94e5e3087a67c765be98592b36d8d187566478d5
- 692ec06d7c92af8ca841a6367648b9b3045344fd
- 26045a7b14bc7a5455e411d820110f66557d6589
- a3498436b3a0f8ec289e6847e1de40b4123e1639

So, the first two patches make it possible to forward/inject uevents into other
network namespaces if the caller has CAP_NET_ADMIN in the owning user namespace
of the target network namespace. This effectively allows for device namespaces.
Any forwarded/injected uevent should strip/not add a sequence number. The
kernel will append the correct sequence number to the buffer itself.

The following two patches are concerned with isolating uevents aka namespacing
them more cleanly. Because #legacybehavior we came up with the following logic:
uevents are restricted to all network namespaces that are owned by the initial
user namespace. This implies that all non-initial network namespaces that are
owned by non-initial user namespaces do not receive any uevents unless the
kobject (in-kernel device representation) (e.g. network devices) carries a
namespace tag or a uevent is forwarded/injected. My patches ensure that network
namespace specific uevents and forwarded/injected uevents get their permissions
fixed-up according to the owning user namespace of the target network
namespace. This has the nice consequence that delegated network devices
(physical, virtual, SRIO-V) can now be seen by udev inside unprivileged
containers.

So if uevents were a bottleneck for you then it shouldn't be the case anymore
for unprivileged containers at least. The in-kernel locking is also improved by
my patches and I have plans to further improve it. I just need to find the
time.

If you're running privileged containers and uevents are still a bottleneck for
you we can think about a per-network-namespace sysctl that might allow you to
opt-in or out per network namespace. Although I doubt that's a clean enough
option.

Post by Fajar A. Nugraha
I don't see udev running inside it's lxc container, so perhaps

they've

Post by Fajar A. Nugraha
managed to solve that issue

Udevd will usually not run in unprivileged containers since /sys is
mounted ro so it won't start. However, in unprivileged containers /sys can
safely be mounted rw and udev will start.
This also makes sense on kernels with my patches added. (cf. above).

Post by Fajar A. Nugraha
The second one occurs when an app in container begins to run, it will

read

Post by kemi
/sys/devices/system/cpu/online file to get avilable cpu number

before

Post by kemi
creating
threads accordingly. Then. the problem is, sysfs now is shared with

host,

Post by kemi
it will get the CPU number equals to host thread number even if the

cpu

Post by kemi
number
of container is limited.

If it simply reads the file, you could simply mount a text file on

it.

Post by Fajar A. Nugraha
Similar to what lxcfs does, but simpler.

Good suggestion. We are considering this workaround.
But it may not be a common solution, because on one knows which file in /sys
will be used by app in userspace.

Post by Fajar A. Nugraha
_______________________________________________
lxc-users mailing list
http://lists.linuxcontainers.org/listinfo/lxc-users

_______________________________________________
lxc-users mailing list
http://lists.linuxcontainers.org/listinfo/lxc-users

kemi

2018-11-06 09:15:31 UTC

Hi, Christian
Appreciated for your detailed explanation here:)

Post by Christian Brauner
That is no longer true from kernels 4.17 onwards.

Yes, it should be.
I googled for a solution for this issue and luckily found your patch series.
The evaluation work is on going and I plan to port your patch series to ubuntu
16.04 with kernel version 4.4.98 in my case.

Post by Christian Brauner
I should really write a
blogpost about my patchset it seems. This keeps popping up every now and then.
So, I'm going to explain this in a little more detail here.
Uevents were previously broadcast into all network namespaces. This was
- You could be smarter than you should be and trick the system into running a
second udev daemon in a non-initial network namespace that is owned by the
initial user namespace. That has the potential to wreck the system. However
this only affects privileged containers that would be dumb enough to mount /sys
read-write.
- You could see an insane performance hit when you ran large numbers of
containers that each ran a udev daemon since the kernel would broadcast these
events to all of them. This is made worse by the fact that in non-initial
network namespaces that are owned by non-initial user namespaces the kernel
would not fix up the uid and gid relative to the owning user namespace of the
network namespace. That meant user space would see those events with
INVALD_{G,U}ID which causes udev to ignore those events.

Agree.
But, why does the broadcast of uevent to all of listeners (ueventd in Android)
lead to a long response latency of ueventd.
If the uevent is broadcast to all of the listers in parallel,right?
and all of listers gets the notification at the same time, we should not observe long
response latency of ueventd. I probably misunderstand something here, please correct me.

Post by Christian Brauner
Effectively, the
kernel was screaming uevents into the void for absolutely no good reason.
Moreover the id permissions weren't even fixed up for namespaced devices such
as network devices that can be owned by different network namespaces (e.g.
moving a physical network device into an unprivileged container)
- You could technically spy on the hosts device events from an unprivileged
container. It's probably not an attack vector but it is definitely an
information leak.
- You had no way of delegating a device to a container since uevents that were
received for it were unuseable (cf. above) but you also had no way of
injecting/forwarding uevents to a container.
For all those reasons I wrote several patches that namespace uevents and allow
- 94e5e3087a67c765be98592b36d8d187566478d5
- 692ec06d7c92af8ca841a6367648b9b3045344fd
- 26045a7b14bc7a5455e411d820110f66557d6589
- a3498436b3a0f8ec289e6847e1de40b4123e1639
So, the first two patches make it possible to forward/inject uevents into other
network namespaces if the caller has CAP_NET_ADMIN in the owning user namespace
of the target network namespace. This effectively allows for device namespaces.
Any forwarded/injected uevent should strip/not add a sequence number. The
kernel will append the correct sequence number to the buffer itself.
The following two patches are concerned with isolating uevents aka namespacing
uevents are restricted to all network namespaces that are owned by the initial
user namespace. This implies that all non-initial network namespaces that are
owned by non-initial user namespaces do not receive any uevents unless the
kobject (in-kernel device representation) (e.g. network devices) carries a
namespace tag or a uevent is forwarded/injected. My patches ensure that network
namespace specific uevents and forwarded/injected uevents get their permissions
fixed-up according to the owning user namespace of the target network
namespace. This has the nice consequence that delegated network devices
(physical, virtual, SRIO-V) can now be seen by udev inside unprivileged
containers.
So if uevents were a bottleneck for you then it shouldn't be the case anymore
for unprivileged containers at least.

Yes. The ueventd should not be started in non-privileged container.
We will try to use non-privileged container in future, but it takes time.
Currently, we are using privileged container.

Post by Christian Brauner
The in-kernel locking is also improved by
my patches and I have plans to further improve it. I just need to find the
time.
If you're running privileged containers and uevents are still a bottleneck for
you we can think about a per-network-namespace sysctl that might allow you to
opt-in or out per network namespace. Although I doubt that's a clean enough
option.

If I understand correctly, even in privileged containers, the uevent broadcast issue
will not be a problem with your patch series above. Since the uevent will only be
forward to the particular lister which has the same network namespace id with that uevent.

Wang, Kemi

2018-10-31 12:39:24 UTC