TL;DR
- With a bit of kernel tuning, i was able to get up to 523k connections opened simultaneously from 4 client boxes to 1 MegaComet server.
- Memory and CPU usage was minimal (128M across the servers processes, maybe 24% CPU at 4000 connections/sec).
- I’ll try to improve the kernel tuning to get it to 1M by checking the /var/log/kern.log next time.
- Libev basically runs on the smell of an oily rag.
Setup
I started 5 EC2 Large 64-bit servers, using the amazon linux image ‘ami-221fec4b’ (aka: amzn-ami-2011.02.1.x86_64). One of these was the server, and the other 4 are the client servers, each trying to open 250k connections. These are vanilla EC2 Large instances, with the following kernel tuning (credits to the metabrew article):
Tuning
The following increases the user limit for number of open file descriptors (TCP connections are file descriptors):
echo "* soft nofile 1048576" >> /etc/security/limits.conf
echo "* hard nofile 1048576" >> /etc/security/limits.conf
After the above is done, you have to log out and back in again.
To tune the kernel to allow 1M connections, the following was appended to the /etc/sysctl.conf:
# Settings from http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3
# Config needed to have enough tcp stack memory:
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 16384 33554432
net.ipv4.tcp_wmem = 4096 16384 33554432
net.ipv4.tcp_mem = 786432 1048576 26777216
net.ipv4.tcp_max_tw_buckets = 360000
net.core.netdev_max_backlog = 2500
vm.min_free_kbytes = 65536
vm.swappiness = 0
# This is for the outgoing connections max:
net.ipv4.ip_local_port_range = 1024 65535
# I added this to set the system wide file max:
fs.file-max = 1100000
# Reduce the time sockets stay in time_wait: http://forums.theplanet.com/lofiversion/index.php/t62399.html
net.ipv4.tcp_fin_timeout = 12
To apply it, you need to do: sudo sysctl -p
I believe this tuning still needs work. Next time i run the tests i’ll check the kernel log to see if anything in the TCP stack has maxed out.
Steps
To reproduce my tests, you can follow the steps used to configure the vanilla instances:
# Install compiler / tools
sudo yum -y install gcc* git* make
# Install libev
wget http://dist.schmorp.de/libev/libev-4.04.tar.gz
tar -zxvf libev-4.04.tar.gz
cd libev-4.04
./configure && make && sudo make install
sudo sh -c "echo /usr/local/lib > /etc/ld.so.conf.d/usr-local-lib.conf"
sudo ldconfig
# Install MC
cd ~
git clone git://github.com/chrishulbert/MegaComet.git
cd MegaComet
# Now do the kernel tuning as mentioned above
# To run the server:
cd MegaComet
make
./start
# To run the clients:
cd MegaComet/testing
make
./megatest X Y # (where X is a,b,c,d depending on which testing server this is)
# Also Y is the IP address of the comet server
Results
The clients got up to 142k, 144k, 105k, and 132k connections respectively before trying to open new connections timed out. This is a total of 523k connections, just over half a million! The RAM and CPU usage on the server was minimal throughout the test. Here’s a screenshot of top while the tests were running at approx 4000 new connections/second, to give an idea of CPU and memory usage:
top - 11:03:28 up 1:12, 2 users, load average: 0.25, 0.58, 0.48
Tasks: 77 total, 2 running, 75 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.8%us, 2.7%sy, 0.0%ni, 95.1%id, 0.0%wa, 0.1%hi, 1.1%si, 0.2%st
Mem: 7652552k total, 1441076k used, 6211476k free, 22144k buffers
Swap: 0k total, 0k used, 0k free, 823848k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22612 ec2-user 20 0 8664 340 264 S 0.0 0.0 0:00.00 megamanager
22614 ec2-user 20 0 25552 17m 460 S 4.7 0.2 0:08.22 megacomet
22615 ec2-user 20 0 25552 17m 460 S 5.0 0.2 0:08.08 megacomet
22616 ec2-user 20 0 25556 17m 460 S 5.0 0.2 0:08.16 megacomet
22617 ec2-user 20 0 25556 17m 460 S 5.0 0.2 0:08.28 megacomet
22618 ec2-user 20 0 25580 17m 460 R 5.0 0.2 0:08.01 megacomet
22619 ec2-user 20 0 25556 17m 460 S 5.0 0.2 0:08.38 megacomet
22620 ec2-user 20 0 25556 17m 460 S 4.7 0.2 0:08.23 megacomet
22621 ec2-user 20 0 25552 17m 460 S 4.7 0.2 0:08.16 megacomet
I forgot to grab a top screenshot when the connections were all opened, but the memory usage was no different, and CPU was zero.
Conclusions
I really can’t believe the CPU and RAM usage are so small when the 1/2M connections are live and idle! At this stage, i’m not really testing for performance when passing messages around. I hope to get to 1M (static) open connections, and then start testing messaging. I’m optimistic: it looks promising. Next time i try this, i’ll keep a close eye on the kernel log (/var/log/kern.log) and see if i can find any bottlenecks.
References
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiwe...
http://www.cs.wisc.edu/condor/condorg/linux_scalability.html