Benchmarking ansible and ara for fun and science

We’ve had issues opened about benchmarking and improving the performance of the ara callback plugin as well as the API server for a while now and I only recently took a bit of time to tackle the callback.

It paid off because we can already see significant performance benefits in the 1.5.3 release of ara and it builds a foundation for future improvement opportunities.

If you’d like to see the raw unformatted data that was used for this post, check out this gist on GitHub.

A benchmarking playbook

Whenever you want to improve something, it’s important to measure it first so you know how better (or worse!) things become as a result of your changes.

The first step was to create a standardized benchmarking playbook that we could run across a variety of configurations and parameters.

The playbook was designed to run a specified number of tasks against a specified number of hosts.

It’s available in the git repository but it’s simple and small enough to include here:

# Copyright (c) 2020 The ARA Records Ansible authors
# GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt)

- name: Create many hosts
  hosts: localhost
  gather_facts: no
  vars:
    benchmark_host_count: 25
  tasks:
    - name: Add a host to the inventory
      add_host:
        ansible_connection: local
        hostname: "host-{{ item }}"
        groups: benchmark
      with_sequence: start=1 end={{ benchmark_host_count }}

- name: Run tasks on many hosts
  hosts: benchmark
  vars:
    benchmark_task_file: "{{ playbook_dir }}/benchmark_tasks.yaml"
    # Run N tasks per host
    benchmark_task_count: 50
    # Off by default to prevent accidental load spike on localhost
    benchmark_gather_facts: no
  gather_facts: "{{ benchmark_gather_facts }}"
  tasks:
    - name: Include a task file
      include_tasks: "{{ benchmark_task_file }}"
      with_sequence: start=1 end={{ benchmark_task_count }}

and then the benchmark_tasks.yaml file:

# Copyright (c) 2020 The ARA Records Ansible authors
# GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt)

# These are tasks meant to be imported by benchmark.yaml

- name: Run a task
  debug:
    msg: "{{ inventory_hostname }} running task {{ item }}/{{ benchmark_task_count }}"

Methodology

All tests ran under Ansible 2.10.2 and ANSIBLE_FORKS=50 using the default sqlite database backend for ara.

The benchmark playbook was run three times:

# 25 hosts and 50 tasks: 1276 results
ansible-playbook -i 'localhost,' -c local tests/integration/benchmark.yaml

# 100 hosts and 50 tasks: 5101 results
ansible-playbook -i 'localhost,' -c local tests/integration/benchmark.yaml \
    -e benchmark_host_count=100

# 200 hosts and 200 tasks: 40201 results
ansible-playbook -i 'localhost,' -c local tests/integration/benchmark.yaml \
    -e benchmark_host_count=200 \
    -e benchmark_task_count=200

Note: localhost and two bootstrap tasks are included in the results below

ansible without ara

tasks	hosts	results	duration
52	26	1276	0m 11s
52	101	5101	0m 41s
202	201	40201	6m 01s

Our control: basically how much time these playbooks take to run without ara enabled so we can calculate the overhead and performance when enabling the ara callback.

ansible with ara 1.5.1

api client	api server	tasks	hosts	results	duration
offline	django	52	26	1276	0m 56s
http	django	52	26	1276	0m 32s
http	gunicorn	52	26	1276	0m 36s
offline	django	52	101	5101	3m 31s
http	django	52	101	5101	1m 39s
http	gunicorn	52	101	5101	2m 19s
offline	django	202	201	40201	30m22s
http	django	202	201	40201	17m28s
http	gunicorn	202	201	40201	21m38s

1.5.1 is the latest version that didn’t implement threading inside the callback plugin.

It’s curious that the django built-in webserver outperformed running with gunicorn when using the http client. I was not able to reproduce this result in 1.5.3.

I was aware that there was an overhead when enabling the callback but never realized the performance hit was this much until now, taking the time to accurately measure it:

results	without ara	1.5.1	overhead
1276	11s	32s	21s
5101	41s	1m39s	58s
40201	6m01s	17m28s	~11.4m

ansible with ara 1.5.3

api client	api server	tasks	hosts	results	duration
offline	django	52	26	1276	0m 52s
http	django	52	26	1276	0m 30s
http	gunicorn	52	26	1276	0m 20s
offline	django	52	101	5101	3m 22s
http	django	52	101	5101	1m 37s
http	gunicorn	52	101	5101	1m 09s
offline	django	202	201	40201	29m25s
http	django	202	201	40201	17m24s
http	gunicorn	202	201	40201	13m47s

1.5.2 introduced threading in the callback and then 1.5.3 was subsequently released to workaround an issue when using the offline client, forcing it to use a single thread for now.

From the table above, we can tell:

Running a single thread with the offline client is just about the same performance as 1.5.1 without threading, if only a little bit faster.
There is a significant improvement in performance due to the multi-threading when using the http client
Running the API server with gunicorn outperforms using the built-in django development server

1.5.3 reduced the overhead of the callback plugin quite a bit when comparing to 1.5.1:

results	without ara	1.5.3	overhead
1276	11s	20s	9s
5101	41s	1m 09s	28s
40201	6m 01s	13m47s	~7.8m

For science: ara 0.16.8

tasks	hosts	results	duration
52	26	1276	0m 28s
52	101	5101	1m 56s
202	201	40201	19m05s

Although ara 0.x is no longer supported, it turns out it still works if you use the stable/0.x git branch.. even with Ansible 2.10!

It was interesting to run the same benchmark against 0.x because it runs a completely different backend. It uses flask instead of django and doesn’t provide an API: the callback talks directly to the database through flask-sqlalchemy.

Putting it all together

Tallying up the numbers, we can see that we’re on the right track and performance is improving:

tasks	hosts	results	without ara	0.16.8	1.5.1	1.5.3
52	26	1276	11s	28s	32s	20s
52	101	5101	41s	1m 56s	1m 39s	1m 09s
202	201	40201	6m 01s	19m05s	17m28s	13m47s

There is definitely more work to do and more opportunities to improve performance to find. There will unfortunately always be an overhead but it needs to be low enough that it’s worth it without sacrificing simplicity.

In the future, it could be interesting the measure the impact of other parameters on performance like:

Ansible forks – what difference does having 25, 100 or 200 forks make ?
Callback threads – is there a benefit when running more threads in the threadpool ?
Version of Python – is there any difference between python 3.5 and 3.9 ?
Version of Ansible – was there any performance improvements or regressions between 2.8 and 2.10 ?
Database backend – is sqlite faster than mysql ? what about postgresql ?
Application backend – is gunicorn faster than uwsgi ? what about apache mod_wsgi ?
Latency: what’s the impact on performance of adding a jump box ? what about 50ms ? 250ms ?

If you’d like to help, have a look at the issues on GitHub or come chat with us on Slack or IRC !

See you around o/

A benchmarking playbook#

Methodology#

ansible without ara#

ansible with ara 1.5.1#

ansible with ara 1.5.3#

For science: ara 0.16.8#

Putting it all together#