[Twisted-Python] Waiting time for tests running on Travis CI and Buildbot

Mon Aug 15 14:38:03 MDT 2016

> On Aug 15, 2016, at 6:06 AM, Adi Roiban <adi at roiban.ro> wrote:
> 
> On 15 August 2016 at 00:10, Glyph Lefkowitz <glyph at twistedmatrix.com> wrote:
>> 
>>> On Aug 14, 2016, at 3:38 AM, Adi Roiban <adi at roiban.ro> wrote:
>>> 
>>> Hi,
>>> 
>>> We now have 5 concurrent jobs on Travis-CI for the whole Twisted organization.
>>> 
>>> If we want to reduce the waste of running push tests for a PR we
>>> should check that the other repos from the Twisted organization are
>>> doing the same.
>>> 
>>> We now have in twisted/twisted 9 jobs per each build ... and for each
>>> push to a PR ... we run the tests for the push and for the PR merge...
>>> so those are 18 jobs for a commit.
>>> 
>>> twisted/mantisa has 7 jobs per build, twisted/epsilon 3 jobs per
>>> build, twisted/nevow 14 jobs, twisted/axiom 6 jobs, twisted/txmongo 16
>>> jobs
>>> 
>>> .... so we are a bit over the limit of 5 jobs
>> 
>> Well, we're not "over the limit".  It's just 5 concurrent.  Most of the projects that I work on have more than 5 entries in their build matrix.
>> 
>>> I have asked Travis-CI how we can improve the waiting time for
>>> twisted/twisted jobs and for $6000 per year they can give us 15
>>> concurrent jobs for the Twisted organization.
>>> 
>>> This will not give us access to a faster waiting line for the OSX jobs.
>>> 
>>> Also, I don't think that we can have twisted/twisted take priority
>>> inside the organization.
>>> 
>>> If you think that we can raise $6000 per year for sponsoring our
>>> Travis-CI and that is worth increasing the queue size I can follow up
>>> with Travis-CI.
>> 
>> I think that this is definitely worth doing.
> 
> Do we have the budget for this or we need to do a fundraising drive?
> Can The Software Freedom Conservancy handle the payment for Travis-CI?

We have almost no budget, so we would definitely need to raise money.  OTOH sometime soon we're going to run out of money just to keep the lights on on our tummy.com server, so we probably need to do that anyway :).

> Even if we speed up the build time, with 5 jobs we would still have
> only 0.5 concurrent complete builds ... or even less.
> So I think that increasing to 15 jobs would be needed anyway.

Faster is better, of course, but I don't see buildbot completing all its builds that much faster than Travis right now, so I'm not sure why you think this is so critical?

>>> I have also asked Circle CI for a free ride on their OSX builders, but
>>> it was put on hold as Glyph told me that Circe CI is slower than
>>> Travis.
>>> 
>>> I have never used Circle CI. If you have a good experience with OSX on
>>> Circle CI I can continue the phone interview with Circle Ci so that we
>>> get the free access and see how it goes.
>> 
>> The reason I'm opposed to Circle is simply that their idiom for creating a build matrix is less parallelism-friendly than Travis.  Travis is also more popular, so more contributors will be interested in participating.
>> 
> 
> OK. No problem. Thanks for the feedback.
> I also would like to have less providers as we already have
> buildbot/travis/appveyor :)
> 
> My push is for a single provider -> buildbot ... but I am aware that
> it might not be feasible

Yeah I just don't think buildbot is hardened enough for this sort of thing (although I would be happy to be proven wrong).

>>> There are multiple ways in which we can improve the time a test takes
>>> to run on Travis-CI, but it will never be faster than buildbot with a
>>> slave which is always active and ready to start a job in 1 second and
>>> which already has 99% of the virtualev dependencies already installed.
>> 
>> There's a lot that we can do to make Travis almost that fast, with pre-built Docker images and cached dependencies.  We haven't done much in the way of aggressive optimization yet.  As recently discussed we're still doing twice as many builds as we need to just because we've misconfigured branch / push builds :).
> 
> Hm... pre-built dockers also takes effort to keep them updated... and
> then we will have a KVM VM starting inside a docker in which we run
> the tests...
> 
> ...and we would not be able to test the inotify part.

Not true:

We can have one non-containerized builder (sudo: true) for testing inotify; no need for KVM-in-Docker (also you can't do that without a privileged container, so, good thing)
Docker has multiple storage backends, and only one (overlayfs) doesn't work with inotify
It's work to keep buildbot updated too :)

> ... and if something will go wrong and  we need debugging on the host
> I am not sure how fun is to debug this.
> 
>>> AFAIK the main concern with buildot, is that the slaves are always
>>> running so a malicious person could create a PR with some malware and
>>> then all our slaves will execute that malware.
>> 
>> Not only that, but the security between the buildmaster and the builders themselves is weak.  Now that we have the buildmaster on a dedicated machine, this is less of a concern, but it still has access to a few secrets (an SSL private key, github oauth tokens) which we would rather not leak if we can avoid it.
> 
> If we have all slaved in RAX and Azure I hope that communication
> between slaves and buildmaster is secure.

https://www.agwa.name/blog/post/cloudflare_ssl_added_and_removed_here

> github token is only for publishing the commit status ... and I hope
> that we can make that token public :)

I am pretty sure that if we do that, Github will revoke it.  Part of the deal with OAuth tokens is that you keep them secret.

> I have little experience with running public infrastructures for open
> source projects... but are there that many malicious people which
> would want to exploit a github commit status only token?

I don't know, because nobody I am aware of is foolish enough to make this public :-).  But spammers will find ANY open access point to shove spam links into and start exploiting it, and commit statuses are such a place.  Even if you forget about spammers; this is a great way to get a high-value target like a sysadmin or infrastructure developer to click on a malicious link.

>>> One way to mitigate this, is to use latent buildslaves and stop and
>>> reset a slave after each build, but this will also slow the build and
>>> lose the virtualenv ... which of docker based slave should not be a
>>> problem... but if we want Windows latent slaves it might increase the
>>> build time.
>> 
>> It seems like fully latent slaves would be slower than Travis by a lot, since Travis is effectively doing the same thing, but they have a massive economy of scale with pre-warmed pre-booted VMs that they can keep in a gigantic pool and share between many different projects.
> 
> Yes... without pre-warmed VMs, latent slaves in the cloud might be slow.
> 
> I don't have experience with Azure/Amazon/Rackspace VM and their
> snapshot capabilities... I have just recently start using Azure and
> Rackspace... and I saw that rackspace vm or image creation is very
> slow.
> 
> I am using Virtualbox on my local system and starting a VM from a
> saved state and restoring a state is fast... and I was expecting that
> I can get the same experience from a cloud VM.

Booting from a cached image is still booting; it's not the same as saving a suspended VM.  (I don't think you can do this with most cloud providers due to the problems it would create for networking, but despite working at a cloud provider I'm not really an expert in that layer of the stack...)

>>> What do you say if we protect our buildslaves with a firewall which
>>> only allows outgoing connections to buildmaster and github ... and
>>> have the slaves running only on RAX + Azure to simplify the firewall
>>> configuration?
>>> 
>>> Will a malicious person still be interested of exploiting the slaves?
>>> 
>>> I would be happy to help with buildbot configuration as I think that
>>> for TDD, buildbot_try with slaves which are always connected and
>>> virtualenv already created is the only acceptable CI system.
>> 
>> 
>> Personally, I just want to stop dealing with so much administrative overhead.  I am willing to wait for slightly longer build times in order to do that.  Using Travis for everything means we don't need to worry about these issues, or have these discussions; we can just focus on developing Twisted, and have sponsors throw money at the problem.  There's also the issue that deploying new things to the buildmaster will forever remain a separate permission level, but proposing changes to the travis configuration just means sending a PR.
> 
> I also don't like to do administrative work and just work on Twisted.
> 
> But I think that with or without buildbot we still have a problem with Travis-CI
> 
> The OSX build is slow (12 minutes) and there is not much we can do about it.

The OS X build on buildbot is 9 minutes: https://buildbot.twistedmatrix.com/builders/osx10.10-py2.7/builds/1418 which sort of reinforces my point: this is a lot of administrative energy to spend on a 25% speedup.

> The OSX build on Travis-CI is now green and soon we might want to enforce it.

Well this is good news at least!

>> There are things we could do to reduce both overhead and the risk impact further though.  For example, we could deploy buildbot as a docker container instead of as a VM, making it much faster to blow away the VM if we have a security problem, and limiting its reach even more.
>> 
>> On the plus side, it would be nice to be dogfooding Twisted as part of our CI system, and buildbot uses it heavily.  So while I think overall the tradeoffs are in favor of travis, I wouldn't say that I'm 100% in favor.  And _most_ of the things we used to need a buildbot config change for now are just tox environment changes; if we can move to 99% of the job configuration being in tox.ini as opposed to the buildmaster, that would eliminate another objection.
> 
> We are now pretty close to running all buildbot jobs using only tox.

Awesome!

> I am happy to work for having a simple buildbot builder configuration
> which will always run the tests based on the branch configuration.
> 
> I think that the only thing left is the bdist_wheel job
> (https://twistedmatrix.com/trac/ticket/8676) but with the latest
> changes in the appveyor script, this should be solved.

Yay!

As we're working through the parts of this process where we don't have total agreement it is REALLY nice to see all the parts where there _is_ consensus moving forward so fast :).  And I should be clear: I'm doing basically none of this work.  Thank you Adi, and also everyone else helping with the CI maintenance!

>> I'd also be interested in hearing from as many contributors as possible about this though.  The point of all of this is to make contributing easier and more enjoyable, so the majority opinion is kind of important here :).
> 
> Working with Travis-CI, but also current buildbot is not an enjoyable
> experience.
> 
> I am using Linux as my dev system, and if I want to fix an OSX or
> Windows issue I don't want to switch my working environment to run the
> tests in a local VM.

One other solution to this would be to get verified fakes.

> If I use the current configuration, even if I want to do TDD for a
> specific tests on a specific OS (OSX or Windows) I will have to commit
> and push each step in the TDD process and have the test executed
> across all systems... which is a waste of resources and my time.
> 
> For my project I am using buildbot_try and I can target a specific
> builder and a specific test using `buildbot_try --builder SOMETHING
> --properties=testcase=TEST_RUN_ARGS  ... and with a little change in
> buildbot try I can wait for results and get them printed in the
> console
> 
> I would enjoy using something similar for when working for Twisted.
> 
> Maybe Travis-CI already has this capability ... or are planning to add
> it soon... but I was not able to find it.

The way that you do this with Travis is you open a PR and you watch its build status.  If you want to see individual test results quickly you can edit the test matrix on your branch to just run the part you want.  There's command-line support for this in 'travis monitor' for watching for the build to finish on the command line.  It's not quite as nice as 'buildbot_try' but it can get the job done.

> but my biggest problem with Twisted is still the huge wait times for
> the review queue ... at least for my tickets :)
> 
> I would enjoy working for a branch is is reviewed in 1 or 2 days.

I wish we had a graph of review latency, because it certainly _feels_ like review queue times are going down with all the activity we've had in the last couple of months.  But we should keep making progress...

-glyph

-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/twisted-python/attachments/20160815/b50729a0/attachment-0002.html>