An untapped resource to reproduce studies

Science is generally accepted to operate by conducting specially-designed structured observations (such as experiments and case studies) and then interpreting the results to build generalised knowledge (sometimes called theories or models). An important, nay necessary, feature of the social operation of science is transparency in the design, conduct, and interpretation of these structured observations. We’re going to work from the view that security research is science just like any other, though of course as its own discipline it has its own tools, topics, and challenges. This means that studies in security should be replicable, reproducible, or at least able to be corroborated. Spring and Hatleback argue that transparency is just as important for computer science as it is for experimental biology. Rossow et al. also persuasively argue that transparency is a key feature for malware research in particular. But how can we judge whether a paper is transparent enough? The natural answer would seem to be if it is possible to make a replication attempt from the materials and information in the paper. Forget how often the replications succeed for now, although we know that there are publication biases and other factors that mess with that.

So how many security papers published in major conferences contain enough information to attempt a reproduction? In short, we don’t know. From anecdotal evidence, Jono and a couple students looked through the IEEE S&P 2012 proceedings in 2013, and the results were pretty grim. But heroic effort from a few interested parties is not a sustainable answer to this question. We’re here to propose a slightly more robust solution. Master’s students in security should attempt to reproduce published papers as their capstone thesis work. This has several benefits, and several challenges. In the following we hope to convince you that the challenges can be mitigated and the benefits are worth it.

This should be a choice, but one that master’s students should want to make. If anyone has a great new idea to pursue, they should be encouraged to do so. However, here in the UK, the dissertation process is compressed into the summer and there’s not always time to prototype and pilot study designs. Selecting a paper to reproduce, with a documented methodology in place, lets the student get to work faster. There is still a start-up cost; students will likely have to read several abstracts to shortlist a few workable papers, and then read these few papers in detail to select a good candidate. But learning to read, shortlist, and study academic papers is an important skill that all master’s students should be attempting to, well, master. This style of project would provide them with an opportunity to practice these skills.

Briefly, let’s be clear what we mean by reproduction of published work.
Reproduction isn’t just one thing. There’s reproduce and replicate and corroborate and controlled variation (see Feitelson for details). Not everything is amenable to reproduction. For example, case studies (such as attack papers) or natural experiments are often interesting because they are unique. Corroborating some aspect of the case may be possible with a new study, and such study is also valuable. But this not the sort of reproduction we have in mind to advocate here.

Excluding case studies is not enough to understand what a student should reproduce, exactly. Systems are sensitive, they depend on all their settings and on the inputs. Often a small change in the settings or different inputs can change completely the efficiency or outputs of the system. So how can a student ever tell whether a reproduction attempt yields a different result, or it was the result of different parameters or implementation details? This is an important question. But if it were really a fatal worry, why is the original research not also fatally wounded by it? It is hard to know what parts of a system are vital to reproduce, and a question that can only be answered case by case. It’s easier to answer if the mechanism by which the results occur is explainable, or at least intelligible. But if the system is really so chaotic that any tiny change makes it not reproduce, and that is unexplained, that result is fascinating. What is special about those settings? Without trying and figuring out the range of viable parameters, we cannot begin to answer it. This sort of deeper understanding of the phenomenon of interest that one needs to design a meaningful reproduction also gets to the heart of our point. Reproducing past work is not some mechanical task. There is real work in studying the past work and documenting and designing an adequate reproduction. Work that is worthy of an MSc.

Not all CS work is even empirical; some students are interested in theoretical aspects of computer science. Doing a proof of a cryptographic primitive again is of significantly less value than re-conducting an empirical observation of hardware performance or network behaviour, for example. This argument says that our reproduction plan would unfairly disadvantage these theoretical students. Three objections. First, proofs are still socially-agreed evidence. It is not uncommon for a mathematical proof to be shown wrong because the author overlooked something e.g. Wiles’ initial proof of Fermat’s last theorem. In our chosen example, Wiles had the opportunity to return to and fix his proof specifically because another member of the community found the error. Doing a proof again, but in a different way, would be a valuable contribution of evidence. Secondly, comparative theoretical work seems viable here instead of reproduction. Systematising or consolidating past publications can involve its own theoretical work. Thirdly, and although this is stretching what is usually meant by theoretical, many such papers include performance measurements or claims. Re-implementing the theoretical aspects of the paper to reproduce these measurements would be valuable, and perhaps could include additional comparative benchmarks to increase value, and would straightforwardly fall under our heading of reproductions.

There remains the problem of what material resources the university can provide to the students, and matching these with possible projects. For example, some studies require privileged access to the systems of a certain company or organisation that cannot be readily expected for a student. Besides access, material or computing resources are also limited. We are not claiming this strategy is magic, this problem remains. But it is no worse than under the current system.

Another objection to the idea of master’s students reproducing prior studies comes not from the students, but the supervisors of the projects. Those supervisors often want publications, at least from a good project. However, we argue that reproductions will yield reliably as many publications. There are venues that would reliably welcome a collection of 7 or so reproduction attempts and lessons learned from the successes and failures. For example, the LASER CFP solicits: “adequate reporting of experiments, leading to an ability to understand the approach taken and reproduce results.” How better to evidence adequate reporting than to do reproduction and make these methods explicit? The ACM DTRAP journal, with its emphasis on both academic rigour and the reliability needed to transition to industry, would likely welcome such papers. Especially if the couple of highly self-motivated students still do a good, independently publishable dissertation, encouraging the other students to do reproductions should actually increase the publishable work out of the group of students as a whole.

A more serious concern can be lodged about the risk to student completion of the project. MSc projects need to be done on short and inflexible time scales. As such, they need a relatively low risk of surprises that derail the project entirely. On the one hand, we might recommend reducing these risks by limiting studies rather carefully. For example, prefer to choose studies with pre-published data sets, such as in the Cambridge Cybercrime Centre or IMPACT, formerly PREDICT. But this is not really a full answer. One reason we have proposed this topic is to investigate whether studies are reproducible as published, and possibly increase transparency. So this requires a conversation about what a successful MSc dissertation consists of. It seems to us there is a good case to be made that a dissertation is a good one if it were to clearly document the steps to reproduce a study, where results or study artefacts diverged from expectations, measure divergences carefully, and reason out the impacts on conclusions of the target work. But this is different from not being able to get off the ground at all, which is the real problem. It will be important to take care to reduce this risk to the same level as current practice.

There is of course also the more pure appeal to the purported ideals of the academy. Reproducing work improves our understanding of its reliability and robustness, and this improves collective creation of general knowledge in the field. In psychology, publication bias seems be rampant. Published results differ from expected statistical distributions by about 50 percentage points. Only 36-40% of studies successfully reproduced, when 90% should have. There is no reason to believe computer security is better than this, and a couple to believe it is worse. If we institutionalise master’s-student reproduction of published studies, the students would learn how to conduct studies more readily, and we would all learn more about which studies are reliable and which were statistical accidents. Given that snake oil gets sold as security, this would be valuable to sort out.

Reproduction information is valuable, but long-form MSc dissertations are not widely read. As written, they do not readily avail themselves of aggregation, and there is no established way of linking a reproduction to a target paper to lend support to it. Whether the work is communicated directly to the target study’s authors, when, and by whom, may be handled on a case-by-case basis. In general, wider dissemination issues would be a nice problem to have, but they are worth considering. It may be valuable to have an additional small requirement of a reproduction study — a 500 to 1000 word summary, suitable for independent consumption. This may well follow the format of a structured abstract. This would support combined analysis into published papers, as discussed regards venues above.

Therefore, for the good of the students, the supervisors, and the academy as a whole, we urge dissertations should incorporate an institutionalised element of attempting to reproduce past results.

 

Thanks to Steven Murdoch for constructive comments on a prior draft of this post.

Leave a Reply

Your email address will not be published. Required fields are marked *