Resolving disputes through computer evidence: lessons from the Post Office Trial

On Monday, the final judgement in the Post Office trial was handed down, finding in favour of the claimants on all counts. The outcome will be of particular interest to the group of 587 claimants who brought the case against Post Office Limited, but the judgement also illustrates problems handling evidence generated by computers that have much broader applicability. I think this trial demonstrates that the way such disputes are resolved is not fit for purpose and that changes are needed in both in how computers generate evidence and how such evidence is reasoned about in litigation.

This case centres around disputes between Post Office Limited and sub-postmasters who operate Post Office branches on its behalf. Post Office Limited supplies these sub-postmasters with products to sell, and the computer accounting system – Horizon – for managing the branch. The claimants contend that shortfalls between the money that was in their branch and what Horizon says result from bugs in Horizon or someone maliciously accessing it. The Post Office instead claims that the shortfalls are real, and it is the responsibility of the sub-postmaster to reimburse the Post Office.

Such disputes have resulted in sub-postmasters being bankrupted, and others have even been jailed because the Post Office contends that evidence produced by Horizon demonstrates fraud by the sub-postmaster. The judgement vindicates the sub-postmasters, concluding that Horizon “was not remotely robust”.

This trial is actually the second in this case, with the prior one also finding in favour of the sub-postmasters – that the contractual terms set by Post Office regarding how they investigate and handle shortfalls are unfair. There would have been at least two more trials, had the parties not settled last week with Post Office Limited offering an apology and £58m in compensation. Of this, the vast majority will go towards legal costs and to the fund which bankrolled the litigation – leaving claimants lucky to get much more than £10k on average. Disappointing, sure, but better than nothing and that is what they could have got had the trials and inevitable appeals continued.

As would be expected for a trial depending on highly technical arguments, expert evidence featured heavily. The Post Office expert took a quantitative approach, presenting a statistical argument that claimant’s losses were implausibly high. This argument went by making a rough approximation as to the total losses of all sub-postmasters resulting from bugs in Horizon. Then, by assuming that these losses were spread over all sub-postmasters equally, losses by the 587 claimants would be no more than £25,000 – far less than the £18.7 million claimed. On this basis, the Post Office said that it is implausible for Horizon bugs to be the cause of the losses, and instead they are the fault of the affected sub-postmasters.

This argument is fundamentally flawed; I said so at the time, as did others. The claimant group was selected specifically as people who thought they were victims of Horizon bugs so it’s quite reasonable to think this group might indeed be disproportionally affected by Horizon bugs. The judge agreed, saying, “The group has a bias, in statistical terms. They plainly cannot be treated, in statistical terms, as though they are a random group of 587 [sub-postmasters]”. This error can be corrected, but the argument becomes circular and a statistical approach adds little new information. As the judgement concludes, “probability theory only takes one so far in this case, and that is not very far”.

Continue reading Resolving disputes through computer evidence: lessons from the Post Office Trial

Measuring mobility without violating privacy – a case study of the London Underground

In the run-up to this year’s Privacy Enhancing Technologies Symposium (PETS 2019), I noticed some decidedly non-privacy-enhancing behaviour. Transport for London (TfL) announced they will be tracking the wifi MAC addresses of devices being carried on London Underground stations. Before storing a MAC address it will be hashed with a key, but since this key will remain unchanged for an extended period (2 years), it will be possible to track the movements of an individual over this period through this pseudonymous ID. These traces are likely enough to link records back to the individual with some knowledge of that person’s distinctive travel plans. Also, for as long as the key is retained it would be trivial for TfL (or someone who stole the key) to convert the someone’s MAC address into its pseudonymised form and indisputably learn that that person’s movements.

TfL argues that under the General Data Protection Regulations (GDPR), they don’t need the consent of individuals they monitor because they are acting in the public interest. Indeed, others have pointed out the value to society of knowing how people typically move through underground stations. But the GDPR also requires that organisations minimise the amount of personal data they collect. Could the same goal be achieved if TfL irreversibly anonymised wifi MAC addresses rather than just pseudonymising them? For example, they could truncate the hashed MAC address so that many devices all have the same truncated anonymous ID. How would this affect the calculation of statistics of movement patterns within underground stations? I posed these questions in a presentation at the PETS 2019 rump session, and in this article, I’ll explain why a set of algorithms designed to violate people’s privacy can be applied to collect wifi mobility information while protecting passenger privacy.

It’s important to emphasise that TfL’s goal is not to track past Underground customers but to predict the behaviour of future passengers. Inferring past behaviours from the traces of wifi records may be one means to this end, but it is not the end in itself, and TfL creates legal risk for itself by holding this data. The inferences from this approach aren’t even going to be correct: wifi users are unlikely to be typical passengers and behaviour will change over time. TfL’s hope is the inferred profiles will be useful enough to inform business decisions. Privacy-preserving measurement techniques should be judged by the business value of the passenger models they create, not against how accurate they are at following individual passengers around underground stations in the past. As the saying goes, “all models are wrong, but some are useful”.

Simulating privacy-preserving mobility measurement

To explore this space, I built a simple simulation of Euston Station inspired by one of the TfL case studies. In my simulation, there are two platforms (A and B) and six types of passengers. Some travel from platform A to B; some from B to A; others enter and leave the station at one platform (A or B). Of the passengers that travel between platforms, they can take either the fast route (taking 2 minutes on average) or the slow route (taking 4 minutes on average). Passengers enter the station at a Poisson arrival rate averaging one per second. The probabilities that each new passenger is of a particular type are shown in the figure below. The goal of the simulation is to infer the number of passengers of each type from observations of wifi measurements taken at platforms A and B.

Continue reading Measuring mobility without violating privacy – a case study of the London Underground

TESSERACT’s evaluation framework and its use of MaMaDroid

In this blog post, we will describe and comment on TESSERACT, a system introduced in a paper to appear at USENIX Security 2019, and previously published as a pre-print. TESSERACT is a publicly available framework for the evaluation and comparison of systems based on statistical classifiers, with a particular focus on Android malware classification. The authors used DREBIN and our MaMaDroid paper as examples of this evaluation. Their choice is because these are two of the most important state-of-the-art papers, tackling the challenge from different angles, using different models, and different machine learning algorithms. Moreover, DREBIN has already been reproduced by researchers even though the code is not available anymore; MaMaDroid’s code is publicly available (the parsed data and the list of samples are available under request). I am one of MaMaDroid’s authors, and I am particularly interested in projects like TESSERACT. Therefore, I will go through this interesting framework and attempt to clarify a few misinterpretations made by the authors about MaMaDroid.

The need for evaluation frameworks

The information security community and, in particular, the systems part of it, feels that papers are often rejected based on questionable decisions or, on the other hand, that papers should be more rigorous, trying to respect certain important characteristics. Researchers from Dutch universities published a survey of papers published to top venues in 2010 and 2015 where they evaluated if these works were presenting “crimes” affecting completeness, relevancy, soundness, and reproducibility of the work. They have shown how the newest publications present more flaws. Even though the authors included their works in the analyzed ones and did not word the paper as a wall of shame by pointing the finger against specific articles, the paper has been seen as an attack to the community rather than an incitement to produce more complete papers. To the best of my knowledge, unfortunately, the paper has not yet been accepted for publication. TESSERACT is another example of researchers’ effort in trying to make the community work more rigorous: most system papers present accuracies that are close to 100% in all the tests done; however, when some of them have been tested on different datasets, their accuracy was worse than a coin toss.

These two works are part of a trend that I personally find important for our community, to allow works that are following other ones on the chronological aspects to be evaluated in a more fair way. I explain with a personal example: I recall when my supervisor told me that at the beginning he was not optimistic about MaMaDroid being accepted at the first attempt (NDSS 2017) because most of the previous literature shows results always over 98% accuracy and that gap of a few percentage points can be enough for some reviewers to reject. When we asked an opinion of a colleague about the paper, before we submitted it for peer-review, this was his comment on the ML part: “I actually think the ML part is super solid, and I’ve never seen a paper with so many experiments on this topic.” We can see completely different reactions over the same specific part of the work.

TESSERACT

The goal of this post is to show TESSERACT’s potential while pointing out the small misinterpretations of MaMaDroid present in the current version of the paper. The authors contacted us to let us read the paper and see whether there has been any misinterpretation. I had a constructive meeting with the authors where we also had the opportunity to exchange opinions on the work. Following the TESSERACT description, there will be a section related to MaMaDroid’s misinterpretations in the paper. The authors told me that the newest versions would be updated according to what we discussed.

Continue reading TESSERACT’s evaluation framework and its use of MaMaDroid

Measuring and Modeling the Vivino Wine Social Network

Over the past few years, food and drink have become an essential part of our social media footprints. This shouldn’t come as a surprise – after all, eating and drinking were social activities long before the first #foodporn hashtag on Instagram. In fact, scientific studies have showed that what we gobble up or gulp down is shaped by social and regional influences, and how we tend to mirror habits of people with shared social connections.

Nowadays, we have an unprecedented opportunity to study eating & drinking habits at scale, as people share more and more of that online, both on popular social networks like Instagram, Twitter, and Facebook, but also on “dedicated” apps like Yummly or Untappd.

Along these lines is our recent paper, “Of Wines and Reviews: Measuring and Modeling the Vivino Wine Social Network,” recently presented at the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2018) in Barcelona. The study – co-authored by former UCL undergraduate student Neema Kotonya, Italian wine journalist Paolo De Cristofaro, and UCL faculty Emiliano De Cristofaro –  presented a preliminary study showcasing big-data and social network analysis of how users worldwide consume, rate, and provide reviews of wines. We did so through the lens of Vivino, a popular wine social network. (And, yes, Paolo is Emiliano’s brother! 😊)

What is Vivino?

Vivino.com is an online community for wine enthusiasts, available both as a web and a mobile application. It was founded in 2009 by Heini Zachariassen, with his colleague Theis Sondergaard joining in 2010. In a nutshell, Vivino allows users to review and purchase wines through third-party vendors. The mobile app also provides a “wine scanner” functionality – i.e., users can upload pictures of wine labels and access reviews and details about the wine/winery.

But Vivino is actually a social network, as it allows wine enthusiasts to communicate with and follow each other, as well as share reviews and recommendations. As of September 2018, it had 32 million users, 9.7 million wines covering a multitude of wine styles, grapes, and geographical regions, as well as 103.7 million ratings and almost 35 million reviews.

Continue reading Measuring and Modeling the Vivino Wine Social Network

Incentives in Security Protocols

The 2018 edition of the International Security Protocols Workshop took place last week. The theme this year was “fail-safe and fail-deadly concepts in protocol design”.

One common theme at this year’s workshop is that of threat models and incentives, which is covered by the majority of accepted papers. One of these is our (Sarah Azouvi, Alexander Hicks and Steven Murdoch) submission – Incentives in Security Protocols. The aim of the paper is to discuss how incentives can be considered and incorporated in the security of systems. In line with the given theme, the focus is on fail-safe and fail-deadly cases which we look at for the cases of the EMV protocol, consensus in cryptocurrencies, and non-economic systems such as Tor. This post will summarise the main ideas laid out in the paper.

Fail safe, fail deadly and people

Systems can fail, which requires some thought by system designers to account for these failures. From this setting comes the idea behind fail safe protocols which are such that even if the protocol fails, the failure can be dealt with or the protocol can be aborted to limit damage. The idea of a fail deadly setting is an extension of this where failure is defended against through deterrence, as in the case of nuclear deterrence (sometimes a realistic case).

Human input often plays a role in the use of the system, particularly when decisions are required as in fail safe and fail deadly instances. These decisions are then made according to incentives which can aligned to make the system robust to failure. For a fail deadly alignment, this means that a person in position to prevent system failure will be harmed by the failure. In the fail safe case, the innocent parties should be protected from the consequences of system failure. The two concepts are really two sides of the same coin that assigns liability.

It is often said that people are the weakest link in security, but that is an easy excuse for broken protocols. If security incentives are aligned properly, then humans are the strongest link.

The EMV protocol, adding incentives after the fact

As a first example, we consider the case of the EMV protocol, which is used for the majority of smart card payments worldwide, as well as smartphone and card-based contactless payment. Over the years, many vulnerabilities have been identified and removed. Fraud still exists however, due not to unexpected protocol vulnerabilities but to decisions made by banks (e.g., omitting the ability for cards to produce digital signatures), merchants (e.g., omitting PIN verification) and payment networks not sending transactions details back to banks. These are intentional choices, aiming to saves costs and cut transaction times but make fraud harder to detect.

Continue reading Incentives in Security Protocols

Practicing a science of security

Recently, at NSPW 2017, Tyler Moore, David Pym, and I presented our work on practicing a science of security. The main argument is that security work – both in academia but also in industry – already looks a lot like other sciences. It’s also an introduction to modern philosophy of science for security, and a survey of the existing science of security discussion within computer science. The goal is to help us ask more useful questions about what we can do better in security research, rather than get distracted by asking whether security can be scientific.

Most people writing about a science of security conclude that security work is not a science, or at best rather hopefully conclude that it is not a science yet but could be. We identify five common reasons people present as to why security is not a science: (1) experiments are untenable; (2) reproducibility is impossible; (3) there are no laws of nature in security; (4) there is no single ontology of terms to discuss security; and (5) security is merely engineering.

Through our introduction to modern philosophy of science, we demonstrate that all five of these complaints are misguided. They rely on an old conception of what counts as science that was largely abandoned in the 1970s, when the features of biology came to be recognized as important and independent from the features of physics. One way to understand what the five complaints actually allege is that security is not physics. But that’s much less impactful than claiming it is not science.

More importantly, we have a positive message on how to overcome these challenges and practice a science of security. Instead of complaining about untenable experiments, we can discuss structured observations of the empirical world. Experiments are just one type of structured observation. We need to know what counts as a useful structure to help us interpret the results as evidence. We provide recommendations for use of randomized control trials as well as references for useful design of experiments that collect qualitative empirical data. Ethical constraints are also important; the Menlo Report provides a good discussion on addressing them when designing structured observations and interventions in security.

Complaints about reproducibility are really targeted at the challenge of interpreting results. Astrophysics and paleontology do not reproduce experiments either, but are clearly still sciences. There are different senses of “reproduce,” from repeat exactly to corroborate by similar observations in a different context. There are also notions of statistical reproducibility, such as using the right tests and having enough observations to justify a statistical claim. The complaint is unfair in essentially demanding all the eight types of reproducibility at once, when realistically any individual study will only be able to probe a couple types at best. Seen with this additional nuance, security has similar challenges in reproducibility and interpreting evidence as other sciences.

A law of nature is a very strange thing to ask for when we have constructed the devices we are studying. The word “law” has had a lot of sticking power within science. The word was perhaps used in the 1600s and 1700s to imply a divine designer, thereby making the Church more comfortable with the work of the early scientists. The intellectual function we really care about is that a so-called “law” lets us generalize from particular observations. Mechanistic explanations of phenomena provide a more useful and approachable goal for our generalizations. A mechanism “for a phenomenon consists of entities (or parts) whose activities and interactions are organized so as to be responsible for the phenomenon” (pg 2).

MITRE wrote the original statement that a single ontology was needed for a science of security. They also happen to have a big research group funded to create such an ontology. We synthesize a more realistic view from Galison, Mitchell, and Craver. Basically, diverse fields contribute to a science of security by collaboratively adding constraints on the available explanations for a phenomenon. We should expect our explanations of complex topics to reflect that complexity, and so complexity may be a mark of maturity, rather than (as is commonly taken) a mark that security has as yet failed to become a science by simplifying everything into one language.

Finally, we address the relationship between science and engineering. In short, people have tried to reduce science to engineering and engineering to science. Neither are convincing. The line between the two is blurry, but it is useful. Engineers generate knowledge, and scientists generate knowledge. Scientists tend to want to explain why, whereas engineers tend to want to predict a change in the future based on something they make.  Knowing why may help us make changes. Making changes may help us understand why. We draw on the work of Dear and Leonelli to bring out this nuanced, mutually supportive relationship between science and engineering.

Security already can accommodate all of these perspectives. There is nothing here that makes it seem any less scientific than life sciences. What we hope to gain from this reorientation is to refocus the question about cybersecurity research from ‘is this process scientific’ to ‘why is this scientific process producing unsatisfactory results’.

Security intrusions as mechanisms

The practice of security often revolves around figuring out what (malicious act) happened to a system. This historical inquiry is the focus of forensics, specifically when the inquiry regards a policy violation (such as a law). The results of forensic investigation might be used to fix the impacted system, attribute the attack to adversaries, or build more resilient systems going forwards. However, to execute any of these purposes, the investigator first must discover the mechanism of the intrusion.

As discussed at an ACE seminar last October, one common framework for this discovery task is the intrusion kill chain. Mechanisms, mechanistic explanation, and mechanism discovery have highly-developed meanings in the biological and social sciences, but the word is not often used in information security. In a recent paper, we argue that incident response and forensics investigators would be well-served to make use of the existing literature on mechanisms, as thinking about intrusion kill chains as mechanisms is a productive and useful way to frame the work.

To some extent, thinking mechanistically is a description of what (certain) scientists do. But the mechanisms literature within philosophy of science is not merely descriptive. The normative benefits extolled include that thinking mechanistically is an effective heuristic for searching out useful explanations; mechanisms provide the most coherent unity to complex fields of study; and that mechanistic explanation is necessary to guide selection among potential studies given limited experimental resources, experiment design decisions, and interpretation of statistical results. I previously argued that capricious use of biological metaphors is bad for information security. We are keenly aware that these benefits of mechanistic explanation need to apply to security as and for security, not merely because they work in other sciences.

Our paper demonstrates how we can cast the intrusion kill chain, the diamond model, and other models of security intrusions as mechanistic models. This casting begins to demonstrate the mosaic unity of information security. Campaigns are made up of attacks. Attacks, as modeled by the kill chain, have multiple steps. In a specific attack, the delivery step might be accomplished by a drive-by-download. So we demonstrate how drive-by-downloads are a mechanism, one among many possible delivery mechanisms. This description is a schema to be filled in during a particular drive-by download incident with a specific URL and specific javascript, etc. The mechanistic schema of the delivery mechanism informs the investigator because it indicates what types of network addresses to look for, and how to fit them into the explanation quickly. This process is what Lindley Darden calls schema instantiation in the mechanism discovery literature.

Our argument is not that good forensics investigators do not do such mechanism discovery strategies. Rather, it is precisely that good investigators do do them. But we need to describe what it is good investigators in fact do. We do not currently, and that lack makes teaching new investigators particularly difficult. Thinking about intrusions as mechanisms unlocks an expansive literature on good ways to do mechanism discovery. This literature will make it easier to codify what good investigators do, which among other benefits allows us to better teach sound methodological practices to incoming investigators.

Our paper on this topic was published in the open-access Journal of Cybersecurity, as Thinking about intrusion kill chains as mechanisms, by Jonathan M. Spring and Eric Hatleback.

Category errors in (information) security: how logic can help

(Information) security can, pretty strongly arguably, be defined as being the process by which it is ensured that just the right agents have just the right access to just the right (information) resources at just the right time. Of course, one can refine this rather pithy definition somewhat, and apply tailored versions of it to one’s favourite applications and scenarios.

A convenient taxonomy for information security is determined by the concepts of confidentiality, integrity, and availability, or CIA; informally:

Confidentiality
the property that just the right agents have access to specified information or systems;
Integrity
the property that specified information or systems are as they should be;
Availability
the property that specified information or systems can be accessed or used when required.

Alternatives to confidentiality, integrity, and availability are sensitivity and criticality, in which sensitivity amounts to confidentiality together with some aspects of integrity and criticality amounts to availability together with some aspects of integrity.

But the key point about these categories of phenomena is that they are declarative; that is, they provide a statement of what is required. For example, that all documents marked ‘company private’ be accessible only to the company’s employees (confidentiality), or that all passengers on the aircraft be free of weapons (integrity), or that the company’s servers be up and running 99.99% of the time (availability).

It’s all very well stating, declaratively, one’s security objectives, but how are they to be achieved? Declarative concepts should not be confused with operational concepts; that is, ones that describe how something is done. For example, passwords and encryption are used to ensure that documents remain confidential, or security searches ensure that passengers do not carry weapons onto an aircraft, or RAID servers are employed to ensure adequate system availability. So, along with each declarative aim there is a collection of operational tools that can be used to achieve it.

Continue reading Category errors in (information) security: how logic can help

Mathematical Modelling in the Two Cultures

Models, mostly based on mathematics of one kind or another, are used everywhere to help organizations make decisions about their design, policies, investment, and operations. They are indispensable.

But if modelling is such a great idea, and such a great help, why do so many things go wrong? Well, there’s good modelling and less good modelling. And it’s hard for the consumers of models — in companies, the Civil Service, government agencies — to know when they’re getting the good stuff. Worse, there’s a lot of comment and advice out there which at best doesn’t help, and perhaps makes things worse.

In 1959, the celebrated scientist and novelist C. P. Snow delivered the Rede Lecture on ‘The Two Cultures’. Snow later published a book developing the ideas as ‘The Two Cultures and the Scientific Revolution’.

A famous passage from Snow’s lecture is the following (it can be found in Wikipedia):

‘A good many times I have been present at gatherings of people who, by the standards of the traditional culture, are thought highly educated and who have with considerable gusto been expressing their incredulity at the illiteracy of scientists. Once or twice I have been provoked and have asked the company how many of them could describe the Second Law of Thermodynamics. The response was cold: it was also negative. Yet I was asking something which is the scientific equivalent of: Have you read a work of Shakespeare’s?

‘I now believe that if I had asked an even simpler question — such as, What do you mean by mass, or acceleration, which is the scientific equivalent of saying, Can you read? — not more than one in ten of the highly educated would have felt that I was speaking the same language. So the great edifice of modern physics goes up, and the majority of the cleverest people in the western world have about as much insight into it as their neolithic ancestors would have had.’

Over the decades since, society has come to depend upon mathematics, and on mathematical models in particular, to a very great extent. Alas, the mathematical sophistication of the great majority of consumers of models has not really improved. Perhaps it has even deteriorated.

So, as mathematicians and modellers, we need to make things work. The starting point for good modelling is communication with the client.

Continue reading Mathematical Modelling in the Two Cultures