MAMADROID: Detecting Android Malware by Building Markov Chains of Behavioral Models

Now making up 85% of mobile devices, Android smartphones have become profitable targets for cybercriminals, allowing them to bypass two factor authentication or steal sensitive information such as credit cards details or login credentials.

Smartphones have limited battery and memory available, therefore, the defences that can be deployed on them have limitations. For these reasons, malware detection is usually performed in a centralised fashion by Android market operators. As previous work and even recent news have shown, however, even Google Play Store is not able to detect all malicious apps; to make things even worse, there are countries in which Google Play Store is blocked. This forces users to resort to third party markets, which are usually performing less careful malware checks.

Previous malware detection studies focused on models based on permissions or on specific API calls. While the first method is prone to false positives, the latter needs constant retraining, because apps as well as the Android framework itself are constantly changing.

Our intuition is that, while malicious and benign apps may call the same API calls during their execution, the reason why those calls are made may be different, resulting in them being called in a different order. For this reason, we decided to rely on sequences of calls that, as explained later, we abstract to higher level for performance, feasibility, and robustness reasons. To implement this idea we created MaMaDroid, a system for Android malware detection.


MaMaDroid is built by combining four different phases:

  • Call graph extraction: starting from the apk file of an app, we extract the call graph of the analysed sample.
  • Sequence extraction: from the call graph, we extract the different potential paths as sequences of API calls and abstract all those calls to higher levels.
  • Markov Chain modelling: all the samples got their sequences of abstracted calls, and these sequences can be modelled as transitions among states of a Markov Chain.
  • Classification: Given the probabilities of transition between states of the chains as features set, we apply machine learning to detect malicious apps.
Four phases of MaMaDroid

Call graph extraction

MaMaDroid is a system based only on static analysis. To analyse the app, we use off-the-shelf tools, such as Soot and FlowDroid for the first step of the system.

Sequence Extraction

Taking the call graph as input, we extract the sequences of functions potentially called by the program and, by identifying the set of entry nodes, enumerate all the possible paths and output them as sequences of API calls.

Example call graph in which we can observe 3 different potential paths, or sequences, starting from the root node

Continue reading MAMADROID: Detecting Android Malware by Building Markov Chains of Behavioral Models

Battery Status Not Included: Assessing Privacy in W3C Web Standards

Designing standards with privacy in mind should be a standard in itself. Historically this was not always the case and the idea of designing systems with privacy is relatively new – it dates from the beginning of this decade. One of the milestones is accepting this view on the IETF level, dating 2013.

This note and the referenced research work focusses on designing standards with privacy in mind. The World Wide Web Consortium (W3C) is one of the most important standardization bodies. W3C is standardizing something everyone – billions of people – use every day for entertainment or work, the web and how web browsers work. It employs a focused, lengthy, but very open and consensus-driven environment in order to make sure that community voice is always heard during the drafting of new specifications. The actual inner workings of W3C are surprisingly complex. One interesting observation is that the W3C Process Document document does not actually mention privacy reviews at all. Although in practice this kind of reviews are always the case.

However, as the web along with its complexity and web browser features constantly grow, and as browsers and the web are expected to be designed in an ever-rapid manner, considering privacy becomes a challenge. But this is what is needed in order to obtain a good specification and well-thought browser features, with good threat models, and well designed privacy strategies.

My recent work, together with the Princeton team, Arvind Narayanan and Steven Englehardt, is providing insight into the standardization process at W3C, specifically the privacy areas of standardization. We point out a number of issues and room for improvements. We also provide recommendations as well as broad evidence of a wide misuse and abuse (in tracking and fingerprinting) of a browser mechanism called Battery Status API.

Battery Status API is a browser feature that was meant to allow websites access the information concerning the battery state of a user device. This seemingly innocuous mechanism initially had no identified privacy concerns. However, following my previous research work in collaboration with Gunes Acar this view has changed (Leaking Battery and a later note).

The work I authored in 2015 identifies a number or privacy risks of the API. Specifically – the issue of potential exploitation of the API to tracking, as well as the potential possibility of recovering the raw value of the battery capacity – something that was not foreseen to be accessed by web sites. Ultimately, this work has led to a fix in Firefox browser and an update to W3C specification. But the matter was not over yet.

Continue reading Battery Status Not Included: Assessing Privacy in W3C Web Standards

What the CIA hack and leak teaches us about the bankruptcy of current “Cyber” doctrines

Wikileaks just published a trove of documents resulting from a hack of the CIA Engineering Development Group, the part of the spying agency that is in charge of developing hacking tools. The documents seem genuine and catalog, among other things, a number of exploits against widely deployed commodity devices and systems, including Android, iPhone, OS X and Windows. Also smart TVs. This hack, with appropriate background, teaches us a lesson or two about the direction of public policy related to “cyber” in the US and the UK.

Routine proliferation of weaponry and tactics

The CIA hack is in many ways extraordinary, in that it allowed the attackers to gain access to the source code of the hacking tools of the agency – an extraordinary act of proliferation of attack technologies. In other ways, it is mundane in that it is neither the first, nor probably the last hack or leak of catastrophic proportions to occur to a US/UK government department in charge of offensive cyber operations.

This list of leaks of government attack technologies, illustrates that when it comes to cyber-weaponry the risk of proliferation is not merely theoretical, but very real. In fact it seems to be happening all the time.

I find it particularly amusing – and those in charge of those agencies should probably find it embarrassing – that NSA and GCHQ go around presenting themselves as national technical authorities in assurance; they provide advice to others on how to not get hacked; they keep asserting that they can be trusted to operate extremely dangerous spying infrastructures; and handle in secret extremely dangerous zero-day exploits. Yet, they seem to be routinely hacked and have their secret documents leaked. Instead of chasing whistleblowers and journalists, policy makers should probably take note that there is not a high-enough level of assurance to secure cyber-weaponry, and for sure it is not to be found within those agencies.

In fact the risk of proliferation is at the very heart of cyber attack, and integral to it, even without hacking or leaking from inside government. Many of us quietly laughed at the bureaucratic nightmare discussed in the recent CIA leak, describing the difficulty of classifying the cyber attack techniques while at the same time deploying them on target system. As the press release summarizes:

To attack its targets, the CIA usually requires that its implants communicate with their control programs over the internet. If CIA implants, Command & Control and Listening Post software were classified, then CIA officers could be prosecuted or dismissed for violating rules that prohibit placing classified information onto the Internet. Consequently the CIA has secretly made most of its cyber spying/war code unclassified.

This illustrates very clearly a key dynamic in hacking: once a hacker uses an exploit against an adversary system, there is a very real risk the exploit is captured by monitoring and intrusion detection systems of the target, and then weponized to hack other computers, at a low cost. This is very well established and researched, and such “honey pot” infrastructures have been used in the academic and commercial community for some time to detect and study potentially new attacks. This is not the premise of sophisticated defenders, the explanation of how honeypots work is on Wikipedia! The Flame malware, and Stuxnet before, were in fact found in the wild.

In that respect cyber-war is not like war at all. The weapons you use will be turned against you immediately, and your effective use of weapons relies on your very own infrastructures being utterly vulnerable to them.

What “Cyber” doctrine?

The constant leaks and hacks, leading to proliferation of exploits and hacking tools from the heart of government, as well through operations, should deeply inform policy makers when making choices about “cyber” doctrines. First, it is probably time to ditch the awkward term “Cyber”.

Continue reading What the CIA hack and leak teaches us about the bankruptcy of current “Cyber” doctrines