thesis/4_rasta/2_methodology.typ

#import "../lib.typ": todo, etal, eg
#import "X_var.typ": *
#import "X_lib.typ": *

== Methodology <sec:rasta-methodology>

=== Collecting Tools

#figure({
  show table: set text(size: 0.80em)
  show "#etal": etal
  let show_citekeys(keys) = [
    #keys.split(",").map(
      citekey => cite(label(citekey))).join([]
    ) (#keys.split(",").map(
      citekey => cite(label(citekey), form: "year")
    ).join([]))
  ]
  table(
    columns: 7,
    inset: (x: 0% + 5pt, y: 0% + 2pt),
    stroke: none,
    align: center+horizon,
    table.hline(),
    table.header(
      table.cell(colspan: 7, inset: 3pt)[],
      table.cell(rowspan:2)[*Tool*],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(colspan:3)[*Availability*],
      table.vline(end: 3),
      table.vline(start: 4),
      [*Repo*],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(rowspan:2)[*Decision*],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(rowspan:2)[*Comments*],

      [Bin],
      [Src],
      [Doc],
      [type],
    ),
    table.cell(colspan: 7, inset: 3pt)[],
    table.hline(),
    table.cell(colspan: 7, inset: 3pt)[],
    ..rasta_tool_data
    .map(entry => (
      [#entry.tool #show_citekeys(entry.citekey)],
      str2sym(entry.binary),
      str2sym(entry.source),
      str2sym(entry.documentation),
      link(entry.url, entry.repo),
      str2sym(entry.decision),
      entry.why,
    )).flatten(),
    table.cell(colspan: 7, inset: 3pt)[],
    table.hline(),
    table.cell(colspan: 7, inset: 3pt)[],
    table.hline(),
  )
  [
    *binaries, sources*: #nr: not relevant, #ok: available, #bad: partially available, #ko: not provided\
    *documentation*: #okk: excellent, MWE, #ok: few inconsistencies, #bad: bad quality, #ko: not available\
    *decision*: #ok: considered; #bad: considered but not built; #ko: out of scope of the study
  ]},
  caption: [Considered tools@Li2017: availability and usage reliability],
) <tab:rasta-tools>

We collected the static analysis tools from@Li2017, plus one additional paper encountered during our review of the state-of-the-art (DidFail@klieberAndroidTaintFlow2014).
They are listed in @tab:rasta-tools, with the original release date and associated paper.
We intentionally limited the collected tools to the ones selected by Li #etal@Li2017 for several reasons.
First, not using recent tools enables to have a gap of at least 5 years between the publication and the more recent APK files, which enables to measure the reusability of previous contribution with a reasonable gap of time.
Second, collecting new tools would require to describe these tools in depth, similarly to what have been performed by Li #etal@Li2017, which is not the primary goal of this paper.
Additionally, selection criteria such as the publication venue or number of citations would be necessary to select a subset of tools, which would require an additional methodology.
These possible contributions are left for future work.

Some tools use hybrid analysis (both static and dynamic): A3E@DBLPconfoopslaAzimN13, A5@vidasA5AutomatedAnalysis2014, Android-app-analysis@geneiatakisPermissionVerificationApproach2015, StaDynA@zhauniarovichStaDynAAddressingProblem2015.
They have been excluded from this paper.
We manually searched the tool repository when the website mentioned in the paper is no longer available (#eg when the repository have been migrated from Google code to GitHub) and for each tool we searched for:

- an optional binary version of the tool that would be usable as a fall back (if the sources cannot be compiled for any reason);
- the source code of the tool;
- the documentation for building and using the tool with a MWE (Minimum Working Example).

In @tab:rasta-tools we rated the quality of these artifacts with "#ok" when available but may have inconsistencies, a "#bad" when too much inconsistencies (inaccurate remarks about the sources, dead links or missing parts) have been found, a "#ko" when no documentation have been found, and a double "#okk" for the documentation when it covers all our expectations (building process, usage, MWE).
Results show that documentation is often missing or very poor (#eg Lotrack), which makes the rebuild process very complex and the first analysis of a MWE.


We finally excluded Choi #etal@CHOI2014620 as their tool works on the sources of Android applications, and Poeplau #etal@DBLPconfndssPoeplauFBKV14 that focus on Android hardening.
As a summary, in the end we have #nbtoolsselected tools to compare.
Some specificities should be noted.
The IC3 tool will be duplicated in our experiments because two versions are available: the original version of the authors and a fork used by other tools like IccTa.
For Androguard, the default task consists of unpacking the bytecode, the resources, and the Manifest.
Cross-references are also built between methods and classes.
Because such a task is relatively simple to perform, we decided to duplicate this tool and ask to Androguard to decompile an APK and create a control flow graph of the code using its decompiler: DAD.
We refer to this variant of usage as androguard_dad.
 For Thresher and Lotrack, because these tools cannot be built, we excluded them from experiments.

 Finally, starting with #nbtools tools of @tab:rasta-tools, with the two variations of IC3 and Androguard, we have in total #nbtoolsselectedvariations static analysis tools to evaluate in which two tools cannot be built and will be considered as always failing.

=== Source Code Selection and Building Process

#figure({
  show table: set text(size: 0.80em)
  show "#etal": etal
  let show_citekeys(keys) = [
    #keys.split(",").map(
      citekey => cite(label(citekey))).join([]
    )
  ]
  table(
    columns: 8,
    inset: (x: 0% + 5pt, y: 0% + 2pt),
    stroke: none,
    align: center+horizon,
    table.hline(),
    table.header(
      table.cell(colspan: 8, inset: 3pt)[],
      table.cell(rowspan:2)[*Tool*],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(colspan:2)[*Origin*],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(colspan:2)[*Alive Forks*],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(rowspan:2)[*Last Commit \ Date*],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(rowspan:2)[*Authors \ Reached*],
      table.vline(end: 3),
      table.vline(start: 4),
      [*Environment*],

      [Stars],
      [Alive],
      [Nb],
      [Usable],
      [Language -- OS],
    ),
    table.cell(colspan: 8, inset: 3pt)[],
    table.hline(),
    table.cell(colspan: 8, inset: 3pt)[],
    ..rasta_tool_data
    .filter(entry => entry.exclude != "EXCLUDE")
    .map(entry => (
      [#entry.tool #show_citekeys(entry.citekey)],
      entry.stars,
      str2sym(entry.alive),
      entry.nbaliveforks,
      str2sym(entry.forkusable),
      entry.selecteddate,
      str2sym(entry.authorconfirmed),
      [#entry.lang -- #entry.os]
    )).flatten(),
    table.cell(colspan: 8, inset: 3pt)[],
    table.hline(),
    table.cell(colspan: 8, inset: 3pt)[],
    table.hline(),
  )
  [#ok: yes, #ko: no, UX.04: Ubuntu X.04]},
  caption: [Selected tools, forks, selected commits and running environment],
) <tab:rasta-sources>

In a second step, we explored the best sources to be selected among the possible forks of a tool.
We reported some indicators about the explored forks and our decision about the selected one in @tab:rasta-sources.
For each source code repository called "Origin", we reported in @tab:rasta-sources the number of GitHub stars attributed by users and we mentioned if the project is still alive (#ok in column Alive when a commit exist in the last two years).
Then, we analyzed the fork tree of the project.
We searched recursively if any forked repository contains a more recent commit than the last one of the branch mentioned in the documentation of the original repository.
If such a commit is found (number of such commits are reported in column Alive Forks Nb), we manually looked at the reasons behind this commit and considered if we should prefer this more up-to-date repository instead of the original one (column "Alive Forks Usable").
As reported in @tab:rasta-sources, we excluded all forks, except IC3 for which we selected the fork JordanSamhi/ic3, because they always contain experimental code with no guarantee of stability.
For example, a fork of Aparecium contains a port for Windows 7 which does not suggest an improvement of the stability of the tool.
For IC3, the fork seems promising: it has been updated to be usable on a recent operating system (Ubuntu 22.04 instead of Ubuntu 12.04 for the original version) and is used as a dependency by IccTa.
We decided to keep these two versions of the tool (IC3 and IC3_fork) to compare their results.

Then, we self-allocated a maximum of four days for each tool to successfully read and follow the documentation, compile the tool and obtain the expected result when executing an analysis of a MWE.
We sent an email to the authors of each tool to confirm that we used the more suitable version of the code, that the command line we used to analyze an application is the most suitable one and, in some cases, requested some help to solve issues in the building process.
We reported in @tab:rasta-sources the authors that answered our request and confirmed our decisions.

From this building phase, several observations can be made.
Using a recent operating system, it is almost impossible in a reasonable amount of time to rebuild a tool released years ago.
Too many dependencies, even for Java based programs, trigger compilation or execution problems.
Thus, if the documentation mentions a specific operating system, we use a Docker image of this OS.
// For example, Dare is a dependency of several tools (Didfail, IC3) and depends on 32 bits libraries such as lib32stdc++ and ia32-libs.
// Those libraries are only available on Ubuntu 12 or previous versions.
//
Most of the time, tools require additional external components to be fully functional.
It could be resources such as the android.jar file for each version of the SDK, a database, additional libraries or tools.
Depending of the quality of the documentation, setting up those components can take hours to days.
This is why we automatized in a Dockerfile the setup of the environment in which the tool is built and run#footnote[To guarantee reproducibility we published the results, datasets, Dockerfiles and containers: https://github.com/histausse/rasta, https://zenodo.org/records/10144014, https://zenodo.org/records/10980349 and on Docker Hub as `histausse/rasta-<toolname>:icsr2024`]

=== Runtime Conditions

#figure(
  image(
    "figs/running.svg",
    width: 80%,
    alt: "A diagram representing the methodology. The word 'Tool' is linked to a box labeled 'Docker image' by an arrow labeled 'building'. The box 'Docker image' is linked to a box labeled 'Singularity image' by an arrow labeled 'conversion'. The box 'Singularity image' is linked to a box labeled 'Execution monitoring' by a dotted arrow labeled 'Manuel tests' and to an image of a server labeled 'Singularity cluster' by an arrow labeled deployment. An image of three android logo labeled 'apks' is also linked to the 'Singularity cluster' by an arrow labeled 'running the tool analysis'. The 'Singularity cluster' image is linked to the 'Execution monitoring' box by an arrow labeled 'log capture'. The 'Execution monitoring' box linked to the words 'Exit status' by an unlabeled arrow.",
  ),
  caption: [Methodology overview],
) <fig:rasta-overview>

As shown in @fig:rasta-overview, before benchmarking the tools, we built and installed them in a Docker containers for facilitating any reuse of other researchers.
We converted them into Singularity containers because we had access to such a cluster and because this technology is often used by the HPC community for ensuring the reproducibility of experiments.
//The Docker container allows a user to interact more freely with the bundled tools.
//Then, we converted this image to a Singularity image.
We performed manual tests using these Singularity images to check:

- the location where the tool is writing on the disk. For the best performances, we expect the tools to write on a mount point  backed by an SSD. Some tools may write data at unexpected locations which required small patches from us.
- the amount of memory allocated to the tool. We checked that the tool could run a MWE with a #ramlimit limit of RAM.
- the network connection opened by the tool, if any. We expect the tool not to perform any network operation such as the download of Android SDKs. Thus, we prepared the required files and cached them in the images during the building phase. In a few cases, we patched the tool to disable the download of resources.

A campaign of tests consists in executing the #nbtoolsvariationsrun selected tools on all APKs of a dataset.
The constraints applied on the clusters are:

- No network connection is authorized in order to limit any execution of malicious software.
- The allocated RAM for a task is #ramlimit.
- The allocated maximum time is 1 hour.
- The allocated object space / stack space is 64 GB / 16 GB if the tool is a Java based program.

For the disk files, we use a mount point that is stored on a SSD disk, with no particular limit of size.
Note that, because the allocation of #ramlimit could be insufficient for some tool, we evaluated the results of the tools on 20% of our dataset (described later in @sec:rasta-dataset) with 128 GB of RAM and #ramlimit of RAM and checked that the results were similar.
With this confirmation, we continued our evaluations with #ramlimit of RAM only.


=== Dataset <sec:rasta-dataset>

/*
DATASET

first seen year: pas dans les BDD officielles d'Androzoo: min added dans AndroZoo et date de VT analysis

année: 2010 et 2023

7% de malware

0 detection dans VT: good
5+ => malware
0-5 detection: exclu


Les tranches de taille sont des déciles de d'androzoo (- les 1% extreme)
pour chaque année, pour chaque tranche de taille, on selectionne randomly 500 applications (avec bonne proporotion de malware) = bucket.

Probleme: Ce n'est pas représentatif de la population: il n'y a propablement pas 7% de malware and chaque décile d'androzoo pour chaque année
Probleme 2: pour sampler, on utilise les deciles de taille d'apk, mais pour nos plot on utiliser les deciles de taille de dex file.

500*10*14=70000


*/

// Two datasets are used in the experiments of this section.
// The first one is *Drebin*@Arp2014, from which we extracted the malware part (5479 samples that we could retrieved) for comparison purpose only.
// It is a well known and very old dataset that should not be used anymore because it contains temporal and spatial biases@Pendlebury2018.
// We intend to compare the rate of success on this old dataset with a more recent one.
// The second one,
We built a dataset named *Rasta* to cover all dates between 2010 to 2023.
This dataset is a random extract of Androzoo@allixAndroZooCollectingMillions2016, for which we balanced applications between years and size.
For each year and inter-decile range of size in Androzoo, 500 applications have been extracted with an arbitrary proportion of 7% of malware.
This ratio has been chosen because it is the ratio of goodware/malware that we observed when performing a raw extract of Androzoo.
For checking the maliciousness of an Android application we rely on the VirusTotal detection indicators.
If more than 5 antiviruses have flagged the application as malicious, we consider it as a malware.
If no antivirus has reported the application as malicious, we consider it as a goodware.
Applications in between are dropped.

For computing the release date of an application, we contacted the authors of Androzoo to compute the minimum date between the submission to Androzoo and the first upload to VirusTotal.
Such a computation is more reliable than using the DEX date that is often obfuscated when packaging the application.

// #todo[Transition] // plus de place :-(