thesis/2_background/3_static_analysis.typ

#import "../lib.typ": APK, etal, ART, SDK, DEX, eg,
#import "../lib.typ": todo, jm-note, jfl-note
#import "@preview/diagraph:0.3.5": raw-render

//== Android Reverse Engineering Techniques <sec:bg-techniques>

//#todo[swap with tool section ?]


== Static Analysis <sec:bg-static>

In the past fifteen years, the research community released many tools to detect or analyse malicious behaviors in applications.
Two main approaches can be distinguished: static and dynamic analysis~@Li2017.
Dynamic analysis requires to run the application in a controlled environment to observe runtime values and/or interactions with the operating system.
For example, an Android emulator with a patched kernel can capture these interactions but the modifications to apply are not a trivial task.
Such approach is limited by the required time to execute a limited part of the application with no guarantee on the obtained code coverage.
Dynamic analysis is also limited by evading techniques that may prevent the execution of malicious parts of the code.
As a consequence, a lot of efforts have been put in static approaches. //, which is the focus of this paper.

Static analysis program examine an #APK file without executing it to extract information from it.
Basic static analysis can include extracting information from the `AndroidManifest.xml` file or decompiling bytecode to Java code.

More advance analysis consist in the computing the control-flow of an application and computing its data-flow~@Li2017.

The most basic form of control-flow analysis is to build a call graph.
A call graph is a graph where the nodes represent the methods in the application, and the edges reprensent calls from one method to another.
@fig:bg-fizzbuzz-cg-cfg b) show the call graph of the code in @fig:bg-fizzbuzz-cg-cfg a).
A more advance control-flow analysis consist in building the control-flow graph.
This time, instead of methods, the nodes represent instructions, and the edges indicate which instruction can follow which instruction.
@fig:bg-fizzbuzz-cg-cfg c) represents the control-flow graph of @fig:bg-fizzbuzz-cg-cfg a), with code statement instead of bytecode instructions.

#todo[Add alt text for @fig:bg-fizzbuzz-cg and @fig:bg-fizzbuzz-cfg]

#figure({
  set align(center)
  stack(dir: ttb,[
  #figure(
    ```java
    public static void fizzBuzz(int n) {
      for (int i = 1; i <= n; i++) {
        if (i % 3 == 0 && i % 5 == 0) {
          Buzzer.fizzBuzz();
        } else if (i % 3 == 0) {
          Buzzer.fizz();
        } else if (i % 5 == 0) {
          Buzzer.buzz();
        } else {
          Log.e("fizzbuzz", String.valueOf(i));
        }
      }
    }
    ```,
    supplement: none,
    kind: "bg-fizzbuzz-cg-cfg subfig",
    caption: [a) A Java program],
  ) <fig:bg-fizzbuzz-java>], v(2em), stack(dir: ltr, [
  #figure(
    raw-render(```
      digraph {
        rankdir=LR
        "fizzBuzz(int)" -> "Buzzer.fizzBuzz()"
        "fizzBuzz(int)" -> "Buzzer.fizz()"
        "fizzBuzz(int)" -> "Buzzer.buzz()"
        "fizzBuzz(int)" -> "String.valueOf(int)"
        "fizzBuzz(int)" -> "Log.e(String, String)"
      }
      ```,
      width: 40%,
      alt: "",
    ),
    supplement: none,
    kind: "bg-fizzbuzz-cg-cfg subfig",
    caption: [b) Corresponding Call Graph]
  ) <fig:bg-fizzbuzz-cg>],[
  #figure(
    raw-render(```
      digraph {
        l1
        l2
        l3
        l4
        l5
        l6
        l7
        l9

        l1 -> l2
        l2 -> l3
        l3 -> l1
        l2 -> l4
        l4 -> l5
        l5 -> l1
        l4 -> l6
        l6 -> l7
        l7 -> l1
        l6 -> l9
        l9 -> l1
      }
      ```,
      labels: (
        "l1": `for (int i = 1; i <= n; i++) {`,
        "l2": `if (i % 3 == 0 && i % 5 == 0) {`,
        "l3": `Buzzer.fizzBuzz();`,
        "l4": `} else if (i % 3 == 0) {`,
        "l5": `Buzzer.fizz();`,
        "l6": `} else if (i % 5 == 0) {`,
        "l7": `Buzzer.buzz();`,
        "l9": `Log.e("fizzbuzz", String.valueOf(i));`,
      ),
      width: 50%,
      alt: "",
    ),
    supplement: none,
    kind: "bg-fizzbuzz-cg-cfg subfig",
    caption: [c) Corresponding Control-Flow Graph]
  ) <fig:bg-fizzbuzz-cfg>]))
  h(1em)},
  supplement: [Figure],
  caption: [Source code for a simple Java method and its Call and Control Flow Graphs],
)<fig:bg-fizzbuzz-cg-cfg>
Once the control-flow graph is computed, it can be used to compute data-flows.
Data-flow analysis, also called taint-tracking, allows to follow the flow of information in the application.
Be defining a list of methods and fields that can generate critical information (taint sources) and a list of methods that can consume information (taint sink), taint-tracking allows to detect potential data leaks (if a data flow link a taint source and a taint sink).
For example, `TelephonyManager.getImei()` returns an unique, persistent, device identifier.
This can be used to identify the user, and it cannot be changed if #jfl-note[compromised][replace by: this imei is dislaxd (illisible) \ jm: ???].
This make `TelephonyManager.getImei()` a good candidate as a taint source.
On the other hand, `UrlRequest.start()` send a request to an external server, making it a taint sink.
If a data-flow is found linking `TelephonyManager.getImei()` to `UrlRequest.start()`, this means the application is potentially leaking a critical information to an external entity, a behavior that is probably not wanted by the user.
Data-flow analysis is the subject of many contribution~@weiAmandroidPreciseGeneral2014 @titzeAppareciumRevealingData2015 @bosuCollusiveDataLeak2017 @klieberAndroidTaintFlow2014 @DBLPconfndssGordonKPGNR15 @octeauCompositeConstantPropagation2015 @liIccTADetectingInterComponent2015, the most notable tool being Flowdroid~@Arzt2014a.

#todo[Describe the different contributions in relations to the issues they tackle, be more critical]

Static analysis is powerfull as it allows to detects unwanted behavior in an application even is the behavior does not manifest itself when running the application.
Hovewer, static analysis tools must overcom many challenges when analysing Android applications:
/ the Java object-oriented paradigm: A call to a method can in fact correspond to a call to any method overriding the original method in subclasses.
/ the multiplicity of entry points: Each component of an application can be an entry point for the application.
/ the event driven architecture: Methods of in the applications can be called when event occur, in unknown order.
/ the interleaving of native code and bytecode: Native code can be called from bytecode and vice versa, but tools often only handle one of those format.
/ the potential dynamic code loading: An application can run code that was not originally in the application.
/ the use of reflection: Methods can be called from their name as a string object, which is difficult to identify statically.
/ the continual evolution of Android: each new version of Android brings new features that an analysis tools must be aware of.
  For instance, the multi-dex feature presented in @sec:bg-android-code-format was introduced in Android #SDK 21.
  Tools unaware of this feature only analyse the `classes.dex` file an will ignore all other `classes<n>.dex` files.

A lot of those more advanced tools rely on common tools to interact with Android applications/#DEX bytecode@~@Li2017.
Reccuring examples of such support tools are Appktool (#eg Amandroid~@weiAmandroidPreciseGeneral2014, Blueseal~@shenInformationFlowsPermission2014, SAAF~@hoffmannSlicingDroidsProgram2013), Androguard (#eg Adagio~@gasconStructuralDetectionAndroid2013, Appareciumn~@titzeAppareciumRevealingData2015, Mallodroid~@fahlWhyEveMallory2012) or Soot (#eg Blueseal~@shenInformationFlowsPermission2014, DroidSafe~@DBLPconfndssGordonKPGNR15, Flowdroid~@Arzt2014a).

The number of publication related to static analysis make can make it difficult to find the right tool for the right task.
Li #etal~@Li2017 published a systematic literature review for Android static analysis before May 2015.
They analysed 92 publications and classified them by goal, method used to solve the problem and underlying technical solution for handling the bytecode when performing the static analysis.
In particular, they listed 27 approaches with an open-source implementation available.
Nevertheless, experiments to evaluate the reusability of the pointed out software were not performed.
#jfl-note[We believe that the effort of reviewing the literature for making a comprehensive overview of available approaches should be pushed further: an existing published approach with a software that cannot be used for technical reasons endanger both the reproducibility and reusability of research.][A mettre en avant?]
In the next section, we will look at the work that has been done to evaluate different analysis tools.