thesis/2_background/2_3_static_analysis.typ

#import "../lib.typ": APK, etal, ART, SDK, DEX, eg, midskip
#import "../lib.typ": todo, jm-note, jfl-note
#import "@preview/diagraph:0.3.5": raw-render

=== Static Analysis <sec:bg-static>

A static analysis program examines an #APK file without executing it to extract information from it.
Basic static analysis can include extracting information from the `AndroidManifest.xml` file or decompiling bytecode to Java code with tools like Apktool or Jadx.
Unfortunately, simply reading the bytecode does not scale.
To do so, a human analyst is needed, making it complicated to analyse a large number of applications, and even for single applications, the size and complexity of some applications can quickly overwhelm the reverse engineer.

Control flow analysis is often used to mitigate this issue.
The idea is to extract the behaviour, the flow, of the application from the bytecode, and to represent it as a graph.
A graph representation is easier to work with than a list of instructions and can be used for further analysis.
Depending on the level of precision required, different types of graphs can be computed.
The most basic of those graphs is the call graph.
A call graph is a graph where the nodes represent the methods in the application, and the edges represent calls from one method to another.
@fig:bg-fizzbuzz-cg-cfg b) show the call graph of the code in @fig:bg-fizzbuzz-cg-cfg a).
A more advanced control-flow analysis consists of building the control-flow graph.
This time, instead of methods, the nodes represent instructions, and the edges indicate which instruction can follow which instruction.
@fig:bg-fizzbuzz-cg-cfg c) represents the control-flow graph of @fig:bg-fizzbuzz-cg-cfg a), with code statements instead of bytecode instructions.


#figure({
  set align(center)
  stack(dir: ttb,[
  #figure(
    ```java
    public static void fizzBuzz(int n) {
      for (int i = 1; i <= n; i++) {
        if (i % 3 == 0 && i % 5 == 0) {
          Buzzer.fizzBuzz();
        } else if (i % 3 == 0) {
          Buzzer.fizz();
        } else if (i % 5 == 0) {
          Buzzer.buzz();
        } else {
          Log.e("fizzbuzz", String.valueOf(i));
        }
      }
    }
    ```,
    supplement: none,
    kind: "bg-fizzbuzz-cg-cfg subfig",
    caption: [a) A Java program],
  ) <fig:bg-fizzbuzz-java>], v(2em), stack(dir: ltr, [
  #figure(
    raw-render(```
      digraph {
        rankdir=LR
        "fizzBuzz(int)" -> "Buzzer.fizzBuzz()"
        "fizzBuzz(int)" -> "Buzzer.fizz()"
        "fizzBuzz(int)" -> "Buzzer.buzz()"
        "fizzBuzz(int)" -> "String.valueOf(int)"
        "fizzBuzz(int)" -> "Log.e(String, String)"
      }
      ```,
      width: 40%,
      alt: "An oriented graph with arrows going from \"fizzBuzz(int)\" to \"Buzzer.fizzBuzz()\", \"Buzzer.fizz()\", \"String.valueOf(int)\", and \"Log.e(String, String)\"",
    ),
    supplement: none,
    kind: "bg-fizzbuzz-cg-cfg subfig",
    caption: [b) Corresponding Call Graph]
  ) <fig:bg-fizzbuzz-cg>],[
  #figure(
    raw-render(```
      digraph {
        l1
        l2
        l3
        l4
        l5
        l6
        l7
        l9

        l1 -> l2
        l2 -> l3
        l3 -> l1
        l2 -> l4
        l4 -> l5
        l5 -> l1
        l4 -> l6
        l6 -> l7
        l7 -> l1
        l6 -> l9
        l9 -> l1
      }
      ```,
      labels: (
        "l1": `for (int i = 1; i <= n; i++) {`,
        "l2": `if (i % 3 == 0 && i % 5 == 0) {`,
        "l3": `Buzzer.fizzBuzz();`,
        "l4": `} else if (i % 3 == 0) {`,
        "l5": `Buzzer.fizz();`,
        "l6": `} else if (i % 5 == 0) {`,
        "l7": `Buzzer.buzz();`,
        "l9": `Log.e("fizzbuzz", String.valueOf(i));`,
      ),
      width: 50%,
      alt: (
        "An oriented graph. ",
        "The node at the top is labelled `for (int i = 1; i <= n; i++) {`. Arrows go from it to the node below, labelled `if (i % 3 == 0 && i % 5 == 0) {`. ",
        "Two arrows start from this node, one to `Buzzer.fizzBuzz();`, one to `} else if (i % 3 == 0) {`. ",
        "An arrow goes from `Buzzer.fizzBuzz();` to the `for` node at the top. ",
        "Two arrows go from the `else if i % 5 = 0` node, one to `} else if (i % 5 == 0) {` and one to `Buzzer.fizz();`. ",
        "An arrow goes from `Buzzer.fizz();` to the `for` node at the top. ",
        "Two arrows go from the `else if i % 5 = 0`, one to `Buzzer.buzz();`, and one to `Log.e(\"fizzbuzz\", String.valueOf(i));`. ",
        "Arrows go from both those nodes, back to the `for` node at the top."
      ).join(),
    ),
    supplement: none,
    kind: "bg-fizzbuzz-cg-cfg subfig",
    caption: [c) Corresponding Control-Flow Graph]
  ) <fig:bg-fizzbuzz-cfg>]))
  h(1em)},
  kind: image,
  supplement: [Figure],
  caption: [Source code for a simple Java method and its Call and Control Flow Graphs],
)<fig:bg-fizzbuzz-cg-cfg>

Once the control-flow graph is computed, it can be used to compute data-flows.
Data-flow analysis, also called taint-tracking, is used to follow the flow of information in the application.
By defining a list of methods and fields that can generate critical information (taint sources) and a list of methods that can consume information (taint sinks), taint-tracking detects potential data leaks (if a data flow links a taint source and a taint sink).
For example, `TelephonyManager.getImei()` returns a unique, persistent, device identifier.
This can be used to identify the user, and it cannot be changed if compromised.
This makes `TelephonyManager.getImei()` a good candidate as a taint source.
On the other hand, `UrlRequest.start()` sends a request to an external server, making it a taint sink.
If a data-flow is found linking `TelephonyManager.getImei()` to `UrlRequest.start()`, this means the application is potentially leaking critical information to an external entity, a behaviour that is probably not wanted by the user.


Static analysis is powerful as it can detect unwanted behaviour in an application, even if the behaviour does not manifest itself when running the application.
However, static analysis tools must overcome many challenges when analysing Android applications.
/ the Java object-oriented paradigm: A call to a method can, in fact, correspond to a call to any method overriding the original method in subclasses.
/ the multiplicity of entry points: Each component of an application can be an entry point for the application.
/ the event-driven architecture: Methods in the applications can be called when events occur, in an unknown order.
/ the interleaving of native code and bytecode: Native code can be called from bytecode and vice versa, but tools often only handle one of those formats.
/ the potential dynamic code loading: An application can run code that was not originally in the application.
/ the use of reflection: Methods can be called from their name as a string object, which is difficult to identify statically.
/ the continual evolution of Android: each new version of Android brings new features that analysis tools must be aware of.
  For instance, the multi-dex feature presented in @sec:bg-android-apk was introduced in Android #SDK 21.
  Tools unaware of this feature only analyse the `classes.dex` file and will ignore all other `classes<n>.dex` files.

#todo[Ca serait bien de souligner Dyn Code Load et Reflection]

#midskip

With the bases of Android application analysis in mind, we can now examine our problem statements further.