thesis/2_background/3_analysis_techniques.typ

#import "../lib.typ": todo, APK, etal, ART, eg, jm-note
#import "@preview/diagraph:0.3.3": raw-render

== Android Reverse Engineering Techniques <sec:bg-techniques>

#todo[swap with tool section ?]

In the past fifteen years, the research community released many tools to detect or analyze malicious behaviors in applications.
Two main approaches can be distinguished: static and dynamic analysis@Li2017.
Dynamic analysis requires to run the application in a controlled environment to observe runtime values and/or interactions with the operating system.
For example, an Android emulator with a patched kernel can capture these interactions but the modifications to apply are not a trivial task.
Such approach is limited by the required time to execute a limited part of the application with no guarantee on the obtained code coverage.
For malware, dynamic analysis is also limited by evading techniques that may prevent the execution of malicious parts of the code.
//As a consequence, a lot of efforts have been put in static approaches, which is the focus of this paper.

=== Static Analysis <sec:bg-static>

Static analysis program examine an #APK file without executing it to extract information from it.
Basic static analysis can include extracting information from the `AndroidManifest.xml` file or decompiling bytecode to Java code.

More advance analysis consist in the computing the control-flow of an application and computing its data-flow@Li2017.

The most basic form of control-flow analysis is to build a call graph.
A call graph is a graph where the nodes represent the methods in the application, and the edges reprensent calls from one method to another.
@fig:bg-fizzbuzz-cg-cfg b) show the call graph of the code in @fig:bg-fizzbuzz-cg-cfg a).
A more advance control-flow analysis consist in building the control-flow graph.
This times instead of methods, the nodes represent instructions, and the edges indicate which instruction can follow which instruction.
@fig:bg-fizzbuzz-cg-cfg c) represent the control-flow graph of @fig:bg-fizzbuzz-cg-cfg a), with code statement instead of bytecode instructions.

#figure({
  set align(center)
  stack(dir: ttb,[
  #figure(
    ```java
    public static void fizzBuzz(int n) {
      for (int i = 1; i <= n; i++) {
        if (i % 3 == 0 && i % 5 == 0) {
          Buzzer.fizzBuzz();
        } else if (i % 3 == 0) {
          Buzzer.fizz();
        } else if (i % 5 == 0) {
          Buzzer.buzz();
        } else {
          Log.e("fizzbuzz", String.valueOf(i));
        }
      }
    }
    ```,
    supplement: none,
    kind: "bg-fizzbuzz-cg-cfg subfig",
    caption: [a) A Java program],
  ) <fig:bg-fizzbuzz-java>], v(2em), stack(dir: ltr, [
  #figure(
    raw-render(```
      digraph {
        rankdir=LR
        "fizzBuzz(int)" -> "Buzzer.fizzBuzz()"
        "fizzBuzz(int)" -> "Buzzer.fizz()"
        "fizzBuzz(int)" -> "Buzzer.buzz()"
        "fizzBuzz(int)" -> "String.valueOf(int)"
        "fizzBuzz(int)" -> "Log.e(String, String)"
      }
      ```,
      width: 40%
    ),
    supplement: none,
    kind: "bg-fizzbuzz-cg-cfg subfig",
    caption: [b) Corresponding Call Graph]
  ) <fig:bg-fizzbuzz-cg>],[
  #figure(
    raw-render(```
      digraph {
        l1
        l2
        l3
        l4
        l5
        l6
        l7
        l9

        l1 -> l2
        l2 -> l3
        l3 -> l1
        l2 -> l4
        l4 -> l5
        l5 -> l1
        l4 -> l6
        l6 -> l7
        l7 -> l1
        l6 -> l9
        l9 -> l1
      }
      ```,
      labels: (
        "l1": `for (int i = 1; i <= n; i++) {`,
        "l2": `if (i % 3 == 0 && i % 5 == 0) {`,
        "l3": `Buzzer.fizzBuzz();`,
        "l4": `} else if (i % 3 == 0) {`,
        "l5": `Buzzer.fizz();`,
        "l6": `} else if (i % 5 == 0) {`,
        "l7": `Buzzer.buzz();`,
        "l9": `Log.e("fizzbuzz", String.valueOf(i));`,
      ),
      width: 50%
    ),
    supplement: none,
    kind: "bg-fizzbuzz-cg-cfg subfig",
    caption: [c) Corresponding Control-Flow Graph]
  ) <fig:bg-fizzbuzz-cfg>]))
  h(1em)},
  supplement: [Figure],
  caption: [Source code for a simple Java method and its Call and Control Flow Graphs],
)<fig:bg-fizzbuzz-cg-cfg>

Once the control-flow graph is computed, it can be used to compute data-flows.
Data-flow analysis, also called taint-tracking, allows to follow the flow of information in the application.
Be defining a list of methods and fields that can generate critical information (taint sources) and a list of method that can consume information (taint sink), taint-tracking allows to detect potential data leak (if a data flow link a taint source and a taint sink).
For example, `TelephonyManager.getImei()` is return an unique, persistent, device identifier.
This can be used to identify the user can cannot be changed if compromised.
This make `TelephonyManager.getImei()` a good candidate as a taint source.
On the other hand, `UrlRequest.start()` send a request to an external server, making it a taint sink.
If a data-flow is found linking `TelephonyManager.getImei()` to `UrlRequest.start()`, this means the application is potentially leaking a critical information to an external entity, a behavior that is probably not wanted by the user.
Data-flow analysis is the subject of many contribution@weiAmandroidPreciseGeneral2014 @titzeAppareciumRevealingData2015 @bosuCollusiveDataLeak2017 @klieberAndroidTaintFlow2014 @DBLPconfndssGordonKPGNR15 @octeauCompositeConstantPropagation2015 @liIccTADetectingInterComponent2015, the most notable source being Flowdroid@Arzt2014a.

#todo[Describe the different contributions in relations to the issues they tackle]

Static analysis is powerfull as it allows to detects unwanted behavior in an application even is the behavior does not manifest itself when running the application.
Hovewer, static analysis tools must overcom many challenges when analysing Android applications:
/ the Java object-oriented paradigm: A call to a method can in fact correspond to a call to any method overriding the original method in subclasses
/ the multiplicity of entry points: Each component of an application can be an entry point for the application
/ the event driven architecture: Methods of in the applications can be called in many different order depending on external events
/ the interleaving of native code and bytecode: Native code can be called from bytecode and vice versa, but tools often only handle one of those format
/ the potential dynamic code loading: And application can run code that was not orriginally in the application
/ the use of reflection: Methods can be called from their name as a string object, which is not necessary known statically
/ the continual evolution of Android: each new version brings new features that an analysis tools must be aware of

The tools can share the backend used to interact with the bytecode.
For example, Apktool is often called in a subprocess to extracte the bytecode, and the Soot framework is a commonly used both to analyse bytecode and modify it.
The most notable user of Soot is Flowdroid. #todo[formulation]

=== Dynamic Analysis <sec:bg-dynamic>

The alternative to static analysis is dynamic analysis.
With dynamic analysis, the application is actually executed.
The most simple strategies consist in just running the application and examining its behavior.
For instance, Shao #etal #todo[cit] capture the network communication of an application and analyse those traces, while Bhatia #etal #todo[cit] take #jm-note[periodic][meh] snapshots of the memory to deduce the beavior of the application #todo[check the papers].

More advanced methods are more intrusive and require modifing either the #APK, the Android framework, runtime, or kernel.
TaintDroid #todo[cit] for example modify the Dalvik Virtual Machine (the predecessor of the #ART) to track the data flow of an application at runtime, while AndroBlare #todo[cit] try to compute the taint flow by hooking system calls from a kernel module. #todo[check papers]
#todo[RealDroid?]

Modifying the Android framwork, runtime or kernel is possible thanks to the Android project beeing opensource, however this is delicate operation.
Thus, a common issue faced by tools that took this approach is that they are stuck with a specific version of Android.
DroidScope@droidscope180237 and CopperDroid@Tam2015 are two well known sandbox faced with this issue. #todo[check, and add android version]
To limit this problem, other sandbox focus on hooking strategies, like DroidHook and Mirage #todo[cit, check paper], based on the Xposed framework, and CamoDroid #todo[cit and check], based on Frida.

Another known challenge when analysing an application dynamically is the code coverage: if some part of the application is not executed, it cannot be annalysed.
Considering that Android applications are meant to interact with a user, this can become problematic for automatic analysis.
#todo[runner considered]
GroddDroid use static analysis to use static analysis to find suspicious code section and then use this information to guide a runner that uses the #todo[whatisnameagain?] framework to triger those suspicious section of code.
More challenging, some application will try to detect is they are in a sandbox environnement (#eg if they are in an emmulator, or if Frida is present in memory) and will refuse to run some sections of code if this is the case.
#todo[name] #etal @ruggia_unmasking_2024 make a list of evation techniques.
They show that most current analysis framework failled to hide themself correctly and introduce a new sandbox, DroidDungeon, that do avoid detection. #todo[limitation?]
#todo[force execution?]

// Shao et al. Yuru Shao, Jason Ott, Yunhan Jack Jia, Zhiyun Qian, and Z Morley Mao. ‘The Misuse of Android Unix Domain Sockets and Security Implications’. In: ACM SIGSAC Conference on Computer and Communications Security. Vienna, Austria: ACM, Oct. 2016, pp. 80–91.
// Bhatia et al. Rohit Bhatia, Brendan Saltaformaggio, Seung Jei Yang, Aisha Ali-Gombe, Xiangyu Zhang, Dongyan Xu, and Golden G Richard III. ‘"Tipped Off by Your Memory Allocator": Device-Wide User Activity Sequencing from Android Memory Images’. In: (Feb. 2018).

- #todo[evasion: droid DroidDungeon @ruggia_unmasking_2024]
- #todo[Xposed: DroidHook / Mirage: Toward a stealthier and modular malware analysis sandbox for android]
- #todo[Frida: CamoDroid]
- #todo[
  modified android framework, framework or kernel:
  - RealDroid
  - AndroBlare, taint analysis, linux module to hook syscalls, c'est maison
  Radoniaina Andriatsimandefitra and Valérie Viet Triem Tong. ‘Detection and identification of Android malware based on information flow monitoring’. In: 2nd International Conference on Cyber Security and
Cloud Computing. New York, USA: IEEE, Jan. 2015, pp. 200–203.
  Radoniaina Andriatsimandefitra, Stéphane Geller, and Valérie Viet Triem Tong. ‘Designing information flow policies for Android’s operating system’. In: IEEE International conference on communications.Ottawa, ON, Canada: IEEE, June 2012, pp. 976–981.
  - TaintDroid (check if dynamic? strange, cf Reaves et al)  modifies the Dalvik Virtual Machine (DVM) interpreter to manage taint
]

=== Hybrid Analysis <sec:bg-hybrid>
#todo[merge with other section?]

- #todo[DyDroid, audit of Dynamic Code Loading@qu_dydroid_2017]