thesis/4_class_loader/3_obfuscation.typ

#import "../lib.typ": eg, paragraph, DFG, DEX, API, SDK, APK, ART, AOSP
#import "../lib.typ": todo, jfl-note
#import "X_var.typ": *

== Obfuscation Techniques <sec:cl-obfuscation>

In this section, we present new obfuscation techniques that take advantage of the complexity of the class loading process.
Then, in order to evaluate their efficiency, we reviewed some common Android reverse analysis tools to see how they behave when collisions occur between classes of the #APK or between a class of the #APK and classes of Android (#Asdk or #hidec).
We call this collision "*class shadowing*", because the attacker's version of the class shadows the one that will be used at runtime.
To evaluate if such shadow attacks are working, we handcrafted three applications implementing shadowing techniques to test their impact on static analysis tools.
Then, we manually inspected the output of the tools in order to check their consistency with what Android is really doing at runtime.
For example, for Apktool, we look at the output disassembled code, and for Flowdroid~@Arzt2014a, we check that a flow between `Taint.source()` and `Taint.sink()` is correctly computed.


/*
shadow: faie une collision de classe
hidden: utiliser une classe de l'API cachée

on peut shadow une classe de l'apk
on peut shadow une classe du SDK
on peut shadow une classe hidden
*/


=== Obfuscation Techniques

From the results presented in @sec:cl-loading, three approaches can be designed to hide the behaviour of an application.

/*
#paragraph([Hidden classes])[
Applications both malicious and benign have been known to use hidden API to access advance features #todo[ref ?].
Using #hidec can have an impact on the accuracy of analysis tools because they may not have access to the code of these classes.
]
#todo[Google blacklist/greylist/ect, ref to paper that says this can be bypass]

#todo[Compare classes in android.jar, framework.jar and other, are they hidden whitelisted classes?]

The two previous attacks have a few issue.
Basic shadowing imply to have several class with the same name in the application, which can be detected by some tools.
On the other hand, using #hidec leave classes without implementation in the application, which can also be detected.
*/

#paragraph([*Self shadow*: shadowing a class with another from #APK])[
This method consists of hiding the implementation of a class with another one by exploiting the possible collision of class names, as described in @sec:cl-collision with multiple #dexfiles.
If reversers or tools ignore the priority order of a multi-dex file, they can take into account the wrong version of a class.
]

//priorité aux classes SDK meme si une shadow classe est définie dans l'APK (tout ca a cause de Boot)
#paragraph([*SDK shadow*: shadowing a #SDK class])[
This method consists of presenting to the reverser a fake implementation of a class of the #SDK.
This class is embedded in the #APK file and has the same name as one of the #SDK.
Because `BootClassLoader` will give priority to the #Asdk at runtime, the reverser or tool should ignore any version of a class that is contained in the #APK.
The only constraint when shadowing an #SDK class is that the shadowing implementation must respect the signature of real classes.
Note that, by introducing a custom class loader, the attacker could invert the priority, but this case is out of the scope of this chapter.
]

// priorité aux classes hidden (car du SDK) meme si une shadow classe est définie dans l'APK
#paragraph([*Hidden shadow*: shadowing a hidden class])[
This method is similar to the previous one, except the class that is shadowed is a #hidecsingular.
Because #ART will give priority to the internal version of the class, the version provided in the #APK file will be ignored.
Such shadow attacks are more difficult to detect by a reverse engineer, who may not know the existence of this specific hidden class in Android.
]

=== Impact on Static Analysis Tools <sec:cl-evaltools>

#figure(
  ```java
  public class Main {
      public static void main(Activity ac) {
          String personal_data = Taint.source();
  	String obfuscated_personal_data = Obfuscation.hide_flow(personal_data);
  	Taint.sink(ac, obfuscated_personal_data);
      }
  }

  // customised for each obfuscation technique
  public class Obfuscation {
  	public static String hide_flow(String personal_data) { ... }
  }
  ```,
  caption: [Main body of test apps]
)<lst:cl-testapp>

We selected tools that are commonly used to unpack and reverse Android applications.
In @sec:rasta (@sec:rasta-src-select), we found only two tools to be still actively maintained: Androguard#footnote[https://github.com/androguard/androguard] and Flowdroid#footnote[https://github.com/secure-software-engineering/FlowDroid].
We also noticed that Apktool#footnote[https://apktool.org/] was a common dependency for a lot of the tools we tested in @sec:rasta (see @tab:rasta-rec-deps), and is still used today.
Consequently, we will test the impact of shadow attacks on those three tools.
Lastly, because it is a state-of-the-art decompiler for Android applications, we added Jadx#footnote[https://github.com/skylot/jadx] to the list of tools we tested.

To evaluate the tools, we designed a single application that we can customise for different tests.
@lst:cl-testapp shows the main body implementing:
- a possible flow to evaluate FlowDroid: a flow from a method `Taint.source()` to a method `Taint.sink(Activity, String)` through a method `Obfuscation.hide_flow(String)`.
- a possible use of a #SDK or hidden class inside the class `Obfuscation` to evaluate #platc shadowing for other tools.

We used 4 versions of this application:

+ A control application that does not do anything special: `Obfuscation.hide_flow(String personal_data)` returns `personal_data`.
  It will be used for checking the expected result of tools.
+ A version that implements self-shadowing: the class `Obfuscation` is duplicated: one is the same as the one in the control app (`Obfuscation.hide_flow(String)` returns its arguments), and the other version returns a constant string.
  These two versions are embedded in several #DEX of a multi-dex application.
+ The third version implements #SDK shadowing and needs an existing class of the #SDK.
  We used the #SDK class `Pair` as the class to shadow.
  We put data in a new `Pair` instance and reread the data from the `Pair`.
  The colliding `Pair` class we created discards the data at the initialisation and stores `null` instead of the argument values.
  This decoy class break the flow of information: Flowdroid will detect the information flow if it uses the actual #SDK implementation of `Pair` to compute the #DFG, but not if it uses the decoy.
+ The last version tests for Hidden #API shadowing.
  Like for the third one, we similarly store data in `com.android.okhttp.Request` and then retrieve it.
  Again, the shadowing implementation discards the data.

We used the 4 selected tools on the 4 versions of the application and compared the results on the control application to the results on the other application implementing the different obfuscation techniques.
We found that these static analysis tools do not consider the class loading mechanism, either because the tools only look at the content of the application file (#eg a disassembler) or because they consider class loading to be a dynamic feature and thus out of their scope.
In @tab:cl-results, we report on the types of shadowing that can trick each tool.
A plain circle is a shadow attack that leads to a wrong result.
A white circle indicates a tool emitting warnings or that displays the two versions of the class.
A cross is a tool not impacted by a shadow attack.
//We explain in more detail in the following the results for each considered tool.

#figure({
  table(
    columns: 5,
    stroke: none,
    align:(left+horizon, center+horizon, center+horizon, center+horizon, center+horizon),
    table.hline(),
    table.header(
      table.cell(colspan: 5, inset: 3pt)[],
      table.cell(rowspan: 2)[Tool],
      table.cell(rowspan: 2)[Version],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(colspan: 3)[Shadow Attack],
      [Self], [#SDK], [Hidden],
    ),
    table.cell(colspan: 5, inset: 3pt)[],
    table.hline(),
    table.cell(colspan: 5, inset: 3pt)[],

    [Jadx], [1.5.0], [#warn], [#ok], [#ok],
    [Apktool], [2.9.3], [#warn], [#ok], [#ok],
    [Androguard], [4.1.2], [#warn], [#ok], [#ok],
    [Flowdroid], [2.13.0], [#ok], [#ko], [#ok],

    table.cell(colspan: 5, inset: 3pt)[],
    table.hline(),
  )
  [#ok: working \ #warn: works but producing warning or can be seen by the reverser \ #ko: not working]
  },
  caption: [Working attacks against static analysis tools]
) <tab:cl-results>

==== Jadx

//Jadx is a reverse engineering tool that regenerates the Java source code of an application.
Jadx processes all the classes present in the application, but only saves/displays one class by name, even if two versions are present in multiple #dexfiles.
Nevertheless, when multiple classes with the same name are found, Jadx reports it in a comment added to the generated Java source code.
This warning stipulates that a possible collision exists and lists the files that contain the different versions of the class.
Unfortunately, after reviewing the code of Jadx, we believe that the selection of the displayed class is an undefined behaviour.
At least for version 1.5.0 that we tested, we found that Jadx selects the wrong implementation when a class with the same name is present.
For example, in `classes2.dex` and `classes3.dex`.
We report it with a "#warn" because warnings are issued.

//Using #hidec does not affect Jadx beyond the fact that #hidec are not decompiled, which is to be expected by the user anyway.

Shadowing #Asdk and #hidec is possible in Jadx: there is only one implementation of the class in the application, and Jadx does not have a list of the internal classes of Android: no warning is issued to the reverser that the displayed class is not the one used by Android.

==== Apktool

//Apktool generates Smali files, an assembler language for #DEX bytecode.
Apktool will store the disassembled classes in a folder that matches the #dexfile that stores the bytecode.
This means that when shadowing a class with two versions in two #dexfiles, the shadow implementations will be disassembled into two directories.
No indication is displayed that a collision is possible.
It is up to the reverser to have a chance to open the good one.

Similarly to Jadx, using an #Asdk or #hidecsingular will not be detected by the tool that will unpack the fake shadow version.

==== Androguard

Androguard has different usages, with different levels of analysis.
The documentation highlights the analysis commands that compute three types of objects: an #APK object, a list of #DEX objects, and an Analysis object.
The #APK and the list of #dexfiles are a one-to-one representation of the content of an application, and have the same issues that we discussed with Apktool: they provide the different versions of a shadow class contained in multiple #dexfiles.

The Analysis object is used to compute a method call graph, and we found that this algorithm may choose the wrong version of a shadowed class when using the cross-references that are computed.
This leads to an invalid call graph, as shown in @fig:cl-andro_obf_cg: the two methods `doSomething()` are represented in the graph, but the one linked to `main()` on the graph is the one calling the method `good()` when in fact the method `bad()` is called when running the application.

Androguard has a method `.is_external()` to detect if the implementation of a class is not provided inside the application and a method `.is_android_api()` to detect if the class is part of the Android #API.
Regrettably, the documentation of `.is_android_api()` explains that the method is still experimental and just checks a few package names.
This means that although those methods are useful, the only indication of the use of an #Asdk or #hidec is the fact that the class is not in the #APK file.
Because of that, like for Apktool and Jadx, Androguard has no way to warn the reverser that the shadow of an #Asdk or #hidec is not the class used when running the application.

#figure({
  set align(center)
  stack(dir: ltr,[
  #figure(
    image(
      "figs/call_graph_expected.svg",
      width: 45%,
      alt: "
      A box diagram.
      Arrows goes from MainActivity.onCreate() to Activity.OnCreate() and Main.main(),
      from Main.main() to Obfuscation.doSomething() to Main.bad(),
      from another Obfuscation.doSomething() box to Main.good(),
      from Main.bad() to Log.i() and from Main.bad() to Log.i().
      There are two Obfuscation.doSomething(), the one pointed by Main.main() and that points to Main.bad() is white like the other boxes, the one without arrows pointed at and that points to Main.good() is gray.

      "
    ),
    supplement: [Subfigure],
    caption: [Expected Call Graph]
  ) <fig:cl-andro_non_obf_cg>],[
  #figure(
    image(
      "figs/call_graph_obf.svg",
      width: 45%,
      alt: "
      A box diagram.
      Arrows goes from MainActivity.onCreate() to Activity.OnCreate() and Main.main(),
      from Main.main() to Obfuscation.doSomething() to Main.good(),
      from another Obfuscation.doSomething() box to Main.bad(),
      from Main.bad() to Log.i() and from Main.bad() to Log.i().
      There are two boxes Obfuscation.doSomething(), the one pointed by Main.main() and that points to Main.good() is gray, the one without arrows pointed at and that points to bad is white like the other boxes.
      "
    ),
    supplement: [Subfigure],
    caption: [Call Graph Computed by Androguard]
  ) <fig:cl-andro_obf_cg>
  ])
  h(1em)},
  caption: [Call Graphs of an application calling `Main.bad()` from a shadowed `Obfuscation` class],
)<fig:cl-androguard_call_graph>

==== Flowdroid

/*
#jfl-note[Flowdroid~@Arzt2014a is used to detect if an application can leak sensitive information.
To do so, the analyst provides a list of source and sink methods.
The return value of a method marked as source is considered sensitive and the argument of a method marked as sink is considered to be leaked.
By analyzing the bytecode of an application, Flowdroid can detect if data emitted by source methods can be exfiltrated by a sink method.
Flowdroid is built on top of the Soot~@Arzt2013 framework that handles, among other things, the class selection process. ][
deja dit dans chap2?

Non mais on aurait du, ca viendra et il faudra modifier a ce moment là
]*/

We found that when selecting the classes implementation in a multi-dex #APK, Soot uses an algorithm close to what #ART is performing:
Soot sorts the `.dex` bytecode file with a specified `prioritizer` (a comparison function that defines an order for #dexfiles) and selects the first implementation found when iterating over the sorted files.
Unfortunately, the `prioritizer` used by Soot is not exactly the same as the one used by the ART.
The Soot `prioritizer` will give priority to `classes.dex` and then give priority to files whose name starts with `classes` over other files, and finally will use alphabetical order.
This order is good enough for application with a small number of #dexfiles generated by Android Studio, but because it uses the alphabetical order and does not check the exact format used by Android, a malicious developer could hide the implementation of a class in `classes2.dex` by putting a false implementation in `classes0.dex`, `classes1.dex` or `classes12.dex`.
Because Flowdroid is based on Soot, it inherits this issue from it.

// TODO This could use more investigation
In addition to self-shadowing, Flowdroid is sensitive to the use of #platc, as it needs the bytecode of those classes to be able to track data flows.
//This is solved for #SDK classes by providing `android.jar` to Flowdroid.
Flowdroid does have a record of #SDK classes, and gives priority to the actual #SDK classes over the classes implemented in the application, thus defeating #SDK shadow attacks.
//Unfortunately, `android.jar` only contains classes from the #Asdk, meaning that using #hidec breaks the flow tracking.
Unfortunately, Flowdroid does not have a record of all platform classes, meaning that using #hidec breaks the flow tracking.
Solving this issue would require finding the bytecode of all the platform classes of the Android version targeted, and, as we said previously, it requires extracting this information from the emulator or phone.

#v(2em)

We have seen that tools can be impacted by shadow attacks. In the next section, we will investigate whether these attacks are used in the wild.