thesis/5_theseus/3_static_transformation.typ

#import "../lib.typ": todo, APK, APKs, DEX, JAR, OAT, SDK, eg, ART, jm-note, jfl-note, midskip

== Code Transformation <sec:th-trans>

In this section, we will see how we can transform the application code to make dynamic code loading and reflective calls more analysable by static analysis tools.

=== Transforming Reflection <sec:th-trans-ref>

In Android, reflection allows applications to instantiate a class or call a method without having this class or method appear in the bytecode.
Instead, the bytecode uses the generic classes `Class`, `Method` and `Constructor`, which represent any existing class, method or constructor.
Reflection often starts by retrieving the `Class` object representing the class to use.
This class is usually retrieved using a `ClassLoader` object (though there are other ways to get it).
Once the class is retrieved, it can be instanciated using the deprecated method `Class.newInstance()`, as shown in @lst:-th-expl-cl-new-instance, or a specific method can be retrieved.
The current approach to instantiate a class is to retrieve the specific `Constructor` object, then call `Constructor.newInstance(..)` like in @lst:-th-expl-cl-cnstr.
Similarly, to call a method, the `Method` object must be retrieved, then called using `Method.invoke(..)`, as shown in @lst:-th-expl-cl-call.

Although the process seems to differ between class instantiation and method call from the Java standpoint, the runtime operations are very similar.
When instantiating an object with `Object obj = cst.newInstance("Hello Void")`, the constructor method `<init>(Ljava/lang/String;)V`, represented by the `Constructor` `cst`, is called on the object `obj`.
Thus, even for instantiation, a method is called at some point.

#figure(
  ```java
  ClassLoader cl = MainActivity.class.getClassLoader();
  Class clz = cl.loadClass("com.example.Reflectee");
  Object obj = clz.newInstance();
  ```,
  caption: [Instantiating a class using `Class.newInstance()`]
) <lst:-th-expl-cl-new-instance>

#figure(
  ```java
  Constructor cst = clz.getDeclaredConstructor(String.class);
  Object obj = cst.newInstance("Hello Void");
  ```,
  caption: [Instantiating a class using `Constructor.newInstance(..)`]
) <lst:-th-expl-cl-cnstr>

#figure(
  ```java
  Method mth = clz.getMethod("myMethod", String.class);
  Object[] args = {(Object)"an argument"};
  String retData = (String) mth.invoke(obj, args);
  ```,
  caption: [Calling a method using reflection]
) <lst:-th-expl-cl-call>

One of the main reasons to use reflection is to access classes that are neither platform classes nor in the application bytecode, as is often the case when dealing with classes from dynamically loaded bytecode.
Indeed, if the #ART were to encounter an instruction referencing a class that cannot be loaded by the current class loaded, it would crash the application.

To allow static analysis tools to analyse an application that uses reflection, we want to replace the reflection call with a bytecode chunk that actually calls the method and can be analysed by any static analysis tool.
In @sec:th-trans-cl, we deal with the issue of dynamic code loading so that the classes used are, in fact, present in the application.

A notable issue is that a specific reflection call can call different methods.
@lst:th-worst-case-ref illustrates a worst-case scenario where any method can be called at the same reflection call.
In those situations, we cannot guarantee that we know all the methods that can be called (#eg the name of the method called could be retrieved from a remote server).
In addition, the method we propose in @sec:th-dyn is a best effort approach to collect reflection data: like any dynamic analysis, it is limited by its code coverage.

#figure(
  ```java
  Object myInvoke(Object obj, Method mth, Object[] args) throws .. {
    return mth.invoke(obj, args);
  }
  ```,
  caption: [A reflection call that can call any method]
) <lst:th-worst-case-ref>

To handle those situations, instead of entirely removing the reflection call, we can modify the application code to test if the `Method` (or `Constructor`) object matches any of the methods observed dynamically, and if so, directly call the method.
If the object does not match any expected method, the code can fall back to the original reflection call.
DroidRA~@li_droidra_2016 has a similar solution, except that reflective calls are always evaluated, and the static equivalent follows just after, guarded behind an opaque predicate that is always false at runtime.
@lst:-th-expl-cl-call-trans demonstrates this transformation for the code originally in @lst:-th-expl-cl-call.
Let's suppose that we observed dynamically a call to a method `Reflectee.myMethod(String)` at line 3 when monitoring the execution of the code of @lst:-th-expl-cl-call.
In @lst:-th-expl-cl-call, at line 25, the `Method` object `mth` is checked using a method we generated and injected in the application (defined at line 2 in the listing).
This method checks if the method name (line 5), its parameters (lines 6-9), its return type (lines 10-11) and its declaring class (lines 13-14) match the expected method.
If it is the case, the method is used directly (line 26) after casting the arguments and associated object into the types/classes we just checked.
If the check line 25 does not pass, the original reflective call is made (line 28).
If we were to expect other possible methods to be called in addition to `myMethod`, we would add `else if` blocks between lines 26 and 27, with other check methods reflecting each potential method call.
/*
#jfl-note[It should be noted that we do the transformation at the bytecode level, the code in the listing correspond to the output of JADX][
  J'aurais bien fait une section a part sur "comment on fait ces transformation concretement;
  plus pedagique de décrire les transformation sans bytecode, ensuite, sous section qui discute
  les facon de modifier le bytecode, soot, apktool, ect et qui explique les limites, puis dire comment tu fait mes modifications
] #todo[Ref to list of common tools?] reformated for readability.
*/

#figure(
  ```java
  class T {
    static boolean check_is_reflectee_mymethod_e398(Method mth) {
      Class<?>[] paramTys = mth.getParameterTypes();
      return (
        meth.getName().equals("myMethod") &&
        paramTys.length == 1 &&
        paramTys[0].descriptorString().equals(
          String.class.descriptorString()
        ) &&
        mth.getReturnType().descriptorString().equals(
          String.class.descriptorString()
        ) &&
        mth.getDeclaringClass().descriptorString().equals(
          Reflectee.class.descriptorString()
        )
      )
    }
  }

  ...

  Method mth = clz.getMethod("myMethod", String.class);
  Object[] args = {(Object)"an argument"}
  Object objRet;
  if (T.check_is_reflectee_mymethod_e398abf7d3ce6ede(mth)) {
    objRet = (Object) ((Reflectee) obj).myMethod((String)args[0]);
  } else {
    objRet = mth.invoke(obj, args);
  }
  String retData = (String) objRet;
  ```,
  caption: [@lst:-th-expl-cl-call after the de-reflection transformation]
) <lst:-th-expl-cl-call-trans>

The check of the `Method` value is done in a separate method injected inside the application to avoid cluttering the application too much.
Because Java (and thus Android) uses polymorphic methods, we cannot just check the method name and its class, but also the whole method signature.
We chose to limit the transformation to the specific instruction that calls `Method.invoke(..)`.
This drastically reduces the risks of breaking the application, but leads to a lot of type casting.
Indeed, the reflection call uses the generic `Object` class, but actual methods usually use specific classes (#eg `String`, `Context`, `Reflectee`) or scalar types (#eg `int`, `long`, `boolean`).
This means that the method parameters and object on which the method is called must be downcasted to their actual type before calling the method, then the returned value must be upcasted back to an `Object`.
Scalar types especially require special attention.
Java (and Android) distinguish between scalar types and classes, and they cannot be mixed: a scalar cannot be cast into an `Object`.
However, each scalar type has an associated class that can be used when doing reflection.
For example, the scalar type `int` is associated with the class `Integer`, the method `Integer.valueOf()` can convert an `int` scalar to an `Integer` object, and the method `Integer.intValue()` converts back an `Integer` object to an `int` scalar.
Each time the method called by reflection uses scalars, the scalar-object conversion must be made before calling it.
And finally, because the instruction following the reflection call expects an `Object`, the return value of the method must be cast into an `Object`.

This back and forth between types might confuse some analysis tools.
This could be improved in future works by analysing the code around the reflection call.
For example, if the result of the reflection call is immediately cast into the expected type (#eg in @lst:-th-expl-cl-call, the result is cast to a `String`), there should be no need to cast it to Object in between.
Similarly, it is common to have the method parameter arrays generated just before the reflection call and never be used again (This is due to `Method.invoke(..)` being a varargs method: the array can be generated by the compiler at compile time).
In those cases, the parameters could be used directly without the detour inside an array.

=== Transforming Code Loading (or Not) <sec:th-trans-cl>

#jfl-note[Ici je pensais lire comment on tranforme le code qui load du code, mais on me parle de multi dex]

An application can dynamically import code from several formats like #DEX, #APK, #JAR or #OAT, either stored in memory or in a file.
Because it is an internal, platform-dependent format, we elected to ignore the #OAT format.
Practically, #JAR and #APK files are zip files containing #DEX files.
This means that we only need to find a way to integrate #DEX files into the application.

We saw in @sec:cl the class loading model of Android.
When doing dynamic code loading, an application defines a new `ClassLoader` that handles the new bytecode, and starts accessing its classes using reflection.
We also saw in @sec:cl that Android now use the multi-dex format, allowing it to handle any number of #DEX files in one class loader.
Therefore, the simpler way to give access to the dynamically loaded code to static analysis tools is to add the dex files in the application as additional multi-dex bytecode files.
This should not impact the class loading model as long as there is no class collision (we will explore this in @sec:th-class-collision) and as long as the original application did not try to access inaccessible classes (we will develop this issue in @sec:th-limits).

#figure(
  image(
    "figs/dex_insertion.svg",
    width: 80%,
    alt: "A diagram showing a box labelled 'app.apk', a box labelled 'lib.jar', and a single file outside the boxes labelled 'lib.dex'. The lib.jar box contains the files classes.dex and classes2.dex. Inside the app.apk box, the files AndroidManifest.xml, resources.arsc, classes.dex, classes2.dex, classes3.dex and the folders lib, res and assets are circled by dashes and labelled 'original files', and, still inside app.apk, the files classes4.dex, classes5.dex and classes5.dex are circled by dashes and labelled 'Added Files'. Arrows go from lib.dex to classes4.dex, from the classes.dex inside lib.jar to classes5.dex inside app.apk and from classes2.dex inside lib.jar to classes6.dex inside app.apk"
  ),
  caption: [Inserting #DEX files inside an #APK]
) <fig:th-inserting-dex>

In the end, we decided *not* to modify the original code that loads the bytecode.
Most tools already ignore dynamic code loading, and, with the dynamically loaded bytecode added using the multi-dex format, they already have access to it.
At runtime, although the bytecode is already present in the application, the application will still dynamically load the code.
This ensures that the application keeps working as intended, even if the transformation we applied is incomplete.
Specifically, to call dynamically loaded code, an application needs to use reflection, and we saw in @sec:th-trans-ref that we need to keep reflection calls, and in order to keep reflection calls, we need the class loader created when loading bytecode.

To summarise, we do not modify the existing bytecode.
Instead, we add the intercepted bytecode to the application as additional #DEX files using the multi-dex format, as represented in @fig:th-inserting-dex.

=== Class Collisions <sec:th-class-collision>

We saw in @sec:cl/*-obfuscation*/ that having several classes with the same name in the same application can be problematic.
In @sec:th-trans-cl, we are adding new code.
By doing so, we increase the probability of having class collisions:
The developer may have reused a helper class in both the dynamically loaded bytecode and the application, or an obfuscation process may have renamed classes without checking for intersection between the two sources of bytecode.
When loaded dynamically, the classes are in a different class loader, and the class resolution is resolved at runtime, like we saw in @sec:cl-loading.
We decided to restrain our scope to the use of class loaders from the Android #SDK.
In the absence of class collision, those class loaders behave seamlessly and adding the classes to the application maintains the behaviour.
#jfl-note[Un example aiderait a comprendre \ jm: j'en ai pas qui prennent pas 3 pages de listing]

When we detect a collision, we rename one of the colliding classes in order to be able to differentiate between classes.
To avoid breaking the application, we then need to rename all references to this specific class and be careful not to modify references to the other class.
To do so, we regroup each class by the class loaders that define them.
Then, for each colliding class name and each class loader, we check the actual class used by the class loader.
If the class has been renamed, we rename all references to this class in the classes defined by this class loader.
To find the class used by a class loader, we reproduce the behaviour of the different class loaders of the Android #SDK.
This is an important step: remember that the delegation process can lead to situations where the class defined by a class loader is not the class that will be loaded when querying the class loader.
The pseudo-code in @lst:renaming-algo shows the three steps of this algorithm:
- First, we detect collisions and rename class definitions to remove the collisions.
- Then we rename the reference to the colliding classes to make sure the right classes are called.
- Ultimately, we merge the modified #DEX files of each class loader into one Android application.

#figure(
  ```python
  defined_classes = set()
  redifined_classes = set()

  # Rename the definition of redifined classes
  for cl in class_loaders:
    for clz in defined_classes.intersection(cl.defined_classes):
      cl.rename_definition(clz)
      redifined_classes.add(clz)
    defined_classes.update(cl.defined_classes)

  # Rename reference of redifined classes
  for cl in class_loaders:
    for clz in redifined_classes:
      defining_cl = cl.resolve_class(clz).class_loader
      cl.rename_reference(clz, defining_cl.new_name(clz))

  # Merge the class loader into a flat APK
  new_apk = Apk()
  for cl in class_loaders:
    for dex in cl.get_dex():
      new_apk.add_dex(dex)
  ```,
  caption: [Pseudo-code of the renaming algorithm]
) <lst:renaming-algo>

/*
* Although we limited ourselves to replacing one specific bytecode instruction, we encontered many technical challenges
* #todo[interupting try blocks: catch block might expect temporary registers to still stored the saved value] ?
*/

=== Implementation Details <sec:th-implem>

Our initial idea was to use Apktool, but in @sec:rasta, we found that many errors raised by tools were due to trying to parse Smali incorrectly.
Thus, we decided to avoid Apktool.

Most of the contributions of the state of the art that perform instrumentation rely on Soot.
Soot works on an intermediate representation, Jimple, that is easier to manipulate.
However, Soot can be cumbersome to set up and use, and we initially wanted better control over the modified bytecode.
In addition, although it might be due to the fact that they performed more complex analysis, tools based on Soot showed a trend of consuming a lot of memory and failing with unclear errors, supporting us in our idea of avoiding Soot.
For these reasons, we decided to make our own instrumentation library from scratch.

That library, Androscalpel, requires being able to parse, modify and generate valid #DEX files.
It was not as difficult as one would expect, thanks to the clear documentation of the Dalvik format from Google#footnote[https://source.android.com/docs/core/runtime/dex-format].
In addition, when we had doubts about the specification, we had the option to check the implementation used by Apktool#footnote[https://github.com/JesusFreke/smali], or the code used by Android to check the integrity of the #DEX files#footnote[https://cs.android.com/android/platform/superproject/main/+/main:art/libdexfile/dex/dex_file_verifier.cc;drc=11bd0da6cfa3fa40bc61deae0ad1e6ba230b0954].

We chose to use Rust to implement this library.
It has both good performance and ergonomics.
For instance, we could parallelise the parsing and generation of #DEX files without much effort.
Because we are not using a high-level intermediate language like Jimple (used by Soot), the management of the Dalvik registers in the methods has to be done manually (by the user of the library), the same way it has to be done when using Apktool.
This poses a few challenges.

A method declares a number of internal registers it will use (let's call this number $n$), and has access to an additional number of registers used to store the parameters (let's call this number $p$).
Each register is referred to by a number from $0$ to $65535$.
The internal registers are numbered from $0$ to $n$, and the parameter registers from $n$ to $n+p$.
This means that when adding new registers to the method when instrumenting it (let's say we want to add $k$ registers), the new registers will be numbered from $n$ to $n+k$, and the parameter registers will be renumbered from $[|n, n+p[|$ to $[|n+k, n+k+p[|$.
In general, this is not an issue, but some instructions can only operate on some registers (#eg `array-length`, which stores the length of an array in a register, only works on registers numbered between $0$ and $8$ excluded).
This means that adding registers to a method can be enough to break a method.
We solved this by adding instructions that move the content of registers $[|n+k, n+k+p[|$ to the registers $[|n, n+p[|$, and keeping the original register numbers ($[|n, n+p[|$) for the parameters in the rest of the body of the method.

The next challenge arises when we need to use one of the new registers with an instruction that only accepts registers lower than $n+p$.
In such cases, a lower register must be used, and its content will be temporarily saved in one of the new registers.
This is not as easy as it seems: the Dalvik instructions differ depending on whether the register stores a reference or a scalar value, and Android does check that the register types match the instructions.
The type of the register can be computed from the control flow graph of the method (we added the computation of such a graph, with the type of each register, as a feature in Androscalpel).
An edge case that must not be overlooked is that each instruction inside a `try` block is branching to each of the `catch` blocks.
This is a problem: it prevents us from restoring the registers to their original values before entering the `catch` blocks (or, if we restore the values at the beginning of the `catch` blocks and an exception is raised before the value is saved, the register will be overwritten by an invalid value).
This means that when modifying the content of a `try` block, the block must be split into several blocks to prevent impromptu branching.

One thing we noticed when manually instrumenting applications with Apktool is that sometimes the repackaged applications cannot be installed or run due to some files being stored incorrectly in the new application (#eg native library files must not be compressed).
We also found that some applications deliberately store files with names that will crash the zip library used by Apktool.
For this reason, we also used our own library to modify the #APK files.
We took special care to process the least possible files in the #APKs, and only strip the #DEX files and signatures, before adding the new modified #DEX files at the end.

Unfortunately, we did not have time to compare the robustness of our solution to existing tools like Apktool and Soot, but we did a quick performance comparison, summarised in @sec:th-lib-perf.
In hindsight, we probably should have taken the time to find a way to use smali/backsmali (the backend of Apktool) as a library or use SootUp to do the instrumentation, but neither option has documentation to instrument applications this way.
At the time of writing, the feature is still being developed, but in the future, Androguard might also become an option to modify #DEX files.
Nevertheless, we published our instrumentation library, Androscalpel, for anyone who wants to use it (see @sec:soft). #todo[Update is CS says no]

#midskip

Now that we saw the transformations we want to make, we know the runtime information we need to do it.
In the next section, we will propose a solution to collect that information.