thesis/5_theseus/1_static_transformation.typ

#import "../lib.typ": todo, APK, DEX, JAR, OAT, eg, ART, paragraph, jm-note

/*
* Parler de dex lego et du papier qui encode les resultats d'anger en jimple
*
*
*/

== Code Transformation <sec:th-trans>

#todo[Define code loading and reflection somewhere]
#todo[This is a draft, clean this up]
#todo[Reflectif call? Reflection call?]

In this section, we will see how we can transform the application code to make dynamic codeloading and reflexif calls analysable by static analysis tools.

=== Reflection <sec:th-trans-ref>

In Android, reflection can be used to do two things: instanciate a class, or call a method.
Either way, reflection starts by retreiving the `Class` object representing the class to use.
This class is usually retrived using a `ClassLoader` object, but can also be retrieved directly from the classloader of the class defining the calling method.
// elaborate? const-class dalvik instruction / MyClass.class in java?
One the class is retrieve, it can be instanciated using the deprecated method `Class.newInstance()`, like shown in @lst:-th-expl-cl-new-instance, or a specific method can be retrieved.
The current approche to instanciate a class is to retrieve the specific `Constructor` object, then calling `Constructor.newInstance(..)` like in @lst:-th-expl-cl-cnstr.
Similarly, to call a method, the `Method` object must be retrieved, then called using `Method.invoke(..)`, like shown in @lst:-th-expl-cl-call.

Although the process seems to differ between class instanciation and method call from the Java stand point, the runtime opperations are very similar.
When instanciating an object with `Object obj = cst.newInstance("Hello Void")`, the constructor method `<init>(Ljava/lang/String;)V`, represented by the `Constructor` `cst`, is called on the object `obj`.

#figure(
  ```java
  ClassLoader cl = MainActivity.class.getClassLoader();
  Class clz = cl.loadClass("com.example.Reflectee");
  Object obj = clz.newInstance();
  ```,
  caption: [Instanciating a class using `Class.newInstance()`]
) <lst:-th-expl-cl-new-instance>

#figure(
  ```java
  Constructor cst = clz.getDeclaredConstructor(String.class);
  Object obj = cst.newInstance("Hello Void");
  ```,
  caption: [Instanciating a class using `Constructor.newInstance(..)`]
) <lst:-th-expl-cl-cnstr>

#figure(
  ```java
  Method mth = clz.getMethod("myMethod", String.class);
  Object[] args = {(Object)"an argument"};
  String retData = (String) mth.invoke(obj, args);
  ```,
  caption: [Calling a method using reflection]
) <lst:-th-expl-cl-call>

To allow static analysis tools to analyse an application that use reflection, we want to replace the reflection call by the bytecode that does the actual calls.

One of the main reason to use reflection is to access classes not from the application.
Although allows the use classes that do not exist in the application in bytecode, at runtime, if the classes are not found in the current classloader, the application will crash.
Similarly, some analysis tools might have trouble analysis application calling non existing classes.
@sec:th-trans-cl deals with the issue of adding dynamically loaded bytecode to the application.

A notable issue is that a specific reflection call can call different methods.
@lst:th-worst-case-ref illustrate a worst case scenario where any method can be call at the same reflection call.
In those situation, we cannot garanty that we know all the methods that can be called (#eg the name of the method called could be retrieved from a remote server).

#figure(
  ```java
  Object myInvoke(Object obj, Method mth, Object[] args) throws .. {
    return mth.invoke(obj, args);
  }
  ```,
  caption: [A reflection call that can call any method]
) <lst:th-worst-case-ref>

To handle those situation, instead of entirely removing the reflection call, we can modify the application code to test if the `Method` (or `Constructor`) object match any expected method, and if yes, directly call the method.
If the object does not match any expected method, the code can fallback to the original reflection call.
@lst:-th-expl-cl-call-trans demonstrate this transformation on @lst:-th-expl-cl-call.
It should be noted that we do the transformation at the bytecode level, the code in the listing correspond to the output of JADX #todo[Ref to list of common tools?] reformated for readability.
The method check is done in a separate method injected inside the application to avoid clutering the application too much.
Because Java (and thus Android) uses polymorphic methods, we cannot just check the method name and its class, but also the whole method signature.
We chose to limit the transformation to the specific instruction that call `Method.invoke(..)`.
This drastically reduce the risks of breaking the application, but leads to a lot of type casting.
Indeed, the reflection call uses the generic `Object` class, but actual methods usually use specific classes (#eg `String`, `Context`, `Reflectee`) or scalar types (#eg `int`, `long`, `boolean`).
This means that the method parameters and object on which the method is called must be downcast to their actual type before calling the method, then the returned value must be upcasted back to an `Object`.
Scalar types especially require special attention.
Java (and Android) distinguish between scalar type and classes, and they cannot be mixed: a scalar cannot be cast into an `Object`.
However, each scalar type has an associated class that can be use when doing reflection.
For example, the scalar type `int` is associated with the class `Integer`, the method `Integer.valueOf()` can convert an `int` scalar to an `Integer` object, and the method `Integer.intValue()` convert back an `Integer` object to an `int` scalar.
Each time the method called by reflection used scalars, the scalar-object convertion must be made before calling it.
And finally, because the instruction following the reflection call expect an `Object`, the return value of the method must be cast into an `Object`.

This back and forth between types might confuse some analysis tools.
This could be improved in futur works by analysing the code around the reflection call.
For example, if the result of the reflection call is imediatly cast into the expected type (#eg in @lst:-th-expl-cl-call, the result is cast to a `String`), they should not be any need to cast it to Object in between.
Similarly, it is common to have the method parameter arrays generated just before the reflection call never be used again (This is due to `Method.invoke(..)` beeing a varargs method: the array can be generated by the compiler at compile time).
In those cases, the parameters could be used directly whithout the detour inside an array.

#figure(
  ```java
  class T {
    static boolean check_is_reflectee_mymethod_e398(Method mth) {
      Class<?>[] paramTys = mth.getParameterTypes();
      return (
        meth.getName().equals("myMethod") &&
        paramTys.length == 1 &&
        paramTys[0].descriptorString().equals(
          String.class.descriptorString()
        ) &&
        mth.getReturnType().descriptorString().equals(
          String.class.descriptorString()
        ) &&
        mth.getDeclaringClass().descriptorString().equals(
          Reflectee.class.descriptorString()
        )
      )
    }
  }

  ...

  Method mth = clz.getMethod("myMethod", String.class);
  Object[] args = {(Object)"an argument"}
  Object objRet;
  if (T.check_is_reflectee_mymethod_e398abf7d3ce6ede(mth)) {
    objRet = (Object) ((Reflectee) obj).myMethod((String)args[0]);
  } else {
    objRet = mth.invoke(obj, args);
  }
  String retData = (String) objRet;
  ```,
  caption: [@lst:-th-expl-cl-call after the de-reflection transformation]
) <lst:-th-expl-cl-call-trans>


=== Code Loading <sec:th-trans-cl>

An application can dynamically import code from several format like #DEX, #APK, #JAR or #OAT, either stored in memory or in a file.
Because it is an internal, platform dependant format, we elected to ignore the #OAT format.
Practically, #JAR and #APK files are zip files containing #DEX files.
This means that we only need to find a way to integrate #DEX files to the application.

We elected to simply add the dex files to the application, using the multi-dex feature introduced by the SDK 21 now used by all applications as shown in @fig:th-inserting-dex.
This gives access to the dynamically loaded code to static analysis tool.

#figure(
  image(
    "figs/dex_insertion.svg",
    width: 80%,
    alt: "A diagram showing a box labelled 'app.apk', a box labelled 'lib.jar', and single file ouside the boxes labelled 'lib.dex'. The lib.jar boxe contains the files classes.dex and classes2.dex. Inside the app.apk box, the files AndroidManifest.xml, resources.arsc, classes.dex, classes2.dex, classes3.dex and the folders lib, res and assets are circled by dashes and labelled 'original files', and, still inside app.apk, the files classes4.dex, classes5.dex and classes5.dex are circled by dashes and labelled 'Added Files'. Arrows go from lib.dex to classes4.dex, from the classes.dex inside lib.jar to classes5.dex inside app.apk and from classe2.dex inside lib.jar to classes6.dex inside app.apk"
  ),
  caption: [Inserting #DEX files inside an #APK]
) <fig:th-inserting-dex>

We decided to leave untouched the original code that load the bytecode.
At runtime, although the bytecode is already present in the application, the application will still dynamically load the code.
This ensure that the application keep working as intended even if the transformation we applied are incomplete.
Specifically, to call dynamically loaded code, an application needs to use reflection, and we saw in @sec:th-trans-ref that we need to keep reflecton calls, and in order to keep reflection calls, we need the classloader created when loading bytecode.

=== Class Collisions <sec:th-class-collision>

We saw in @sec:cl-obfuscation that having several classes with the same name in the same application can be problematic.
In @sec:th-trans-cl, we are adding code from another source.
By doing so, we augment the probability of having class collisions.
When loaded dynamically, the classes are in a different classloader, and the class resolution is resolved at runtime like we saw in @sec:cl-loading.
We decided to restrain our scope to the use of class loader from the Android SDK.
In the abscence of class collision, those class loader behave seamlessly and adding the classes to application maintains the behavior.

#jm-note[
When we detect a collision, we rename one of the classes colliding in order to be able to differenciate both classes.
To avoid breaking the application, we then need to rename all references to this specific class, an be carefull not to modify references to the other class.
To do so, we regroup each classes by the classloaders defining them, then, for each colliding class name and each classloader, we check the actual class used by the classloader.
If the class has been renamed, we rename all reference to this class in the classes defined by this classloader.
To find the class used by a classloader, we reproduce the behavior of the different classloaders of the Android SDK.
This is an important step: remember that the delegation process can lead to situation where the class defined by a classloader is not the class that will be loaded when querying the classloader.
The pseudo-code in @lst:renaming-algo show the three steps of this algorithm:
- first we detect collision and rename classes definitions to remove the collisions
- then we rename the reference to the colliding classes to make sure the right classes are called
- ultimately, we merge the modified dexfiles of each class loaders into one android application
][this is redundant an messy]

#figure(
  ```python
  defined_classes = set()
  redifined_classes = set()

  # Rename the definition of redifined classes
  for cl in class_loaders:
    for clz in defined_classes.intersection(cl.defined_classes):
      cl.rename_definition(clz)
      redifined_classes.add(clz)
    defined_classes.update(cl.defined_classes)

  # Rename reference of redifined classes
  for cl in class_loaders:
    for clz in redifined_classes:
      defining_cl = cl.resolve_class(clz).class_loader
      cl.rename_reference(clz, defining_cl.new_name(clz))

  # Merge the classloader into a flat APK
  new_apk = Apk()
  for cl in class_loaders:
    for dex in cl.get_dex():
      new_apk.add_dex(dex)
  ```,
  caption: [Pseudo-code of the renaming algorithm]
) <lst:renaming-algo>

/*
* Although we limited ourselves to replacing one specific bytecode instruction, we encontered many technical challenges
* #todo[interupting try blocks: catch block might expect temporary registers to still stored the saved value] ?
*/

=== Limitations

#paragraph()[Custom Classloaders][
The first obvious limitation is that we do not know what custom classloaders do, so we cannot accuratly emulate their behavior.
We elected to fallback to the behavior of the `BaseDexClassLoader`, which is the highest Android specific classloader in the inheritance hierarchy, and whose behavior is shared by all classloaders safe `DelegateLastClassLoader`.
The current implementation of the #ART enforce some restrictions on the classloaders behavior to optimize the runtime performance by caching classes.
This gives us some garanties that custom classesloaders will keep a some coherences will the classic classloaders.
For instance, a class loaded dynamically must have the same name as the name used in `ClassLoader.loadClass()`.
This make `BaseDexClassLoader` a good estimation for legitimate classloaders, however, an obfuscated application could use the techniques discussed in @sec:cl-cross-obf, in wich case our model would be entirelly wrong.
]

#paragraph()[Multiple Classloaders for one `Method.invoke()`][
#todo[explain the problem arrose each time a class is compared to another]
Although we managed to handle call to different methods from one `Method.invoke()` site, we do not handle calling methods from different classloaders with colliding classes definition.
The first reason is that it is quite challenging to compare classloaders statically.
At runtime, each object has an unique identifier that can be used to compare them over the course of the same execution, but this identifier is reset each time the application starts.
This means we cannot use this identifier in an `if` condition to differentiate the classloaders.
Ideally, we would combine the hash of the loaded #DEX files, the classloader class and parent to make an unique, static identifier, but the #DEX files loaded by a classloader cannot be accessed at runtime without accessing the process memory at arbitrary locations.
For some classloaders, the string representation returned by `Object.toString()` list the location of the loaded #DEX file on the file system.
This is not the case for the commonly used `InMemoryClassLoader`.
In addition, the #DEX files are often located in the application private folder, whose name is derived from the hash of the #APK itself.
Because we modify the application, the path of the private folder also change, and so will the string representation of the classloaders.
Checking the classloader of a classes can also have side-effect on classloaders that delegate to the main application classloader:
because we inject the classes in the #APK, the classes of the classloader are now already in the main application classloader, which in most case will have priority on the other classloaders, and lead to the class beeing loaded by the application classloader instead of the original classloader.
If we check for the classloader, we would need to considere such cases en rename each classes of each classloader before reinjecting them to the in the application.
This would greatly increase the risk of breaking the application during its transformation.
Instead, we elected to ignore the classloaders when selecting the method to invoque.
This leads to potential invalid runtime behaviore, as the first method that matching the class name will be called, but the alternative methods from other classloader still appears in the new application, albeit in a block that might be flagged as dead-code by a sufficiently advenced static analyser.
]