thesis/5_theseus/5_results.typ

#import "@preview/diagraph:0.3.5": render

#import "../lib.typ": SDK, num, mypercent, ART, ie, APKs, API, APIs, etal, midskip
#import "../lib.typ": todo, jfl-note
#import "X_var.typ": *
#import "../3_rasta/X_var.typ": NBTOTALSTRING

== Results <sec:th-res>

To study the impact of our transformation on analysis tools, we reused applications from the dataset we sampled in @sec:rasta/*-dataset*/.
Because we are running the application on a recent version of Android (#SDK 34), we only took the most recent applications: the one collected in 2023.
This represents #num(5000) applications over the #NBTOTALSTRING total of the initial dataset.
Among them, we could not retrieve 43 from Androzoo, leaving us with #num(dyn_res.all.nb) applications to test.

We will first look at the results of the dynamic analysis and look at the bytecode we intercepted.
Then, we will study the impact the instrumentation has on static analysis tools, notably on their success rate.
Additionally, we will study with the analysis of a handcrafted application to check whether the instrumentation does, in fact, improve the results of analysis tools.

=== Dynamic Analysis Results <sec:th-dyn-failure>

After running the dynamic analysis on our dataset the first time, we realised our dynamic setup was quite fragile.
We found that #mypercent(dyn_res.all.nb_failed_first_run, dyn_res.all.nb) of the executions failed with various errors.
The majority of those errors were related to failures to connect to the Frida agent or start the activity from Frida.
Some of those errors seemed to come from Frida, while others seemed related to the emulator failing to start the application.
We found that relaunching the analysis for the applications that failed was the simplest way to fix those issues, and after 6 passes, we went from #num(dyn_res.all.nb_failed_first_run) to #num(dyn_res.all.nb_failed) applications that could not be analysed.
The remaining errors look more related to the application itself or Android, with #num(96) errors being a failure to install the application, and #num(110) others being a null pointer exception from Frida.

Unfortunately, although we managed to start the applications, we can see from the list of activities visited by GroddDroid that a majority (#mypercent(dyn_res.all.z_act_visited, dyn_res.all.nb - dyn_res.all.nb_failed)) of the applications stopped before even starting one activity.
Some applications do not have any activities and are not intended to interact with a user, but those are clearly a minority and do not explain such a high number.
We expected some issues related to the use of an emulator, like the lack of x86_64 library in the applications, or contermesures aborting the application if the emulator is detected.
We manually looked at some applications, but did not find a notable pattern.
In some cases, the application was just broken -- for instance, an application was trying to load a native library that simply does not exist in the application.
In other cases, Frida is to blame: we found some cases where calling a method from Frida can confuse the #ART.
`protected` methods cannot be called from a class other than the one that defined the method or one of its children.
The issue is that Frida might be considered by the #ART as another class, leading to the #ART aborting the application.
@tab:th-dyn-visited shows the number of applications that we analysed, if we managed to start at least one activity and if we intercepted code loading or reflection.
It also shows the average number of activities visited (when at least one activity was started).
This average is slightly higher than 1, which seems reasonable: a lot of applications do not need more than one activity, but some do, and we did manage to explore at least some of those additional activities.
As shown in the table, even if the application fails to start an activity, sometimes it will still load external code or use reflection.

We later tested the applications on a real phone (model Nothing (2a), Android 15), without Frida but still using GroddRunner.
This time, we managed to visit at least one activity for #num(2130) applications, 3 times more than in our actual experiment.
This shows that our setup is indeed breaking applications, but also that there is still another issue we did not find: more than half of the tested applications did not display any activities at all.

#figure({
  let nb_col = 7
  table(
    columns: nb_col,
    stroke: none,
    inset: 7pt,
    align: center+horizon,
    table.header(
      table.hline(),
      table.cell(colspan: nb_col, inset: 2pt)[],
      table.cell(rowspan: 2)[],
      table.cell(rowspan: 2)[nb apk],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(colspan: 2, inset: (bottom: 0pt))[nb failed],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(colspan: 2, inset: (bottom: 0pt))[activities visited],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(rowspan: 2)[average nb \ activities when > 0],

      [1#super[st] pass], [6#super[th] pass],
      [0], [$>= 1$],
    ),
    table.cell(colspan: nb_col, inset: 2pt)[],
    table.hline(),
    table.cell(colspan: nb_col, inset: 2pt)[],
    [All], num(dyn_res.all.nb), num(dyn_res.all.nb_failed_first_run), num(dyn_res.all.nb_failed), num(dyn_res.all.z_act_visited), num(dyn_res.all.nz_act_visited), num(dyn_res.all.avg_nz_act),
    [With Reflection], num(dyn_res.reflection.nb), [], [], num(dyn_res.reflection.z_act_visited), num(dyn_res.reflection.nz_act_visited),  num(dyn_res.reflection.avg_nz_act),
    [With Code Loading], num(dyn_res.code_loading.nb), [], [], num(dyn_res.code_loading.z_act_visited), num(dyn_res.code_loading.nz_act_visited),  num(dyn_res.code_loading.avg_nz_act),
    table.cell(colspan: nb_col, inset: 2pt)[],
    table.hline(),
  )},
  caption: [Summary of the dynamic exploration of the applications from the RASTA dataset collected by Androzoo in 2023]
) <tab:th-dyn-visited>

The high number of applications that did not start an activity means that our results will be highly biased.
The code/method that might be loaded/called by reflection from inside activities is filtered out by the limit of our dynamic execution.
This bias must be kept in mind while reading the next subsection that studies the bytecode that we intercepted.

=== The Bytecode Loaded by Application <sec:th-code-collected>

We collected a total of #nb_bytecode_collected files for #dyn_res.code_loading.nb application that we detected loading bytecode dynamically.
#num(92) of them were loaded by a `DexClassLoader`, #num(547) were loaded by a `InMemoryDexClassLoader`, and #num(1) was loaded by a `PathClassLoader`.

Once we compared the files, we found that we only collected #num(bytecode_hashes.len()) distinct files, and that #num(bytecode_hashes.at(0).at(0)) were identical.
Once we looked more in detail, we found that most of those files are advertisement libraries.
In total, we collected #num(nb_google) files containing Google ads libraries and #num(nb_facebook) files containing Facebook ads libraries.
In addition, we found #num(nb_appsflyer) files containing code that we believe to be AppsFlyer, a company that provides "measurement, analytics, engagement, and fraud protection technologies".
The remaining #num(nb_bytecode_collected - nb_google - nb_appsflyer - nb_facebook) files were custom code from high security applications (#ie banking, social security)
@tab:th-bytecode-hashes summarises the information we collected about the most common bytecode files.

#figure(
  table(
    columns: 4,
    stroke: none,
    align: center+horizon,
    table.header(
      [Nb Occurences], [SHA 256], [Content], [Format]
    ),
    table.hline(),
    ..bytecode_hashes.slice(0, 10)
    .map(
      (e) => (num(e.at(0)), [#e.at(1).slice(0, 10)...], ..e.slice(2))
    ).flatten(),
    table.cell(colspan: 4)[...],
    table.hline(),
  ),
  caption: [Most common dynamically loaded files]
) <tab:th-bytecode-hashes>

=== Impact on Analysis Tools

We took the applications associated with the #num(nb_bytecode_collected - nb_google - nb_appsflyer - nb_facebook) unique #DEX files we found to see the impact of our transformation.

The applications were indeed obfuscated, making a manual analysis tedious.
We did not find visible #DEX files or #APK files inside the applications, meaning the applications are either downloading or generating them from variables and assets at runtime.
To estimate the scope of the code we made available, we use Androguard to generate the call graph of the applications, before and after the instrumentation.
@tab:th-compare-cg shows the number of edges of those call graphs.
The columns before and after show the total number of edges of the graphs, and the diff column indicates the number of new edges detected (#ie the number of edges after instrumentation minus the number of edges before).
This number include edges from the bytecode loaded dynamically, as well as the call added to reflect reflection calls, and calls to "glue" methods (method like `Integer.intValue()` used to convert objects to scalar values, or calls to `T.check_is_Xxx_xxx(Method)` used to check if a `Method` object represent a known method).
The last column, "Added Reflection", is the list of non-glue method calls found in the call graph of the instrumented application but neither in the call graph of the original #APK, nor in the call graphs of the added bytecode files that we computed separately.
This corresponds to the calls we added to represent reflection calls.

The first application, #lower(compared_callgraph.at(0).sha256), is noticable.
The instrumented #APK has ten times more edges to its call graph than the original, and only one reflection call.
This is consistent with the behaviour of a packer: the application loads the main part of its code at runtime and switches from the bootstrap code to the loaded code with a single reflection call.

#figure({
  let nb_col = 5
  table(
    columns: (2fr, 1fr, 1fr, 1fr, 2fr),
    align: center+horizon,
    stroke: none,
    table.hline(),
    table.header(
      //[SHA 256], [Original CG edges], [New CG edges], [Edges added], [Reflection edges added],
      table.cell(rowspan: 2)[#APK SHA 256], table.cell(colspan: nb_col - 1)[Number of Call Graph edges], [Before], [After], [Diff], [Added Reflection],
    ),
    table.hline(),
    ..compared_callgraph.map(
      //(e) => ([#lower(e.sha256).slice(0, 10)...], num(e.edges_before), num(e.edges_after), num(e.added), num(e.added_ref_only))
      (e) => (
        [#lower(e.sha256).slice(0, 10)...],
        text(fill: luma(75), num(e.edges_before)),
        text(fill: luma(75), num(e.edges_after)),
        num(e.added),
        num(e.added_ref_only)
    )).flatten(),
    [#lower("5D2CD1D10ABE9B1E8D93C4C339A6B4E3D75895DE1FC49E248248B5F0B05EF1CE").slice(0, 10)...], table.cell(colspan: nb_col - 1)[_Instrumentation Crashed_],
    table.hline(),
  )},
  caption: [Edges added to the call graphs computed by Androguard by instrumenting the applications]
) <tab:th-compare-cg>

Unfortunately, our implementation of the transformation is imperfect and sometimes fails, as illustrated by #lower("5D2CD1D10ABE9B1E8D93C4C339A6B4E3D75895DE1FC49E248248B5F0B05EF1CE") in @tab:th-compare-cg.
However, over the #num(dyn_res.all.nb - dyn_res.all.nb_failed) applications whose dynamic analysis finished in our experiment, #num(nb_patched) were patched.
The remaining #mypercent(dyn_res.all.nb - dyn_res.all.nb_failed - nb_patched, dyn_res.all.nb - dyn_res.all.nb_failed) failed either due to some quirk in the zip format of the #APK file, because of a bug in our implementation when exceeding the method reference limit in a single #DEX file, or in the case of #lower("5D2CD1D10ABE9B1E8D93C4C339A6B4E3D75895DE1FC49E248248B5F0B05EF1CE"), because the application reused the original application classloader to load new code instead of instanciated a new classes loader (a behavior we did not expected as not possible using only the #SDK, but enabled by hidden #APIs).
Taking into account the failure from both dynamic analysis and the instrumentation process, we have a #mypercent(dyn_res.all.nb - nb_patched, dyn_res.all.nb) failure rate.
This is a reasonable failure rate, but we should keep in mind that it adds up to the failure rate of the other tools we want to use on the patched application.

To check the impact on the finishing rate of our instrumentation, we then run the same experiment we ran in @sec:rasta.
We run the tools on the #APK before and after instrumentation, and compared the finishing rates in @fig:th-status-npatched-vs-patched (without taking into account #APKs we failed to patch#footnote[Due to a handling error during the experiment, the figure shows the results for #nb_patched_rasta #APKs instead of #nb_patched. \ We also ignored the tool from Wognsen #etal due to the high number of timeouts]).

The finishing rate comparison is shown in @fig:th-status-npatched-vs-patched.
We can see that in most cases, the finishing rate is either the same or slightly lower for the instrumented application.
This is consistent with the fact that we add more bytecode to the application, hence adding more opportunities for failure during analysis.
They are two notable exceptions: Saaf and IC3.
The finishing rate of IC3, which was previously reasonable, dropped to 0 after our instrumentation, while the finishing rate of Saaf jumped to 100%, which is extremely suspicious.
Analysing the logs of the analysis showed that both cases have the same origin: the bytecode generated by our instrumentation has a version number of 37 (the version introduced by Android 7.0).
Unfortunately, neither the version of Apktool used by Saaf nor Dare (the tool used by IC3 to convert Dalvik bytecode to Java bytecode) recognises this version of bytecode, and thus failed to parse the #APK.
In the case of Dare and IC3, our experiment correctly identifies this as a crash.
On the other hand, Saaf do not detect the issue with Apktool and pursues the analysis with no bytecode to analyse and returns a valid return file, but for an empty application.

#todo[alt text @fig:th-status-npatched-vs-patched]
#figure({
  image(
    "figs/comparision-of-exit-status.svg",
    width: 100%,
    alt: "",
  )
  //place(center + horizon, rotate(24deg, text(red.transparentize(0%), size: 20pt, "PRELIMINARY RESULTS")))
  },
  caption: [Exit status of static analysis tools on original #APKs (left) and patched #APKs (right)]
) <fig:th-status-npatched-vs-patched>

#todo[Flowdroid results are inconclusive: some apks have more leak after and as many apks have less? also, runing flowdroid on the same apk can return a different number of leak???]

=== Example

In this subsection, we use our approach on a unique #APK to look in more detail into the analysis of the transformed application.
We handcrafted this application for the purpose of demonstrating how this can help a reverse engineer in their work.
Accordingly, this application is quite small and contains both dynamic code loading and reflection.
We defined a method `Utils.source()` and `Utils.sink()` to model a method that collects sensitive data and a method that exfiltrates data.
Those methods are the ones we will use with Flowdroid to track data flows.

#figure(
  ```java
package com.example.theseus;

public class Main {
    private static final String DEX = "ZGV4CjA [...] EAAABEAwAA";
    Activity ac;
    private Key key = new SecretKeySpec("_-_Secret Key_-_".getBytes(), "AES");
    ClassLoader cl = new InMemoryDexClassLoader(ByteBuffer.wrap(Base64.decode(DEX, 2)), Main.class.getClassLoader());

    public void main() throws Exception {
        String[] strArr = {"n6WGYJzjDrUvR9cYljlNlw==", "dapES0wl/iFIPuMnH3fh7g=="};
        Class<?> loadClass = this.cl.loadClass(decrypt("W5f3xRf3wCSYcYG7ckYGR5xuuESDZ2NcDUzGxsq3sls="));
        Object obj = "imei";
        for (int i = 0; i < 2; i++) {
            obj = loadClass.getMethod(decrypt(strArr[i]), String.class, Activity.class).invoke(null, obj, this.ac);
        }
    }
    public String decrypt(String str) throws Exception {
        Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
        cipher.init(2, this.key);
        return new String(cipher.doFinal(Base64.decode(str, 2)));
    }

    ...
}
  ```,
  caption: [Code of the main class of the application, as shown by Jadx, before patching],
)<lst:th-demo-before>

A first analysis of the content of the application shows that the application contains one `Activity` that instantiates the class `Main` and calls `Main.main()`.
@lst:th-demo-before shows most of the code of `Main` as returned by Jadx.
We can see that the class contains another #DEX file encoded in base 64 and loaded in the `InMemoryDexClassLoader` `cl` (line 7).
A class is then loaded from this class loader (line 11), and two methods from this class loader are called (line 14).
The names of this class and methods are not directly accessible as they have been ciphered and are decoded just before being used at runtime.
Here, the encryption key is available statically (line 6), and in theory, a very good static analyser implementing Android `Cipher` #API could compute the actual methods called.
However, we could easily imagine an application that gets this key from a remote command and control server.
In this case, it would be impossible to compute those methods with static analysis alone.
When running Flowdroid on this application, it computed a call graph of 43 edges on this application, and no data leaks.
This is not particularly surprising considering the obfuscation methods used.

Then we run the dynamic analysis we described in @sec:th-dyn on the application and apply the transformation described in @sec:th-trans to add the dynamic information to it.
This time, Flowdroid computes a larger call graph of 76 edges, and does find a data leak.
Indeed, when looking at the new application with Jadx, we notice a new class `Malicious`, and the code of `Main.main()` is now as shown in @lst:th-demo-after:
the method called in the loop is either `Malicious.get_data`, `Malicious.send_data()` or `Method.invoke()` (lines 9, 11 and 12).
Although self-explanatory, verifying the code of those methods indeed confirms that `get_data()` calls `Utils.source()` and `send_data()` calls `Utils.sink()`.

#figure(
  ```java
    public void main() throws Exception {
        String[] strArr = {"n6WGYJzjDrUvR9cYljlNlw==", "dapES0wl/iFIPuMnH3fh7g=="};
        Class<?> loadClass = this.cl.loadClass(decrypt("W5f3xRf3wCSYcYG7ckYGR5xuuESDZ2NcDUzGxsq3sls="));
        Object obj = "imei";
        for (int i = 0; i < 2; i++) {
            Method method = loadClass.getMethod(decrypt(strArr[i]), String.class, Activity.class);
            Object[] objArr = {obj, this.ac};
            obj = T.check_is_Malicious_get_data_fe2fa96eab371e46(method) ?
              Malicious.get_data((String) objArr[0], (Activity) objArr[1]) :
              T.check_is_Malicious_send_data_ca50fd7916476073(method) ?
              Malicious.send_data((String) objArr[0], (Activity) objArr[1]) :
              method.invoke(null, objArr);
        }
    }
    ```,
    caption: [Code of `Main.main()`, as shown by Jadx, after patching],
)<lst:th-demo-after>

For a higher-level view of the method, we can also look at its call graph.
We used Androguard to generate the call graphs in @fig:th-cg-before and @fig:th-cg-after#footnote[We manually edited the generated .dot files for readability.].
@fig:th-cg-before shows the original call graph, and gives a good idea of the obfuscation methods used: we can see calls to `Main.decrypt(String)` that itself calls cryptographic #APIs, as well as calls to `ClassLoader.loadClass(String)`, `Class.getMethod(String, Class[])` and `Method.invoke(Object, Object[])`.
This indicates reflection calls based on ciphered strings, but does not reveal what the method actually does.
In comparison, @fig:th-cg-after, the call graph after instrumentation, still shows the cryptographic and reflection calls, as well as four new method calls.
In grey on the figure, we can see the glue methods (`T.check_is_Xxx_xxx(Method)`).
Those methods are part of the instrumentation process presented in @sec:th-trans, but do not bring a lot to the analysis of the call graph.
In red on the figure however, we have the calls that were hidded by reflection in the first call graph, and thank to the bytecode of the methods called being injected in the application, we can also see that they call `Utils.source(String)` and `Utils.sink(String)`, the methods we defined for this application as source of confidential data and exfiltration method.

#todo[alt text for @fig:th-cg-before and @fig:th-cg-after]
#figure(
  render(
    read("figs/demo_main_main.dot"),
    width: 100%,
    alt: (
      "",
    ).join(),
  ),
  caption: [Call Graph of `Main.main()` generated by Androguard before patching],
) <fig:th-cg-before>

#figure(
  render(
    read("figs/patched_main_main.dot"),
    width: 100%,
    alt: (
      "",
    ).join(),
  ),
  caption: [Call Graph of `Main.main()` generated by Androguard after patching],
) <fig:th-cg-after>

#midskip

To conclude, we showed that our approach indeed improves the results of analysis tools without impacting their finishing rates much.
Unfortunately, we also noticed that our dynamic analysis is suboptimal, either due to our experimental setup or due to our solution to explore the applications.
In the next section, we will present in more detail the limitations of our solution, as well as future work that can be done to improve the contributions presented in this chapter.