thesis/5_theseus/4_results.typ

#import "@preview/diagraph:0.3.5": render

#import "../lib.typ": SDK, num, mypercent, ART, ie, APKs, API, APIs
#import "../lib.typ": todo, jfl-note
#import "X_var.typ": *
#import "../3_rasta/X_var.typ": NBTOTALSTRING

== Results <sec:th-res>

#todo[better section name for @sec:th-res]

To studdy the impact of our transformation on analysis tools, we reused applications from the dataset we sampled in @sec:rasta/*-dataset*/.
Because we are running the application on a rescent version of Android (#SDK 34), we only took the most recent applications: the one collected in 2023.
This represent #num(5000) applications over the #NBTOTALSTRING total of the initial dataset.
Among them, we could not retrieve 43 from Androzoo, leaving us with #num(dyn_res.all.nb) applications to test.

We will first look at the results of the dynamic analysis and look closer at the bytecode we intercepted.
Then, we will studdy the impact the instrumentation have on static analysis tools, notably on their success rate, and we will finish with the analysis of an handcrafted application to check the instrumentation does in fact improve the results of analysis tools.

=== Dynamic Analysis Results <sec:th-dyn-failure>

After running the dynamic analysis on our dataset the first time we realised our dynamic setup was quite fragile.
We found that #mypercent(dyn_res.all.nb_failed_first_run, dyn_res.all.nb) of the execution failed with various errors.
The majority of those errors were related to faillures to connect to the Frida agent or start the activity from Frida.
Some of those errors seamed to come from Frida, while other seamed related to the emulator failing to start the application.
We found that relaunching the analysis for the applications that failled was the most simple way to fix those issues, and after 6 passes we went from #num(dyn_res.all.nb_failed_first_run) to #num(dyn_res.all.nb_failed) application that could not be analysed.
The remaining errors look more related to the application itself or Android, with #num(96) errors being a failure to install the application, and #num(110) other beeing a null pointer exception from Frida.

Infortunatly, although we managed to start the applications, we can see from the list of activity visited by GroddDroid that a majority (#mypercent(dyn_res.all.z_act_visited, dyn_res.all.nb - dyn_res.all.nb_failed)) of the application stopped before even starting one activity.
Some applications do not have an activity, and are not intended to interact with a user, but those are clearly a minority and do not explain such a high number.
We expected some issue related to the use of an emulator, like the lack of x86_64 library in the applications, or contermesures aborting the application if the emulator is detected.
We manually looked at some applications, but did not found a notable pattern.
In some cases, the application was just broken -- for instance, an application was trying to load a native library that simply does not exists in the application.
In other case, Frida is to blame: we found some cases where calling a method from Frida can confuse the #ART.
`protected` methods needs to be called from the class that defined the method or one of its children calsses, but Frida might be considered by the #ART as an other class, leading to the #ART aborting the application.
#todo[jfl was suppose to test a few other app #emoji.eyes]
@tab:th-dyn-visited shows the number of applications that we analysed, if we managed to start at least one activity and if we intercepted code loading or reflection.
It also shows the average number of activities visited (when at least one activity was started).
This average slightly higher than 1, which seems reasonable: a lot of applications do not need more than one activity, but some do and we did manage to explore at least some of those additionnal activities.
As shown in the table, even if the application fails to start an activity, some times it will still load external code or use reflection.

#figure({
  let nb_col = 7
  table(
    columns: nb_col,
    stroke: none,
    inset: 7pt,
    align: center+horizon,
    table.header(
      table.hline(),
      table.cell(colspan: nb_col, inset: 2pt)[],
      table.cell(rowspan: 2)[],
      table.cell(rowspan: 2)[nb apk],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(colspan: 2, inset: (bottom: 0pt))[nb failled],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(colspan: 2, inset: (bottom: 0pt))[activities visited],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(rowspan: 2)[average nb \ activities when > 0],

      [1#super[st] pass], [6#super[th] pass],
      [0], [$>= 1$],
    ),
    table.cell(colspan: nb_col, inset: 2pt)[],
    table.hline(),
    table.cell(colspan: nb_col, inset: 2pt)[],
    [All], num(dyn_res.all.nb), num(dyn_res.all.nb_failed_first_run), num(dyn_res.all.nb_failed), num(dyn_res.all.z_act_visited), num(dyn_res.all.nz_act_visited), num(dyn_res.all.avg_nz_act),
    [With Reflection], num(dyn_res.reflection.nb), [], [], num(dyn_res.reflection.z_act_visited), num(dyn_res.reflection.nz_act_visited),  num(dyn_res.reflection.avg_nz_act),
    [With Code Loading], num(dyn_res.code_loading.nb), [], [], num(dyn_res.code_loading.z_act_visited), num(dyn_res.code_loading.nz_act_visited),  num(dyn_res.code_loading.avg_nz_act),
    table.cell(colspan: nb_col, inset: 2pt)[],
    table.hline(),
  )},
  caption: [Summary of the dynamic exploration of the applications from the RASTA dataset collected by Androzoo in 2023]
) <tab:th-dyn-visited>

The high number of application that did not start an activity means that our result will be highly biaised.
The code that might be loaded or method that might be called by reflection from inside activities is filtered out by the limit of or dynamic execution.
This biaised must be kept in mind when reading the next subsection that studdy the bytecode that we intercepted.

=== The Bytecode Loaded by Application <sec:th-code-collected>

We collected a total of #nb_bytecode_collected files for #dyn_res.code_loading.nb application that we detected loading bytecode dynamicatlly.
#num(92) of them were loaded by a `DexClassLoader`, #num(547) were loaded by a `InMemoryDexClassLoader` and #num(1) was loaded by a `PathClassLoader`.

Once we compared the files, we found that we only collected #num(bytecode_hashes.len()) distinct files, and that #num(bytecode_hashes.at(0).at(0)) where identicals.
Once we looked more in details, we found that most of those files are advertisement libraries.
In total, we collected #num(nb_google) files containing Google ads librairies and #num(nb_facebook) files containing Facebook ads librairies.
In addition, we found #num(nb_appsflyer) files containing code that we believe to be AppsFlyer, and company that provides "measurement, analytics, engagement, and fraud protection technologies".
The remaining #num(nb_bytecode_collected - nb_google - nb_appsflyer - nb_facebook) files were custom code from high security applications (#ie banking, social security)
@tab:th-bytecode-hashes sumarize the information we collected about the most common bytecode files.

#figure(
  table(
    columns: 4,
    stroke: none,
    align: center+horizon,
    table.header(
      [Nb Occurences], [SHA 256], [Content], [Format]
    ),
    table.hline(),
    ..bytecode_hashes.slice(0, 10)
    .map(
      (e) => (num(e.at(0)), [#e.at(1).slice(0, 10)...], ..e.slice(2))
    ).flatten(),
    table.cell(colspan: 4)[...],
    table.hline(),
  ),
  caption: [Most common dynamically loaded files]
) <tab:th-bytecode-hashes>

=== Impact on Analysis Tools

We took the applications associated with the #num(nb_bytecode_collected - nb_google - nb_appsflyer - nb_facebook) unique #DEX files we found to see the impact of our transformation.

The applications where indeed obfuscated, making a manual analysis tedious.
We did not found visible #DEX files or #APK files inside the applications, meaning the applications are either downloading or generating them from variables and assets at runtime.
To estimate the scope of the code we made available, we use Androguard to generate the call graph of the applications, before and after the instrumentation.
@tab:th-compare-cg shows the number of edges of those call graphs.
The columns before and after shows the total number of edges of the graphs, and the diff columns is the number of new edges detected (#ie the number of edges after instrumentation minus the number of edges before).
This number include edges from the bytecode loaded dynamically, as well as the call added to reflect reflection calls, and calls to "glue" methods (method like `Integer.intValue()` used to convert objects to scalar values, or calls to `T.check_is_Xxx_xxx(Method)` used to check if a `Method` object represent a known method).
The last column, "Added Reflection", is the list of non-glue method calls found in the call graph of the instrumented application but neither in call graph of the original #APK, nor in the call graphes of the added bytecode files that we computed separately.
This correspond to the calls we added to represent reflection calls.

The first application, #lower(compared_callgraph.at(0).sha256), is noticable.
The instrumented #APK has ten times more edges to its call graph than the original, and only one reflection call.
This is consistant with the behaviour of a packer: the application load the main of its code at runtime and switch from the bootstrap code to the loaded code with a single reflection call.

#figure({
  let nb_col = 5
  table(
    columns: (2fr, 1fr, 1fr, 1fr, 2fr),
    align: center+horizon,
    stroke: none,
    table.hline(),
    table.header(
      //[SHA 256], [Original CG edges], [New CG edges], [Edges added], [Reflection edges added],
      table.cell(rowspan: 2)[#APK SHA 256], table.cell(colspan: nb_col - 1)[Number of Call Graph edges], [Before], [After], [Diff], [Added Reflection],
    ),
    table.hline(),
    ..compared_callgraph.map(
      //(e) => ([#lower(e.sha256).slice(0, 10)...], num(e.edges_before), num(e.edges_after), num(e.added), num(e.added_ref_only))
      (e) => (
        [#lower(e.sha256).slice(0, 10)...],
        text(fill: luma(75), num(e.edges_before)),
        text(fill: luma(75), num(e.edges_after)),
        num(e.added),
        num(e.added_ref_only)
    )).flatten(),
    [#lower("5D2CD1D10ABE9B1E8D93C4C339A6B4E3D75895DE1FC49E248248B5F0B05EF1CE").slice(0, 10)...], table.cell(colspan: nb_col - 1)[_Instrumentation Crashed_],
    table.hline(),
  )},
  caption: [Edges added to the call graphes computed by Androguard by instrumenting the applications]
) <tab:th-compare-cg>

Unfortunately, our implementation of the transformation is imperfect and does fails some time, as illustrated by #lower("5D2CD1D10ABE9B1E8D93C4C339A6B4E3D75895DE1FC49E248248B5F0B05EF1CE") in @tab:th-compare-cg.
However, over the #num(dyn_res.all.nb - dyn_res.all.nb_failed) applications whose dynamic analysis finished in our experiment, #num(nb_patched) were patched.
The remaining #mypercent(dyn_res.all.nb - dyn_res.all.nb_failed - nb_patched, dyn_res.all.nb - dyn_res.all.nb_failed) failed either due to some quirk in the zip format of the #APK file, because of a bug in our implementation when exceeding the method reference limit in a single #DEX file, or in the case of #lower("5D2CD1D10ABE9B1E8D93C4C339A6B4E3D75895DE1FC49E248248B5F0B05EF1CE"), because the application reused the original application classloader to load new code instead of instanciated a new classes loader (a behavior we did not expected as not possible using only the #SDK, but enabled by hidden #APIs).
Taking into accound the failure from both dynamic analysis and the instrumentation process, we have a #mypercent(dyn_res.all.nb - nb_patched, dyn_res.all.nb) failure rate.
This is a reasonable failure rate, but we should keep in mind that it adds up to the failure rate of the other tools we want to use on the patched application.

To check the impact on the finishing rate of or instrumentation, we then run the same experiment we run in @sec:rasta.
We run the tools on the #APK before and after instrumentation, and compared the finishing rates in @fig:th-status-npatched-vs-patched (without taking into account #APKs we failed to patch#footnote[Due to an handling error during the experiment, the figure show the results for #nb_patched_rasta #APKs instead of #nb_patched.]).

The finishing rate comparision is shown in @fig:th-status-npatched-vs-patched.
We can see that in most cases, the finishing rate either the same, or slightly lower for the instrumented application.
This is consistent with the fact that we add more bytecode to the application, hence adding more oportunities of failure during analysis.
They are two notable exceptions: Saaf and IC3.
The finishing rate of IC3 which was previously reasibabe drop to 0 after our instrumentation, while the finishing rate of Saaf jump to 100%, which is extremely suspicious.
Analysing the logs of the analysis showed that both cases have the same origin: the bytecode generated by our instrumentation has version number of 37 (the version introduced by Android 7.0).
Infortunately, neither the version of Apktool used by Saaf nor Dare (the tool used by IC3 to convert Dalvik bytecode to Java bytecode) recognize this version of bytecode, and thus failed to parse the #APK.
In the case of Dare and IC3, our experiment correctly identify this a crash.
On the other hand, Saaf do not detect the issue with Apktool and pursue the analysis with no bytecode to analyse and return a valid return file, but for an empty application.

#todo[alt text @fig:th-status-npatched-vs-patched]
#figure({
  image(
    "figs/comparision-of-exit-status.svg",
    width: 100%,
    alt: "",
  )
  place(center + horizon, rotate(24deg, text(red.transparentize(0%), size: 20pt, "PRELIMINARY RESULTS")))
  },
  caption: [Exist status of static analysis tools on original #APKs (left) and patched #APKs (right)]
) <fig:th-status-npatched-vs-patched>

#todo[Flowdroid results are inconclusive: some apks have more leak after and as many apks have less? also, runing flowdroid on the same apk can return a different number of leak???]

=== Example

In this subsection, we use on our approach on a small #APK to look in more details into analysis of the transformed application.
We handcrafted this application for the purpose of demonstrating how this can improve help a reverse engineer in its work.
Accordingly, this application is quite small and contains both dynamic code loading and reflection.
We defined a method `Utils.source()` and `Utils.sink()` to model respectively a method that collect sensitive data and that exfiltrate data.
Those methods are the one we will use with Flowdroid to track data flows.

#figure(
  ```java
package com.example.theseus;

public class Main {
    private static final String DEX = "ZGV4CjA [...] EAAABEAwAA";
    Activity ac;
    private Key key = new SecretKeySpec("_-_Secret Key_-_".getBytes(), "AES");
    ClassLoader cl = new InMemoryDexClassLoader(ByteBuffer.wrap(Base64.decode(DEX, 2)), Main.class.getClassLoader());

    public void main() throws Exception {
        String[] strArr = {"n6WGYJzjDrUvR9cYljlNlw==", "dapES0wl/iFIPuMnH3fh7g=="};
        Class<?> loadClass = this.cl.loadClass(decrypt("W5f3xRf3wCSYcYG7ckYGR5xuuESDZ2NcDUzGxsq3sls="));
        Object obj = "imei";
        for (int i = 0; i < 2; i++) {
            obj = loadClass.getMethod(decrypt(strArr[i]), String.class, Activity.class).invoke(null, obj, this.ac);
        }
    }
    public String decrypt(String str) throws Exception {
        Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
        cipher.init(2, this.key);
        return new String(cipher.doFinal(Base64.decode(str, 2)));
    }

    ...
}
  ```,
  caption: [Code of the main class of the application showed by Jadx, before patching],
)<lst:th-demo-before>

A first analysis of the contant of the application shows that the application contains one `Activity` that instanciate the class `Main` and call `Main.main()`.
@lst:th-demo-before shows the most of the code of `Main` as returned by Jadx.
We can see that the class contains another #DEX file encoded in base 64 and loaded in the `InMemoryDexClassLoader` `cl`.
A class is then loaded from this class loader, and two methods from this class loader are called.
The names of this class and methods are not directly accessible as they have been chipĥered and are decoded just before beeing used at runtime.
Here, the encryption key is available statically, and in theorie, a verry good static analyser implementing Android `Cipher` #API could compute the actual methods called.
However, we could easily imagine an application that gets this key from a remote command and control server.
In this case, it would be impossible to compute those methods with static analysis alone.
When running Flowdroid on this application, it computed a callgraph of 43 edges on this application, an no data leaks.
This is not particularly surprising considering the obfusctation methods used.

Then we run the dynamic analysis we described in @sec:th-dyn on the application and apply the transformation described in @sec:th-trans to add the dynamic informations to it.
This time, Flowdroid compute a larger callgraph of 76 edges, and does find a data leak.
Indeed, when looking at the new application with Jadx, we notice a new class `Malicious`, and the code of `Main.main()` is now as shown in @lst:th-demo-after:
the method called in the loop is either `Malicious.get_data`, `Malicious.send_data()` or `Method.invoke()`.
Although self explanatory, verifying the code of those methods indeed confirm that `get_data()` calls `Utils.source()` and `send_data()` calls `Utils.sink()`.

#figure(
  ```java
    public void main() throws Exception {
        String[] strArr = {"n6WGYJzjDrUvR9cYljlNlw==", "dapES0wl/iFIPuMnH3fh7g=="};
        Class<?> loadClass = this.cl.loadClass(decrypt("W5f3xRf3wCSYcYG7ckYGR5xuuESDZ2NcDUzGxsq3sls="));
        Object obj = "imei";
        for (int i = 0; i < 2; i++) {
            Method method = loadClass.getMethod(decrypt(strArr[i]), String.class, Activity.class);
            Object[] objArr = {obj, this.ac};
            obj = T.check_is_Malicious_get_data_fe2fa96eab371e46(method) ?
              Malicious.get_data((String) objArr[0], (Activity) objArr[1]) :
              T.check_is_Malicious_send_data_ca50fd7916476073(method) ?
              Malicious.send_data((String) objArr[0], (Activity) objArr[1]) :
              method.invoke(null, objArr);
        }
    }
    ```,
    caption: [Code of `Main.main()` showed by Jadx, after patching],
)<lst:th-demo-after>

#todo[alt text for @fig:th-cg-before and @fig:th-cg-after]
#todo[comment @fig:th-cg-before and @fig:th-cg-after]
#todo[Conclude and transition]
#figure(
  render(
    read("figs/demo_main_main.dot"),
    width: 100%,
    alt: (
      "",
    ).join(),
  ),
  caption: [Call Graph of `Main.main()` view by Androguard before patching],
) <fig:th-cg-before>

#figure(
  render(
    read("figs/patched_main_main.dot"),
    width: 100%,
    alt: (
      "",
    ).join(),
  ),
  caption: [Call Graph of `Main.main()` view by Androguard after patching],
) <fig:th-cg-after>


#todo[androgard call graph]