Compare commits

..

3 commits

Author SHA1 Message Date
Jean-Marie Mineau
5c3a6955bd
I declare this manuscript finished
All checks were successful
/ test_checkout (push) Successful in 1m48s
2025-10-07 17:16:32 +02:00
Jean-Marie Mineau
9f39ded209
Merge branch 'main' of git.mineau.eu:these-android-re/thesis 2025-10-07 14:21:27 +02:00
Jean-Marie Mineau
550f886977
add xp results 2025-10-07 14:21:18 +02:00
15 changed files with 201 additions and 129 deletions

View file

@ -125,7 +125,7 @@ The contributions of this thesis are the following:
+ We propose an approach to allow static analysis tools to analyse applications that perform dynamic code loading:
We collect at runtime the bytecode dynamically loaded and the reflection calls information, and patch the #APK file to perform those operations statically.
Finally, we evaluate the impact this transformation has on the tools we containerised previously.
+ We released under the GPL licence #todo[Still waiting for the INRIA to validate] the software we used in the experiments presented in this thesis.
+ We released under the AGPL licence #todo[Still waiting for CS to validate] the software we used in the experiments presented in this thesis.
For @sec:rasta, this includes the code used to test the output of each tool and the code to analyse the results of the experiment, in addition to the containers to run the tested tools.
We also released Androscalpel, a Rust crate to manipulate Dalvik bytecode, that we used to create Theseus, a set of scripts that implement the approach presented in @sec:th.
The complete list and location of the software we release are available in @sec:soft.

View file

@ -4,23 +4,6 @@
=== Static Analysis <sec:bg-static>
A static analysis program examines an #APK file without executing it to extract information from it.
Basic static analysis can include extracting information from the `AndroidManifest.xml` file or decompiling bytecode to Java code with tools like Apktool or Jadx.
Unfortunately, simply reading the bytecode does not scale.
To do so, a human analyst is needed, making it complicated to analyse a large number of applications, and even for single applications, the size and complexity of some applications can quickly overwhelm the reverse engineer.
Control flow analysis is often used to mitigate this issue.
The idea is to extract the behaviour, the flow, of the application from the bytecode, and to represent it as a graph.
A graph representation is easier to work with than a list of instructions and can be used for further analysis.
Depending on the level of precision required, different types of graphs can be computed.
The most basic of those graphs is the call graph.
A call graph is a graph where the nodes represent the methods in the application, and the edges represent calls from one method to another.
@fig:bg-fizzbuzz-cg-cfg b) show the call graph of the code in @fig:bg-fizzbuzz-cg-cfg a).
A more advanced control-flow analysis consists of building the control-flow graph.
This time, instead of methods, the nodes represent instructions, and the edges indicate which instruction can follow which instruction.
@fig:bg-fizzbuzz-cg-cfg c) represents the control-flow graph of @fig:bg-fizzbuzz-cg-cfg a), with code statements instead of bytecode instructions.
#figure({
set align(center)
stack(dir: ttb,[
@ -119,6 +102,22 @@ This time, instead of methods, the nodes represent instructions, and the edges i
caption: [Source code for a simple Java method and its Call and Control Flow Graphs],
)<fig:bg-fizzbuzz-cg-cfg>
A static analysis program examines an #APK file without executing it to extract information from it.
Basic static analysis can include extracting information from the `AndroidManifest.xml` file or decompiling bytecode to Java code with tools like Apktool or Jadx.
Unfortunately, simply reading the bytecode does not scale.
To do so, a human analyst is needed, making it complicated to analyse a large number of applications, and even for single applications, the size and complexity of some applications can quickly overwhelm the reverse engineer.
Control flow analysis is often used to mitigate this issue.
The idea is to extract the behaviour, the flow, of the application from the bytecode, and to represent it as a graph.
A graph representation is easier to work with than a list of instructions and can be used for further analysis.
Depending on the level of precision required, different types of graphs can be computed.
The most basic of those graphs is the call graph.
A call graph is a graph where the nodes represent the methods in the application, and the edges represent calls from one method to another.
@fig:bg-fizzbuzz-cg-cfg b) show the call graph of the code in @fig:bg-fizzbuzz-cg-cfg a).
A more advanced control-flow analysis consists of building the control-flow graph.
This time, instead of methods, the nodes represent instructions, and the edges indicate which instruction can follow which instruction.
@fig:bg-fizzbuzz-cg-cfg c) represents the control-flow graph of @fig:bg-fizzbuzz-cg-cfg a), with code statements instead of bytecode instructions.
Once the control-flow graph is computed, it can be used to compute data-flows.
Data-flow analysis, also called taint-tracking, is used to follow the flow of information in the application.
By defining a list of methods and fields that can generate critical information (taint sources) and a list of methods that can consume information (taint sinks), taint-tracking detects potential data leaks (if a data flow links a taint source and a taint sink).

View file

@ -341,7 +341,7 @@ Two datasets are used in the experiments of this section.
The first one is *Drebin*~@Arp2014, from which we extracted the malware part (5479 samples that we could retrieve) for comparison purposes only.
It is a well-known and very old dataset that should not be used anymore because it contains temporal and spatial biases~@Pendlebury2018.
We intend to compare the rate of success on this old dataset with a more recent one.
The second one, *Rasta*, we built to cover all dates between 2010 and 2023.
The second one, *RASTA* (Reusability of Android Static Tools and Analysis), we built to cover all dates between 2010 and 2023.
This dataset is a random extract of Androzoo~@allixAndroZooCollectingMillions2016, for which we balanced applications between years and size.
For each year and inter-decile range of size in Androzoo, 500 applications have been extracted with an arbitrary proportion of 7% of malware.
This ratio has been chosen because it is the ratio of goodware/malware that we observed when performing a raw extract of Androzoo.

View file

@ -4,10 +4,8 @@
== Experiments <sec:rasta-xp>
=== #rq1: Re-Usability Evaluation
#figure(
image(
"figs/exit-status-for-the-drebin-dataset.svg",
@ -71,10 +69,10 @@
wognsen_et_al: a little less than 15% finished, a little less than 20% failed, the rest timed out
"
),
caption: [Exit status for the Rasta dataset],
caption: [Exit status for the RASTA dataset],
) <fig:rasta-exit>
@fig:rasta-exit-drebin and @fig:rasta-exit compare the Drebin and Rasta datasets.
@fig:rasta-exit-drebin and @fig:rasta-exit compare the Drebin and RASTA datasets.
They represent the success/failure rate (green/orange) of the tools.
We distinguished failure to compute a result from timeout (blue) and crashes of our evaluation framework (in grey, probably due to out-of-memory kills of the container itself).
Because it may be caused by a bug in our own analysis stack, exit statuses represented in grey (Other) are considered as unknown errors and not as failures of the tool.
@ -84,8 +82,8 @@ Results on the Drebin datasets show that 11 tools have a high success rate (grea
The other tools have poor results.
The worst, excluding Lotrack and Tresher, is Anadroid with a ratio under 20% of success.
On the Rasta dataset, we observe a global increase in the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
The tools that have bad results with Drebin are, of course, bad results on Rasta.
On the RASTA dataset, we observe a global increase in the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
The tools that have bad results with Drebin are, of course, bad results on RASTA.
Three tools (androguard_dad, blueseal, saaf) that were performing well (higher than 85%) on Drebin, surprisingly fall below the bar of 50% of failure.
7 tools keep a high success rate: Adagio, Amandroid, Androguard, Apparecium, Gator, Mallodroid, Redexer.
Regarding IC3, the fork with a simpler build process and support for modern OS has a lower success rate than the original tool.
@ -135,7 +133,7 @@ For the tools that we could run, #resultratio of analyses are finishing successf
supplement: none,
kind: "sub-rasta-exit-evolution"
) <fig:rasta-exit-evolution-not-java>]
), caption: [Exit status evolution for the Rasta dataset]
), caption: [Exit status evolution for the RASTA dataset]
) <fig:rasta-exit-evolution>
For investigating the effect of application dates on the tools, we computed the date of each #APK based on the minimum date between the first upload in AndroZoo and the first analysis in VirusTotal.
@ -293,7 +291,7 @@ The date is also correlated with the success rate for Java-based tools only.
table.hline(),
table.header(
table.cell(colspan: 3/*4*/, inset: 3pt)[],
table.cell(rowspan:2)[*Rasta part*],
table.cell(rowspan:2)[*RASTA part*],
table.vline(end: 3),
table.vline(start: 4),
table.cell(colspan:2)[*Average size* (MB)],
@ -358,7 +356,7 @@ sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size
width: 100%,
alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
Each tools has two bars, one for goodware an one for malware.
The goodware bars are the same as the one in the figure Exit status for the Rasta dataset.
The goodware bars are the same as the one in the figure Exit status for the RASTA dataset.
The timeout rate looks the same on both bar of each tools.
The finishing rate of the malware bar is a lot higher than in the goodware bar for androguard_dad, blueseal, didfail, iccta, perfchecker and wogsen_et_al.
The finishing rate of the malware bar is higher than in the goodware bar for ic3 and ic3_fork.
@ -366,7 +364,7 @@ sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size
The other tools have similar finishing rate, finishing rate slightly in favor of malware.
"
),
caption: [Exit status comparing goodware (left bars) and malware (right bars) for the Rasta dataset],
caption: [Exit status comparing goodware (left bars) and malware (right bars) for the RASTA dataset],
) <fig:rasta-exit-goodmal>
/*

View file

@ -137,7 +137,7 @@ Therefore, we investigated the nature of errors globally, without distinction be
width: 100%,
alt: "",
),
caption: [Heatmap of the ratio of error reasons for all tools for the Rasta dataset],
caption: [Heatmap of the ratio of error reasons for all tools for the RASTA dataset],
) <fig:rasta-heatmap>
@fig:rasta-heatmap draws the most frequent error objects for each of the tools.
@ -148,7 +148,7 @@ First, the heatmap helps us to confirm that our experiment is running in adequat
Regarding errors linked to memory, two errors should be investigated: `OutOfMemoryError` and `StackOverflowError`.
The first one only appears for Gator with a low ratio.
Several tools have a low ratio of errors concerning the stack.
These results confirm that the allocated heap and stack are sufficient for running the tools with the Rasta dataset.
These results confirm that the allocated heap and stack are sufficient for running the tools with the RASTA dataset.
Regarding errors linked to the disk space, we observe small ratios for the exception `IOException`, `FileNotFoundError` and `FileNotFoundException`.
Manual inspections revealed that those errors are often a consequence of a failed Apktool execution.

View file

@ -10,10 +10,10 @@ In this section, we will compare our results with the contributions presented in
Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022, a real-world benchmark and the associated recommendations to build such a benchmark.
These benchmarks confirmed that some tools, such as Amandroid and Flowdroid, are less efficient on real-world applications.
We confirm the hypothesis of Luo #etal that real-world applications lead to less efficient analysis than using handcrafted test applications or old datasets~@luoTaintBenchAutomaticRealworld2022.
In addition, even if Drebin is not hand-crafted, it is quite old and seems to present similar issues as handcrafted datasets when used to evaluate a tool: we obtained really good results compared to the Rasta dataset -- which is more representative of real-world applications.
In addition, even if Drebin is not hand-crafted, it is quite old and seems to present similar issues as handcrafted datasets when used to evaluate a tool: we obtained really good results compared to the RASTA dataset -- which is more representative of real-world applications.
Our findings are also consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench~@bosuCollusiveDataLeak2017 real-world applications are analysed successfully with the 6 evaluated tools~@pauckAndroidTaintAnalysis2018.
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications.
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the RASTA dataset of #NBTOTALSTRING applications.
We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio.
We confirmed that most tools require a significant amount of work to get them running~@reaves_droid_2016.
Our investigations of crashes also confirmed that dependencies on older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal.

View file

@ -73,6 +73,15 @@ This behaviour implements a priority and avoids redefining by error a core class
This behaviour is useful for overriding specific classes of a class loader while keeping the other classes.
A normal class loader would prioritise the classes of its delegate over its own.
At runtime, Android instantiates for each application three instances of class loaders described previously: `bootClassLoader`, the unique instance of `BootClassLoader`, and two instances of `PathClassLoader`: `systemClassLoader` and `appClassLoader`.
`bootClassLoader` is responsible for loading Android *#platc*.
It is the direct delegate of the two other class loaders instantiated by Android.
`appClassLoader` points to the application `.apk` file, and is used to load the classes inside the application
`systemClassLoader` is a `PathClassLoader` pointing to `'.'`, the working directory of the application, which is `'/'` by default.
The documentation of `ClassLoader.getSystemClassLoader` reports that this class loader is the default context class loader for the main application thread.
In reality, the #platc are loaded by `bootClassLoader` and the classes from the application are loaded from `appClassLoader`.
`systemClassLoader` is never used in production according to our careful reading of the #AOSP sources.
#figure(
```python
def get_mutli_dex_classses_dex_name(index: int):
@ -98,15 +107,6 @@ A normal class loader would prioritise the classes of its delegate over its own.
caption: [Default Class Loading Algorithm for Android Applications],
) <lst:cl-loading-alg>
At runtime, Android instantiates for each application three instances of class loaders described previously: `bootClassLoader`, the unique instance of `BootClassLoader`, and two instances of `PathClassLoader`: `systemClassLoader` and `appClassLoader`.
`bootClassLoader` is responsible for loading Android *#platc*.
It is the direct delegate of the two other class loaders instantiated by Android.
`appClassLoader` points to the application `.apk` file, and is used to load the classes inside the application
`systemClassLoader` is a `PathClassLoader` pointing to `'.'`, the working directory of the application, which is `'/'` by default.
The documentation of `ClassLoader.getSystemClassLoader` reports that this class loader is the default context class loader for the main application thread.
In reality, the #platc are loaded by `bootClassLoader` and the classes from the application are loaded from `appClassLoader`.
`systemClassLoader` is never used in production according to our careful reading of the #AOSP sources.
In addition to the class loaders instantiated by ART when starting an application, the developer of an application can use class loaders explicitly by calling ones from the #Asdk, or by recoding custom class loaders that inherit from the `ClassLoader` class.
At this point, accurately modelling the complete class loading algorithm becomes impossible: the developer can program any algorithm of their choice.
For this reason, this case is excluded from this chapter, and we focus on the default behaviour where the context class loader is the one pointing to the `.apk` file and where its delegate is `BootClassLoader`.

View file

@ -107,14 +107,6 @@ We used 4 versions of this application:
Like for the third one, we similarly store data in `com.android.okhttp.Request` and then retrieve it.
Again, the shadowing implementation discards the data.
We used the 4 selected tools on the 4 versions of the application and compared the results on the control application to the results on the other application implementing the different obfuscation techniques.
We found that these static analysis tools do not consider the class loading mechanism, either because the tools only look at the content of the application file (#eg a disassembler) or because they consider class loading to be a dynamic feature and thus out of their scope.
In @tab:cl-results, we report on the types of shadowing that can trick each tool.
A plain circle is a shadow attack that leads to a wrong result.
A white circle indicates a tool emitting warnings or that displays the two versions of the class.
A cross is a tool not impacted by a shadow attack.
//We explain in more detail in the following the results for each considered tool.
#figure({
table(
columns: 5,
@ -147,6 +139,14 @@ A cross is a tool not impacted by a shadow attack.
caption: [Working attacks against static analysis tools]
) <tab:cl-results>
We used the 4 selected tools on the 4 versions of the application and compared the results on the control application to the results on the other application implementing the different obfuscation techniques.
We found that these static analysis tools do not consider the class loading mechanism, either because the tools only look at the content of the application file (#eg a disassembler) or because they consider class loading to be a dynamic feature and thus out of their scope.
In @tab:cl-results, we report on the types of shadowing that can trick each tool.
A plain circle is a shadow attack that leads to a wrong result.
A white circle indicates a tool emitting warnings or that displays the two versions of the class.
A cross is a tool not impacted by a shadow attack.
//We explain in more detail in the following the results for each considered tool.
==== Jadx
//Jadx is a reverse engineering tool that regenerates the Java source code of an application.

View file

@ -143,14 +143,6 @@ Manual inspection of some applications revealed that the two main reasons are:
- Instead of checking if the method's attributes are null inline, like Android does, applications use the method `org.apache.http.util.Args.notNull()`. According to comments in the source code of Android#footnote[https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/java/org/apache/http/params/HttpConnectionParams.java;drc=3bdd327f8532a79b83f575cc62e8eb09a1f93f3d?], the class was forked in 2007 from the Apache 'httpcomponents' project. Looking at the history of the project, the use of `Args.notNull()` was introduced in 2012#footnote[https://github.com/apache/httpcomponents-core/commit/9104a92ea79e338d876b1b60f5cd2b243ba7069f?]. This shows that applications are embedding code from more recent versions of this library without realising their version will not be the one used.
- Very small changes that we found can be attributed to the compilation process (e.g. swapping registers: `v0` is used instead of `v1` and `v1` instead of `v0`), but even if we consider them different, they are very similar.
The remaining 4.99% of classes that are identical to the Android version are classes where the body of the methods is replaced by stubs that throw `RuntimeException("Stub!")`.
This code corresponds to what we found in `android.jar`, but not the code we found in the emulator, which is surprising.
Nevertheless, we decided to count them as identical, because `android.jar` is the official jar file for developers, and stubs are replaced in the emulator: it is intended by Google developers.
Other results of @tab:cl-topsdk can be similarly discussed: either they are identical with a high ratio, or they are different because of small variations.
When substantial differences appear, it is mainly because different versions of the same library have been used or an #SDK class is embedded for retro-compatibility.
]
#figure({
show table: set text(size: 0.80em)
table(
@ -202,6 +194,14 @@ When substantial differences appear, it is mainly because different versions of
caption: [Shadow classes compared to #SDK 34 for a dataset of #nbapk applications]
) <tab:cl-topsdk>
The remaining 4.99% of classes that are identical to the Android version are classes where the body of the methods is replaced by stubs that throw `RuntimeException("Stub!")`.
This code corresponds to what we found in `android.jar`, but not the code we found in the emulator, which is surprising.
Nevertheless, we decided to count them as identical, because `android.jar` is the official jar file for developers, and stubs are replaced in the emulator: it is intended by Google developers.
Other results of @tab:cl-topsdk can be similarly discussed: either they are identical with a high ratio, or they are different because of small variations.
When substantial differences appear, it is mainly because different versions of the same library have been used or an #SDK class is embedded for retro-compatibility.
]
#paragraph([Hidden shadowing])[
For applications redefining hidden classes, on average, 16.1 classes are redefined (cf bottom part of @tab:cl-shadow).
The top 3 packages whose code actually differs from the ones found in Android are `java.util.stream`, `org.ccil.cowan.tagsoup` and `org.json`:

View file

@ -82,25 +82,6 @@ If we were to expect other possible methods to be called in addition to `myMetho
] #todo[Ref to list of common tools?] reformated for readability.
*/
The check of the `Method` value is done in a separate method injected inside the application to avoid cluttering the application too much.
Because Java (and thus Android) uses polymorphic methods, we cannot just check the method name and its class, but also the whole method signature.
We chose to limit the transformation to the specific instruction that calls `Method.invoke(..)`.
This drastically reduces the risks of breaking the application, but leads to a lot of type casting.
Indeed, the reflection call uses the generic `Object` class, but actual methods usually use specific classes (#eg `String`, `Context`, `Reflectee`) or scalar types (#eg `int`, `long`, `boolean`).
This means that the method parameters and object on which the method is called must be downcasted to their actual type before calling the method, then the returned value must be upcasted back to an `Object`.
Scalar types especially require special attention.
Java (and Android) distinguish between scalar types and classes, and they cannot be mixed: a scalar cannot be cast into an `Object`.
However, each scalar type has an associated class that can be used when doing reflection.
For example, the scalar type `int` is associated with the class `Integer`, the method `Integer.valueOf()` can convert an `int` scalar to an `Integer` object, and the method `Integer.intValue()` converts back an `Integer` object to an `int` scalar.
Each time the method called by reflection uses scalars, the scalar-object conversion must be made before calling it.
And finally, because the instruction following the reflection call expects an `Object`, the return value of the method must be cast into an `Object`.
This back and forth between types might confuse some analysis tools.
This could be improved in future works by analysing the code around the reflection call.
For example, if the result of the reflection call is immediately cast into the expected type (#eg in @lst:-th-expl-cl-call, the result is cast to a `String`), there should be no need to cast it to Object in between.
Similarly, it is common to have the method parameter arrays generated just before the reflection call and never be used again (This is due to `Method.invoke(..)` being a varargs method: the array can be generated by the compiler at compile time).
In those cases, the parameters could be used directly without the detour inside an array.
#figure(
```java
class T {
@ -137,6 +118,25 @@ In those cases, the parameters could be used directly without the detour inside
caption: [@lst:-th-expl-cl-call after the de-reflection transformation]
) <lst:-th-expl-cl-call-trans>
The check of the `Method` value is done in a separate method injected inside the application to avoid cluttering the application too much.
Because Java (and thus Android) uses polymorphic methods, we cannot just check the method name and its class, but also the whole method signature.
We chose to limit the transformation to the specific instruction that calls `Method.invoke(..)`.
This drastically reduces the risks of breaking the application, but leads to a lot of type casting.
Indeed, the reflection call uses the generic `Object` class, but actual methods usually use specific classes (#eg `String`, `Context`, `Reflectee`) or scalar types (#eg `int`, `long`, `boolean`).
This means that the method parameters and object on which the method is called must be downcasted to their actual type before calling the method, then the returned value must be upcasted back to an `Object`.
Scalar types especially require special attention.
Java (and Android) distinguish between scalar types and classes, and they cannot be mixed: a scalar cannot be cast into an `Object`.
However, each scalar type has an associated class that can be used when doing reflection.
For example, the scalar type `int` is associated with the class `Integer`, the method `Integer.valueOf()` can convert an `int` scalar to an `Integer` object, and the method `Integer.intValue()` converts back an `Integer` object to an `int` scalar.
Each time the method called by reflection uses scalars, the scalar-object conversion must be made before calling it.
And finally, because the instruction following the reflection call expects an `Object`, the return value of the method must be cast into an `Object`.
This back and forth between types might confuse some analysis tools.
This could be improved in future works by analysing the code around the reflection call.
For example, if the result of the reflection call is immediately cast into the expected type (#eg in @lst:-th-expl-cl-call, the result is cast to a `String`), there should be no need to cast it to Object in between.
Similarly, it is common to have the method parameter arrays generated just before the reflection call and never be used again (This is due to `Method.invoke(..)` being a varargs method: the array can be generated by the compiler at compile time).
In those cases, the parameters could be used directly without the detour inside an array.
=== Transforming Code Loading (or Not) <sec:th-trans-cl>
#jfl-note[Ici je pensais lire comment on tranforme le code qui load du code, mais on me parle de multi dex]
@ -270,7 +270,7 @@ We took special care to process the least possible files in the #APKs, and only
Unfortunately, we did not have time to compare the robustness of our solution to existing tools like Apktool and Soot, but we did a quick performance comparison, summarised in @sec:th-lib-perf.
In hindsight, we probably should have taken the time to find a way to use smali/backsmali (the backend of Apktool) as a library or use SootUp to do the instrumentation, but neither option has documentation to instrument applications this way.
At the time of writing, the feature is still being developed, but in the future, Androguard might also become an option to modify #DEX files.
Nevertheless, we published our instrumentation library, Androscalpel, for anyone who wants to use it. #todo[ref to code]
Nevertheless, we published our instrumentation library, Androscalpel, for anyone who wants to use it (see @sec:soft). #todo[Update is CS says no]
#midskip

View file

@ -111,7 +111,8 @@ The remaining #num(nb_bytecode_collected - nb_google - nb_appsflyer - nb_faceboo
table.cell(colspan: 4)[...],
table.hline(),
),
caption: [Most common dynamically loaded files]
caption: [Most common dynamically loaded files],
placement: top,
) <tab:th-bytecode-hashes>
=== Impact on Analysis Tools
@ -167,16 +168,6 @@ This is a reasonable failure rate, but we should keep in mind that it adds up to
To check the impact on the finishing rate of our instrumentation, we then run the same experiment we ran in @sec:rasta.
We run the tools on the #APK before and after instrumentation, and compared the finishing rates in @fig:th-status-npatched-vs-patched (without taking into account #APKs we failed to patch#footnote[Due to a handling error during the experiment, the figure shows the results for #nb_patched_rasta #APKs instead of #nb_patched. \ We also ignored the tool from Wognsen #etal due to the high number of timeouts]).
The finishing rate comparison is shown in @fig:th-status-npatched-vs-patched.
We can see that in most cases, the finishing rate is either the same or slightly lower for the instrumented application.
This is consistent with the fact that we add more bytecode to the application, hence adding more opportunities for failure during analysis.
They are two notable exceptions: Saaf and IC3.
The finishing rate of IC3, which was previously reasonable, dropped to 0 after our instrumentation, while the finishing rate of Saaf jumped to 100%, which is extremely suspicious.
Analysing the logs of the analysis showed that both cases have the same origin: the bytecode generated by our instrumentation has a version number of 37 (the version introduced by Android 7.0).
Unfortunately, neither the version of Apktool used by Saaf nor Dare (the tool used by IC3 to convert Dalvik bytecode to Java bytecode) recognises this version of bytecode, and thus failed to parse the #APK.
In the case of Dare and IC3, our experiment correctly identifies this as a crash.
On the other hand, Saaf do not detect the issue with Apktool and pursues the analysis with no bytecode to analyse and returns a valid return file, but for an empty application.
#todo[alt text @fig:th-status-npatched-vs-patched]
#figure({
image(
@ -189,6 +180,16 @@ On the other hand, Saaf do not detect the issue with Apktool and pursues the ana
caption: [Exit status of static analysis tools on original #APKs (left) and patched #APKs (right)]
) <fig:th-status-npatched-vs-patched>
The finishing rate comparison is shown in @fig:th-status-npatched-vs-patched.
We can see that in most cases, the finishing rate is either the same or slightly lower for the instrumented application.
This is consistent with the fact that we add more bytecode to the application, hence adding more opportunities for failure during analysis.
They are two notable exceptions: Saaf and IC3.
The finishing rate of IC3, which was previously reasonable, dropped to 0 after our instrumentation, while the finishing rate of Saaf jumped to 100%, which is extremely suspicious.
Analysing the logs of the analysis showed that both cases have the same origin: the bytecode generated by our instrumentation has a version number of 37 (the version introduced by Android 7.0).
Unfortunately, neither the version of Apktool used by Saaf nor Dare (the tool used by IC3 to convert Dalvik bytecode to Java bytecode) recognises this version of bytecode, and thus failed to parse the #APK.
In the case of Dare and IC3, our experiment correctly identifies this as a crash.
On the other hand, Saaf do not detect the issue with Apktool and pursues the analysis with no bytecode to analyse and returns a valid return file, but for an empty application.
#todo[Flowdroid results are inconclusive: some apks have more leak after and as many apks have less? also, runing flowdroid on the same apk can return a different number of leak???]
=== Example
@ -266,16 +267,8 @@ Although self-explanatory, verifying the code of those methods indeed confirms t
caption: [Code of `Main.main()`, as shown by Jadx, after patching],
)<lst:th-demo-after>
For a higher-level view of the method, we can also look at its call graph.
We used Androguard to generate the call graphs in @fig:th-cg-before and @fig:th-cg-after#footnote[We manually edited the generated .dot files for readability.].
@fig:th-cg-before shows the original call graph, and gives a good idea of the obfuscation methods used: we can see calls to `Main.decrypt(String)` that itself calls cryptographic #APIs, as well as calls to `ClassLoader.loadClass(String)`, `Class.getMethod(String, Class[])` and `Method.invoke(Object, Object[])`.
This indicates reflection calls based on ciphered strings, but does not reveal what the method actually does.
In comparison, @fig:th-cg-after, the call graph after instrumentation, still shows the cryptographic and reflection calls, as well as four new method calls.
In grey on the figure, we can see the glue methods (`T.check_is_Xxx_xxx(Method)`).
Those methods are part of the instrumentation process presented in @sec:th-trans, but do not bring a lot to the analysis of the call graph.
In red on the figure however, we have the calls that were hidded by reflection in the first call graph, and thank to the bytecode of the methods called being injected in the application, we can also see that they call `Utils.source(String)` and `Utils.sink(String)`, the methods we defined for this application as source of confidential data and exfiltration method.
#todo[alt text for @fig:th-cg-before and @fig:th-cg-after]
#figure([
#figure(
render(
read("figs/demo_main_main.dot"),
@ -297,12 +290,26 @@ In red on the figure however, we have the calls that were hidded by reflection i
),
caption: [Call Graph of `Main.main()` generated by Androguard after patching],
) <fig:th-cg-after>
],
caption: none,
kind: "th-cg-cmp-andro",
supplement: none,
)
For a higher-level view of the method, we can also look at its call graph.
We used Androguard to generate the call graphs in @fig:th-cg-before and @fig:th-cg-after#footnote[We manually edited the generated .dot files for readability.].
@fig:th-cg-before shows the original call graph, and gives a good idea of the obfuscation methods used: we can see calls to `Main.decrypt(String)` that itself calls cryptographic #APIs, as well as calls to `ClassLoader.loadClass(String)`, `Class.getMethod(String, Class[])` and `Method.invoke(Object, Object[])`.
This indicates reflection calls based on ciphered strings, but does not reveal what the method actually does.
In comparison, @fig:th-cg-after, the call graph after instrumentation, still shows the cryptographic and reflection calls, as well as four new method calls.
In grey on the figure, we can see the glue methods (`T.check_is_Xxx_xxx(Method)`).
Those methods are part of the instrumentation process presented in @sec:th-trans, but do not bring a lot to the analysis of the call graph.
In red on the figure however, we have the calls that were hidded by reflection in the first call graph, and thank to the bytecode of the methods called being injected in the application, we can also see that they call `Utils.source(String)` and `Utils.sink(String)`, the methods we defined for this application as source of confidential data and exfiltration method.
=== Androscalpel Performances <sec:th-lib-perf>
Because we implemented our own instrumentation library, we wanted to compare it to other existing options.
Unfortunately, we did not have time to compare the robustness and correctness of the generated applications.
However, we did compare the performances of our library, Androscalpel, to Apktool and Soot.
However, we did compare the performances of our library, Androscalpel, to Apktool and Soot, over the first 100 applications of RASTA (in alphabetical order of the SHA256).
Due to time constraints, we could not test a complex transformation, as adding registers requires complex operations for both Androscalpel and Apktool (see @sec:th-implem for more details).
We decided to test two operations: travelling the instructions of an application (a read-only operation), and regenerating an application, without modification (a read/write operation).
@ -316,19 +323,46 @@ It should be noted that all three of the tested tools have multiprocessing suppo
table.header(
table.cell(colspan: 2)[Tool], [Soot], [Apktool], [Androscalpel],
),
table.cell(rowspan: 2)[Read],
[Time], [], [], [],
[Mem], [], [], [],
table.cell(rowspan: 2)[Read/Write],
[Time], [], [], [],
[Mem], [], [], [],
table.cell(colspan: nb_col, inset: 1pt, stroke: none)[],
table.cell(rowspan: 3)[Read],
[Time (s)], ..for tool in ("soot", "apktool", "androscalpel") {
let res = performance_results.at(tool).read
(num(calc.round(res.cumulative_time / res.nb_results, digits: 2)),)
},
[Mem (GB)], ..for tool in ("soot", "apktool", "androscalpel") {
let res = performance_results.at(tool).read
(num(calc.round(res.cumulative_mem / res.nb_results / 1000000, digits: 2)),)
},
[Detected Crashes], ..for tool in ("soot", "apktool", "androscalpel") {
let res = performance_results.at(tool).read
(num(100 - res.nb_results),)
},
table.cell(colspan: nb_col, inset: 1pt, stroke: none)[],
table.cell(rowspan: 3)[Read/Write],
[Time (s)], ..for tool in ("soot", "apktool", "androscalpel") {
let res = performance_results.at(tool).write
(num(calc.round(res.cumulative_time / res.nb_results, digits: 2)),)
},
[Mem (GB)], ..for tool in ("soot", "apktool", "androscalpel") {
let res = performance_results.at(tool).write
(num(calc.round(res.cumulative_mem / res.nb_results / 1000000, digits: 2)),)
},
[Detected Crashes], ..for tool in ("soot", "apktool", "androscalpel") {
let res = performance_results.at(tool).write
(num(100 - res.nb_results),)
},
)},
caption: [Average time and memory consumption of Soot, Apktool and Androscalpel]
) <tab:th-compare-perf>
@tab:th-compare-perf compares the resources consumed by each tool for each operation.
We can see that for read-only operation, we are 16 times faster than Soot and 8 times faster than Apktool, while keeping a smaller memory footprint.
When generating an application, the gap lessens, but we are still almost 8 times faster than Soot.
Some of this difference probably comes from implementation choices: Soot and Apktool are implemented in Java, which has a noticeable overhead compared to Rust.
However, a noticeable part of this difference can also be explained by the specialised nature of our library; we did not implement all the features Soot has, and we do not parse Android resources like Apktool does.
Having better performances does not means that our solution can replace the other in all cases.
#todo[Conlude depending on the results of the experiment]
Nevertheless, it should be noted that over the 100 applications tested, Soot failed to regenerate 10 of them, Apktool 4, and Androscalpel only 1, showing that our efforts to limit crashes were successful.
#midskip

View file

@ -2,12 +2,12 @@
== Conclusion <sec:th-conclusion>
In this chapter, we presented a set of transformations to apply to an application to encode reflection calls and code loaded dynamically inside the application.
In this chapter, we presented a set of transformations to encode reflection calls and code loaded dynamically inside the application.
We also presented a dynamic analysis approach to collect the information needed to perform those transformations.
We then applied this method to a recent subset of applications of our dataset from @sec:rasta.
When comparing the success rate of the tools of @sec:rasta on the applications before and after the transformation, we found that, in general, the success rate of those tools slightly decreases (except for a few tools).
We also showed that our transformation indeed allows static analysis tools to access and process that runtime information in their analysis.
We also showed that our transformation allows static analysis tools to access and process that runtime information in their analysis.
However, a more in-depth look at the results of our dynamic analysis showed that our code coverage is lacking, and that the great majority of dynamically loaded code we intercepted is from generic advertisement and telemetry libraries.
#v(2em)

View file

@ -119,3 +119,6 @@ F34CE1E7A81F935A5BB2D0B2B3FE81E62C1C8B906C92253C9CA467DA9BB3C9D1,704095,706576,2
#let nb_google = bytecode_hashes.filter((e) => "google" in e.at(2)).map((e) => e.at(0)).sum()
#let nb_facebook = bytecode_hashes.filter((e) => "facebook" in e.at(2)).map((e) => e.at(0)).sum()
#let nb_appsflyer = bytecode_hashes.filter((e) => "appsflyer" in e.at(2)).map((e) => e.at(0)).sum()
#let performance_results = json("./data/performance_results.json")

View file

@ -0,0 +1,38 @@
{
"androscalpel": {
"read": {
"cumulative_time": 438.4299999999998,
"cumulative_mem": 58026844,
"nb_results": 100
},
"write": {
"cumulative_time": 2013.3000000000002,
"cumulative_mem": 117539104,
"nb_results": 99
}
},
"soot": {
"read": {
"cumulative_time": 7395.499999999999,
"cumulative_mem": 233476236,
"nb_results": 100
},
"write": {
"cumulative_time": 14092.539999999999,
"cumulative_mem": 362160516,
"nb_results": 90
}
},
"apktool": {
"read": {
"cumulative_time": 3308.8900000000012,
"cumulative_mem": 138020736,
"nb_results": 100
},
"write": {
"cumulative_time": 7189.420000000001,
"cumulative_mem": 184730724,
"nb_results": 96
}
}
}

View file

@ -215,15 +215,15 @@ Le @tab:rasta-choix-sources résume cette étape.
caption: [Outils selectionnés, branchements, versions selectionnées et environnements d'exécution],
) <tab:rasta-choix-sources>
Nous avons ensuite exécuté ces outils sur deux jeux d'applications: Drebin, un jeu de maliciels connus pour être vieux et biaisé, et Rasta, un jeu que nous avons échantilloné nous-mêmes pour réprésenter l'évolution des caractéristiques des applications entre 2010 et 2023, d'un total de #NBTOTALSTRING #APKs.
Nous avons ensuite exécuté ces outils sur deux jeux d'applications: Drebin, un jeu de maliciels connus pour être vieux et biaisé, et RASTA, un jeu que nous avons échantilloné nous-mêmes pour réprésenter l'évolution des caractéristiques des applications entre 2010 et 2023, d'un total de #NBTOTALSTRING #APKs.
Après avoir lancé les outils, nous avons collecté les différents résultats et traces d'exécution.
@fig:rasta-exit-drebin-fr et @fig:rasta-exit-fr montrent les résultas des analyses sur les applications de Drebin et Rasta.
@fig:rasta-exit-drebin-fr et @fig:rasta-exit-fr montrent les résultas des analyses sur les applications de Drebin et RASTA.
L'analyse est considérée comme réussie (vert) si un résultat est obtenu, sinon elle a échoué (rouge).
Quand l'analyse met plus d'une heure à finir, elle est avortée (bleue).
On peut voir que les outils ont d'assez bon résultats sur Drebin, avec 11 outils qui ont un taux de finition au dessus de 85%.
Sur Rasta par contre, #resultunusablenb outils (#resultunusable) ont un taux de finition en dessous de 50%.
Les outils qui avaient des difficultés avec Drebin ont aussi de mauvais résultats sur Rasta, mais d'autre outils avec des résultats acceptables sur Drebin chutent avec Rasta.
Sur RASTA par contre, #resultunusablenb outils (#resultunusable) ont un taux de finition en dessous de 50%.
Les outils qui avaient des difficultés avec Drebin ont aussi de mauvais résultats sur RASTA, mais d'autre outils avec des résultats acceptables sur Drebin chutent avec RASTA.
Ces résultats nous permettent de répondre à notre première question *QR1*:
@ -293,7 +293,7 @@ De plus pour les outils que nous avons pu lancer, #resultratio des analyses ont
wognsen_et_al: a little less than 15% finished, a little less than 20% failed, the rest timed out
"
),
caption: [Taux de finition pour le jeu d'application Rasta],
caption: [Taux de finition pour le jeu d'application RASTA],
) <fig:rasta-exit-fr>
Nous avons ensuite étudié l'évolution du taux de finition des outils au cours des ans.
@ -305,7 +305,7 @@ Par exemple, la librairie standard d'Android et le format des applications ont
Un autre changement notable est la taille du code à octets des applications.
Les applications les plus récentes ont notablement plus de code.
Pour déterminer le facteur qui influence le plus le taux de finitions, nous avons étudié des sous-ensembles de Rasta avec certains de ces paramètres fixés.
Pour déterminer le facteur qui influence le plus le taux de finitions, nous avons étudié des sous-ensembles de RASTA avec certains de ces paramètres fixés.
Par exemple, nous avons tracé l'évolution du taux de finition en fonction de l'année de publication des applications sur l'ensemble des applications dont le code à octets fait entre 4.08 et 5.2 Mo.
#figure(
@ -601,7 +601,7 @@ Nous avons finalement choisi `DexFile.openInMemoryDexFileNative()` et `DexFile.o
Ces méthodes sont les dernières méthodes appelées dans l'environement Java avant de passer en native pour analyser et charger le code à octets.
Pour aider à l'exploration des applications, nous avons réutilisé une partie de GroddDroid, un outil dédié à l'exploration dynamique d'applications.
Nous avons lancé notre analyse statique sur les #num(5000) applications publiées en 2023 du jeu d'applications Rasta.
Nous avons lancé notre analyse statique sur les #num(5000) applications publiées en 2023 du jeu d'applications RASTA.
Malheureusement, les résultats semble indiquer que notre environnement d'exécution est insuffisant et que beaucoup d'applications n'ont pas été visitées correctement.
Malgrés tout, nous avons collecté #nb_bytecode_collected fichiers de code à octets.
Toutefois, une fois comparé, nous remarquons que parmi ces fichiers, il n'y a que #num(bytecode_hashes.len()) fichiers distincts.