wip
All checks were successful
/ test_checkout (push) Successful in 1m21s

This commit is contained in:
Jean-Marie Mineau 2025-08-19 23:27:25 +02:00
parent 5a71a9d5dd
commit 81f49f87d3
Signed by: histausse
GPG key ID: B66AEEDA9B645AD2
16 changed files with 267 additions and 202 deletions

View file

@ -1,10 +1,14 @@
#let ADB = link(<acr-adb>)[ADB]
#let AOSP = link(<acr-aosp>)[AOSP]
#let API = link(<acr-api>)[API]
#let APIs = link(<acr-api>)[APIs]
#let APK = link(<acr-apk>)[APK]
#let APKs = link(<acr-apk>)[APKs]
#let ART = link(<acr-art>)[ART]
#let AXML = link(<acr-axml>)[AXML]
#let CFG = link(<acr-cfg>)[CFG]
#let DEX = link(<acr-dex>)[DEX]
#let DFG = link(<acr-cfg>)[DFG]
#let FR = link(<acr-fr>)[FR]
#let HPC = link(<acr-hpc>)[HPC]
#let MWE = link(<acr-mwe>)[MWE]
@ -25,11 +29,14 @@
[Acronyms], [Meanings],
),
ADB, [Android Debug Bridge, a tool to connect to an Android emulator of smartphone to run commands, start applications, send events and perform other operations for testing and debuging purpose <acr-adb>],
AOSP, [Android Open Source Project, the project hosting the most of the Android operating system source code <acr-aosp>],
API, [Application Programming Interface, in the Android ecosystem, it is a set of classes with known method signatures that can be called by an application to interact with the Android framework <acr-api>],
APK, [Android Package, the file format used to install application on Android. The APK format is an extention of the #JAR format <acr-apk>],
ART, [Android RunTime, the runtime environement that execute an Android application. The ART is the successor of the older Dalvik Virtual Machine <acr-art>],
AXML, [Android #XML. The specific flavor of #XML used by Android. The main specificity of AXML is that it can be compile in a binary version inside an APK <acr-axml>],
CFG, [Control-Flow Graph, a graph representing the control structures (_e.g._ "if" blocks) of the code of a method/application <acr-cfg>],
DEX, [Dalvik Executable, the file format for the bytecode used for applicatiobs by Android <acr-dex>],
DFG, [Data-Flow Graph, a graph representing the flow of information in an application <acr-dfg>],
FR, [Finishing Rate, the number of runs that finished over the number of total runs of an analysis <acr-fr>],
HPC, [High-Performance Computing, the use of supercomputers and computer clusers <acr-hpc>],
MWE, [Minimum Working Example, in this context, a small example that can be used to check if a tool is working <acr-mwe>],

View file

@ -70,7 +70,7 @@ Compared to Soot, it has a modernize interface and architecture, but it is not y
=== Frida <sec:bg-frida>
Fidra#footnote[https://frida.re/] is a dynamic intrumentation toolkit.
Frida#footnote[https://frida.re/] is a dynamic intrumentation toolkit.
It allows the reverse engineer to inject and run javascript code inside a running application.
To instrument an application, the frida server must be running as root on the phone, or the frida librairy must be injected inside the #APK file before installing it.

View file

@ -1,7 +1,9 @@
#import "../lib.typ": etal, paragraph, DEX
#import "X_var.typ": *
#import "../lib.typ": etal, paragraph, DEX, todo
== State of the Art <sec:cl-soa>
=== SOA CLASS CHAP 4 <sec:cl-soa>
#todo[include in CHAP 2 properly]
#todo[Split Class Loading and Hidden API in subsection]
#paragraph([Class loading])[
Class loading mechanisms have been studied in the general context of the Java language.
@ -18,7 +20,7 @@ Contributions about Android class loading focus on using the capabilities of cla
For instance, Zhou #etal~@zhou_dynamic_2022 extend the class loading mechanism of Android to support regular Java bytecode and Kritz and Maly~@kriz_provisioning_2015 propose a new class loader to automatically load modules of an application without user interactions.
Regarding reverse engineering, class loading mechanisms are frequently used by packers for hiding all or parts of the code of an application~@Duan2018.
The problem to be solved consists in locating secondary #dexfiles that can be unciphered just before being loaded.
The problem to be solved consists in locating secondary #DEX files that can be unciphered just before being loaded.
Dynamic hook mechanisms should be used to intercept the bytecode at load time.
These techniques can be of some help for the reverser, but they require to instrument the source code of AOSP or the application itself.
The engineering cost is high and anti-debugging techniques can slow down the process.

View file

@ -9,6 +9,7 @@
#include("2_tools.typ")
#include("3_static_analysis.typ")
#include("4_datasets_and_benchmarking.typ")
#include("_chapter_4_soa.typ")
#include("X_dynamic_analysis.typ")
/*

View file

@ -21,11 +21,11 @@ The observation of the success or failure of these analysis enables us to answer
/ RQ3: Does the reusability of tools change when analyzing goodware compared to malware? <rq-3>
/*
As a summary, the contributions of this paper are the following:
As a summary, the contributions of this chapterare the following:
- We provide containers with a compiled version of all studied analysis tools, which ensures the reproducibility of our experiments and an easy way to analyse applications for other researchers. Additionally receipts for rebuilding such containers are provided.
- We provide a recent dataset of #NBTOTALSTRING applications balanced over the time interval 2010-2023.
- We point out which static analysis tools of Li #etal SLR paper@Li2017 can safely be used and we show that #resultunusable of evaluated tools are unusable (considering that a tool that fails more than 50% of time is unusable). In total, the success rate of the tools we could run is #resultratio on our dataset.
- We point out which static analysis tools of Li #etal SLR~@Li2017 can safely be used and we show that #resultunusable of evaluated tools are unusable (considering that a tool that fails more than 50% of time is unusable). In total, the success rate of the tools we could run is #resultratio on our dataset.
- We discuss the effect of applications features (date, size, SDK version, goodware/malware) on static analysis tools and the nature of the issues we found by studying statistics on the errors captured during our experiments.
*/

View file

@ -108,7 +108,7 @@ We refer to this variant of usage as androguard_dad.
Finally, starting with #nbtools tools of @tab:rasta-tools, with the two variations of IC3 and Androguard, we have in total #nbtoolsselectedvariations static analysis tools to evaluate in which two tools cannot be built and will be considered as always failing.
=== Source Code Selection and Building Process
=== Source Code Selection and Building Process <sec:rasta-src-select>
#figure({
show table: set text(size: 0.80em)

View file

@ -5,6 +5,8 @@
== Conclusion <sec:rasta-conclusion>
#todo[Ca serait bien de faire un PR ou deux a Jadx/Androguard/Soot quand même]
Since the release of Android, many tools have been published in order to analyse Android application.
In @sec:bg, we went through contributions benchmarking and comparing some of those tools.
Those contributions suggested that analysing real-world applications might be more of a challenged than expected.

View file

@ -0,0 +1,23 @@
#import "../lib.typ": etal, ie, ART, DEX, APK, SDK
#import "X_var.typ": *
== Introduction
In this chapter, we study how Android handles the loading of classes in the case of multiple versions of the same class.
Such collision can exist inside the #APK file or between the #APK file and #Asdkc.
We intend to understand if a reverser would be impacted during a static analysis when dealing with such an obfuscated code.
Because this problem is already complex enough with the current operations performed by Android, we exclude the case where a developer recodes a specific class loader or replace a class loader by another one, as it is often the case for example in packed applications~@Duan2018.
We present a new technique that "shadows" a class #ie embeds a class in the #APK file and "presents" it to the reverser instead of the legitimate version.
The goal of such an attack is to confuse them during the reversing process: at runtime the real class will be loaded from another location of the #APK file or from the #Asdk, instead of the shadow version.
This attack can be applied to regular classes of the #Asdk or to hidden classes of Android~@he_systematic_2023 @li_accessing_2016.
We show how these attacks can confuse the tools of the reverser when he performs a static analysis.
In order to evaluate if such attacks are already used in the wild, we analysed #nbapk applications from 2023 that we extracted randomly from AndroZoo~@allixAndroZooCollectingMillions2016.
Our main result is that #shadowsdk of these applications contain shadow collisions against the #SDK and #shadowhidden against hidden classes.
Our investigations conclude that most of these collisions are not voluntary attacks, but we highlight one specific malware sample performing strong obfuscation revealed by our detection of one shadow attack.
The chapter is structured as follows.
@sec:cl-loading investigates the internal mechanisms about class loading and presents how a reverser can be confused by these mechanisms.
Then in @sec:cl-obfuscation, we design obfuscation techniques and we show their effect on static analysis tools.
Next, @sec:cl-wild evaluates if these obfuscation techniques are used in the wild, by searching inside #nbapk APKs if they exploit these techniques.
@sec:cl-conclusion extends on the possible countermesures against those shadow attacks, how they interact with other obfuscation techniques, as well as the limitations of this work and avenues left to explore.
Finally, @sec:cl-conclusion concludes the chapter.

View file

@ -1,17 +1,17 @@
#import "../lib.typ": todo, ie, etal, num, DEX
#import "../lib.typ": todo, ie, etal, num, DEX, ART, SDK, API, APK, APIs, AOSP
#import "X_var.typ": *
== Analyzing the Class Loading Process <sec:cl-loading>
For building obfuscation techniques based on the confusion of tools with class loaders, we manually studied the code of Android that handles class loading.
In this section, we report the inner workings of ART and we focus on the specificities of class loading that can bring confusion.
Because the class loading implementation has evolved over time during the multiple iterations of the Android operating system, we mainly describe the behavior of ART from Android version 14 (SDK 34).
In this section, we report the inner workings of #ART and we focus on the specificities of class loading that can bring confusion.
Because the class loading implementation has evolved over time during the multiple iterations of the Android operating system, we mainly describe the behavior of #ART from Android version 14 (#SDK 34).
=== Class Loaders
When ART needs to access a class, it queries a `ClassLoader` to retrieve its implementation.
When #ART needs to access a class, it queries a `ClassLoader` to retrieve its implementation.
Each class has a reference to the `ClassLoader` that loaded it, and this class loader is the one that will be used to load supplementary classes used by the original class.
For example in @lst:cl-expl-cl-loading, when calling `A.f()`, the ART will load `B` with the class loader that was used to load `A`.
For example in @lst:cl-expl-cl-loading, when calling `A.f()`, the #ART will load `B` with the class loader that was used to load `A`.
#figure(
```java
@ -32,10 +32,9 @@ Moreover, rather than using the Java class loaders `SecureClassLoader` or `URLCl
The left part of @fig:cl-class_loading_classes shows the different class loaders specific to Android in white and the stubs of the original Java class loaders in grey.
The main difference between the original Java class loaders and the ones used by Android is that they do not support the Java bytecode format.
Instead, the Android-specific class loaders load their classes from (many) different file formats specific to Android.
Usually, when used by a programmer, the classes are loaded from memory or from a file using the DEX format (`.dex`).
When used directly by ART, the classes are usually stored in an application file (`.apk`) or in an optimized format (`OAR/ODEX`).
Usually, when used by a programmer, the classes are loaded from memory or from a file using the #DEX format (`.dex`).
When used directly by #ART, the classes are usually stored in an application file (`.apk`) or in an optimized format (`OAR/ODEX`).
#todo[Alt text for cl-class_loading_classes]
#figure([
#image(
"figs/classloaders-crop.svg",
@ -110,7 +109,7 @@ In reality, the #platc are loaded by `bootClassLoader` and the classes from the
In addition to the class loaders instantiated by ART when starting an application, the developer of an application can use class loaders explicitly by calling to ones from the #Asdk, or by recoding custom class loaders that inherit from the `ClassLoader` class.
At this point, modeling accurately the complete class loading algorithm becomes impossible: the developer can program any algorithm of their choice.
For this reason, this case is excluded from this paper and we focus on the default behavior where the context class loader is the one pointing to the `.apk` file and where its delegate is `BootClassLoader`.
For this reason, this case is excluded from this chapter and we focus on the default behavior where the context class loader is the one pointing to the `.apk` file and where its delegate is `BootClassLoader`.
With such a hypothesis, the delegation process can be modeled by the pseudo-code of method `load_class` given in <lst:cl-listing3>.
In addition, it is important to distinguish the two types of #platc handled by `BootClassLoader` and that both have priority over classes from the application at runtime:
@ -143,16 +142,16 @@ On the top right, a diagram of a web browser open at https//develoer.android.com
"
),
caption: [Location of SDK classes during development and at runtime]
caption: [Location of #SDK classes during development and at runtime]
) <fig:cl-archisdk>
@fig:cl-archisdk shows how classes of Android are used in the development environment and at runtime.
In the development environment, Android Studio uses `android.jar` and the specific classes written by the developer.
After compilation, only the classes of the developer, and sometimes extra classes computed by Android Studio are zipped in the APK file, using the multi-dex format.
After compilation, only the classes of the developer, and sometimes extra classes computed by Android Studio are zipped in the #APK file, using the multi-dex format.
At runtime, the application uses `BootClassLoader` to load the #platc from Android.
Until our work, previous works~@he_systematic_2023 @li_accessing_2016 considered both #Asdk and #hidec to be in the file `/system/framework/framework.jar` found in the phone itself, but we found that the classes loaded by `bootClassLoader` are not all present in `framework.jar`.
For example, He #etal~@he_systematic_2023 counted 495 thousand APIs (fields and methods) in Android 12, based on Google documentation on restriction for non SDK interfaces#footnote[https://developer.android.com/guide/app-compatibility/restrictions-non-sdk-interfaces].
However, when looking at the content of `framework.jar`, we only found #num(333) thousand APIs.
For example, He #etal~@he_systematic_2023 counted 495 thousand #APIs (fields and methods) in Android 12, based on Google documentation on restriction for non #SDK interfaces#footnote[https://developer.android.com/guide/app-compatibility/restrictions-non-sdk-interfaces].
However, when looking at the content of `framework.jar`, we only found #num(333) thousand #APIs.
Indeed, classes such as `com.android.okhttp.OkHttpClient` are loaded by `bootClassLoader`, listed by Google, but not in `framework.jar`.
For optimization purposes, classes are now loaded from `boot.art`.
@ -160,10 +159,10 @@ This file is used to speed up the start-up time of applications: it stores a dum
Unfortunately, this format is not documented and not retro-compatible between Android versions and is thus difficult to parse.
An easier solution to investigate the #platc is to look at the `BOOTCLASSPATH` environment variable in an emulator.
This variable is used to load the classes without the `boot.art` optimization.
We found 25 `.jar` files, including `framework.jar`, in the `BOOTCLASSPATH` of the standard emulator for Android 12 (SDK 32), 31 for Android 13 (SDK 33), and 35 for Android 14 (SDK 35), containing respectively a total of #num(499837), #num(539236) and #num(605098) API methods and fields.
We found 25 `.jar` files, including `framework.jar`, in the `BOOTCLASSPATH` of the standard emulator for Android 12 (#SDK 32), 31 for Android 13 (#SDK 33), and 35 for Android 14 (#SDK 35), containing respectively a total of #num(499837), #num(539236) and #num(605098) API methods and fields.
@tab:cl-platform_apis) summarizes the discrepancies we found between Google's list and the #platc we found in Android emulators.
Note also that some methods may also be found _only_ in the documentation.
Our manual investigations suggest that the documentation is not well synchronized with the evolution of the #platc and that Google has almost solved this issue in API 34.
Our manual investigations suggest that the documentation is not well synchronized with the evolution of the #platc and that Google has almost solved this issue in #API 34.
#figure({
@ -194,7 +193,7 @@ Our manual investigations suggest that the documentation is not well synchronize
table.hline(),
)},
caption: [Comparison for API methods between documentation and emulators],
caption: [Comparison for #API methods between documentation and emulators],
)<tab:cl-platform_apis>
We conclude that it can be dangerous to trust the documentation and that gathering information from the emulator or phone is the only reliable source.
@ -202,8 +201,8 @@ Gathering the precise list of classes and the associated bytecode is not a trivi
=== Multiple #DEX Files <sec:cl-collision>
For the application class files, Android uses its specific format called DEX: all the classes of an application are loaded from the file `classes.dex`.
With the increasing complexity of Android applications, the need arrised to load more methods than the DEX format could support in one #dexfile.
For the application class files, Android uses its specific format called #DEX: all the classes of an application are loaded from the file `classes.dex`.
With the increasing complexity of Android applications, the need arrised to load more methods than the #DEX format could support in one #dexfile.
To solve this problem, Android started storing classes in multiple files named `classesX.dex` as illustrated by the @lst:cl-dexname that generates the filenames read by class loaders.
Android starts loading the file `GetMultiDexClassesDexName(0)` (`classes.dex`), then `GetMultiDexClassesDexName(1)` (`classes2.dex`), and continues until finding a value `n` for which `GetMultiDexClassesDexName(n)` does not exist.
Even if Android emits a warning message when it finds more than 100 #dexfiles, it will still load any number of #dexfiles that way.
@ -219,13 +218,13 @@ We will show later in @sec:cl-evaltools that this choice is not the most intuiti
As a conclusion, we model both the multi-dex and delegation behaviors in the pseudo-code of @lst:cl-loading-alg.
#figure(
```C++
```C
std::string DexFileLoader::GetMultiDexClassesDexName(size_t index) {
return (index == 0) ?
"classes.dex" :
StringPrintf("classes%zu.dex", index + 1);
}
```,
caption: [The method generating the .dex filenames from the AOSP]
caption: [The method generating the .dex filenames from the #AOSP]
) <lst:cl-dexname>

View file

@ -1,10 +1,11 @@
#import "../lib.typ": eg, todo, paragraph
#import "../lib.typ": eg, paragraph, DFG, DEX, API, SDK, APK, ART, AOSP
#import "../lib.typ": todo, jfl-note
#import "X_var.typ": *
== Obfuscation Techniques <sec:cl-obfuscation>
In this section, we present new obfuscation techniques that take advantage of the complexity of the class loading process.
Then, in order to evaluate their efficiency, we reviewed some common Android reverse analysis tools to see how they behave when collisions occur between classes of the APK or between a class of the APK and classes of Android (#Asdk or #hidec).
Then, in order to evaluate their efficiency, we reviewed some common Android reverse analysis tools to see how they behave when collisions occur between classes of the #APK or between a class of the #APK and classes of Android (#Asdk or #hidec).
We call this collision "*class shadowing*", because the attacker version of the class shadows the one that will be used at runtime.
To evaluate if such shadow attacks are working, we handcrafted three applications implementing shadowing techniques to test their impact on static analysis tools.
Then, we manually inspected the output of the tools in order to check its consistency with what Android is really doing at runtime.
@ -39,24 +40,24 @@ Basic shadowing imply to have several class with the same name in the applicatio
On the other hand, using #hidec leave classes without implementation in the application, which can also be detected.
*/
#paragraph([*Self shadow*: shadowing a class with another from APK])[
#paragraph([*Self shadow*: shadowing a class with another from #APK])[
This method consists in hiding the implementation of a class with another one by exploiting the possible collision of class names, as described in @sec:cl-collision with multiple #dexfiles.
If reversers or tools ignore the priority order of a multi-dex file, they can take into account the wrong version of a class.
]
//priorité aux classes SDK meme si une shadow classe est définie dans l'APK (tout ca a cause de Boot)
#paragraph([*SDK shadow*: shadowing a SDK class])[
This method consists in presenting to the reverser a fake implementation of a class of the SDK.
This class is embedded in the APK file and has the same name as the one of the SDK.
Because `BootClassLoader` will give priority to the #Asdk at runtime, the reverser or tool should ignore any version of a class that is contained in the APK.
The only constraint when shadowing an SDK class is that the shadowing implementation must respect the signature of real classes.
Note that, by introducing a custom class loader, the attacker could inverse the priority, but this case is out of the scope of this paper.
#paragraph([*SDK shadow*: shadowing a #SDK class])[
This method consists in presenting to the reverser a fake implementation of a class of the #SDK.
This class is embedded in the #APK file and has the same name as the one of the #SDK.
Because `BootClassLoader` will give priority to the #Asdk at runtime, the reverser or tool should ignore any version of a class that is contained in the #APK.
The only constraint when shadowing an #SDK class is that the shadowing implementation must respect the signature of real classes.
Note that, by introducing a custom class loader, the attacker could inverse the priority, but this case is out of the scope of this chapter.
]
// priorité aux classes hidden (car du SDK) meme si une shadow classe est définie dans l'APK
#paragraph([*Hidden shadow*: shadowing an hidden class])[
This method is similar to the previous one, except the class that is shadowed is a #hidecsingular.
Because ART will give priority to the internal version of the class, the version provided in the APK file will be ignored.
Because #ART will give priority to the internal version of the class, the version provided in the #APK file will be ignored.
Such shadow attacks are more difficult to detect by a reverser, that may not know the existence of this specific hidden class in Android.
]
@ -71,33 +72,44 @@ Such shadow attacks are more difficult to detect by a reverser, that may not kno
Taint.sink(ac, obfuscated_personal_data);
}
}
public class Obfuscation { // customized for each obfuscation technique
// customized for each obfuscation technique
public class Obfuscation {
public static String hide_flow(String personal_data) { ... }
}
```,
caption: [Main body of test apps]
)<lst:cl-testapp>
We selected tools that are commonly used to unpack and reverse Android applications: Jadx#footnote[https://github.com/skylot/jadx], a decompiler for Android applications, Apktool#footnote[https://apktool.org/], a disassembler/repackager of applications, Androguard#footnote[https://github.com/androguard/androguard], one of the oldest Python package for manipulating Android applications, and Flowdroid~@Arzt2014a that performs taint flow analysis.
For evaluating the tools, we designed a single application that we can customize for different tests.
We selected tools that are commonly used to unpack and reverse Android applications.
The only two tools that we found to still be alive in @sec:rasta-src-select: Androguard#footnote[https://github.com/androguard/androguard] and Flowdroid~@Arzt2014a.
We also selected Jadx#footnote[https://github.com/skylot/jadx], a state-of-the-art decompiler for Android applications, as well as Apktool#footnote[https://apktool.org/], a disassembler/repackager used by 9 of the tools tested in @sec:rasta and often used by reverser when Jadx fails.
To evaluate the tools, we designed a single application that we can customize for different tests.
@lst:cl-testapp shows the main body implementing:
- a possible flow to evaluate FlowDroid: a flow from a method `Taint.source()` to a method `Taint.sink(Activity, String)` through a method `Obfuscation.hide_flow(String)`;
- a possible use of a SDK or hidden class inside the class `Obfuscation` to evaluate #platc shadowing for other tools.
- a possible flow to evaluate FlowDroid: a flow from a method `Taint.source()` to a method `Taint.sink(Activity, String)` through a method `Obfuscation.hide_flow(String)`.
- a possible use of a #SDK or hidden class inside the class `Obfuscation` to evaluate #platc shadowing for other tools.
The first application we released is a control application that does not do anything special.
It will be used for checking the expecting result of tools.
The second implements self shadowing: the class `Obfuscation` is duplicated: one is the same as the in the control app (`Obfuscation.hide_flow(String)` returns its arguments), and the other version returns a constant string.
These two versions are embedded in several DEX of a multi-dex application.
The third application tests SDK shadowing and needs an existing class of the SDK.
We used `Pair` that we try to shadow.
We put data in a `Pair` and reread the data from the `Pair`. The colliding `Pair` discards the data and returns null.
The last application tests for Hidden API shadowing.
Like for the third one, we similarly store data in `com.android.okhttp.Request` and then retrieve it.
Again, the shadowing implementation discards the data.
We used 4 versions of this application:
* A control application that does not do anything special: `Obfuscation.hide_flow(String personal_data)` simply return `personal_data`.
It will be used for checking the expecting result of tools.
* A version that implements self shadowing: the class `Obfuscation` is duplicated: one is the same as the in the control app (`Obfuscation.hide_flow(String)` returns its arguments), and the other version returns a constant string.
These two versions are embedded in several #DEX of a multi-dex application.
* The third version implement #SDK shadowing and needs an existing class of the #SDK.
We used the #SDK class `Pair` that we try to shadow.
We put data in a new `Pair` instance and reread the data from the `Pair`.
The colliding `Pair` class we created discards the data at the initialisation and stores `null` instead of the argument values.
This decoy class break the flow of information: Flowdroid will detect the information flow if it uses the actuall #SDK implementation of `Pair` to compute the #DFG, but not if it uses the decoy.
* The last version tests for Hidden #API shadowing.
Like for the third one, we similarly store data in `com.android.okhttp.Request` and then retrieve it.
Again, the shadowing implementation discards the data.
We used the 4 selected tools on the 4 versions of the application and compared the results on the control application to the results on the other application implementing the different obfuscation techniques.
We found that these static analysis tools do not consider the class loading mechanism, either because the tools only look at the content of the application file (#eg a disassembler) or because they consider class loading to be a dynamic feature and thus out of their scope.
In @tab:cl-results, we report on the types of shadowing that can be tricked each tool.
In @tab:cl-results, we report on the types of shadowing that can trick each tool.
A plain circle is a shadow attack that leads to a wrong result.
A white circle indicates a tool emitting warnings or that displays the two versions of the class.
A cross is a tool not impacted by a shadow attack.
@ -152,7 +164,7 @@ Shadowing #Asdk and #hidec is possible in Jadx: there is only one implementation
==== Apktool
Apktool generates Smali files, an assembler language for DEX bytecode.
Apktool generates Smali files, an assembler language for #DEX bytecode.
Apktool will store the disassembled classes in a folder that matches the #dexfile that stores the bytecode.
This means that when shadowing a class with two versions in two #dexfiles, the shadow implementations will be disassembled into two directories.
No indication is displayed that a collision is possible.
@ -163,15 +175,15 @@ Similarly to Jadx, using an #Asdk or #hidecsingular will not be detected by the
==== Androguard
Androguard has different usages, with different levels of analysis.
The documentation highlights the analysis commands that compute three types of objects: an APK object, a list of DEX objects, and an Analysis object.
The APK and the list of #dexfiles are a one-to-one representation of the content of an application, and have the same issues that we discussed with Apktool: they provide the different versions of a shadow class contained in multiple #dexfiles.
The documentation highlights the analysis commands that compute three types of objects: an #APK object, a list of #DEX objects, and an Analysis object.
The #APK and the list of #dexfiles are a one-to-one representation of the content of an application, and have the same issues that we discussed with Apktool: they provide the different versions of a shadow class contained in multiple #dexfiles.
The Analysis object is used to compute a method call graph and we found that this algorithm may choose the wrong version of a shadowed class when using the cross references that are computed.
This leads to an invalid call graph as shown in @fig:cl-andro_obf_cg: the two methods `doSomething()` are represented in the graph, but the one linked to `main()` on the graph is the one calling the method `good()` when in fact the method `bad()` is called when running the application.
Androguard has a method `.is_external()` to detect if the implementation of a class is not provided inside the application and a method `.is_android_api()` to detect if the class is part of the Android API.
Androguard has a method `.is_external()` to detect if the implementation of a class is not provided inside the application and a method `.is_android_api()` to detect if the class is part of the Android #API.
Regrettably, the documentation of `.is_android_api()` explains that the method is still experimental and just checks a few package names.
This means that although those methods are useful, the only indication of the use of an #Asdk or #hidec is the fact that the class is not in the APK file.
This means that although those methods are useful, the only indication of the use of an #Asdk or #hidec is the fact that the class is not in the #APK file.
Because of that, like for Apktool and Jadx, Androguard has no way to warn the reverser that the shadow of an #Asdk or #hidec is not the class used when running the application.
#figure({
@ -217,13 +229,17 @@ Because of that, like for Apktool and Jadx, Androguard has no way to warn the re
==== Flowdroid
Flowdroid~@Arzt2014a is used to detect if an application can leak sensitive information.
#jfl-note[Flowdroid~@Arzt2014a is used to detect if an application can leak sensitive information.
To do so, the analyst provides a list of source and sink methods.
The return value of a method marked as source is considered sensitive and the argument of a method marked as sink is considered to be leaked.
By analyzing the bytecode of an application, Flowdroid can detect if data emitted by source methods can be exfiltrated by a sink method.
Flowdroid is built on top of the Soot~@Arzt2013 framework that handles, among other things, the class selection process.
Flowdroid is built on top of the Soot~@Arzt2013 framework that handles, among other things, the class selection process. ][
deja dit dans chap2?
We found that when selecting the classes implementation in a multi-dex APK, Soot uses an algorithm close to what ART is performing:
Non mais on aurait du, ca viendra et il faudra modifier a ce moment
]
We found that when selecting the classes implementation in a multi-dex #APK, Soot uses an algorithm close to what #ART is performing:
Soot sorts the `.dex` bytecode file with a specified `prioritizer` (a comparison function that defines an order for #dexfiles) and selects the first implementation found when iterating over the sorted files.
Unfortunately, the `prioritizer` used by Soot is not exactly the same as the one used by the ART.
The Soot `prioritizer` will give priority to `classes.dex` and then give priority to files whose name starts with `classes` over other files and finally will use the alphabetical order.
@ -231,67 +247,12 @@ This order is good enough for application with a small number of #dexfiles gener
// TODO This could use more investigation
In addition to self shadowing, Flowdroid is sensitive to the use of #platc, as it needs the bytecode of those classes to be able to track data flows.
This is solved for SDK classes by providing `android.jar` to Flowdroid.
Flowdroid gives priority to the classes from the SDK over the classes implemented in the application, thus defeating SDK shadow attacks.
This is solved for #SDK classes by providing `android.jar` to Flowdroid.
Flowdroid gives priority to the classes from the #SDK over the classes implemented in the application, thus defeating #SDK shadow attacks.
Unfortunately, `android.jar` only contains classes from the #Asdk, meaning that using #hidec breaks the flow tracking.
Solving this issue would require finding the bytecode of all the platform classes of the Android version targeted and as we said previously it requires extracting this information from the emulator.
=== Countermeasures <sec:cl-countermeasures>
Countermeasures against shadow attacks depend on each tool and its objectives.
The first important recommendation is to implement the class selection algorithm according to the algorithm described in Listing @lst:cl-loading-alg.
It should solve any case of self-shadowing, except for tools like Apktool, which do not have to select a class for computing the result but show the whole application's content.
For those tools, a clear warning should be added, pointing out that multiple implementations have been found and displaying the one that will be used at runtime.
Countermeasures against SDK shadow and Hidden shadow attacks are more complex to handle: it requires the list of platform classes on the target smartphone.
The list of SDK classes can be extracted easily from android.jar, but hidden classes need to be obtained by another means.
They could be listed directly from the AOSP tree of the Android source code, or obtained from Android documentation, or extracted from the phone itself.
The first approach requires statically analyzing the source code, which can be difficult to achieve as several programming languages are used, and the code base is large andd fragmented.
As discussed earlier in the paper, the documentation can lack some classes.
Consequently, the most reliable source is the smartphone itself.
It should be noted that none of these methods can be generalized for all possible versions of Android, as the exact list will depend on the exact targeted device, possibly modified by the manufacturer.
Thus, to conter Shadow attaks, the static analysis tools that we evaluated need to embed multiple lists of platform classes, one for each Android version.
Then, the best heuristic would be to use the list of platform classes that is closest to the target SDK of the analysed application.
Some tools like Flowdroid would require additional countermeasures: to compute the exact flow of data, Flowdroid also needs to analyse the code of platform classes.
For the SDK classes, Flowdroid has already analysed them, but the hidden classes have not.
In addition to the data flow in hidden classes, Flowdroid needs a list of data sources and sinks from those classes.
%Other analysis tools may require additional data from platform classes, which may be too difficult to obtain.
We believe that analysis tools can handle shadow attacks to some degree.
The implementation of the solution will differ depending on the nature tool and may not always require the same implementation effort.
=== Relation with Obfuscation Techniques <sec:cl-cross-obf>
As described in the state of the art, reverse engineers face other techniques of obfuscation such as packers or native code.
These techniques rely on custom class loaders that load new parts of the application from ciphered assets or from the network.
The reverse engineers have to study the application dynamically, to recover new classes, and eventually go back to a static phase to understand the behavior of the application.
In this section, we compare shadow attacks with these techniques and we discuss how they interact with them.
Advanced obfuscation techniques relying on packers have a higher impact on the difficulty of performing a static analysis compared to shadow attacks.
Most of the time, the reverse engineer cannot deobfuscate the application without performing a dynamic analysis.
For this reasons, approaches have been designed to assist the capture of the bytecode that is loaded dynamically, after the precise time where the deobfuscation methods have been executed~@zhang2015dexhunter @xue2017adaptive @wong2018tackling.
On the contrary, a shadow attack can be easily defeated by implementing our algorithm in the static analysis tool, as discussed earlier in @sec:cl-countermeasures.
Nevertheless, shadow attacks are stealthier than packers or native code.
Packers can be easily spotted by artifacts left behind in the application or by detecting classes implementing a custom class loading mechanism.
On the contrary, an extra class implementing a shadow attack, that would not be executed, could contain voluntarily few code, compared to the executed class of Android.
Such attack would be more discrete than a packer that adds in the application a lot of possibly native code
Combining regular obfuscation techniques with shadow attacks can be achieved in two ways.
First, the attacker could hide the code of a packer or a native call by using a shadow attack.
For example, by colliding a class of the SDK, a control flow analysis could be wrongly computed, leading to consider that part of the code is dead, which would mislead the reverse engineer about the use of this part that contains a packer.
At runtime, this code would be triggered, unpacking new code.
Second, the attacker could use a packer to unpack code at runtime in a first phase.
The reverse engineer would have to perform a dynamic analysis, for example uising a tool such as Dexhunter~@zhang2015dexhunter, to recover new DEX files that are loaded by a custom class loader.
Then, the reverse engineer would go back to a new static analysis and could have the problem of solving shadow attacks, for example, if a class is defined multiple times in the loaded DEX files.
Because the interaction between shadow attacks and other obfuscations techniques often rely on a loading mechanism implemented by the developer, investigating these cases require to analyse the Java bytecode that is handling the loading.
This problem is left as future work.
//\medskip
#v(2em)
We have seen that tools can be impacted by shadow attacks. In the next section, we will investigate if these attacks are used in the wild.

View file

@ -1,15 +1,20 @@
#import "../lib.typ": num, todo, paragraph
#import "../lib.typ": num, todo, paragraph, SDK, APK, API, ART, DEX
#import "X_var.typ": *
== Shadow Attacks in the Wild <sec:cl-wild>
In this section, we evaluate in the wild if applications that can be found in the Play store or other markets use one of the shadow techniques.
Our goal is to explore the usage of shadow techniques in real applications.
Because we want to include malicious applications (in case such techniques would be used to hide malicious code), we selected #num(50000) applications randomly from AndroZoo~@allixAndroZooCollectingMillions2016 that appeared in 2023.
Malicious applications are spot in our dataset by using a threshold of 3 over the number of antivirus reporting an application as a malware.
Some few applications over the total cannot be retrieved or parsed leading to a final dataset of #nbapk applications.
Because we modeled the behavior of a rescent version of Android (#SDK 34), we decided to not used our dataset from @sec:rasta.
The applications in the RASTA dataset span over more than 10 years and we cannot garanties that sandow attacks behaved the same during those 10 years.
At the verry least, self-shadowing would not be possible before the introduction of multi-dex in 2014 -- about a fourth of the applications in the RASTA dataset.
Instead, sampled another dataset of recent applications.
We want to include malicious applications (in case such techniques would be used to hide malicious code) so we selected #num(50000) applications randomly from AndroZoo~@allixAndroZooCollectingMillions2016 that appeared in 2023.
Malicious applications are spot in our dataset by using a threshold of 3 over the number of VirusTotal engines reporting an application as a malware.
This number is provided by Androzoo, for scans performed between january 2023 and january 2024 depending on the application.
A few applications over the total could not be retrieved or parsed leading to a final dataset of #nbapk applications.
We automatically disassembled the applications to obtain the list of included classes.
Then, we check if any shadow attack occurs in the APK itself or with #platc of SDK 34.
Then, we check if any shadow attack occurs in the #APK itself or with #platc of #SDK 34.
=== Results
@ -76,24 +81,24 @@ comparé à SDK 32 33 34: si la shadow class match, alors match
table.cell(colspan: 9, inset: 3pt)[],
table.hline(),
)},
caption: [Shadow classes compared to SDK 34 for a dataset of #nbapk applications]
caption: [Shadow classes compared to #SDK 34 for a dataset of #nbapk applications]
) <tab:cl-shadow>
//The metadata provided by AndroZoo helps to have the flags reported by antiviruses used by VirusTotal#footnote[https://www.virustotal.com].
We report in the upper part of @tab:cl-shadow the statistics about the whole dataset and the three shadow attacks: "self" when a class shadows another one in the APK, "SDK" when a class of the SDK shadows one of the APK, and "Hidden" when a hidden class of Android shadows one of the APK.
We report in the upper part of @tab:cl-shadow the statistics about the whole dataset and the three shadow attacks: "self" when a class shadows another one in the #APK, "#SDK" when a class of the #SDK shadows one of the #APK, and "Hidden" when a hidden class of Android shadows one of the #APK.
We observe that, on average, a few classes are shadowed by another class.
Note that the median value is 0 meaning that few apps shadow a lot of classes, but the majority of apps do not shadow anything.
The number of applications shadowing a hidden API is low, which is an expected result as these classes should not be known by the developer.
We observe a consequent number of applications, 23.52%, of applications that perform SDK shadowing.
It can be explained by the fact that some classes that newly appear are embedded in the APK for end users that have old versions of Android: it is suggested by the average value of Min SDK which is 21.7 for the whole dataset: on average, an application can be run inside a smartphone with API 21, which would require to embed all new classes from 22 to 34.
The number of applications shadowing a hidden #API is low, which is an expected result as these classes should not be known by the developer.
We observe a consequent number of applications, 23.52%, of applications that perform #SDK shadowing.
It can be explained by the fact that some classes that newly appear are embedded in the #APK for end users that have old versions of Android: it is suggested by the average value of Min #SDK which is 21.7 for the whole dataset: on average, an application can be run inside a smartphone with #API 21, which would require to embed all new classes from 22 to 34.
This hypothesis about missing classes is further investigated later in this section.
In the bottom part of @tab:cl-shadow, we give the same statistics but we excluded applications that do not perform any shadowing.
For those pairs of shadow classes, we disassembled them using Apktool to perform a comparison using instructions represented in the Smali language.
For self-shadow, we compare the pair.
For the shadowing of the SDK or Hidden class, we compare the code found in the APK with implementations found in the emulator and `android.jar` of SDK 32, 33, and 34.
For the shadowing of the #SDK or Hidden class, we compare the code found in the #APK with implementations found in the emulator and `android.jar` of #SDK 32, 33, and 34.
#paragraph([Self-shadowing])[
We observe a low number of applications doing self-shadow attacks.
@ -117,22 +122,22 @@ We investigate later in @sec:cl-malware the case of malicious applications.
The remaining bars are between 0 and 5,000.
"
),
caption: [Redefined SDK classes, sorted by the first SDK they appeared in.]
caption: [Redefined #SDK classes, sorted by the first #SDK they appeared in.]
)<fig:cl-classes_by_first_sdk>
#paragraph([SDK shadowing])[
For the shadowing of SDK classes, we observe a low ratio of identical classes.
This result could lead to the wrong conclusion that developers embed malicious versions of the SDK classes, but our manual investigation shows that the difference is slight and probably due to compiler optimization.
#paragraph([#SDK shadowing])[
For the shadowing of #SDK classes, we observe a low ratio of identical classes.
This result could lead to the wrong conclusion that developers embed malicious versions of the #SDK classes, but our manual investigation shows that the difference is slight and probably due to compiler optimization.
To go further in the investigation, in @fig:cl-classes_by_first_sdk we represent these redefined classes with the following rules:
- The class is classified on the X abscissa in the figure according to the SDK it first appeared in.
- The class is counted as "green" (solid) if it first appeared in the SDK *after* the APK min SDK (retro compatibility purpose).
- The class is counted as "red" (hatched) if it first appeared in the SDK *before* the APK min SDK (which is useless for the application as the SDK version is always available).
- The class is classified on the X abscissa in the figure according to the #SDK it first appeared in.
- The class is counted as "green" (solid) if it first appeared in the #SDK *after* the #APK min #SDK (retro compatibility purpose).
- The class is counted as "red" (hatched) if it first appeared in the #SDK *before* the #APK min #SDK (which is useless for the application as the #SDK version is always available).
We observe that the majority of classes are legitimate retro-compatibility additions of classes, especially after SDK 21 (which is the average min SDK, cf. @tab:cl-shadow).
Abnormal cases are observed for classes that appeared in API versions 7 and before, 8, and 16.
@tab:cl-topsdk reports the top ten classes that shadow the SDK for the three mentioned versions.
For SDK before 7, it mainly concerns HTTP classes: for example, the class `HttpParams` is an interface, containing limited bytecode that mostly matches the class already present on the emulator (98.03% of shadowed classes are identical).
We observe that the majority of classes are legitimate retro-compatibility additions of classes, especially after #SDK 21 (which is the average min #SDK, cf. @tab:cl-shadow).
Abnormal cases are observed for classes that appeared in #API versions 7 and before, 8, and 16.
@tab:cl-topsdk reports the top ten classes that shadow the #SDK for the three mentioned versions.
For #SDK before 7, it mainly concerns HTTP classes: for example, the class `HttpParams` is an interface, containing limited bytecode that mostly matches the class already present on the emulator (98.03% of shadowed classes are identical).
`HttpConnectionParams` on the other hand differs from the platform class and we observe only 4.99% of identical classes.
Manual inspection of some applications revealed that the two main reasons are:
@ -141,11 +146,11 @@ Manual inspection of some applications revealed that the two main reasons are:
- very small changes that we found can be attributed to the compilation process (e.g. swapping registers: `v0` is used instead of `v1` and `v1` instead of `v0`), but even if we consider them different, they are very similar.
The remaining 4.99% of classes that are identical to the Android version are classes where the body of the methods is replaced by stubs that throw `RuntimeException("Stub!")`.
This code corresponds to what we found in android.jar but not the code we found in the emulator, which is surprising.
This code corresponds to what we found in `android.jar` but not the code we found in the emulator, which is surprising.
Nevertheless, we decided to count them as identical, because `android.jar` is the official jar file for developer, and stubs are replaced in the emulator: it is intended by Google developers.
Other results of @tab:cl-topsdk can be similarly discussed: either they are identical with a high ratio, or they are different because of small variations.
When substantial differences appear it is mainly because different versions of the same library have been used or an SDK class is embedded for retro-compatibility.
When substantial differences appear it is mainly because different versions of the same library have been used or an #SDK class is embedded for retro-compatibility.
]
#figure({
@ -196,17 +201,19 @@ When substantial differences appear it is mainly because different versions of t
table.cell(colspan: 3, inset: 2pt)[],
table.hline(),
)},
caption: [Shadow classes compared to SDK 34 for a dataset of #nbapk applications]
caption: [Shadow classes compared to #SDK 34 for a dataset of #nbapk applications]
) <tab:cl-topsdk>
#paragraph([Hidden shadowing])[
For applications redefining hidden classes, on average, 16.1 classes are redefined (cf bottom part of @tab:cl-shadow).
The top 3 packages whose code actually differs from the ones found in Android are `java.util.stream`, `org.ccil.cowan.tagsoup` and `org.json`:
- stream: when looking in more detail, we found that `java.util.stream` was only redefined by 6 applications, but the large number of classes redefined artificially puts the package at the top of the list. // It is explained by the fact that developers have included this library containing a lot of classes colliding with Android.
- tagsoup: `TagSoup` is a library for parsing HTML. // Developers do not know that it is part of Android as hidden classes.
- stream: when looking in more detail, we found that `java.util.stream` was only redefined by 6 applications, but the large number of classes redefined artificially puts the package at the top of the list.
It is explained by the fact that developers have included this library containing a lot of classes colliding with Android.
- tagsoup: `TagSoup` is a library for parsing HTML.
Developers do not know that it is part of Android as hidden classes.
- json: there is only one hidden class in `org.json`, redefined by #num(821) applications: `JSONObject$1`.
`org.json` is a package in Android SDK, not a hidden one.
`org.json` is a package in Android #SDK, not a hidden one.
However, `JSONObject$1` is an anonymous class not provided by `android.jar` because its class `JSONObject` is an empty stub, and thus, does not use `JSONObject$1`.
Thus, this class falls in the category of hidden #platc.
All these hidden shadow classes are libraries included by the developers who probably did not know that they were already embedded in Android.
@ -236,7 +243,7 @@ All these hidden shadow classes are libraries included by the developers who pro
// ...
}
```,
caption: [Implementation of Reflection found un classes11.dex (shadows @lst:cl-refl1)],
caption: [Implementation of Reflection found un `classes11.dex` (shadows @lst:cl-refl1)],
) <lst:cl-refl2>
#figure(
@ -258,7 +265,7 @@ All these hidden shadow classes are libraries included by the developers who pro
// ...
}
```,
caption: [Implementation of Reflection executed by ART (shadowed by @lst:cl-refl2],
caption: [Implementation of Reflection executed by #ART (shadowed by @lst:cl-refl2],
) <lst:cl-refl1>
The last column of @tab:cl-shadow shows the proportion of applications considered as malware because we arbitrarily fixed a threshold of 3 positive detections from VirusTotal reports.
@ -271,18 +278,19 @@ Additionally, we noticed multiple times internal classes from `com.google.androi
// Nom de l'app: ShareCRM, mais ca a l'air d'exister sur le store donc on va eviter un process et pas la nommer
// https://play.google.com/store/apps/details?id=com.facishare.fsplay&hl=en
The most notable case we found was an application that still exists on the Google Play Store with the same package name#footnote[SHA256: `C46A65EA1A797119CCC03C579B61C94FE8161308A3B6A8F55718D6ADAD112546`]. This application contains a self-shadow class `me.weishu.reflection.Reflection` that can be found in github, in the repository `tiann/FreeReflection`#footnote[https://github.com/tiann/FreeReflection]. This class is used to disable Android restrictions on hidden API.
The most notable case we found was an application that still exists on the Google Play Store with the same package name#footnote[SHA256: `C46A65EA1A797119CCC03C579B61C94FE8161308A3B6A8F55718D6ADAD112546`]. This application contains a self-shadow class `me.weishu.reflection.Reflection` that can be found in github, in the repository `tiann/FreeReflection`#footnote[https://github.com/tiann/FreeReflection]. This class is used to disable Android restrictions on hidden #API.
At first glance, we believed the shadowing to be done voluntarily for obfuscation purposes.
The shadow class that would be seen by a reverser is given in @lst:cl-refl2: it contains some Java bytecode performing reflection and loading a native library named "free-reflection" (the associated `.so` is missing).
The shadowed class that is really executed is summarized in @lst:cl-refl1.
It contains a more obfuscated code: a `DEX` field storing base64 encoded DEX bytecode that is later used to load some new code.
It contains a more obfuscated code: a `DEX` field storing base64 encoded #DEX bytecode that is later used to load some new code.
When looking at this new code stored in the field, we found that it does almost the same thing as the code in the shadow class.
Thus, we believe that the developer has upgraded their obfuscation techniques, replacing a native library by inline base64 encoded bytecode.
The shadow attack could be unintentional, but it strengthens the masking of the new implementation.
#v(2em)
As a conclusion, we observed that:
- SDK shadowing is performed by #shadowsdk of applications but are unintentional: these classes are embedded for retro-compatibility purpose or because the developer added a library already present in Android;
- #SDK shadowing is performed by #shadowsdk of applications but are unintentional: these classes are embedded for retro-compatibility purpose or because the developer added a library already present in Android;
- Hidden shadowing rarely occurs and is mainly due to the usage of libraries that Android already contains;
- Malware perform more self-shadowing than goodware applications, and we found a sample where self-shadowing would clearly mislead the reverser.

View file

@ -0,0 +1,96 @@
#import "../lib.typ": SDK, AOSP, DEX, ART, jm-note, todo
== Discussion <sec:cl-disc>
#todo[small intro]
=== Countermeasures <sec:cl-countermeasures>
Countermeasures against shadow attacks depend on each tool and its objectives.
The first important recommendation is to implement the class selection algorithm according to the algorithm described in Listing @lst:cl-loading-alg.
It should solve any case of self-shadowing, except for tools like Apktool, which do not have to select a class for computing the result but show the whole application's content.
For those tools, a clear warning should be added, pointing out that multiple implementations have been found and displaying the one that will be used at runtime.
Countermeasures against #SDK shadow and Hidden shadow attacks are more complex to handle: it requires the list of platform classes on the target smartphone.
The list of #SDK classes can be extracted easily from android.jar, but hidden classes need to be obtained by another means.
They could be listed directly from the #AOSP tree of the Android source code, or obtained from Android documentation, or extracted from the phone itself.
The first approach requires statically analyzing the source code, which can be difficult to achieve as several programming languages are used, and the code base is large andd fragmented.
As discussed earlier in the chapter, the documentation can lack some classes.
Consequently, the most reliable source is the smartphone itself.
It should be noted that none of these methods can be generalized for all possible versions of Android, as the exact list will depend on the exact targeted device, possibly modified by the manufacturer.
Thus, to conter Shadow attaks, the static analysis tools that we evaluated need to embed multiple lists of platform classes, one for each Android version.
Then, the best heuristic would be to use the list of platform classes that is closest to the target #SDK of the analysed application.
Some tools like Flowdroid would require additional countermeasures: to compute the exact flow of data, Flowdroid also needs to analyse the code of platform classes.
For the #SDK classes, Flowdroid has already analysed them, but the hidden classes have not.
In addition to the data flow in hidden classes, Flowdroid needs a list of data sources and sinks from those classes.
Other analysis tools may require additional data from platform classes, which may be too difficult to obtain.
We believe that analysis tools can handle shadow attacks to some degree.
The implementation of the solution will differ depending on the nature tool and may not always require the same implementation effort.
=== Relation with Obfuscation Techniques <sec:cl-cross-obf>
As described in the state of the art, reverse engineers face other techniques of obfuscation such as packers or native code.
These techniques rely on custom class loaders that load new parts of the application from ciphered assets or from the network.
The reverse engineers have to study the application dynamically, to recover new classes, and eventually go back to a static phase to understand the behavior of the application.
In this section, we compare shadow attacks with these techniques and we discuss how they interact with them.
Advanced obfuscation techniques relying on packers have a higher impact on the difficulty of performing a static analysis compared to shadow attacks.
Most of the time, the reverse engineer cannot deobfuscate the application without performing a dynamic analysis.
For this reasons, approaches have been designed to assist the capture of the bytecode that is loaded dynamically, after the precise time where the deobfuscation methods have been executed~@zhang2015dexhunter @xue2017adaptive @wong2018tackling.
On the contrary, a shadow attack can be easily defeated by implementing our algorithm in the static analysis tool, as discussed earlier in @sec:cl-countermeasures.
Nevertheless, shadow attacks are stealthier than packers or native code.
Packers can be easily spotted by artifacts left behind in the application or by detecting classes implementing a custom class loading mechanism.
On the contrary, an extra class implementing a shadow attack, that would not be executed, could contain voluntarily few code, compared to the executed class of Android.
Such attack would be more discrete than a packer that adds in the application a lot of possibly native code
Combining regular obfuscation techniques with shadow attacks can be achieved in two ways.
First, the attacker could hide the code of a packer or a native call by using a shadow attack.
For example, by colliding a class of the #SDK, a control flow analysis could be wrongly computed, leading to consider that part of the code is dead, which would mislead the reverse engineer about the use of this part that contains a packer.
At runtime, this code would be triggered, unpacking new code.
Second, the attacker could use a packer to unpack code at runtime in a first phase.
The reverse engineer would have to perform a dynamic analysis, for example uising a tool such as Dexhunter~@zhang2015dexhunter, to recover new #DEX files that are loaded by a custom class loader.
Then, the reverse engineer would go back to a new static analysis and could have the problem of solving shadow attacks, for example, if a class is defined multiple times in the loaded #DEX files.
Because the interaction between shadow attacks and other obfuscations techniques often rely on a loading mechanism implemented by the developer, investigating these cases require to analyse the Java bytecode that is handling the loading.
This problem is left as future work.
=== Limitations <sec:cl-ttv>
During the analysis of the #ART internals, we made the hypothesis that its different operating modes are equivalent: we analysed the loading process for classes stored as non-optimized `.dex` format, and not for the pre-compiled `.oat`.
It is a reasonable hypothesis to suppose that the two implementations have been produced from the same algorithm using two compilation workflows.
Similarly, we assumed that the platform classes stored in `boot.art` are the same as the ones in `BOOTCLASSPATH`.
We confirm empirically our hypothesis on an Android Emulator, but we may have missed some edge cases.
The comparison of Smali code can lead to underestimated values, for example, if the compilation process performs minor modifications such as instruction reordering.
The ratios reported in this study for the comparison of code are thus a lower bound and would be higher with a more precise comparison.
In addition, platform classes are stored differently in older versions of Android and could not be easily retrieved.
For this reason, we did not compared the classes found in applications to their versions older than #SDK 32 to avoid producing unreliable statistics for those versions.
=== Futur Works <sec:cl-futur>
#todo[Develop @sec:cl-futur]
As we said, our comparison technique is quite naive and could use more work.
It could be insightful to be able to detect excatlly when two classes are from the same fource file, or which version of a library a class belong.
More importantly, a better comparision technique would allow to detect cases where the shadowed library has actual malicious bytecode added that we could have missed manually.
Additionally, the question of dynamic class loaders, used manually by the application developer is interesting.
This is reaching the limits of static analysis, thoses cases involve dynamically loading bytecode, and in many cases the classes loaded by those classe loaders are not even available for analysis.
However, even with dynamic analysis, the behavior of class loaders can still be an issue, especially when the analysis is performed by alternating static and dynamic analysis, as it is often the case in when manually reversing an application.
To handle those cases, it could be interesting to develop a method to model any arbitrary class loaders, either by analysing its bytecode or by interacting with an instance of the class loader dynamically.
In september 2024 (just after we finished this work), Android 15 introduce support for the new version 41 of the #DEX format.
We can expect this version of #DEX to become the norm in a few years.
The most notable change in version 41 is the new container format: instead of storing the bytecode in separated #DEX files, the different files can now be concatenated into one unique file.
There is also some permeability between the concatenated files: some structures stored in one file can be used by the nexts concatenated files.
This significant change in the bytecode storage is similar to the introduction of the multi-dex format.
Considering that self shadowing is only possible because of the multi-dex format, be expect this change to have the potential to introduce new similar issues.
Thus, we believe that the implementation details of this new version need to be studied and model properly to avoid introducing new issues when updating analysis tools to support it.
Just by reading the specification#footnote[https://source.android.com/docs/core/runtime/dex-format#container], we believe that self shadowing between concatenated #DEX files is possible, unless additionnal checks are enforced by the #ART when loading the file.
#jm-note[Maybe talk about v41 in RASTA? this will break a lot of things]

View file

@ -1,13 +0,0 @@
== Threat to Validity <sec:cl-ttv>
During the analysis of the ART internals, we made the hypothesis that its different operating modes are equivalent: we analysed the loading process for classes stored as non-optimized `.dex` format, and not for the pre-compiled `.oat`.
It is a reasonable hypothesis to suppose that the two implementations have been produced from the same algorithm using two compilation workflows.
Similarly, we assumed that the platform classes stored in `boot.art` are the same as the ones in `BOOTCLASSPATH`.
We confirm empirically our hypothesis on an Android Emulator, but we may have missed some edge cases.
The comparison of Smali code can lead to underestimated values, for example, if the compilation process performs minor modifications such as instruction reordering.
The ratios reported in this study for the comparison of code are thus a lower bound and would be higher with a more precise comparison.
In addition, platform classes are stored differently in older versions of Android and could not be easily retrieved.
For this reason, we did not compared the classes found in applications to their versions older than SDK 32 to avoid producing unreliable statistics for those versions.

View file

@ -1,16 +1,24 @@
#import "../lib.typ": SDK, pb2, pb2-text, highlight-block, ie
#import "X_var.typ": *
== Conclusion <sec:cl-conclusion>
This paper has presented three shadow attacks that allow malware developers to fool static analysis tools when reversing an Android application.
This chapter has presented three shadow attacks that allow malware developers to fool static analysis tools when reversing an Android application.
By including multiple classes with the same name or by using the same name as a class of the #Asdk, the developer can mislead a reverser or impact the result of a flow analysis, such as the ones of Androguard or Flowdroid.
We explored if such shadow attacks are present in as dataset of #nbapk applications .
We found that on average, #shadowsdk of applications are shadowing the SDK, mainly for retro-compatibility purposes and library embedding.
We found that on average, #shadowsdk of applications are shadowing the #SDK, mainly for retro-compatibility purposes and library embedding.
More suspiciously, #shadowhidden of applications are shadowing a hidden class, which could lead to unexpected execution as these classes can appear/disappear with the evolution of Android internals.
Investigations for applications that defined classes multiple times suggest that the compilation process or the inclusion of different versions of the same library is the main explanation.
Finally, when investigating malware samples, we found a specific sample containing a shadow attack that would hide a part of the critical code from a reverser studying the application.
Future work concerns the correctness of bytecode analysis.
For now, we rely on the Smali representation of the bytecode but the compilation process makes this comparison difficult.
We intend to better parse the bytecode to summarize it and be able to have a more reliable comparison method.
#v(1.5em)
#align(center, highlight-block(inset: 15pt, width: 75%, breakable: false, block(align(left)[
#pb2: #pb2-text
#v(0.75em)
@lst:cl-loading-alg model the class loading algorithm: platform classes have priority over classes stored in `classes.dex` which have priority over `classes<n>.dex` (where $n in [| 2, +infinity [| $ and $forall i in [| 2, n [|, exists $ `classes<i>.dex`) which has priority over `classes<n+1>.dex`.
Failing to implement this model (#ie by ignoring some platform classes or by sorting the `classes<n>.dex` alphabetically instead of numerically) can cause static analysis tools to compute an incorrect representation of the analyzed application.
])))

View file

@ -1,8 +1,3 @@
#import "../lib.typ": etal, ie, ART, DEX, APK, SDK
#import "X_var.typ": *
== Introduction
/*
When building an application with Android Studio, the source codes of applications are compiled to Java bytecode, which is then converted to Dalvik bytecode.
Dalvik bytecode is then put in a zip archive with other resources such as the application manifest, and the zip archive is then signed.
@ -38,27 +33,4 @@ During these steps, the reverser faces the problem of resolving statically, whic
If they, or the tools they use, choose the wrong version of the class, they may obtain wrong conclusions about the code.
Thus, the possibility of shadowing classes could be exploited by an attacker in order to obfuscate the code.
In this paper, we study how Android handles the loading of classes in the case of multiple versions of the same class.
Such collision can exist inside the APK file or between the APK file and #Asdkc.
We intend to understand if a reverser would be impacted during a static analysis when dealing with such an obfuscated code.
Because this problem is already enough complex with the current operations performed by Android, we exclude the case where a developer recodes a specific class loader or replace a class loader by another one, as it is often the case for example in packed applications~@Duan2018.
We present a new technique that "shadows" a class #ie embeds a class in the APK file and "presents" it to the reverser instead of the legitimate version.
The goal of such an attack is to confuse them during the reversing process: at runtime the real class will be loaded from another location of the APK file or from the #Asdk, instead of the shadow version.
This attack can be applied to regular classes of the #Asdk or to hidden classes of Android~@he_systematic_2023 @li_accessing_2016.
We show how these attacks can confuse the tools of the reverser when he performs a static analysis.
In order to evaluate if such attacks are already used in the wild, we analysed #nbapk applications from 2023 that we extracted randomly from AndroZoo~@allixAndroZooCollectingMillions2016.
Our main result is that #shadowsdk of these applications contain shadow collisions against the #SDK and #shadowhidden against hidden classes.
Our investigations conclude that most of these collisions are not voluntary attacks, but we highlight one specific malware sample performing strong obfuscation revealed by our detection of one shadow attack.
The paper is structured as follows.
@sec:cl-soa reviews the state of the art about loading of Android classes and the tools to perform reverse engineering on applications.
Then, @sec:cl-loading investigates the internal mechanisms about class loading and presents how a reverser can be confused by these mechanisms.
In @sec:cl-obfuscation, we design obfuscation techniques and we show their effect on static analysis tools.
Finally, @sec:cl-wild evaluates if these obfuscation techniques are used in the wild, by searching inside #nbapk APKs if they exploit these techniques.
@sec:cl-ttv discusses the limits of this work and @sec:cl-conclusion concludes the paper.
// In addition to the public #Asdk of `android.jar`, other internal classes are also available for the Android Runtime.
// Those classes are called hidden #Asdkc@li_accessing_2016, and are not supposed to be used by applications.
// In reality their use is tolerated and many applications use them to access some of Android features.
// This tolerance is one of the key point that lead to confusion attacks that we describe later in the paper.

View file

@ -15,10 +15,9 @@
])))
#include("0_intro.typ")
#include("1_related_work.typ")
#include("1_intro.typ")
#include("2_classloading.typ")
#include("3_obfuscation.typ")
#include("4_in_the_wild.typ")
#include("5_ttv.typ")
#include("5_discussion.typ")
#include("6_conclusion.typ")