fix 'typos' (yesss, they are definitely typos)
All checks were successful
/ test_checkout (push) Successful in 1m49s

This commit is contained in:
Jean-Marie 'Histausse' Mineau 2025-09-26 04:21:05 +02:00
parent fede0bd9b2
commit 0d87fae9da
Signed by: histausse
GPG key ID: B66AEEDA9B645AD2
11 changed files with 302 additions and 304 deletions

View file

@ -3,31 +3,28 @@
== Introduction
In order to understand the challenges of reverse engineering Android applications, we first need to understand some key concepts and specificities of Android.
In particular, the format in wich application are distributed, as well as the runtime environment that runs those application, are very specific to Android.
To handle those specificities, a reverse engineer must appropriate tools.
Some of those tools are used recurrently, either by the reverse engineer themself, or as basis for other more complexe tools that implement more advance analysis techniques.
In particular, the format in which applications are distributed, as well as the runtime environment that runs those applications, is very specific to Android.
To handle those specificities, a reverse engineer must have appropriate tools.
Some of those tools are used recurrently, either by the reverse engineer themself, or as a basis for other more complex tools that implement more advanced analysis techniques.
Among those techniques, the ones that do not require to run the application are called static analysis.
Over the time, many of those tools have been released.
To compare those different tools, different benchmarks have been proposed, highlighting different strenght and weeknesses of each tools.
Among those techniques, the ones that do not require running the application are called static analysis.
Over time, many of those tools have been released.
To compare those different tools, different benchmarks have been proposed, highlighting different strengths and weaknesses of each tool.
Unfortunately static analysis has its limits.
One such limit is that it cannot analysis what is not inside the application.
Unfortunately, static analysis has its limits.
One such limit is that it cannot analyse what is not inside the application.
Platform classes are classes that are present directly on the smartphone, and not in the application.
Some of those classes are well known and taken into account by analysis tools, but the rest of those classes, often called _hidden #API;_, are not.
In addition to platform classes, classes that are loaded dynamically (#ie at runtime) are also not always available to static analysis.
This led static analysis tools to disregard the class loading process altogether, leaving the subject relativelly unexplored.
This led static analysis tools to disregard the class loading process altogether, leaving the subject relatively unexplored.
When static analysis fails, for instance because of dynamic class loading, the reverse engineer will fallback dynamic analysis.
Dynamic analysis is the counterpart of static analysis: the analysis is based on the analysis of the excecution of the application.
When static analysis fails, for instance, because of dynamic class loading, the reverse engineer will fall back on dynamic analysis.
Dynamic analysis is the counterpart of static analysis: it is based on the analysis of the execution of the application.
Depending on the context, the reverse engineer will then alternate between different techniques, using previous results to improve the next iteration.
Regrettably, analysis tools mostly return results in an ad hoc format, making it difficult to make other tools aware of the retrieved information.
Some tools however encode their result in the form of a new augmented Android application.
The idea beeing that any Android analysis tools must be able to handle an Android application in the first place, so it will have access to those new information.
Some tools, however, encode their result in the form of a new augmented Android application.
The idea being that any Android analysis tools must be able to handle an Android application in the first place, so they will have access to that new information.
We will begin this chapter by a presentation of the bases of the Android ecosystem.
The reader already familliar with Android reverse engineering might want to skip to @sec:bg-probl where we put our problem statements in perspective.
We will then examine the state of the art related to those problem statements @sec:bg-soa, and conclude this chapter in @sec:bg-conclusion.
#todo[synthese a la fin de chaque section soa des problemes]
#todo[Problematique avant soa]
We will begin this chapter with a presentation of the bases of the Android ecosystem.
The reader already familiar with Android reverse engineering might want to skip to @sec:bg-probl, where we put our problem statements in perspective.
We will then examine the state of the art related to those problem statements in @sec:bg-soa, and conclude this chapter in @sec:bg-conclusion.

View file

@ -4,165 +4,165 @@
=== Android <sec:bg-android>
Android is the smartphone operating system developed by Google.
It is based on a Long Term Support Linux Kernel, to which are added patches develloped by the Android community.
On top of the kernel, Android redeveloped many of the usual components used by linux-based operating systems, like the init system or the standart C library, and added new ones, like the #ART that execute the applications.
Those change make Android a verry unique operating system.
It is based on a Long Term Support Linux Kernel, to which patches developed by the Android community are added.
On top of the kernel, Android redeveloped many of the usual components used by Linux-based operating systems, like the init system or the standard C library, and added new ones, like the #ART that executes the applications.
Those changes make Android a unique operating system.
==== Android Applications <sec:bg-android-apk>
Application in the Android ecosystem are distributed in the #APK format.
#APK files are #JAR files with additionnal features, which are themself #ZIP files with additionnal features.
Applications in the Android ecosystem are distributed in the #APK format.
#APK files are #JAR files with additional features, which are themself #ZIP files with additional features.
A minimal #APK file contains a file `AndroidManifest.xml`, the `META-INF/` folder containing the #JAR manifest and signature files, and an #APK Signing Block at the end of the #ZIP file.
The code of the application is then store in a custom format, the Dalvik bytecode, or in the binary ELF format, called native code in the Android ecosystem, or both.
The code of the application is then stored in a custom format, the Dalvik bytecode, or in the binary ELF format, called native code in the Android ecosystem, or both.
Dalvik bytecode is stored in the `classes.dex`, `classes2.dex`, `classes3.dex`, ... while native code is stored in `lib/<arch>/*.so`.
The `res/` folder contains the ressources required for the user interface.
When ressources are present in `res/`, the file `resources.arsc` is also present at the root of the archive.
The `res/` folder contains the resources required for the user interface.
When resources are present in `res/`, the file `resources.arsc` is also present at the root of the archive.
The `assets/` folder contains the files that are used directly by the code application.
Depending on the application and compilation process, any kind of other files and folders can be added to the application.
#paragraph[*Signature*][
Android applications are cryptographically signed to prove the autorship.
Applicatations signed with the same key are considered developed by the same entity.
This allow to securely update applications, and applications can declare security permission to restrict access to some feature to only application with the same author.
Android applications are cryptographically signed to prove the authorship.
Applications signed with the same key are considered developed by the same entity.
This allows updating the applications securely, and applications can declare security permissions to restrict access to some feature to only applications with the same author.
Android has several signature schemes coexisting:
- The v1 signature scheme is the #JAR signing scheme, where the signature data is stored in the `META-INF/` folder.
- The v2, v3 and v3.1 signature scheme are store in the '#APK Signing Block' of the #APK.
The v2 signature scheme was introduced in Android 7.0, and to keep retrocompatibility with older version, the v1 scheme is still used in addition to the #APK Signing Block.
The v2 signature scheme was introduced in Android 7.0, and to keep retro-compatibility with older versions, the v1 scheme is still used in addition to the #APK Signing Block.
The Signing block is an unindexed binary section added to the #ZIP file, between the #ZIP entries and the Central Directory.
The signature was added in an unindexed section of the #ZIP to avoid interferring with the v1 signature scheme that sign the files inside the archive, and not the archive itself.
The signature was added in an unindexed section of the #ZIP to avoid interfering with the v1 signature scheme that signed the files inside the archive, and not the archive itself.
- The v4 signature scheme is complementary to the v2/v3 signature scheme.
Signature data are stored in an external, `.apk.idsig` file.
]
#paragraph[*Android Manifest*][
The Android Manifest is stored in the `AndroidManifest.xml`, encoded in the binary #AXML format.
The manifest declare important informations about the application:
- Generic informations like the application name, id, icon.
- The Android compatibility of the applications, in the form of 3 values: the Android `min-sdk`, `target-sdk` and `max-sdk`. Those are the minimum, targeted and maximum version of the Android SDK supported by the application.
- The application componants (Activity, Service, Receiver and Provider) of the application and their associated classes.
- Intent filters to list the intents that can start or be sent to the application componants.
The manifest declares important information about the application:
- Generic information like the application name, ID and icon.
- The Android compatibility of the applications, in the form of 3 values: the Android `min-sdk`, `target-sdk` and `max-sdk`. Those are the minimum, targeted and maximum versions of the Android SDK supported by the application.
- The application components (Activity, Service, Receiver and Provider) of the application and their associated classes.
- Intent filters to list the intents that can start or be sent to the application components.
- Security permissions required by the application.
]
#paragraph[*Code*][
An application usually contains at least a `classes.dex` file containing Dalvik bytecode.
This is the format executed by the Android #ART.
It is common for an application to have more thant one #DEX file, when application need to reference more methods than the format allows in one file
(each method referenced inside a #DEX is associated to a 16 bits number, limiting their number to #num(65536)).
Support for multiple #DEX files was added in the #SDK 21 version of Android, and applications that have multiple #DEX file are sometimes refered to as 'multi-dex'.
It is common for an application to have more than one #DEX file when an application needs to reference more methods than the format allows in one file
(each method referenced inside a #DEX is associated with a 16-bits number, limiting their number to #num(65536)).
Support for multiple #DEX files was added in the #SDK 21 version of Android, and applications that have multiple #DEX files are sometimes referred to as 'multi-dex'.
In addition to #DEX files, and sometimes instead of #DEX files, applications can contain `.so` ELF (Executable and Linkable Format) files in the `lib/` folder.
In the Android ecosystem, binary code is called native code.
Because native code is compiled for a specific architecture, `.so` files are present in different versions, stored in different subfolders, depending on the targetted architecture.
For example `lib/arm64-v8a/libexample.so` is the version of the `example` library compiled for an ARM 64 architecture.
Because native code is compiled for a specific architecture, `.so` files are present in different versions, stored in different subfolders, depending on the targeted architecture.
For example, `lib/arm64-v8a/libexample.so` is the version of the `example` library compiled for an ARM 64 architecture.
Because smartphones mostly use ARM processors, it is not rare to see applications that only have the ARM version of their native code.
]
#paragraph[*Ressources*][
Developing graphical interfaces for applications require many kind of specific assets, which are stored in `lib/`.
Those ressources include bitmap images, text, layout, etc.
Data like layout, color or text are stored in binary #AXML.
An additionnal file, `resources.arsc`, in a custom binary format, contains a list of the ressources names, ids, and their properties.
#paragraph[*Resources*][
Developing graphical interfaces for applications requires many kinds of specific assets, which are stored in `lib/`.
Those resources include bitmap images, text, layout, etc.
Data like layout, colour or text are stored in binary #AXML.
An additional file, `resources.arsc`, in a custom binary format, contains a list of the resource names, ids, and their properties.
]
#paragraph[*Compilation Process*][
For the developer, the compilation process is handled by Android Studio and is mostly transparent.
Behind the scene, Android Studio rely on Gradle to orchestrate the different compilation steps:
Behind the scenes, Android Studio rely on Gradle to orchestrate the different compilation steps:
The sources #XML files like `AndroidManifest.xml` and the one in `res/` are compile to binary #AXML by `aapt`, which also generate the ressource table `resources.arsc` and a `R.java` file that define for each ressources variable named after the variable, set to the id of the ressouce.
The `R.java` file allows the developer to refere to ressources with readable names and avoid using the often automatically generated ressources ids, that can change from a version of the application to another.
The sources #XML files like `AndroidManifest.xml` and the one in `res/` are compiled to binary #AXML by `aapt`, which also generates the resource table `resources.arsc` and a `R.java` file that defines for each resource variables named after the resource, set to the ID of the resource.
The `R.java` file allows the developer to refer to resources with readable names and avoid using the often automatically generated resource IDs, which can change from one version of the application to another.
The source code is then compile.
The most common programming langages used for Android application are Java and Kotlin.
Both are first compiled to java bytecode in `.class` files using the langage compiler.
To allow access to the Android #API, the `.class` are linked during the compilation to an `android.jar` file that contains classes with the same signatures as the one in the Android #API for the targeted SDK.
The source code is then compiled.
The most common programming languages used for Android applications are Java and Kotlin.
Both are first compiled to Java bytecode in `.class` files using the language compiler.
To allow access to the Android #API, the `.class` are linked during the compilation to an `android.jar` file that contains classes with the same signatures as the ones in the Android #API for the targeted SDK.
The `.class` files are then converted into the #DEX format using `d8`.
During those steeps, both the original langage compiler and `d8` can perform optimizations on the classes, like code shrinking, inlining, etc.
During those steps, both the original language compiler and `d8` can perform optimisations on the classes, like code shrinking, inlining, etc.
If the application contains native code, the original C or C++ code is compile using tools Android from the #NDK to target the different possible architectures.
If the application contains native code, the original C or C++ code is compiled using Android tools from the #NDK to target the different possible architectures.
`aapt` is then used once again to package all the generated #AXML, #DEX, `.so` files, as well as the other ressources files, assets, `resources.arsc`, and any additionnal files deemed necessary to form the final #ZIP file.
`aapt` ensures that the generated #ZIP is compatible with the requirement from Android.
`aapt` is then used once again to package all the generated #AXML, #DEX, `.so` files, as well as the other resource files, assets, `resources.arsc`, and any additional files deemed necessary to form the final #ZIP file.
`aapt` ensures that the generated #ZIP is compatible with the requirements of Android.
For instance, the `resources.arsc` will be mapped directly in memory at runtime, so it must not be compressed inside the #ZIP file.
If necessary, the #ZIP file is then aligned using `zipalign`.
Again, this is to ensure compatibility with android optimizations: some files like `resources.arsc` need to be 4 bits alligned to be mapped in memory.
Again, this is to ensure compatibility with Android optimisations: some files like `resources.arsc` need to be 4-bits aligned to be mapped in memory.
The last step is to sign the application using the `apksigner` utility.
Since 2021, Google requires that new applications in the Google Play app store to be uploaded in a new format called Android App Bundles.
Since 2021, Google has required that new applications in the Google Play app store be uploaded in a new format called Android App Bundles.
The main difference is that Google will perform the last packaging steps and generate (and sign) the application itself.
This allow Google to generate different applications for different target, and avoid including unnecessary files in the application like native code targetting the wrong architecture.
This allows Google to generate different applications for different targets and to avoid including unnecessary files in the application, like native code targeting the wrong architecture.
]
==== Android Runtime <sec:bg-art>
Android runtime environement has many specificities that sets it appart from other platforms.
An heavy emphasis is put on isolating the applications from one another as well from the systems critical capabilities.
Android runtime environment has many specificities that set it apart from other platforms.
A heavy emphasis is put on isolating the applications from one another as well as from the system's critical capabilities.
The code execution itself can be confusing at first.
Instead of the usual linear model with a single entry point, applications have many entrypoints that are called by the Android framework in accordance to external events.
Instead of the usual linear model with a single entry point, applications have many entry points that are called by the Android framework in accordance with external events.
#paragraph[*Application Architecture*][
Android application expose their componants to the Android Runtime (#ART) via classes inheriting specific classes from the Android #SDK.
Android application expose their components to the Android Runtime (#ART) via classes inheriting specific classes from the Android #SDK.
Four classes represent application components that can be used as entry points:
- _Activities_: An activity represent a single screen with a user interface. This is the component used to interact with a user.
- _Services_: A service serves as en entrypoint to run the application in the background.
- _Broadcast receivers_: A broadcast receiver is an entry point used when a matching event is broadcasted by the system.
- _Content providers_: A content provider is a component that manage data accessible by other app through the content provider.
- _Activities_: An activity represents a single screen with a user interface. This is the component used to interact with a user.
- _Services_: A service serves as an entry point to run the application in the background.
- _Broadcast receivers_: A broadcast receiver is an entry point used when a matching event is broadcast by the system.
- _Content providers_: A content provider is a component that manages data accessible by other applications through the content provider.
Components must be listed in the `AndroidManifest.xml` of the application so that the system knows of them.
In the live cicle of a component, the system will call specific methods defined by the classes associated to each componant type.
Those methods are to be overridden by the classes defined in the application if they are specific action to be perfomed.
In the life cycle of a component, the system will call specific methods defined by the classes associated with each component type.
Those methods are to be overridden by the classes defined in the application if they are specific actions to be performed.
For instance, an activity might compute some values in `onCreate()`, called when the activity is created, save the value of those variable to the file system in `onStop()`, called when the acitivity stop being visible to the user, and recover the saved values in `onRestart()`, called when the user navigate back to the activity.
In addition to the componants declared in the manifest that act as entry points, the Android #API heavily relies on callbacks.
The most obvious cases are for the user interface, for example a button will call a callback method defined by the application when clicked.
Other part of the #API also rely on non-linear execution, for example when an application sends an intent (see next paragraph), the intent sent in responce is transmitted back to the application by calling another method.
In addition to the components declared in the manifest that act as entry points, the Android #API heavily relies on callbacks.
The most obvious cases are for the user interface; for example, a button will call a callback method defined by the application when clicked.
Other parts of the #API also rely on non-linear execution; for example, when an application sends an intent (see next paragraph), the intent sent in response is transmitted back to the application by calling another method.
]
#paragraph[*Application Isolation and Interprocess Communication*][
On Android, each application has its own storage folders and the application processes are isolated from each other and from the hardware interfaces.
On Android, each application has its own storage folders and the application processes are isolated from each other, and from the hardware interfaces.
This sandboxing is done using Linux security features like group and user permissions, SELinux, and seccomp.
The sandboxing is adjusted according to the permissions requested in the `AndroidManifest.xml` file of the applications.
In addition, most feature of the Android system can only be accessed through Binder, Android main interprocess communication channel.
In addition, most features of the Android system can only be accessed through Binder, Android's main interprocess communication channel.
Binder is a componant of tha Android framework, external to the application, that all applications can communicate with.
Binder is a component of the Android framework, external to the application, that all applications can communicate with.
Applications can send messages to Binder, called *intents*.
Binder will check if the application is allowed to send it, and then foward it to the appropriate componant.
Binder will check if the application is allowed to send it, and then forward it to the appropriate component.
This component can then respond with another intent.
Applications must declare intent filters to indicate which intent can be send to the application, and which classes receive the intents.
Applications must declare intent filters to indicate which intent can be sent to the application, and which classes receive the intents.
Intents are central to Android applications and are not just used to access Android capabilities.
For instance, the activities and services are started by receiving and intent, and it is not uncommon for application to self-send intents to switch between activities.
Intent can also be sent directly from Android to the application: when a user starts an application by tapping the app icons, Android will send an intent to the class of the application that defined the intent filter for the `android.intent.action.MAIN` intent.
One interesting feature of the Binder is that intent do not need to explicitly name the targetted application and class: intent can be implicit and request an action without knowing the exact application that will performed it.
An example of this behaviour is when an application want to open a file: an `android.intent.action.VIEW` intent is sent with the file location and type, and Binder will find and start an application capable of viewing this file.
For instance, activities and services are started by receiving intents, and it is not uncommon for an application to self-send intents to switch between activities.
Intent can also be sent directly from Android to the application: when a user starts an application by tapping the app icon, Android will send an intent to the class of the application that defined the intent filter for the `android.intent.action.MAIN` intent.
One interesting feature of the Binder is that intents do not need to explicitly name the targeted application and class: intents can be implicit and request an action without knowing the exact application that will perform it.
An example of this behaviour is when an application wants to open a file: an `android.intent.action.VIEW` intent is sent with the file location and type, and Binder will find and start an application capable of viewing this file.
]
#paragraph[*Platform Classes*][
In addition to the classes they include, Android applications have access to classes provided by Android, stored on the phone.
Those classes are called _platform classes_.
They are devided between #SDK classes, and hidden #API.
They are divided between #SDK classes and hidden #API.
The #SDK classes can be seen as the Android standard library.
They are documented by Google, and have a certain stability from version to version.
In case of breaking changes, the changed are listed by Google as well.
The list of #SDK classes is available at compile time in the form of a `android.jar` file to link against.
They are documented by Google and have a certain stability from version to version.
In case of breaking changes, the changes are listed by Google as well.
The list of #SDK classes is available at compile time in the form of an `android.jar` file to link against.
On the opposite, hidden #API are undocumented methods used internally by the #ART.
On the contrary, hidden #API are undocumented methods used internally by the #ART.
Still, they are loaded by the application and can be used by it.
]
#paragraph[*Class Loading and Reflection*][
Class loading is the mechanism used by Android to find and select the classes implementation when encontering a reference to a class.
Class loading is the mechanism used by Android to find and select the class implementation when encountering a reference to a class.
Android developers mainly use it to load bytecode dynamically from a source other than the application itself (#eg a file downloaded at runtime), using `ClassLoader` objects.
`Class` objects are the retrieved from those class loaders using their name in the form of strings to identify them.
Those `Class` can then be instanciated into object, and `Methods` objects can be used to call the mehtods of the instanciated object.
The process of manipulating `Class` and `Methods` object instead of using bytecode instructions is called reflection.
`Class` objects are retrieved from those class loaders using their name in the form of strings to identify them.
Those `Class` can then be instantiated into an object, and `Methods` objects can be used to call the methods of the instantiated object.
The process of manipulating `Class` and `Methods` objects instead of using bytecode instructions is called reflection.
Reflection is not limited to bytecode that has been dynamically loaded: it can be used for any class or method available to the application.
Because the `ClassLoader` object are only used when loading bytecode dynamically or when using reflection, it is often forgotten that the #ART uses class loaders constantly behind the scene, allowing classes from the application and platform classes to cohabit seamlessly.
Because the `ClassLoader` objects are only used when loading bytecode dynamically or when using reflection, it is often forgotten that the #ART uses class loaders constantly behind the scene, allowing classes from the application and platform classes to cohabit seamlessly.
]
#v(2em)

View file

@ -4,93 +4,93 @@
=== Reverse Engineering Tools <sec:bg-tools>
Due to the specificities of Android, reverse engineers need tools adapted to Android.
The developement tools provided by Google can be used for basic operations, but a reverse engineer will quickly need more specialized tool.
Usually, the first steep while while analysing an application is to look at its content.
The development tools provided by Google can be used for basic operations, but a reverse engineer will quickly need more specialised tools.
Usually, the first step while analysing an application is to look at its content.
Apktool and Jadx are common tools used to convert the content of an application into a readable format.
Analysing an application this way, without running it, is called static analysis.
For more advanced form of static analysis, Androguard and Soot can be used as librairy to automate analyses.
For more advanced forms of static analysis, Androguard and Soot can be used as libraries to automate analyses.
When static analysis became too complicated (#eg if the application uses obfuscation techniques), a reverse engineer might switch to dynamic analysis.
This time, the application is executed and the analyst will scrutinise the behaviour of the application.
Frida is a good option to help this dynamic analysis,
It is a toolkit that can be use to intercept method call and execute custom while an application is running.
This time, the application is executed, and the analyst will scrutinise the behaviour of the application.
Frida is a good option to help with this dynamic analysis.
It is a toolkit that can be used to intercept method calls and execute custom scripts while an application is running.
#paragraph[*Android Studio*][
The whole Android developement ecosystem is packaged by Google in the #IDE Android Studio#footnote[https://developer.android.com/studio].
In practice, Android Studio is a source-code editor that wrap arround the different tools of the android #SDK.
The whole Android development ecosystem is packaged by Google in the #IDE Android Studio#footnote[https://developer.android.com/studio].
In practice, Android Studio is a source-code editor that wraps around the different tools of the Android #SDK.
The #SDK tools and packages can be installed manually with the `sdkmanager` tool.
Among the notable tools in the #SDK, they are:
Among the notable tools in the #SDK are:
- `emulator`: an Android emulator.
This tools allow to run an emulated Android phone on a computer.
Although very useful, Android emulator has several limitation.
For once, it cannot emulate another achitecture.
This tool allows running an emulated Android phone on a computer.
Although very useful, Android emulator has several limitations.
For once, it cannot emulate another architecture.
An x86_64 computer cannot emulate an ARM smartphone.
This can be an issue because a majority of smartphone run on ARM processor.
Also, for certain version of Android, the proprietary GooglePlay libraries are not available on rooted emulators.
This can be an issue because a majority of smartphones run on ARM processors.
Also, for certain versions of Android, the proprietary GooglePlay libraries are not available on rooted emulators.
Lastly, emulators are not designed to be stealthy and can easily be detected by an application.
Malware will avoid detection by not running their payload on emulators.
- #ADB: a tool to send commands to Android smartphone or emulator.
It can be used to install applications, send instructions, events, and generally perform debuging operations.
- Platform Packages: Those packages contains data associated to a version of android needed to compile an application.
Especially, they contains the so call `android.jar` files, that contains the list of #API for a version of Android.
- `d8`: The main use of `d8` is to convert java bytecode files (`.class`) to Android #DEX format.
It can also be used to perform different level of optimization of the bytecode generated.
- `aapt`/`aapt2` (Android Asset Packaging Tool): This tools is used to build the #APK file.
Malware will avoid detection by not running its payload on emulators.
- #ADB: a tool to send commands to an Android smartphone or emulator.
It can be used to install applications, send instructions, events, and generally perform debugging operations.
- Platform Packages: Those packages contain data associated with a version of Android needed to compile an application.
Especially, they contain the so-called `android.jar` files, which contain the list of #API for a version of Android.
- `d8`: The main use of `d8` is to convert Java bytecode files (`.class`) to Android #DEX format.
It can also be used to perform different levels of optimisation of the bytecode generated.
- `aapt`/`aapt2` (Android Asset Packaging Tool): This tool is used to build the #APK file.
It is commonly used by other tools that repackage applications like Apktool.
Behind the scene, it converts #XML to binary #AXML and ensure that each files have the right compression and alignment. (#eg some ressource files are mapped in memory by the #ART, and thus need to be aligned and not compressed).
Behind the scenes, it converts #XML to binary #AXML and ensures that each file has the right compression and alignment. (#eg some resource files are mapped in memory by the #ART, and thus need to be aligned and not compressed).
- `apksigner`: the tool used to sign an #APK file.
When repackaging an application, for example with Apktool, the new application need to be signed.
When repackaging an application, for example, with Apktool, the new application needs to be signed.
]
#paragraph[*Apktool*][
Apktool#footnote[https://apktool.org/] is a _reengineering tool_ for Android #APK files.
It can be used to disassemble an application: it will extract the files from the #APK file, convert the binary #AXML to text #XML, and use smali/backsmali#footnote[https://github.com/JesusFreke/smali] to convert the #DEX files to smali, an assembler-like langage that match the Dalvik bytecode instructions.
The main strenght of Apktool is that after having disassemble an application, the content of the application can be edited and reassemble into a new #APK. #jfl-note[limites? ca marche toujours?]
It can be used to disassemble an application: it will extract the files from the #APK file, convert the binary #AXML to text #XML, and use smali/backsmali#footnote[https://github.com/JesusFreke/smali] to convert the #DEX files to smali, an assembler-like language that matches the Dalvik bytecode instructions.
The main strength of Apktool is that after disassembling an application, its content can be edited and reassembled into a new #APK. #jfl-note[limites? ca marche toujours?]
]
#paragraph[*Androguard*][
Androguard#footnote[https://github.com/androguard/androguard]~@desnos:adnroguard:2011 is a python library for parsing and disassembling #APK files.
It can be used to automatically read Android manifests, ressources, and bytecode.
Contrary to Apktool wich generate text files, it can be used as a library to programatically to analyse the application.
Androguard#footnote[https://github.com/androguard/androguard]~@desnos:adnroguard:2011 is a Python library for parsing and disassembling #APK files.
It can be used to automatically read Android manifests, resources, and bytecode.
Contrary to Apktool, which generates text files, it can be used as a library to programmatically analyse the application.
However, contrary to Apktool, it cannot repackage a modified application.
In addition, it can perform additionnal analysis, like computing a call graph or control flow graph of the application.
We will explain what are those graphs later in @sec:bg-static.
It can also perform additional analysis, like computing a call graph or control flow graph of the application.
We will explain what those graphs are later in @sec:bg-static.
]
#paragraph[*Jadx*][
Jadx#footnote[https://github.com/skylot/jadx] is an application decompiler.
It convert #DEX files to Java source code.
It converts #DEX files to Java source code.
It is not always capable of decompiling all classes of an application, so it cannot be used to recompile a new application, but the code generated can be very helpful to reverse an application.
In addition to decompilling #DEX files, Jadx can also decode Android manifests and application ressources.
In addition to decompiling #DEX files, Jadx can also decode Android manifests and application resources.
]
#paragraph[*Soot*][
Soot#footnote[https://github.com/soot-oss/soot]~@Arzt2013 was originaly a Java optimization framework.
It could leaft java bytecode to other intermediate representations that can could be optimized, then converted back to bytecode.
Soot#footnote[https://github.com/soot-oss/soot]~@Arzt2013 was originally a Java optimisation framework.
It could lift Java bytecode to other intermediate representations that can be optimised, then converted back to bytecode.
Because Dalvik bytecode and Java bytecode are equivalent, support for Android was added to Soot, and Soot features are now leveraged to analyse and modify Android applications.
One of the best known example of Soot usage for Android analysis is Flowdroid~@Arzt2014a, a tool that computes data flow in an application.
One of the best-known examples of Soot usage for Android analysis is Flowdroid~@Arzt2014a, a tool that computes data flow in an application.
A new version of Soot, SootUp#footnote[https://github.com/soot-oss/SootUp], is currently beeing worked on.
Compared to Soot, it has a modernize interface and architecture, but it is not yet feature complete and some tools like Flowdroid are still using Soot.
A new version of Soot, SootUp#footnote[https://github.com/soot-oss/SootUp], is currently being worked on.
Compared to Soot, it has a modernised interface and architecture, but it is not yet feature-complete, and some tools like Flowdroid are still using Soot.
]
#paragraph[*Frida*][
Frida#footnote[https://frida.re/] is a dynamic intrumentation toolkit.
It allows the reverse engineer to inject and run javascript code inside a running application.
Frida#footnote[https://frida.re/] is a dynamic instrumentation toolkit.
It allows the reverse engineer to inject and run JavaScript code inside a running application.
To instrument an application, the frida server must be running as root on the phone, or the frida librairy must be injected inside the #APK file before installing it.
Frida defines a javascript wrapper arround the Java Native Interface (JNI) used by native code to interact with Java classes and the Android #API.
In addition to allowing interaction with Java objects from the application and the Android API, this wrapper provides the option to replace a method implementation by a javascript function (that itself can call the original method implementation if needed).
This make Frida a powerful tool capable of collecting runtime informations or modifying the behavior of an application as needed.
To instrument an application, the Frida server must be running as root on the phone, or the Frida library must be injected inside the #APK file before installing it.
Frida defines a JavaScript wrapper around the Java Native Interface (JNI) used by native code to interact with Java classes and the Android #API.
In addition to allowing interaction with Java objects from the application and the Android API, this wrapper provides the option to replace a method implementation with a JavaScript function (that itself can call the original method implementation if needed).
This makes Frida a powerful tool capable of collecting runtime information or modifying the behaviour of an application as needed.
The main drawback of using Frida is that it is a known tools easily detected by applications.
Malware might implement countermeasures that avoid running malicious payload in presence of Frida.
The main drawback of using Frida is that it is a known tool, easily detected by applications.
Malware might implement countermeasures that avoid running malicious payloads if Frida is detected.
]
#v(2em)
Those tools are quite useful for manual operations.
However, considering the complexity of modern Android applications, it might take a lot of work for a reverse engineer to analyse one application.
Different techniques have been developped to streamline the analysis.
Different techniques have been developed to streamline the analysis.
Next, we will see the most common of those techniques for static analysis.

View file

@ -4,21 +4,21 @@
=== Static Analysis <sec:bg-static>
Static analysis program examine an #APK file without executing it to extract information from it.
A static analysis program examines an #APK file without executing it to extract information from it.
Basic static analysis can include extracting information from the `AndroidManifest.xml` file or decompiling bytecode to Java code with tools like Apktool or Jadx.
Unfortunately, simply reading the bytecode does not scale.
To do so, a human analyst is needed, making it complicated to analyse a large number of applications, and even for single applications, the size and complexity of some applications can quickly overwhelm the reverse engineer.
Control flow analysis is often used to mitigate this issue.
The idea is to extract the behaviour, the flow, of the application from the bytecode, and to represent it as a graph.
A graph representation is easier to work with than a list of instructions, and can be used for further analysis.
A graph representation is easier to work with than a list of instructions and can be used for further analysis.
Depending on the level of precision required, different types of graphs can be computed.
The most basic of those graph is the call graph.
A call graph is a graph where the nodes represent the methods in the application, and the edges reprensent calls from one method to another.
The most basic of those graphs is the call graph.
A call graph is a graph where the nodes represent the methods in the application, and the edges represent calls from one method to another.
@fig:bg-fizzbuzz-cg-cfg b) show the call graph of the code in @fig:bg-fizzbuzz-cg-cfg a).
A more advance control-flow analysis consist in building the control-flow graph.
A more advanced control-flow analysis consists of building the control-flow graph.
This time, instead of methods, the nodes represent instructions, and the edges indicate which instruction can follow which instruction.
@fig:bg-fizzbuzz-cg-cfg c) represents the control-flow graph of @fig:bg-fizzbuzz-cg-cfg a), with code statement instead of bytecode instructions.
@fig:bg-fizzbuzz-cg-cfg c) represents the control-flow graph of @fig:bg-fizzbuzz-cg-cfg a), with code statements instead of bytecode instructions.
#todo[Add alt text for @fig:bg-fizzbuzz-cg and @fig:bg-fizzbuzz-cfg]
@ -111,26 +111,26 @@ This time, instead of methods, the nodes represent instructions, and the edges i
)<fig:bg-fizzbuzz-cg-cfg>
Once the control-flow graph is computed, it can be used to compute data-flows.
Data-flow analysis, also called taint-tracking, allows to follow the flow of information in the application.
Be defining a list of methods and fields that can generate critical information (taint sources) and a list of methods that can consume information (taint sink), taint-tracking allows to detect potential data leaks (if a data flow link a taint source and a taint sink).
For example, `TelephonyManager.getImei()` returns an unique, persistent, device identifier.
Data-flow analysis, also called taint-tracking, is used to follow the flow of information in the application.
By defining a list of methods and fields that can generate critical information (taint sources) and a list of methods that can consume information (taint sinks), taint-tracking detects potential data leaks (if a data flow links a taint source and a taint sink).
For example, `TelephonyManager.getImei()` returns a unique, persistent, device identifier.
This can be used to identify the user, and it cannot be changed if compromised.
This make `TelephonyManager.getImei()` a good candidate as a taint source.
On the other hand, `UrlRequest.start()` send a request to an external server, making it a taint sink.
If a data-flow is found linking `TelephonyManager.getImei()` to `UrlRequest.start()`, this means the application is potentially leaking a critical information to an external entity, a behavior that is probably not wanted by the user.
This makes `TelephonyManager.getImei()` a good candidate as a taint source.
On the other hand, `UrlRequest.start()` sends a request to an external server, making it a taint sink.
If a data-flow is found linking `TelephonyManager.getImei()` to `UrlRequest.start()`, this means the application is potentially leaking critical information to an external entity, a behaviour that is probably not wanted by the user.
Static analysis is powerful as it allows to detects unwanted behavior in an application even is the behavior does not manifest itself when running the application.
Hovewer, static analysis tools must overcom many challenges when analysing Android applications.
/ the Java object-oriented paradigm: A call to a method can in fact correspond to a call to any method overriding the original method in subclasses.
Static analysis is powerful as it can detect unwanted behaviour in an application, even if the behaviour does not manifest itself when running the application.
However, static analysis tools must overcome many challenges when analysing Android applications.
/ the Java object-oriented paradigm: A call to a method can, in fact, correspond to a call to any method overriding the original method in subclasses.
/ the multiplicity of entry points: Each component of an application can be an entry point for the application.
/ the event driven architecture: Methods of in the applications can be called when event occur, in unknown order.
/ the interleaving of native code and bytecode: Native code can be called from bytecode and vice versa, but tools often only handle one of those format.
/ the event-driven architecture: Methods in the applications can be called when events occur, in an unknown order.
/ the interleaving of native code and bytecode: Native code can be called from bytecode and vice versa, but tools often only handle one of those formats.
/ the potential dynamic code loading: An application can run code that was not originally in the application.
/ the use of reflection: Methods can be called from their name as a string object, which is difficult to identify statically.
/ the continual evolution of Android: each new version of Android brings new features that an analysis tools must be aware of.
/ the continual evolution of Android: each new version of Android brings new features that analysis tools must be aware of.
For instance, the multi-dex feature presented in @sec:bg-android-apk was introduced in Android #SDK 21.
Tools unaware of this feature only analyse the `classes.dex` file an will ignore all other `classes<n>.dex` files.
Tools unaware of this feature only analyse the `classes.dex` file and will ignore all other `classes<n>.dex` files.
#todo[Ca serait bien de souligner Dyn Code Load et Reflection]

View file

@ -2,8 +2,8 @@
== Android Background <sec:bg-android-bg>
We begin this chapter with background information about Android and reverse engineering of Android applications.
We start with a description of Android applications and their execution environement, then list some usefull basic tools for reverse engineering, and finish with the bases of static analysis for Android.
We begin this chapter with background information about Android and the reverse engineering of Android applications.
We start with a description of Android applications and their execution environment, then list some useful basic tools for reverse engineering, and finish with the basics of static analysis for Android.
#include("2_1_android.typ")
#include("2_2_tools.typ")

View file

@ -3,36 +3,36 @@
== Problems of the Reverse Engineer <sec:bg-probl>
In this section, we will develop some issues encontered by reverse engineer, and link them to our problem statements.
In this section, we will develop on some issues encountered by reverse engineers, and link them to our problem statements.
In the previous section, we listed some limitations to static analysis.
Some limitations have been known for some time now, and many contributions have been made to been made to overcome them.
Those contribution often introduce new tools that implements solutions to those different issues.
Depending on the situation, a reverse engineer might want to use those tools, or build another tool on top of one.
Some limitations have been known for some time now, and many contributions have been made to overcome them.
Those contributions often introduce new tools that implement solutions to those different issues.
Depending on the situation, a reverse engineer might want to use those tools or build another tool on top of one.
Unfortunately, they can be hard to use.
And like we said previously, the fast evolution of Android can be a significant obstacle.
The combinaison of those two point can lead a reverse engineer to spend a lot of time trying to use a tool without realising that tools does not work anymore.
The combination of those two points can lead a reverse engineer to spend a lot of time trying to use a tool without realising that the tool does not work anymore.
Our first problem statement #pb1 focuses on this issue: #pb1-text
Determining which tools are still usable today is a first step, but finding out what reasons make a tool stop working might help writing more resilient tools in the futur.
Determining which tools are still usable today is a first step, but finding out what reasons make a tool stop working might help write more resilient tools in the future.
We also presented dynamic code loading an obstacle for static analysis.
We also presented dynamic code loading as an obstacle for static analysis.
Code loading is achieved using class loader objects, causing class loaders to be generally associated with dynamic code loading.
However, class loading plays a much more important role in the #ART.
Class loading originate from the Java ecosystem, and was ported to Android so that developers could keep writting application in Java.
Despit that, Android made a lot of change to the original Java classes, and did not document those changes.
Between static analysis general oversight of class loading, relegating it to dynamic analysis, and the lake of documentation of the actual behaviour of the #ART, the question of the impact of the class loading algorithm on static analysis can be ask.
Class loading originates from the Java ecosystem and was ported to Android so that developers could keep writing applications in Java.
Despite that, Android made a lot of changes to the original Java classes and did not document those changes.
Between static analysis, general oversight of class loading, relegating it to dynamic analysis, and the lake of documentation of the actual behaviour of the #ART, the question of the impact of the class loading algorithm on static analysis can be asked.
Our second problem statement #pb2 aims to anwser this question: #pb2-text
Circling back to known limitations of static analysis, dynamic code loading and reflection are often used to obfuscate applications.
Dynamic code loading allows to hide bytecode from static analysis with relativelly low effort.
The bytecode can downloaded at runtime, stored in the application encrypted, hidden inside other files, generated at runtime, etc.
In a way, reflection allows to do the same thing, but for specific method calls: instead of the actual call, static analysis will see a call to the generic `Method.invoke()` method.
By contrast, it is relatively easy to find those the name of the method called or to intercept dynamically loaded bytecode using dynamic tools like Frida.
The issue that arrise then is what to do with the collected data.
Dynamic code loading allows hiding bytecode from static analysis with relatively low effort.
The bytecode can be downloaded at runtime, stored in the application encrypted, hidden inside other files, generated at runtime, etc.
In a way, reflection can do the same thing, but for specific method calls: instead of the actual call, static analysis will see a call to the generic `Method.invoke()` method.
By contrast, it is relatively easy to find the name of the method called or to intercept dynamically loaded bytecode using dynamic tools like Frida.
The issue that arises then is what to do with the collected data.
Simply having it greatly helps a manual analysis, but it cannot be used directly by tools that perform static analyses.
There is no standard representation for runtime information, and there is simply no way to give a list of reflection sites and the associated method calls for most tools.
This means that in most cases, when a reverse engineer wants to improve static analysis with dynamic analysis, they need to modify the static tools to receive the additionnal runtime data.
This means that in most cases, when a reverse engineer wants to improve static analysis with dynamic analysis, they need to modify the static tools to receive the additional runtime data.
Doing so requires both time and knowledge of the internals of the tools used.
Our third problem statement, #pb3, explore an alternative aproach that modify the application instead of the tool: #pb3-text
Our third problem statement, #pb3, explores an alternative approach that modifies the application instead of the tool: #pb3-text
We will now explore the current state of the art for relevent contributions related to our problem statements.
We will now explore the current state of the art for relevant contributions related to our problem statements.

View file

@ -6,45 +6,45 @@
#pb1-text
In the past fifteen years, the research community released many tools to detect or analyse malicious behaviors in applications.
The first steps to anwser this question is to list those previously published tools.
The number of publication related to static analysis can make it difficult to find the right tool for the right task.
In the past fifteen years, the research community has released many tools to detect or analyse malicious behaviours in applications.
The first step to answer this question is to list those previously published tools.
The number of publications related to static analysis can make it difficult to find the right tool for the right task.
Li #etal~@Li2017 published a systematic literature review for Android static analysis before May 2015.
They analysed 92 publications and classified them by goal, method used to solve the problem and underlying technical solution for handling the bytecode when performing the static analysis.
In particular, they listed 27 approaches with an open-source implementation available.
Interestingly, a lot of the tools listed rely on common tools to interact with Android applications/#DEX bytecode.
Reccuring examples of such support tools are Appktool (#eg Amandroid~@weiAmandroidPreciseGeneral2014, Blueseal~@shenInformationFlowsPermission2014, SAAF~@hoffmannSlicingDroidsProgram2013), Androguard (#eg Adagio~@gasconStructuralDetectionAndroid2013, Appareciumn~@titzeAppareciumRevealingData2015, Mallodroid~@fahlWhyEveMallory2012) or Soot (#eg Blueseal~@shenInformationFlowsPermission2014, DroidSafe~@DBLPconfndssGordonKPGNR15, Flowdroid~@Arzt2014a).
This strengthens our idea that behing able to reuse previous tools in important.
This strengthens our idea that being able to reuse previous tools is important.
Those tools are built incrementally, on top of each other.
Nevertheless, experiments to evaluate the reusability of the pointed out software were not performed by Li #etal
#jfl-note[We believe that the effort of reviewing the literature for making a comprehensive overview of available approaches should be pushed further: an existing published approach with a software that cannot be used for technical reasons endanger both the reproducibility and reusability of research.][A mettre en avant?]
#jfl-note[We believe that the effort of reviewing the literature for making a comprehensive overview of available approaches should be pushed further: an existing published approach with a software that cannot be used for technical reasons endangers both the reproducibility and reusability of research.][A mettre en avant?]
//Data-flow analysis is the subject of many contribution~@weiAmandroidPreciseGeneral2014 @titzeAppareciumRevealingData2015 @bosuCollusiveDataLeak2017 @klieberAndroidTaintFlow2014 @DBLPconfndssGordonKPGNR15 @octeauCompositeConstantPropagation2015 @liIccTADetectingInterComponent2015, the most notable tool being Flowdroid~@Arzt2014a.
We will now explore this direction further by looking at the work that has been done to evaluate different analysis tools.
Works that perform benchmaks of tools follow a similar method.
Works that perform benchmarks of tools follow a similar method.
They start by selecting a set of tools with similar goals.
Usually, those contribusions are comparing existing tools to their own, but some contributions do not introduce a new tool and focus on surveying the state of the art for some technique.
They then selected a dataset of application to analyse.
We will see in @sec:bg-datasets that those dataset are often and crafted, even if some studdies select a few read-world application that they manually reverse engineer to get a ground truth to compare to the tools result.
Once the tools and test dataset are selected, the tools are run on the application dataset, and the results of the tools are compared to the ground truth to determine the accuracy of each tools.
Usually, those contributions are comparing existing tools to their own, but some contributions do not introduce a new tool and focus on surveying the state of the art for some technique.
They then selected a dataset of applications to analyse.
We will see in @sec:bg-datasets that those datasets are often hand-crafted, except for some studies that select a few real-world applications that they manually reverse-engineered to get a ground truth to compare to the tool's result.
Once the tools and test dataset are selected, the tools are run on the application dataset, and the results of the tools are compared to the ground truth to determine the accuracy of each tool.
Several factors can be considered to compare the results of the tools:
the number of false positives, false negatives, or even the time it took to finish the analysis.
Occasionally, the number of application a tool simply failled to analyse are also compared.
Occasionally, the number of applications a tool simply failed to analyse is also compared.
In @sec:bg-datasets we will look at the dataset used in the community to compare analysis tools.
In @sec:bg-datasets, we will look at the dataset used in the community to compare analysis tools.
Then in @sec:bg-bench> we will go through the contributions that benchmarked those tools #jm-note[to see if they can be used as an indication as to which tools can still be used today.][Mettre en avant]
==== Application Datasets <sec:bg-datasets>
Research contributions often rely on existing datasets or provide new ones in order to evaluate the developed software.
Raw datasets such as Drebin@Arp2014 contain few information about the provided applications.
Raw datasets such as Drebin@Arp2014 contain little information about the provided applications.
As a consequence, dataset suites have been developed to provide, in addition to the applications, meta information about the expected results.
For example, taint analysis datasets should provide the source and expected sink of a taint.
In some cases, the datasets are provided with additional software for automatizing part of the analysis.
One such dataset is DroidBench, that was released with the tool Flowdroid~@Arzt2014a.
In some cases, the datasets are provided with additional software for automating part of the analysis.
One such dataset is DroidBench, which was released with the tool Flowdroid~@Arzt2014a.
Later, the dataset ICC-Bench was introduced with the tool Amandroid~@weiAmandroidPreciseGeneral2014 to complement DroidBench by introducing applications using Inter-Component data flows.
These datasets contain carefully crafted applications containing flows that the tools should be able to detect.
These hand-crafted applications can also be used for testing purposes or to detect any regression when the software code evolves.
@ -52,43 +52,43 @@ The drawback to using hand-crafted applications is that these datasets are not r
Contrary to DroidBench and ICC-Bench, some approaches use real-world applications.
Bosu #etal~@bosuCollusiveDataLeak2017 use DIALDroid to perform a threat analysis of Inter-Application communication and published DIALDroid-Bench, an associated dataset.
Similarly, Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022 a real-world dataset and the associated recommendations to build such a dataset.
These datasets are useful for carefully spotting missing taint flows, but contain only a few dozen of applications.
Similarly, Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022, a real-world dataset, and the associated recommendations to build such a dataset.
These datasets are useful for carefully spotting missing taint flows, but contain only a few dozen applications.
In addition to those datasets, AndroZoo~@allixAndroZooCollectingMillions2016 collect applications from several application market places, including the Google Play store (the official Google application store), Anzhi and AppChina (two chinese stores), or FDroid (a store dedicated to free and open source applications).
Currently, Androzoo contains more than 25 millions applications, that can be downloaded by researchers from the SHA256 hash of the application.
Androzoo also provide additionnal information about the applications, like the date the application was detected for the first time by Androzoo or the number of antivirus from VirusTotal that flaged the application as malicious.
In addition to providing researchers with an easy access to real world applications, Androzoo make it a lot easier to share datasets for reproducibility: instead of sharing hundreds of #APK files, the list of SHA256 is enough.
In addition to those datasets, AndroZoo~@allixAndroZooCollectingMillions2016 collect applications from several application marketplaces, including the Google Play store (the official Google application store), Anzhi and AppChina (two Chinese stores), or FDroid (a store dedicated to free and open source applications).
Currently, Androzoo contains more than 25 million applications that can be downloaded by researchers from the SHA256 hash of the application.
Androzoo also provide additional information about the applications, like the date the application was detected for the first time by Androzoo or the number of antiviruses from VirusTotal that flagged the application as malicious.
In addition to providing researchers with easy access to real-world applications, Androzoo make it a lot easier to share datasets for reproducibility: instead of sharing hundreds of #APK files, the list of SHA256 is enough.
==== Benchmarking <sec:bg-bench>
The few datasets composed of real-world application confirmed that some tools such as Amandroid~@weiAmandroidPreciseGeneral2014 and Flowdroid~@Arzt2014a are less efficient on real-world applications~@bosuCollusiveDataLeak2017 @luoTaintBenchAutomaticRealworld2022.
Unfortunatly, those real-world applications datasets are rather small, and a larger number of applications would be more suitable for our goal, #ie evaluating the reusability of a variety of static analysis tools.
The few datasets composed of real-world applications confirmed that some tools, such as Amandroid~@weiAmandroidPreciseGeneral2014 and Flowdroid~@Arzt2014a, are less efficient on real-world applications~@bosuCollusiveDataLeak2017 @luoTaintBenchAutomaticRealworld2022.
Unfortunately, those real-world applications datasets are rather small, and a larger number of applications would be more suitable for our goal, #ie evaluating the reusability of a variety of static analysis tools.
Pauck #etal~@pauckAndroidTaintAnalysis2018 used DroidBench~@Arzt2014a, ICC-Bench~@weiAmandroidPreciseGeneral2014 and DIALDroid-Bench~@bosuCollusiveDataLeak2017 to compare Amandroid~@weiAmandroidPreciseGeneral2014, DIAL-Droid~@bosuCollusiveDataLeak2017, DidFail~@klieberAndroidTaintFlow2014, DroidSafe~@DBLPconfndssGordonKPGNR15, FlowDroid~@Arzt2014a and IccTA~@liIccTADetectingInterComponent2015. //-- all these tools will be also compared in this chapter.
Pauck #etal~@pauckAndroidTaintAnalysis2018 used DroidBench~@Arzt2014a, ICC-Bench~@weiAmandroidPreciseGeneral2014 and DIALDroid-Bench~@bosuCollusiveDataLeak2017 to compare Amandroid~@weiAmandroidPreciseGeneral2014, DIAL-Droid~@bosuCollusiveDataLeak2017, DidFail~@klieberAndroidTaintFlow2014, DroidSafe~@DBLPconfndssGordonKPGNR15, FlowDroid~@Arzt2014a and IccTA~@liIccTADetectingInterComponent2015. //-- all these tools will also be compared in this chapter.
To perform their comparison, they introduced the AQL (Android App Analysis Query Language) format.
AQL can be used as a common language to describe the computed taint flow as well as the expected result for the datasets.
It is interesting to notice that all the tested tools timed out at least once on real-world applications, and that Amandroid~@weiAmandroidPreciseGeneral2014, DidFail~@klieberAndroidTaintFlow2014, DroidSafe~@DBLPconfndssGordonKPGNR15, IccTA~@liIccTADetectingInterComponent2015 and ApkCombiner~@liApkCombinerCombiningMultiple2015 (a tool used to combine applications) all failed to run on applications built for Android API 26.
These results suggest that a more thorough study of the link between application characteristics (#eg date, size) should be conducted.
Luo #etal~@luoTaintBenchAutomaticRealworld2022 used the framework introduced by Pauck #etal to compare Amandroid~@weiAmandroidPreciseGeneral2014 and Flowdroid~@Arzt2014a on DroidBench and their own dataset TaintBench, composed of real-world android malware.
They found out that those tools have a low recall on real-world malware, and are thus over adapted to micro-datasets.
Luo #etal~@luoTaintBenchAutomaticRealworld2022 used the framework introduced by Pauck #etal to compare Amandroid~@weiAmandroidPreciseGeneral2014 and Flowdroid~@Arzt2014a on DroidBench and their own dataset TaintBench, composed of real-world Android malware.
They found out that those tools have a low recall on real-world malware, and are thus over-adapted to micro-datasets.
Unfortunately, because AQL is only focused on taint flows, we cannot use it to evaluate tools performing more generic analysis.
A first work about quantifying the reusability of static analysis tools was proposed by Reaves #etal~@reaves_droid_2016.
Seven Android analysis tools (Amandroid~@weiAmandroidPreciseGeneral2014, AppAudit~@xiaEffectiveRealTimeAndroid2015, DroidSafe~@DBLPconfndssGordonKPGNR15, Epicc~@octeau2013effective, FlowDroid~@Arzt2014a, MalloDroid~@fahlWhyEveMallory2012 and TaintDroid~@Enck2010) were selected to check if they were still readily usable.
For each tool, both the usability and results of the tool were evaluated by asking auditors to install and use it on DroidBench and 16 real world applications.
The auditors reported that most of the tools require a significant amount of time to setup, often due to dependencies issues and operating system incompatibilities.
For each tool, both the usability and results of the tool were evaluated by asking auditors to install and use it on DroidBench and 16 real-world applications.
The auditors reported that most of the tools require a significant amount of time to set up, often due to dependency issues and operating system incompatibilities.
Reaves #etal propose to solve these issues by distributing a Virtual Machine with a functional build of the tool in addition to the source code.
Regrettably, these Virtual Machines were not made available, preventing future researchers to take advantage of the work done by the auditors.
Reaves #etal also report that real world applications are more challenging to analyse, with tools having lower results, taking more time and memory to run, sometimes to the point of not being able to run the analysis.
This result is worrying considering it was noticed on a dataset of only 16 real-world application.
A more diverse dataset would be needed to better assess the extend of the issue and give more insight about the factor impacting the perfomances of the tools.
Regrettably, these Virtual Machines were not made available, preventing future researchers from taking advantage of the work done by the auditors.
Reaves #etal also report that real-world applications are more challenging to analyse, with tools having lower results, taking more time and memory to run, sometimes to the point of not being able to run the analysis.
Considering it was noticed on a dataset of only 16 real-world applications, this result is worrying.
A more diverse dataset would be needed to better assess the extent of the issue and give more insight into the factors impacting the performance of the tools.
//We will confirm and expand this result in @sec:rasta with a larger dataset than only 16 real-world applications.
Mauthe #etal present an interresting methodology to asses the robustness of Android decompilers~@mauthe_large-scale_2021.
Mauthe #etal present an interesting methodology to assess the robustness of Android decompilers~@mauthe_large-scale_2021.
They used 4 decompilers on a dataset of 40 000 applications.
The error messages of the decompilers were parsed to list the methods that failed to decompile, and this information was used to estimate the main causes of failure.
It was found that the failure rate is correlated to the size of the method, and that a consequent amount of failure are from third parties library rather than the core code of the application.
It was found that the failure rate is correlated to the size of the method, and that a consequent amount of failures are from third-party libraries rather than the core code of the application.
They also concluded that malware are easier to entirely decompile, but have a higher failure rate, meaning that the ones that are hard to decompile are substantially harder to decompile than goodware.
/*
@ -144,12 +144,12 @@ DroidBench@Arzt2014a
#v(2em)
To summariz, Li #etal made a systematic literature review of static analysis for Android that listed 27 opensourced tools.
However, they did not tested those tools.
To summarise, Li #etal made a systematic literature review of static analysis for Android that listed 27 open-sourced tools.
However, they did not test those tools.
Reaves #etal did so for some of them and analysed the difficulty of using them.
They raised two major concern for the use of Android static analysis tools.
First, they can be quite difficult to setup, and second, they appear to have difficulties analysing read-world applications.
This is problematic for a reverser engineer, not only do they need to invest a significant amont of work to setup a tool properly, they do not have any guarantees that the tool will actually manage to analyse the application they are investigating.
They raised two major concerns about the use of Android static analysis tools.
First, they can be quite difficult to set up, and second, they appear to have difficulties analysing real-world applications.
This is problematic for a reverse engineer, not only do they need to invest a significant amount of work to set up a tool properly, but they also do not have any guarantees that the tool will actually manage to analyse the application they are investigating.
In @sec:rasta, we will try to setup the tools listed by Li #etal and test them on a large number of real-world applications to see wich can be used today.
We will also aim at identify what caracteristic of real-world applications make them harder to analyse.
In @sec:rasta, we will try to set up the tools listed by Li #etal and test them on a large number of real-world applications to see which can be used today.
We will also aim at identifying what characteristics of real-world applications make them harder to analyse.

View file

@ -1,4 +1,4 @@
#import "../lib.typ": SDK, API, API, DEX, pb2, pb2-text, etal
#import "../lib.typ": SDK, API, API, DEX, pb2, pb2-text, etal, APIs
#import "../lib.typ": todo
=== Android Class Loading <sec:bg-soa-cl>
@ -8,26 +8,26 @@
This subsection is mainly dedicated to class loading in Java and Android.
Because we focus on the _default_ class loading algorithm, we will not focus on dynamic code loading.
However, class loading is used to load classes other than the one in the application, without dynamic code loading.
In the second part of this subsection we will look at the work that has been done related to those classes, the platform classes.
In the second part of this subsection, we will look at the work that has been done related to those classes, the platform classes.
==== Class Loading <sec:bg-cl>
Class loading is a fundamental element of Java, it define which classes are loaded from where.
In Android, this is often associated to dynamic code loading, as the `ClassLoader` objects are used to load code at runtime.
However, class loading also intervenes to load platform classes or classes from the application itself, and thus require some attention when analysing an application.
Class loading is a fundamental element of Java; it defines which classes are loaded from where.
In Android, this is often associated with dynamic code loading, as the `ClassLoader` objects are used to load code at runtime.
However, class loading also intervenes to load platform classes or classes from the application itself, and thus requires some attention when analysing an application.
Class loading mechanisms have been studied in the general context of the Java language.
Gong~@gong_secure_1998 describes the JDK 1.2 class loading architecture and capabilities.
One of the main advantages of class loading is the type safety property that prevents type spoofing.
As explained by Liang and Bracha~@liang_dynamic_1998, by capturing events at runtime (new loaders, new class) and maintaining constraints on the multiple loaders and their delegation hierarchy, authors can avoid confusion when loading a spoofed class.
This behavior is now implemented in modern Java virtual machines.
Later Tazawa and Hagiya~@tozawa_formalization_2002 proposed a formalization of the Java Virtual Machine supporting dynamic class loading in order to ensure type safety.
This behaviour is now implemented in modern Java virtual machines.
Later, Tazawa and Hagiya~@tozawa_formalization_2002 proposed a formalisation of the Java Virtual Machine supporting dynamic class loading in order to ensure type safety.
Those works ensure strong safety for the Java Virtual Machine, in particular when linking new classes at runtime.
Although Android has a similar mechanism, the implementation is not shared with the JVM of Oracle.
Additionally, in this paper, we do not focus on spoofing classes at runtime, but on confusion that occurs when using a static analyser used by a reverser that tries to understand the code loading process offline.
Additionally, our problem statement does not focus on spoofing classes at runtime, but on confusions that occur when using a static analyser used by a reverser that tries to understand the code loading process offline.
Contributions about Android class loading focus on using the capabilities of class loading to extend Android features or to prevent reverse engineering of Android applications.
For instance, Zhou #etal~@zhou_dynamic_2022 extend the class loading mechanism of Android to support regular Java bytecode and Kritz and Maly~@kriz_provisioning_2015 propose a new class loader to automatically load modules of an application without user interactions.
For instance, Zhou #etal~@zhou_dynamic_2022 extend the class loading mechanism of Android to support regular Java bytecode, and Kritz and Maly~@kriz_provisioning_2015 propose a new class loader to automatically load modules of an application without user interactions.
Regarding reverse engineering, class loading mechanisms are frequently used by packers for hiding all or parts of the code of an application~@Duan2018.
For example, packers exploits the class loading capability of Android to load new code.
@ -35,27 +35,27 @@ They also combine the loading with code generation from ciphered assets or code
Because parts of the original code will be only available at runtime, deobfuscation approaches propose techniques that track #DEX structures when manipulated by the application~@zhang2015dexhunter @xue2017adaptive @wong2018tackling.
Those contributions interact with the class loading mechanism of Android to collect the #DEX structures at the right moment.
Some classes however are not load from the application, nor dynamically load by the application.
Those classes are platform classes, and appart from dynamic code loaded, they are the main reason class loading is needed by Android.
Some classes, however, are not loaded from the application, nor dynamically loaded by the application.
Those classes are platform classes, and apart from dynamic code loaded, they are the main reason class loading is needed by Android.
We will now look at the literature related to them.
==== Platform Classes <sec:bg-soa-platform>
Platform classes are divided between #SDK classes that are documented, and the other classes, often refered to as hidden #API.
#SDK classes are clearly listed and documented by Google, so they do not require as much attention as hidden #API.
Platform classes are divided between #SDK classes that are documented, and the other classes, often referred to as hidden #APIs.
#SDK classes are clearly listed and documented by Google, so they do not require as much attention as hidden #APIs.
As we said earlier, hidden #API are undocumented methods that can be used by an application, thus making them a potential blind spot when analysing an application.
However, not a lot a research has been done on the subject.
However, not a lot of research has been done on the subject.
Li #etal did an empirical study of the usage and evolution of hidden #API~@li_accessing_2016.
They found that hidden #API are added and removed in every release of Android, and that they are used both by benign and malicious applications.
More recently, He #etal~@he_systematic_2023 did a systematic study of hidden service #API related to security.
They studied how the hidden #API can be used to bypass Android security restrictions and found that although Google countermeasures are effective, they need to be implemented inside the system services and not the hidden #API due to the lack of in-app privilege isolation: the framework code is in the same process as the user code, meaning any restriction in the framework can be bypassed by the user.
Unfortunately those two contributions do not explore further the consequences of the use of hidden #API for a reverse engineer.
Unfortunately, those two contributions do not explore further the consequences of the use of hidden #APIs for a reverse engineer.
#v(2em)
Class loading mechanisms have been studies carefully in the context of the Java language.
However, the same cannot be said about Android, whose implementation diverge significantly from classic Java Virtual Machine.
Most work done on Android focus on extending Android capabilities using class loading, or on analysing dynamically the code loading operations of an application.
Class loading mechanisms have been studied carefully in the context of the Java language.
However, the same cannot be said about Android, whose implementation diverges significantly from classic Java Virtual Machines.
Most work done on Android focuses on extending Android capabilities using class loading, or on analysing dynamically the code loading operations of an application.
In @sec:cl, we will model the behaviour of Android when loaded classes used by an application that do not use dynamic code loading, and check if this behaviour mach the behaviour of common analysis tools.
We will also take some times to if the state of the art related to hidden #API is up to date with the current Android versions.
In @sec:cl, we will model the behaviour of Android when loaded classes used by an application that do not use dynamic code loading, and check if this behaviour matches the behaviour of common analysis tools.
We will also take some time to check if the state of the art related to hidden #API is up to date with the current Android versions.

View file

@ -5,88 +5,89 @@
#pb3-text
Dynamic analysis of Android application have been researched for a long time.
Like static analysis, it has its own challenges, that we will explore in this subsection.
After that we will also look at contributions that seeked to encode results inside the #APK format, or used intrumentation to improve analyses in some way.
Dynamic analysis of Android applications has been researched for a long time.
Like static analysis, it has its own challenges, which we will explore in this subsection.
After that, we will also look at contributions that sought to encode results inside the #APK format or used instrumentation to improve analyses in some way.
==== Dynamic Analysis <sec:bg-dynamic>
Some situation, like reflection of dynamic code loading, are difficult to solve with static analysis and require a different approach: dynamic analysis.
With dynamic analysis, the application is actually executed and the reverse engineer obserces its behavior.
Monitoring the behavior can be achieved by various strategies: observing the filesystem, the display screen, the process memory, the kernel, ...
Some situations, like reflection of dynamic code loading, are difficult to solve with static analysis and require a different approach: dynamic analysis.
With dynamic analysis, the application is actually executed, and the reverse engineer observes its behaviour.
Monitoring the behaviour can be achieved by various strategies: observing the filesystem, the display screen, the process memory, the kernel, ...
Depending on the chosen level of observation, it can be technically difficult.
A basic example of dynamic analysis is presented by Bernardi #etal~@bernardi_dynamic_2019: the logs generated by `strace` is used to list the system calls generated in response to an event to determine if an application is malicious or not.
A basic example of dynamic analysis is presented by Bernardi #etal~@bernardi_dynamic_2019: the logs generated by `strace` are used to list the system calls generated in response to an event to determine if an application is malicious or not.
More advanced methods are more intrusive and require modifing either the #APK, the Android framework, runtime, or kernel.
TaintDroid~@Enck2010 for example modify the Dalvik Virtual Machine (the predecessor of the #ART) to track the data flow of an application at runtime, while AndroBlare~@Andriatsimandefitra2012 @andriatsimandefitra_detection_2015 try to compute the taint flow by hooking system calls using a Linux Security Module.
More advanced methods are more intrusive and require modifying either the #APK, the Android framework, runtime, or kernel.
TaintDroid~@Enck2010, for example, modifies the Dalvik Virtual Machine (the predecessor of the #ART) to track the data flow of an application at runtime, while AndroBlare~@Andriatsimandefitra2012 @andriatsimandefitra_detection_2015 try to compute the taint flow by hooking system calls using a Linux Security Module.
DexHunter~@zhang2015dexhunter and AppSpear~@yang_appspear_2015 also patch the Dalvik Virtual Machine/#ART, this time to collect bytecode loaded dynamically.
Modifying the Android framwork, runtime or kernel is possible thanks to the Android project beeing open source, however this is a delicate operation that require to revise a patch for each new version of Android.
Modifying the Android framework, runtime, or kernel is possible thanks to the Android project being open-source, but this is a delicate operation that requires revising a patch for each new version of Android.
Thus, a common issue faced by tools that took this approach is that they are stuck with a specific version of Android.
Some sandboxes limit this issue by using dynamic binary instrumentation, like DroidHook~@cui_droidhook_2023, based the Xposed framework, or CamoDroid~@faghihi_camodroid_2022, based on Frida.
This approche is a lot less stealthy than patching Android, but is generally easier to setup and is easier to port to new Android version.
Some sandboxes limit this issue by using dynamic binary instrumentation, like DroidHook~@cui_droidhook_2023, based on the Xposed framework, or CamoDroid~@faghihi_camodroid_2022, based on Frida.
This approach is a lot less stealthy than patching Android, but it is generally easier to set up and is easier to port to new Android versions.
Another known challenge when analysing an application dynamically is the code coverage: if some part of the application is not executed, it cannot be annalysed.
Another known challenge when analysing an application dynamically is the code coverage: if some part of the application is not executed, it cannot be analysed.
Considering that Android applications are meant to interact with a user, this can become problematic for automatic analysis.
The Monkey tool developed by Google is one of the most used solution~@sutter_dynamic_2024.
It sends a random streams of events the phone without tracking the state of the application.
More advance tools statically analyse the application to model in order to improve the exploration.
Sapienz~@mao_sapienz_2016 and Stoat~@su_guided_2017 uses this technique to improve application testing.
GroddDroid~@abraham_grodddroid_2015 has the same approach but detect statically suspicious sections of code to target, and will interact with the application to target those code section.
It sends a random stream of events to the phone without tracking the state of the application.
More advanced tools statically analyse the application to model in order to improve the exploration.
Sapienz~@mao_sapienz_2016 and Stoat~@su_guided_2017 use this technique to improve application testing.
GroddDroid~@abraham_grodddroid_2015 has the same approach but detects statically suspicious sections of code to target, and will interact with the application to target those code sections.
Unfortuntely, exploring the application entirely is not always possible, as some applications will try to detect is they are in a sandbox environnement (#eg if they are in an emmulator, or if Frida is present in memory) and will refuse to run some sections of code if this is the case.
Unfortunately, exploring the application entirely is not always possible, as some applications will try to detect if they are in a sandbox environment (#eg if they are in an emulator, or if Frida is present in memory) and will refuse to run some sections of code if this is the case.
Ruggia #etal~@ruggia_unmasking_2024 make a list of evasion techniques.
They propose a new sandbox, DroidDungeon, that contrary to other sandboxes like DroidScope@droidscope180237 or CopperDroid@Tam2015, strongly emphasizes on resiliance against evasion mechanism.
They propose a new sandbox, DroidDungeon, that, contrary to other sandboxes like DroidScope@droidscope180237 or CopperDroid@Tam2015, strongly emphasises resilience against evasion mechanisms.
A common objectif of dynamic analysis is to collect bytecode loaded dynamically and reflections information.
Like we said earlier, DexHunter~@zhang2015dexhunter and AppSpear~@yang_appspear_2015 that by instrumenting the Android Runtime.
Qu #etal~@qu_dydroid_2017 developped DyDroid, an hybrid framework using dynamic analysis to intercept dynamic code loading and static analysis to determine the nature of the loaded code.
They used DyDroid to make an autit of the use of dynamic code loading in applications from the Google Play store in 2016.
It resulted that dynamic code loading was mostly related to mobile advertisement, and that the code loading originated from a third party library included in the application, rather than the code of the application developper itself.
Similarly, StaDynA~@zhauniarovichStaDynAAddressingProblem2015 is a framework that generate a call graph statically, then use dynamic analysis to analyse dynamic code loading and reflection calls to complete this call graph.
A common objective of dynamic analysis is to collect bytecode loaded dynamically and reflection information.
Like we said earlier, DexHunter~@zhang2015dexhunter and AppSpear~@yang_appspear_2015 do that by instrumenting the Android Runtime.
Qu #etal~@qu_dydroid_2017 developed DyDroid, a hybrid framework using dynamic analysis to intercept dynamic code loading and static analysis to determine the nature of the loaded code.
They used DyDroid to make an audit of the use of dynamic code loading in applications from the Google Play store in 2016.
It resulted that dynamic code loading was mostly related to mobile advertisement, and that the code loading originated from a third-party library included in the application, rather than the code of the application developer itself.
Similarly, StaDynA~@zhauniarovichStaDynAAddressingProblem2015 is a framework that generates a call graph statically, then uses dynamic analysis to analyse dynamic code loading and reflection calls to complete this call graph.
The issue with those approach is that they are only compatible with their own subsequent analysis.
The issue with those approaches is that they are only compatible with their own subsequent analysis.
For instance, StaDynA only provide the call graph, and cannot be used as is to improve the capacity of Flowdroid.
This is unfortunate, has the reverse engineer next step will depend on the context: not beeing able to reuse the result of a previous analysis with any ad hoc tools limit greatly their options.
AppSpear has an interesting solution to this issue: the code it intercept is repackage inside a new #APK file that Android analysis tools should be able to analyze.
We will now explore further the contributions that take this approache of using actual application to encode its result.
This is unfortunate: the reverse engineer's next step will depend on the context.
Not being able to reuse the result of a previous analysis with any ad hoc tools greatly limits their options.
AppSpear has an interesting solution to this issue: the code it intercepts is repackaged inside a new #APK file that Android analysis tools should be able to analyse.
We will now explore further the contributions that take this approach of using actual applications to encode their results.
//#todo[RealDroid sandbox bases on modified ART?]
//#todo[force execution?]
==== Improving Analysis with Instrumentation <sec:bg-instrumentation>
Usually, instrumentation refers to the practice of modifying the behavior of a program to collect information during its execution.
Frida is a good example of instrumentation framework.
The term can also be used more generally to describe operation that modify the application code.
In this section, we will focus on the use of instrumentation that make an application easier to analyse by other tools, instead of just collecting additionnal information at runtime.
Usually, instrumentation refers to the practice of modifying the behaviour of a program to collect information during its execution.
Frida is a good example of an instrumentation framework.
The term can also be used more generally to describe operations that modify the application code.
In this section, we will focus on the use of instrumentation that makes an application easier to analyse by other tools, instead of just collecting additional information at runtime.
I the previous section, we gave the example of AppSpear~@yang_appspear_2015, that reconstruct #DEX files intercepted at runtime and repackage the #APK with the new code in it.
In the previous section, we gave the example of AppSpear~@yang_appspear_2015, which reconstructs #DEX files intercepted at runtime and repackages the #APK with the new code in it.
DexLego~@dexlego has a similar but a lot more aggressive technique.
It targets heavily obfuscated packer that decrypt then reencrypt the methods instructions just in time.
To get the bytecode, DexLego log each instruction executed by the #ART, and reconstruct the methods, then the #DEX files, from this stream of instructions.
The main limitation of this technique is that it carrys over the limitation of dynamic analysis to static analysis: the bytecode injected in the application is limited to the instructions executed during the dynamic analysis.
Nevertheless, it is an intersting way to encode the traces of a dynamic analysis in a way that can be used by any Android analysis tool.
It targets heavily obfuscated packers that decrypt then re-encrypt the method's instructions just in time.
To get the bytecode, DexLego logs each instruction executed by the #ART, and reconstructs the methods, then the #DEX files, from this stream of instructions.
The main limitation of this technique is that it carries over the limitation of dynamic analysis to static analysis: the bytecode injected in the application is limited to the instructions executed during the dynamic analysis.
Nevertheless, it is an interesting way to encode the traces of a dynamic analysis in a way that can be used by any Android analysis tool.
IccTa~@liIccTADetectingInterComponent2015 technique is close to idea of modifying the application to improve its analysis: it perform a first analysis to compute the potential inter-component communication of an application, then modify the jimple representation of this application before feeding it to Flowdroid to perform a taint analysis.
Jimple is the intermediate language used by Soot, so even if IccTa does not generate a new application, this modify representation can probably be used by any tool based on the Soot framework or recompilled into a new application without too much effort.
IccTa~@liIccTADetectingInterComponent2015 technique is close to the idea of modifying the application to improve its analysis: it performs a first analysis to compute the potential inter-component communication of an application, then modifies the Jimple representation of this application before feeding it to Flowdroid to perform a taint analysis.
Jimple is the intermediate language used by Soot, so even if IccTa does not generate a new application, this modified representation can probably be used by any tool based on the Soot framework or recompiled into a new application without too much effort.
Samhi #etal~@samhi_jucify_2022 followed this direction to unify the analysis of bytecode and native code.
Their tool, JuCify, use Angr~@angrPeople to generate the call graph of the native code, and use euristics to encode this call graph into jimple that can then be added to the jimple generated by Soot from the bytecode of the application.
Their tool, JuCify, uses Angr~@angrPeople to generate the call graph of the native code, and uses heuristics to encode this call graph into Jimple that can then be added to the Jimple generated by Soot from the bytecode of the application.
Like IccTa, they use Flowdroid to analyse this new augmented representation of the application, but it should be usable by any analysis tools relying on Soot.
Finally, DroidRA~@li_droidra_2016 use the COAL~@octeauCompositeConstantPropagation2015 solver to statically compute the reflection informations.
Finally, DroidRA~@li_droidra_2016 use the COAL~@octeauCompositeConstantPropagation2015 solver to statically compute the reflection information.
The reflection calls are transformed into direct calls inside the application using Soot.
Using COAL makes DroidRA quite good to solve the simpler cases, where name of classes and methods targeted by reflection are already present in the application.
Those cases are quite commons and beeing able to solve those without resorting to dynamic analysis is quite useful.
On the other hand, COAL will struggle to solve cases with complexe string manipulation and is simply not able to handle cases that rely on external data (#eg downloaded from the internet at runtime).
Likewise, this can only access code loaded dynamically if the code was present inside the application without any kind of obfuscation (#eg a #DEX file in the assets of the application can be analyse, but not if it is ciphered).
Using COAL makes DroidRA quite good at solving the simpler cases, where the names of classes and methods targeted by reflection are already present in the application.
Those cases are quite common; being able to solve those without resorting to dynamic analysis is quite useful.
On the other hand, COAL will struggle to solve cases with complex string manipulation and is simply not able to handle cases that rely on external data (#eg downloaded from the internet at runtime).
Likewise, this can only access code loaded dynamically if the code was present inside the application without any kind of obfuscation (#eg a #DEX file in the assets of the application can be analysed, but not if it is ciphered).
#v(2em)
Instrumenting applications to encode the result of an analysis as an unified representation has been explored before.
Instrumenting applications to encode the result of an analysis as a unified representation has been explored before.
It has been used by tools like AppSpear and DexLego to expose heavily obfuscated bytecode collected dynamically.
Similarly, DroidRA compute reflection information computed statically and inject the actual method calls inside the application it returns.
However, AppSpear and DexLego focus primarely on specific obfuscation techniques, making there implementation difficult to port to more rescent version of Android, and DroidRA suffers the limitation of static analysis.
We believe that instrumentation is a promising approach to encode those information.
Especially, we think that it could be used to provide dynamic information that are not available to static analysis tools like DroidRA.
Similarly, DroidRA compute reflection information statically and injects the actual method calls inside the application it returns.
However, AppSpear and DexLego focus primarily on specific obfuscation techniques, making their implementation difficult to port to more recent versions of Android, and DroidRA suffers from the limitation of static analysis.
We believe that instrumentation is a promising approach to encoding that information.
Especially, we think that it could be used to provide dynamic information that is not available to static analysis tools like DroidRA.
In @sec:th, we will try use instrumentation to combine dynamica analysis (to collect code loaded dynamically and reflection information) with static analysis, indifferently of the static analysis tool used.
In @sec:th, we will try to use instrumentation to combine dynamic analysis (to collect code loaded dynamically and reflection information) with static analysis, regardless of the static analysis tool used.

View file

@ -1,7 +1,7 @@
== State of the Art <sec:bg-soa>
This section focus on the state of the art related to our three probleme statements: the reusability of Android static analysis tools, the class loading mechanism of Android, and the use of instrumentation to encode information collected dynamically.
This section focuses on the state of the art related to our three problem statements: the reusability of Android static analysis tools, the class loading mechanism of Android, and the use of instrumentation to encode information collected dynamically.
#include("4_1_rasta.typ")
#include("4_2_classloader.typ")

View file

@ -2,16 +2,16 @@
== Conclusion <sec:bg-conclusion>
This chapter, presented the specificities of Android and the usual tools used as a basis for reverse engeenering applications.
Many contributions have been done to static analysis, and benchmarks have been proposed to compare the different tools that resulted from those contributions.
This chapter presented the specificities of Android and the usual tools used as a basis for reverse engineering applications.
Many contributions have been made to static analysis, and benchmarks have been proposed to compare the different tools that resulted from those contributions.
Those benchmarks raised questions about the reusability of those tools and their capacity to handle real-world applications.
We then looked at a platform classes and class loading, a commonly recognised limitation of static analysis.
We then looked at platform classes and class loading, a commonly recognised limitation of static analysis.
Because of that, the issue is generally relegated to dynamic analysis, leaving the details of the class loading mechanisms of Android unexplored.
To complement static analysis we continued by looking at dynamic analysis.
A variety of approaches have been proposed, balancing ease of use, maintanability and stealthyness.
The result of those analysis are often in an ad hoc format, making it difficult to reuse with other tools.
A few exception as well as some static analysis tools proposed an interesting solution to this issue:
instrumenting the analyse application to encode the results of the analysis in the form of a valide #APK, a format any Android analysis tools should be able read.
To complement static analysis, we continued by looking at dynamic analysis.
A variety of approaches have been proposed, balancing ease of use, maintainability and stealthiness.
The results of those analyses are often in an ad hoc format, making it difficult to reuse with other tools.
A few exceptions, as well as some static analysis tools, proposed an interesting solution to this issue:
instrumenting the analysed application to encode the results of the analysis in the form of a valid #APK, a format any Android analysis tools should be able to read.
We liked this solution and believe it should be studied further.
This process led us to explore three problem statements:
@ -19,4 +19,4 @@ This process led us to explore three problem statements:
/ #pb2: #pb2-text
/ #pb3: #pb3-text
In the next chapters, we will endeavor to contribute to the Android reverse reverse engineering field by anwsering them.
In the next chapters, we will endeavour to contribute to the Android reverse engineering field by answering them.