thesis/2_background/2_1_android.typ

#import "../lib.typ": eg, num, APK, JAR, AXML, ART, SDK, JNI, NDK, DEX, XML, API, ZIP, paragraph, midskip
#import "../lib.typ": todo, jfl-note

=== Android <sec:bg-android>

Android is the smartphone operating system developed by Google.
It is based on a Long Term Support Linux Kernel, to which patches developed by the Android community are added.
On top of the kernel, Android redeveloped many of the usual components used by Linux-based operating systems, like the init system or the standard C library, and added new ones, like the #ART that executes the applications.
Those changes make Android a unique operating system.

==== Android Applications <sec:bg-android-apk>

Applications in the Android ecosystem are distributed in the #APK format.
#APK files are #JAR files with additional features, which are themself #ZIP files with additional features.

A minimal #APK file contains a file `AndroidManifest.xml`, the `META-INF/` folder containing the #JAR manifest and signature files, and an #APK Signing Block at the end of the #ZIP file.
The code of the application is then stored in a custom format, the Dalvik bytecode, or in the binary ELF format, called native code in the Android ecosystem, or both.
Dalvik bytecode is stored in the `classes.dex`, `classes2.dex`, `classes3.dex`, ... while native code is stored in `lib/<arch>/*.so`.
The `res/` folder contains the resources required for the user interface.
When resources are present in `res/`, the file `resources.arsc` is also present at the root of the archive.
The `assets/` folder contains the files that are used directly by the code application.
Depending on the application and compilation process, any kind of other files and folders can be added to the application.

#paragraph[*Signature*][
Android applications are cryptographically signed to prove the authorship.
Applications signed with the same key are considered developed by the same entity.
This allows updating the applications securely, and applications can declare security permissions to restrict access to some features to only applications with the same author.

Android has several signature schemes coexisting:
- The v1 signature scheme is the #JAR signing scheme, where the signature data is stored in the `META-INF/` folder.
- The v2, v3 and v3.1 signature scheme are store in the '#APK Signing Block' of the #APK.
  The v2 signature scheme was introduced in Android 7.0, and to keep retro-compatibility with older versions, the v1 scheme is still used in addition to the #APK Signing Block.
  The Signing block is an unindexed binary section added to the #ZIP file, between the #ZIP entries and the Central Directory.
  The signature was added in an unindexed section of the #ZIP to avoid interfering with the v1 signature scheme that signed the files inside the archive, and not the archive itself.
- The v4 signature scheme is complementary to the v2/v3 signature scheme.
  Signature data are stored in an external, `.apk.idsig` file.
]

#paragraph[*Android Manifest*][
The Android Manifest is stored in the `AndroidManifest.xml`, encoded in the binary #AXML format.
The manifest declares important information about the application:
- Generic information like the application name, ID and icon.
- The Android compatibility of the applications, in the form of 3 values: the Android `min-sdk`, `target-sdk` and `max-sdk`. Those are the minimum, targeted and maximum versions of the Android SDK supported by the application.
- The application components (Activity, Service, Receiver and Provider) of the application and their associated classes.
- Intent filters to list the intents that can start or be sent to the application components.
- Security permissions required by the application.
]

#paragraph[*Code*][
An application usually contains at least a `classes.dex` file containing Dalvik bytecode.
This is the format executed by the Android #ART.
It is common for an application to have more than one #DEX file when an application needs to reference more methods than the format allows in one file
(each method referenced inside a #DEX is associated with a 16-bits number, limiting their number to #num(65536)).
Support for multiple #DEX files was added in the #SDK 21 version of Android, and applications that have multiple #DEX files are sometimes referred to as 'multi-dex'.

In addition to #DEX files, and sometimes instead of #DEX files, applications can contain `.so` ELF (Executable and Linkable Format) files in the `lib/` folder.
In the Android ecosystem, binary code is called native code.
Because native code is compiled for a specific architecture, `.so` files are present in different versions, stored in different subfolders, depending on the targeted architecture.
For example, `lib/arm64-v8a/libexample.so` is the version of the `example` library compiled for an ARM 64 architecture.
Because smartphones mostly use ARM processors, it is not rare to see applications that only have the ARM version of their native code.
]

#paragraph[*Resources*][
Developing graphical interfaces for applications requires many kinds of specific assets, which are stored in `lib/`.
Those resources include bitmap images, text, layout, etc.
Data like layout, colour or text are stored in binary #AXML.
An additional file, `resources.arsc`, in a custom binary format, contains a list of the resource names, ids, and their properties.
]

#paragraph[*Compilation Process*][
For the developer, the compilation process is handled by Android Studio and is mostly transparent.
Behind the scenes, Android Studio relies on Gradle to orchestrate the different compilation steps:

The sources #XML files like `AndroidManifest.xml` and the one in `res/` are compiled to binary #AXML by `aapt`, which also generates the resource table `resources.arsc` and a `R.java` file that defines for each resource variables named after the resource, set to the ID of the resource.
The `R.java` file allows the developer to refer to resources with readable names and avoid using the often automatically generated resource IDs, which can change from one version of the application to another.

The source code is then compiled.
The most common programming languages used for Android applications are Java and Kotlin.
Both are first compiled to Java bytecode in `.class` files using the language compiler.
To allow access to the Android #API, the `.class` are linked during the compilation to an `android.jar` file that contains classes with the same signatures as the ones in the Android #API for the targeted SDK.
The `.class` files are then converted into the #DEX format using `d8`.
During those steps, both the original language compiler and `d8` can perform optimisations on the classes, like code shrinking, inlining, etc.

If the application contains native code, the original C or C++ code is compiled using Android tools from the #NDK to target the different possible architectures.

`aapt` is then used once again to package all the generated #AXML, #DEX, `.so` files, as well as the other resource files, assets, `resources.arsc`, and any additional files deemed necessary to form the final  #ZIP file.
`aapt` ensures that the generated #ZIP is compatible with the requirements of Android.
For instance, the `resources.arsc` will be mapped directly in memory at runtime, so it must not be compressed inside the #ZIP file.

If necessary, the #ZIP file is then aligned using `zipalign`.
Again, this is to ensure compatibility with Android optimisations: some files like `resources.arsc` need to be 4-bits aligned to be mapped in memory.

The last step is to sign the application using the `apksigner` utility.

Since 2021, Google has required that new applications in the Google Play app store be uploaded in a new format called Android App Bundles.
The main difference is that Google will perform the last packaging steps and generate (and sign) the application itself.
This allows Google to generate different applications for different targets and to avoid including unnecessary files in the application, like native code targeting the wrong architecture.
]

==== Android Runtime <sec:bg-art>

Android runtime environment has many specificities that set it apart from other platforms.
A heavy emphasis is put on isolating the applications from one another as well as from the system's critical capabilities.
The code execution itself can be confusing at first.
Instead of the usual linear model with a single entry point, applications have many entry points that are called by the Android framework in accordance with external events.

#paragraph[*Application Architecture*][
Android application expose their components to the Android Runtime (#ART) via classes inheriting specific classes from the Android #SDK.
Four classes represent application components that can be used as entry points:

- _Activities_: An activity represents a single screen with a user interface. This is the component used to interact with a user.
- _Services_: A service serves as an entry point to run the application in the background.
- _Broadcast receivers_: A broadcast receiver is an entry point used when a matching event is broadcast by the system.
- _Content providers_: A content provider is a component that manages data accessible by other applications through the content provider.

Components must be listed in the `AndroidManifest.xml` of the application so that the system knows of them.
In the life cycle of a component, the system will call specific methods defined by the classes associated with each component type.
Those methods are to be overridden by the classes defined in the application if they are specific actions to be performed.
For instance, an activity might compute some values in `onCreate()`, called when the activity is created, save the value of those variable to the file system in `onStop()`, called when the acitivity stop being visible to the user, and recover the saved values in `onRestart()`, called when the user navigate back to the activity.

In addition to the components declared in the manifest that act as entry points, the Android #API heavily relies on callbacks.
The most obvious cases are for the user interface; for example, a button will call a callback method defined by the application when clicked.
Other parts of the #API also rely on non-linear execution; for example, when an application sends an intent (see next paragraph), the intent sent in response is transmitted back to the application by calling another method.
]

#paragraph[*Application Isolation and Interprocess Communication*][
On Android, each application has its own storage folders and the application processes are isolated from each other, and from the hardware interfaces.
This sandboxing is done using Linux security features like group and user permissions, SELinux, and seccomp.
The sandboxing is adjusted according to the permissions requested in the `AndroidManifest.xml` file of the applications.
In addition, most features of the Android system can only be accessed through Binder, Android's main interprocess communication channel.

Binder is a component of the Android framework, external to the application, that all applications can communicate with.
Applications can send messages to Binder, called *intents*.
Binder will check if the application is allowed to send it, and then forward it to the appropriate component.
This component can then respond with another intent.
Applications must declare intent filters to indicate which intent can be sent to the application, and which classes receive the intents.
Intents are central to Android applications and are not just used to access Android capabilities.
For instance, activities and services are started by receiving intents, and it is not uncommon for an application to self-send intents to switch between activities.
Intents can also be sent directly from Android to the application: when a user starts an application by tapping the app icon, Android will send an intent to the class of the application that defined the intent filter for the `android.intent.action.MAIN` intent.
One interesting feature of the Binder is that intents do not need to explicitly name the targeted application and class: intents can be implicit and request an action without knowing the exact application that will perform it.
An example of this behaviour is when an application wants to open a file: an `android.intent.action.VIEW` intent is sent with the file location and type, and Binder will find and start an application capable of viewing this file.
]

#paragraph[*Platform Classes*][
In addition to the classes they include, Android applications have access to classes provided by Android, stored on the phone.
Those classes are called _platform classes_.
They are divided between #SDK classes and hidden #API.
The #SDK classes can be seen as the Android standard library.
They are documented by Google and have a certain stability from version to version.
In case of breaking changes, the changes are listed by Google as well.
The list of #SDK classes is available at compile time in the form of an `android.jar` file to link against.

On the contrary, hidden #API are undocumented methods used internally by the #ART.
Still, they are loaded by the application and can be used by it.
]

#paragraph[*Class Loading and Reflection*][
Class loading is the mechanism used by Android to find and select the class implementation when encountering a reference to a class.
Android developers mainly use it to load bytecode dynamically from a source other than the application itself (#eg a file downloaded at runtime), using `ClassLoader` objects.
`Class` objects are retrieved from those class loaders using their name in the form of strings to identify them.
Those `Class` can then be instantiated into an object, and `Methods` objects can be used to call the methods of the instantiated object.
The process of manipulating `Class` and `Methods` objects instead of using bytecode instructions is called reflection.
Reflection is not limited to bytecode that has been dynamically loaded: it can be used for any class or method available to the application.

Because the `ClassLoader` objects are only used when loading bytecode dynamically or when using reflection, it is often forgotten that the #ART uses class loaders constantly behind the scene, allowing classes from the application and platform classes to cohabit seamlessly.
]

#midskip

In this subsection, we presented the most notable specificities of the Android ecosystem.
In the next section, we will continue with the various tools available for an Android reverse engineer.