Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure: implement sending memory metrics via diagnostic extension #2022

Merged
merged 10 commits into from
May 27, 2024

Conversation

francescolavra
Copy link
Member

This change set enhances the cloud_init klib by implementing an Azure VM agent (this fixes the "virtual machine agent status is not ready" warning that is currently displayed for Nanos instances in the Azure portal), and adds a new "azure" klib that implements an Azure extension similar to the Linux Diagnostic extension.

The current implementation supports sending 4 types of memory metrics (i.e. available and used memory, as both number of bytes and percentage of total memory). The azure klib is configured in the manifest options via an "azure" tuple; the diagnostic functionalities in this klib are enabled and configured by inserting a "diagnostic" tuple with the following attributes:

  • storage_account: indicates the Azure storage account to be used to store metrics data generated by the klib; the storage account must be located in the same region as the region where the Azure instance is deployed
  • storage_account_sas: Shared Access Signature token for accessing the storage account: this token must have proper permissions to create Azure storage tables and add table entities in the above storage account; SAS tokens for a given storage account can be generated for example via the Azure portal in the "Security + networking" section
  • metrics: tuple that enables sending memory metrics; it can contain 2 optional attributes:
    • sample_interval: interval expressed in seconds at which metrics data is collected (default: 15)
    • transfer_interval: interval expressed in seconds at which metrics data is aggregated and sent to the storage account (default: 60)

Example snippet of Ops configuration file:

"ManifestPassthrough": {
  "azure": {
    "diagnostics": {
      "storage_account": "mystorageaccount",
      "storage_account_sas": "sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupiytfx&se=2024-05-22T14:50:28Z&st=2024-05-12T06:50:28Z&spr=https&sig=xxyyzz",
      "metrics": {"sample_interval": "15","transfer_interval": "60"}
    }
  }
}

Aggregated memory metrics data consist of the number of samples, the minimum, maximum, last, and average value, and the sum of all values; these data are inserted in an Azure storage table (one entity per aggregated data). The name of the table is in the format "WADMetricsxxxxP10DV2Syyyymmdd", where xxxx is the transfer interval expressed with ISO8601 format, and yyyymmdd is a representation of the 10-day date interval to which the metrics refer (thus, a new table is created every 10 days). For example, a table named "WADMetricsPT1MP10DV2S20240503" contains metrics data aggregated every minute ("PT1M" is the ISO8601 representation of a 1-minute period) generated for a 10-day period starting on May 3, 2024.

By default, the Azure portal does not display these metrics in its charts; in order for metrics to be available in the portal, the Linux Diagnostics Extension must be enabled and configured in a running instance (this can be done in the "Diagnostic settings"
section in the portal) to match the settings in the Nanos manifest options. More specifically, the storage account and the metric aggregation interval specified in the Azure diagnostic settings must match those specified in the manifest options.
Note: the Azure VM agent implemented in the cloud_init klib responds to requests to enable and configure the diagnostic
extension, but does not actually apply the extension settings specified in the requests; instead, it always applies the settings from the manifest.

Closes #2014

When a TLS handshake with a remote peer is complete, the TLS input
buffer handler invokes the application layer connection handler,
which returns the application layer input buffer handler for the
connection. Any error at this stage should be reported by the
application layer by returning INVALID_ADDRESS; this is consistent
with the behavior for non-encryped connections (see
direct_receive_service() in net/direct.c), and allows applications
to not implement an input buffer handler (e.g. when they connect to
a remote peer to only send data and then close the connection), in
which case their connection handler can return 0.
This change modifies the TLS input buffer handler so that the check
for errors uses INVALID_ADDRESS instead of 0, and modifes the gcp
and cloudwatch code to align with this implementation.
This function allows sending an arbitrary HTTP request and
receiving a response without having to implement a connection
handler and an input buffer handler. Callers can optionally
implement a value handler to receive the server response, which is
internally parsed by the utility code. The cloud_azure.c code has
been refactored to use this new function.
An Azure instance must report its "ready" status at least once
after being provisioned. In the current code, if for some reason
the cloud_init klib fails to report ready at the first boot, it
will never report ready even at subsequent boots, which prevents
the instance status from transitioning to the running state.
This change modifies the cloud_init klib so that cloud-specific
initialization is executed at every boot; beside fixing the above
potential issue, this will allow implementing an Azure VM agent.
The first_boot() function, being no longer user, is being removed
(the existing implementation had a flaw by which if the TFS log is
compacted at the first boot, the first_boot() function would return
true even at the next boot).
This change makes http_request() insert the Content-length HTTP
header in any request, regardless of the presence of a non-empty
request body. This is necessary in order to support some types of
requests which require this header (for example PUT requests to
the Azure blob storage service to create a blob).
The kernel code that automatically loads klibs found in the /klib
folder has a flaw by which a klib that at a first attempt fails to
initialize due to missing dependencies is put in a state where it
cannot be initialized even after these dependencies are satisfied
by other klibs that are subsequently loaded. This is because the
`pending` variable cannot be safely used in a lock-free manner to
determine whether any other klibs are about to be initialized; for
example, in an SMP VM one core could set `pending` to 0 and another
core could put a klib in a failed state before the first core
initializes the just loaded klib.
This issue is causing sporadic CI test failures, such as
https://app.circleci.com/pipelines/github/nanovms/nanos/4623/workflows/f30b3b7f-0732-49d6-9e09-f9efbd5d6e21/jobs/16230.

This change fixes the above issue by introducing a
spinlock-protected klib_autoload structure that keeps track of
pending klibs and loaded klibs with missing dependencies. As a side
effect, the lock in this structure protects the `klib_loaded`
vector and the global kernel symbol table from concurrent
modifications.
This change makes the buffer_set_capacity() function work with
buffers without contents (i.e. buffer structs with a zero `length`
field). This allows buffers initialized via `init_buffer()` to be
used as dynamically allocated buffers without having to do an
initial allocation for buffer contents.
The `buffer_set_capacity()` function is being moved from a header
file to a source file because it can be computationally intensive
(due to memory allocation and deallocation operations, as well as
memory copying) and as such should not be called in hot code paths.
This decreases the kernel binary size by about 55 KB.
This is done in preparation for the next commit which will add
support for printing hexadecimal numbers with uppercase letters.
Beside adding a new functionality to printf-style functions, this
change makes the kernel compatible with third-party code (such as
lwIP and mbedtls) that uses this format for printing hexadecimal
numbers with uppercase letters.
This change enhances the cloud_init klib by implementing an Azure
VM agent. This fixes the "virtual machine agent status is not
ready" warning that is currently displayed for Nanos instances in
the Azure portal. In addition, it adds support for implementing
Azure extensions.
This change adds a new "azure" klib that implements an Azure
extension similar to the Linux Diagnostic extension.
The current implementation supports sending 4 types of memory
metrics (i.e. available and used memory, as both number of bytes
and percentage of total memory).
This klib is configured in the manifest options via an "azure"
tuple; the diagnostic functionalities are enabled and configured by
inserting a "diagnostic" tuple with the following attributes:
- storage_account: indicates the Azure storage account to be used
to store metrics data generated by the klib; the storage account
must be located in the same region as the region where the Azure
instance is deployed
- storage_account_sas: Shared Access Signature token for accessing
the storage account: this token must have proper permissions to
create Azure storage tables and add table entities in the above
storage account; SAS tokens for a given storage account can be
generated for example via the Azure portal in the
"Security + networking" menu.
- metrics: tuple that enables sending memory metrics; it can
contain 2 optional attributes:
  - sample_interval: interval expressed in seconds at which metrics
data is collected (default: 15)
  - transfer_interval: interval expressed in seconds at which
metrics data is aggregated and sent to the storage account
(default: 60)

Example snippet of Ops configuration file:
```
"ManifestPassthrough": {
  "azure": {
    "diagnostics": {
      "storage_account": "mystorageaccount",
      "storage_account_sas": "sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupiytfx&se=2024-05-22T14:50:28Z&st=2024-05-12T06:50:28Z&spr=https&sig=xxyyzz",
      "metrics": {"sample_interval": "15","transfer_interval": "60"}
    }
  }
}
```

Aggregated memory metrics data consist of the number of samples,
the minimum, maximum, last, and average value, and the sum of all
values; these data are inserted in an Azure storage table (one
entity per aggregated data). The name of the table is in the format
"WADMetricsxxxxP10DV2Syyyymmdd", where xxxx is the transfer
interval expressed with ISO8601 format, and yyyymmdd is a
representation of the 10-day date interval to which the metrics
refer (thus, a new table is created every 10 days). For example, a
table named WADMetricsPT1MP10DV2S20240503 contains metrics data
aggregated every minute ("PT1M" is the ISO8601 representation of a
1-minute period) generated for a 10-day period starting on May 3,
2024.

By default, the Azure portal does not display these metrics in its
charts; in order for metrics to be available in the portal, the
Linux Diagnostics Extension must be enabled and configured in a
running instance (this can be done in the "Diagnostic settings"
section in the portal) to match the settings in the Nanos manifest
options. More specifically, the storage account and the metric
aggregation interval specified in the Azure diagnostic settings
must match those specified in the manifest options.
Note: the Azure VM agent implemented in the cloud_init klib
responds to requests to enable and configure the diagnostic
extension, but does not actually apply the extension settings
specified in the requests; instead, it always applies the settings
from the manifest.

Closes #2014
@francescolavra francescolavra merged commit a92295e into master May 27, 2024
5 checks passed
@francescolavra francescolavra deleted the feature/azure-metrics branch May 27, 2024 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Azure VM Agent
1 participant