Getting Started๏ƒ

The GEOPM project consists of a two-tiered software structure: the GEOPM Service and the GEOPM Runtime. The GEOPM Service stands out by offering a secure userspace interface, facilitating access to hardware telemetry and configurations. On the other hand, the GEOPM Runtime empowers end-users to delve deeper into their application profiles for refined data analysis. Additionally, it provides the option to implement active hardware configuration algorithms, paving the way for enhanced energy efficiency.

For in-depth information see: User Guide for GEOPM Service or User Guide for GEOPM Runtime.


๐Ÿ’ป Install GEOPM๏ƒ

There are two recommended ways to install the GEOPM software: one is to install the pre-built packages provided for various Linux distributions while the other is to build from source code using Spack.

Packages for Linux Distributions๏ƒ

Pre-built binaries of the GEOPM Service and Runtime are available for download using the openSUSE Build Service for RPM-based Linux distributions and through Launchpad for Debian-based distributions. For information on how to configure those repositories with your system package manager or directly download the binaries see: Installation

Building with Spack๏ƒ

Note

Recipes for the v3.1 release of GEOPM are currently work-in-progress.

For users that leverage Spack to distribute software, recipes to build the geopm-service and geopm-runtime have been included in their v0.22.0 release. These recipes currently allow for building both the v3.0.1 release of GEOPM and our main development branch.

For deploying GEOPMโ€™s layers to a compute image in an HPC system context (i.e. PXE booted via warewulf or similar), a typical configuration would be to have a system install of the service RPMs baked into the compute image, and use spack to install geopm-runtime. This is required as the GEOPM Service will be launched via systemd, and thus must run against the system installed Python runtime.

For GEOPM v3.0.1, system install geopm-service, geopm-service-devel, libgeopmd2, and python3-geopmdpy

For GEOPM v3.1, system install geopm-service, geopm-service-doc, geopm-service-devel, libgeopmd2, libgeopmd-doc, python3-geopmdpy, and python3-geopmdpy-doc

In order to build with spack this way, geopm-service must be configured as an external package in ~/.spack/packages.yaml:

packages:
  geopm-service:
    externals:
    - spec: "geopm-service@3.0.1"
      prefix: /usr
    - spec: "geopm-service@3.1.0"
      prefix: /usr
    - spec: "geopm-service@develop"
      prefix: /usr
    version:
    - 3.0.1
    - 3.1.0
    - develop
    buildable: False

Afterwards, geopm-runtime can be installed normally with spack install geopm-runtime.

Admin Configuration๏ƒ

After the Service has been installed, it must be configured properly before non-root users will be able to leverage it.

To grant permissions to all non-root users to be able to use all of the features provided by the Service, execute the following commands:

$ sudo geopmaccess -a | sudo geopmaccess -w
$ sudo geopmaccess -a -c | sudo geopmaccess -w -c

These commands will create access lists in the system location that the Service will use to determine user privilege.

An administrator may use the --log (-l) option of geopmaccess to restrict an access list to the set of values that have been used since last restart by piping the output into geopmaccess -w:

$ sudo geopmaccess -l | sudo geopmaccess -w
$ sudo geopmaccess -l -c | sudo geopmaccess -w -c

More information on access list configuration can be found on the following pages: Service Administrators and geopmaccess(1) โ€“ Access management for the GEOPM Service.


๐Ÿ—ƒ๏ธ Platform Topology๏ƒ

Topology Encapsulation Diagram

We refer to the different hardware layers within a system as domains. GEOPM has support for the following domains:

  • Board

  • Package

  • Core (physical)

  • CPU (Linux logical)

  • Memory

  • Package Integrated Memory

  • NIC

  • Package Integrated NIC

  • GPU

  • Package Integrated GPU

  • GPU Chip

For more information on the domain types, see: Domain Types.

Code Examples๏ƒ

All of the code examples require linking against libgeopmd for C/C++. The Python examples require that your PYTHONPATH contains the geopmdpy module and that libgeopmd is available in your LD_LIBRARY_PATH.

The following examples leverage geopmread or geopmwrite for command-line usage, and the C, C++, and Python APIs of PlatformTopo for the platform topology.

# Print all domains:
$ geopmread --domain
# OR
$ geopmwrite --domain

board                       1
package                     2
core                        104
cpu                         208
memory                      2
package_integrated_memory   2
nic                         0
package_integrated_nic      0
gpu                         6
package_integrated_gpu      0
gpu_chip                    12

๐Ÿ”ฌ Reading Telemetry๏ƒ

We refer to any bit of telemetry that can be read with the Service as a signal. Each signal has a native domain. For example, the native domain of the current operating frequency of the CPU (i.e. CPU_FREQUENCY_STATUS or MSR::PERF_STATUS:FREQ) is the CPU domain. Any signal can be aggregated to any domain that is more coarse than its native domain; in our example, CPU frequency can be aggregated to the package or board domains since they are more coarse than the CPU domain.

The following examples make use of geopmread for the command-line and the C, C++, and Python APIs for PlatformIO in their respective languages.

Listing All Available Signals๏ƒ

$ geopmread

Listing Signal Information๏ƒ

Note

Some telemetry fields have a โ€œhigh levelโ€ alias that can be used in place of the โ€œlow levelโ€ name. In this case, CPU_FREQUENCY_STATUS is an alias for MSR::PERF_STATUS:FREQ. When using geopmread -i to query for information about a signal, the native domain and aggregation type are only listed for the โ€œlow levelโ€ name. For more information on names, see: Breaking Down Signal/Control Names.

$ geopmread -i CPU_FREQUENCY_STATUS

CPU_FREQUENCY_STATUS:
    description: The current operating frequency of the CPU.
    iogroup: MSR
    alias_for: MSR::PERF_STATUS:FREQ

$ geopmread -i MSR::PERF_STATUS:FREQ

MSR::PERF_STATUS:FREQ:
    description: The current operating frequency of the CPU.
    units: hertz
    aggregation: average
    domain: cpu
    iogroup: MSRIOGroup

Reading Signals๏ƒ

# Read the current CPU frequency for cpu 0

$ geopmread CPU_FREQUENCY_STATUS cpu 0

Understanding Aggregation๏ƒ

The telemetry that is output from geopmread or the APIs will automatically be aggregated based on the requested domain and the aggregation type.

Using CPU_FREQUENCY_STATUS as an example, the output in Listing Signal Information shows the native domain as cpu and the aggregation type as average. Notice the topology diagram shows that CPUs are contained within cores, cores within packages, and packages within the board.

When a CPU_FREQUENCY_STATUS request is made at the core domain, GEOPM reads and averages the frequencies of all CPUs linked to that core. If the request is at the package domain, it aggregates the frequencies of all CPUs across every core in that package and provides the average. This methodology escalates up to the broadest domain, the board domain. Thus, to obtain the average frequency spanning all packages, cores, and CPUs in the system, one would issue a geopmread at the board domain.

On the other hand, using CPU_ENERGY as an example, the output in Listing Signal Information shows the native domain as cpu and the aggregation type as sum. When a CPU_ENERGY request is made at the core domain, GEOPM sums the energy consumed by all CPUs linked to that core. If the request is at the package domain, it sums the energy consumed by all CPUs across every core in that package and provides the total. This methodology escalates up to the broadest domain, the board domain. Thus, to obtain the total energy consumed by all packages, cores, and CPUs in the system, one would issue a geopmread at the board domain.

For more information about aggregation types, see: geopm::Agg(3) โ€“ data aggregation functions.

Video Demo: Using geopmread๏ƒ


Reading Multiple Signals๏ƒ

To fetch platform telemetry and output it to the console or a file:

  • From the command-line: Use geopmsession. Its input arguments are similar to geopmread, but are taken from standard input rather than the command-line.

  • From code: Utilize the batch read API.

$ echo -e 'TIME board 0\nCPU_FREQUENCY_STATUS package 0' | geopmsession

For more information on geopmsession see: geopmsession(1) โ€“ Command line interface for the GEOPM service batch read features.

Capturing Telemetry Over Time๏ƒ

geopmsession can also capture data over time with the -p and -t options. This behavior is easily implemented in code along with the batch read interface.

# Read 2 signals for 10 seconds, sampling once a second:

$ echo -e 'TIME board 0\nCPU_FREQUENCY_STATUS package 0' | geopmsession -p 1.0 -t 10.0

Again, for more information on geopmsession see geopmsession(1) โ€“ Command line interface for the GEOPM service batch read features.

Video Demo: Using geopmsession๏ƒ


โš™๏ธ Enact Hardware-based Settings๏ƒ

We refer to any hardware setting that can be manipulated through the Service as a control. Like signals, each control has a native domain. Any control can be disaggregated from a coarse domain (e.g., board) to its native domain. See Understanding Disaggregation for more information.

The following examples make use of geopmwrite for the command-line and the C, C++, and Python APIs for PlatformIO to enact hardware controls in their respective languages.

Listing All Available Controls๏ƒ

$ geopmwrite

Listing Control Information๏ƒ

$ geopmwrite -i CPU_FREQUENCY_MAX_CONTROL

CPU_FREQUENCY_MAX_CONTROL:
Target operating frequency of the CPU based on the control register.

# To include the aggregation type, use geopmread:

$ geopmread -i CPU_FREQUENCY_MAX_CONTROL

CPU_FREQUENCY_MAX_CONTROL:
    description: Target operating frequency of the CPU based on the control register. Note: when querying at a higher domain, if NaN is returned, query at its native domain.
    alias_for: MSR::PERF_CTL:FREQ
    units: hertz
    aggregation: expect_same
    domain: core
    iogroup: MSRIOGroup

Writing Controls๏ƒ

# Write the current CPU frequency for core 0 to 3.0 GHz

$ geopmwrite CPU_FREQUENCY_MAX_CONTROL core 0 3.0e9

Note

To determine the initial value of any control, use geopmread or the corresponding PlatformIO APIs at the desired domain. E.g.:

$ geopmread CPU_FREQUENCY_MAX_CONTROL core 0

Understanding Disaggregation๏ƒ

Just as signals can be aggregated to a more coarse domain from their native one, controls can be disaggregated from a coarse domain to their native domain. This happens automatically with geopmwrite and the corresponding APIs.

Using CPU_FREQUENCY_MAX_CONTROL as an example, the output in Listing Control Information shows the native domain as core. To write the same value to all the cores in a package, simply issue the request at the package domain, and the CPU_FREQUENCY_MAX_CONTROL of all cores in that package will be written. Likewise, to write the same value to all cores in all packages, issue the request at the board domain.

To understand the method of disaggregation for a specific control, you must examine its aggregation type.

For instance, CPU_FREQUENCY_MAX_CONTROL has an aggregation type labeled expect_same. When setting this control at a domain level coarser than its native domain, all native domains inherit the same value as the coarser domain. This consistent distribution applies to all aggregation types, with the exception of sum; controls that use sum aggregation will have the requested value distributed evenly across the native domain. Taking MSR::PKG_POWER_LIMIT:PL1_POWER_LIMIT as an example, it has the following information:

$ geopmread -i MSR::PKG_POWER_LIMIT:PL1_POWER_LIMIT

MSR::PKG_POWER_LIMIT:PL1_POWER_LIMIT:
    description: The average power usage limit over the time window specified in PL1_TIME_WINDOW.
    units: watts
    aggregation: sum
    domain: package
    iogroup: MSRIOGroup

Since the package domain is contained within the board domain, writing this control at the board domain will evenly distribute the requested value over all the packages in the system. This means that requesting a 200 W power limit at the board domain will result in each package receiving a limit of 100 W.

Video Demo: Using geopmwrite๏ƒ


๐Ÿ“ Measure Performance๏ƒ

The GEOPM Runtime offers capabilities for collecting telemetry throughout an applicationโ€™s execution. If you want to measure a particular segment of an application, you can annotate the application code using GEOPM markup.

To integrate the Runtime with an application, you have two options:

  1. geopmlaunch: Ideal for MPI-enabled applications. Simply launch the application using this method.

  2. Manual Setup: This involves configuring the necessary environment settings and directly invoking geopmctl.

geopmlaunch will bring up the Runtime alongside your application using one of three launch methods: process, pthread, or application. The process launch method will attempt to launch the main entity of the Runtime, the Controller, as an extra rank in the MPI gang. The application launch method (default when unspecified) will launch the Controller as a separate application (useful for non-MPI applications). For more information, see the --geopm-ctl option description.

Using geopmlaunch with MPI Applications๏ƒ

# Run with 1 OpenMP thread per rank, and 2 ranks

# SLURM example

$ OMP_NUM_THREADS=1 geopmlaunch srun -N 1 -n 2 --geopm-preload -- ./mpi_application

# PALS example

$ OMP_NUM_THREADS=1 geopmlaunch pals -ppn 2 -n 2 --geopm-preload -- ./mpi_application

When the run has concluded, there will be an output file from the Runtime called geopm.report in the current working directory. This report file contains a summary of hardware telemetry over the course of the run. Time-series data is also available through the use of the --geopm-trace option to geopmlaunch. For more information about geompmlaunch see: geopmlaunch(1) โ€“ application launch wrapper. For more information about the reports, see: geopm_report(7) โ€“ GEOPM summary report file.

Profiling Applications without geopmlaunch๏ƒ

The geopmlaunch(1) command may not be best suited for your needs if you are running a non-MPI application, or if you are running an MPI application but the launch command is embedded in scripts that are difficult to modify. Instead of using geopmlaunch(1), the user may use the geopmctl(1) application in conjunction with environment variables that control the GEOPM Runtime behavior.

In this simple example we run the sleep(1) command for 10 seconds and monitor the system during its execution. Rather than using the geopmlaunch tool as in the above example, we will run the geopmctl command in the background while the application of interest is executing. The geopmctl MPI application should be launched with one process per compute node when executing the runtime on multiple nodes. There are five requirements to enable the GEOPM controller process to connect to the application process and generate a report:

  1. Both the geopmctl process and the application process must have the GEOPM_PROFILE environment variable set to the same value or both environments may leave this variable unset.

  2. The application process must have LD_PRELOAD=libgeopm.so.2 set in the environment or the application binary must be linked directly to libgeopm.so.2 at compile time.

  3. The GEOPM_REPORT environment variable must be set in the environment of the geopmctl process.

  4. The GEOPM_PROGRAM_FILTER environment variable is required and explicitly lists the program invocation names of any process to be profiled. All other programs will not be affected by LD_PRELOAD of libgeopm.so. For this reason a user will typically set these two environment variables together. This is especially important when profiling programs within a bash script.

  5. The GEOPM_NUM_PROCESS variable must be set in the geopmctl environment if there is more than one process to be tracked on each compute node.

In addition to generating a report in YAML format, the example below showcases two optional features of the GEOPM Runtime:

  1. CSV Trace File: By setting the GEOPM_TRACE environment variable, you can generate a trace file in CSV format.

  2. Sampling Period Adjustment: The GEOPM_PERIOD environment variable allows you to modify the controllerโ€™s sampling period. For instance, setting it to 200 milliseconds, up from the default 5 milliseconds, results in approximately 50 rows of samples in the trace file (calculated as five samples per second over ten seconds).

  3. Disable Network Use The GEOPM_CTL_LOCAL environment variable may be set which disables all intra-node communication between the controllers on each node, thereby generating a unique report file per host node over which the application processes are launched.

These three options together will inform the GEOPM runtime controller (geopmctl) to profile the sleep utility and generate a CSV trace file with approximately 50 rows of samples (five per-second for ten seconds). In the provided example, the awk command extracts specific columns: time since application start (column 1), CPU energy (column 6), and CPU power (column 8).

$ GEOPM_PROFILE=sleep-ten \
  GEOPM_REPORT=sleep-ten.yaml \
  GEOPM_CTL_LOCAL=true \
  GEOPM_TRACE=sleep-ten.csv \
  GEOPM_PERIOD=0.2 \
  geopmctl &
$ GEOPM_PROFILE=sleep-ten \
  GEOPM_PROGRAM_FILTER=sleep \
  LD_PRELOAD=libgeopm.so.2 \
  sleep 10
$ cat sleep-ten.yaml-$(hostname)
$ awk -F\| '{print $1, $6, $8}' sleep-ten.csv-$(hostname) | less

For the full listing of the environment variables accepted by the GEOPM runtime, please refer to the GEOPM Environment Variables section of the GEOPM documentation.

Profiling Specific Parts of an Application๏ƒ

The Runtime supports the automatic profiling of various application regions through several methods:

  • Annotation with GEOPM Profiling APIs

  • MPI Autodetection via PMPI

  • OpenMP Autodetection via OMPT

  • OpenCL Autodetection (WIP)

The GEOPM Profiling API enables users to annotate specific sections of the target application for profiling. Each section that is annotated will show up as a separate Region in the report output files from the runtime. An example app could be annotated as follows:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <mpi.h>
#include <geopm_prof.h>
#include <geopm_hint.h>

int main(int argc, char** argv)
{

    MPI_init(&argc, &argv);

    // Application setup...

    // Create a GEOPM region ID for later tracking
    uint64_t region_1, region_2;

    geopm_prof_region("interesting_kernel",
                      GEOPM_REGION_HINT_COMPUTE,
                      &region_1);

    geopm_prof_region("synchronize_results",
                      GEOPM_REGION_HINT_NETWORK,
                      &region_2);

    //Begin execution loop...
    for (int ii = 0; ii < iterations; ii++) {
        // Marker to capture behavior of all regions
        geopm_prof_epoch();

        geopm_prof_enter(region_1);
        call_interesting_kernel();
        geopm_prof_exit(region_1);

        geopm_prof_enter(region_2);
        call_synchronize_results();
        geopm_prof_exit(region_2);
    }

    MPI_Finalize();

    return 0;

}

For more examples on how to profile applications, see the tutorials section of our GitHub repository.


โš—๏ธ Advanced Topics๏ƒ

Breaking Down Signal/Control Names๏ƒ

Signal and control names in GEOPM are categorized into two types: low-level and high-level.

  • Low-Level Names: These are prefixed with the IOGroup name followed by two colons. For instance, MSR::PERF_CTL:FREQ is a low-level name.

  • High-Level Names (Aliases): These are user-friendly alternatives to commonly used or multi-IOGroup-supported names. For example:

    • Alias CPU_FREQUENCY_STATUS corresponds to MSR::PERF_STATUS_FREQ.

    • Alias CPU_FREQUENCY_MAX_CONTROL is linked to MSR::PERF_CTL_FREQ.

When using geopmread or geopmwrite to display available signals and controls, aliases are presented first. These command-line tools also help decipher what each alias represents. For instance:

$ geopmread -i CPU_FREQUENCY_STATUS

CPU_FREQUENCY_STATUS:
    description: The current operating frequency of the CPU.
    iogroup: MSR
    alias_for: MSR::PERF_STATUS:FREQ

For more information about the currently supported aliases and IOGroups, see: Aliasing Signals And Controls.

Using the Programmable Counters๏ƒ

The programmable counters available on various CPUs can be read with geopmread from the command-line and through the use of the InitControl API using the Runtime.

First, determine the event code for your desired performance metric. E.g. for Skylake Server, the event names and corresponding codes are listed here. The following example programs the counter to track LONGEST_LAT_CACHE.MISS on CPU 0:

$ export EVENTCODE=0x2E
$ export UMASK=0x41

# Configure which event to monitor, and under which scope
$ geopmwrite MSR::IA32_PERFEVTSEL0:EVENT_SELECT cpu 0 ${EVENTCODE}
$ geopmwrite MSR::IA32_PERFEVTSEL0:UMASK cpu 0 ${UMASK}
$ geopmwrite MSR::IA32_PERFEVTSEL0:USR cpu 0 1   # Enable user scope for events
$ geopmwrite MSR::IA32_PERFEVTSEL0:OS cpu 0 1    # Enable OS scope for events

# Turn on the counter
$ geopmwrite MSR::IA32_PERFEVTSEL0:EN cpu 0 1
$ geopmwrite MSR::PERF_GLOBAL_CTRL:EN_PMC0 cpu 0 1

# Read the counter. Repeat this read operation after a test scenario.
$ geopmread MSR::IA32_PMC0:PERFCTR cpu 0

To accomplish this with the Runtime, leverage the geopm-init-control feature along with the geopm-report-signals and/or geopm-trace-signals options to geopmlaunch. First, create a file in your current working directory with the following contents:

# LONGEST_LAT_CACHE.MISS: EVENT_CODE = 0x2E | UMASK = 0x41
MSR::IA32_PERFEVTSEL0:EVENT_SELECT package 0 0x2E
MSR::IA32_PERFEVTSEL0:UMASK package 0 0x41
MSR::IA32_PERFEVTSEL0:USR package 0 1
MSR::IA32_PERFEVTSEL0:OS package 0 1
MSR::IA32_PERFEVTSEL0:EN package 0 1
MSR::PERF_GLOBAL_CTRL:EN_PMC0 package 0 1

Name the file accordingly (e.g. enable_cache_misses). This configuration will program and enable all the counters on all of the CPUs on the first package.Use the file, with geopmlaunch and add the desired counter to the reports and/or traces:

$ OMP_NUM_THREADS=1 geopmlaunch srun -N 1 -n 2 --geopm-preload \
                                     --geopm-init-control=enable_cache_misses \
                                     --geopm-report-signals=MSR::IA32_PMC0:PERFCTR@package \
                                     -- ./mpi_application

As configured above, the report data associated with each region will include the counter data summarized per package.