geopm_agent_gpu_activity(7) -- agent for selecting GPU frequency based on GPU compute activity ================================================================================================= Description ----------- The goal of **GPUActivityAgent** is to save GPU energy by scaling GPU frequency based upon the compute activity of each GPU as provided by the ``GPU_CORE_ACTIVITY`` signal and modified by the ``GPU_UTILIZATION`` signal. The agent scales frequency in the range of ``Fe`` to ``Fmax``, where ``Fmax`` is provided by the NVMLIOGroup or LevelZeroIOGroup and ``Fe`` is provided by the ConstConfigIOGroup or LevelZeroIOGroup. Low activity regions (compute activity of 0.0) run at the ``Fe`` frequency, high activity regions (compute activity of 1.0) run at the ``Fmax`` frequency, and regions in between the extremes run at a frequency (F) selected using the equation: ``F = Fe + (Fmax - Fe) * GPU_CORE_ACTIVITY/GPU_UTILIZATION`` ``GPU_UTILIZATION`` is used to scale the ``GPU_CORE_ACTIVITY`` in order to scale frequency selection with the percentage of time a kernel is running on the GPU. This tends to help with workloads that contain short but highly scalable GPU phases. ``Fe`` is intended to be an energy efficient frequency that is selected via system characterization. The recommended approach to selecting ``Fe`` is to perform a frequency sweep on the GPUs of interest using a workload that scales strongly with frequency. With this approach, ``Fe`` will be the frequency that provides the lowest GPU energy consumption for the workload. ``Fmax`` is intended to be the maximum allowable frequency, and may be set as the default GPU maximum frequency, or limited based upon user/admin preference. The agent provides an optional input of ``phi`` that allows for biasing the frequency range used by the agent. The default ``phi`` value of 0.5 provides frequency selection in the full range from ``Fe`` to ``Fmax``. A ``phi`` value less than 0.5 biases the agent towards higher frequencies by increasing the ``Fe`` value. In the extreme case (``phi`` of 0) ``Fe`` will be raised to ``Fmax``. A ``phi`` value greater than 0.5 biases the agent towards lower frequencies by reducing the ``Fmax`` value. In the extreme case (``phi`` of 1.0) ``Fmax`` will be lowered to ``Fe``. For NVIDIA based systems the agent should be used with DCGM settings of ``DCGM::FIELD_UPDATE_RATE`` = 100 ms, ``DCGM::MAX_STORAGE_TIME`` = 1 s, and ``DCGM::MAX_SAMPLES`` = 100. While the DCGM documentation indicates that users should generally query no faster than 100 ms, the interface allows for setting the polling rate in the microsecond range. If the agent is intended to be used with workloads that exhibit extremely short phase behavior a 1 ms polling rate can be used. As the 1 ms polling rate is not officially recommended by the DCGM API the 100 ms setting should be used by default. Agent Name ---------- The agent described in this manual is selected in many geopm interfaces with the ``"gpu_activity"`` agent name. This name can be passed to :doc:`geopmlaunch(1) <geopmlaunch.1>` as the argument to the ``--geopm-agent`` option, or the ``GEOPM_AGENT`` environment variable can be set to this name (see :doc:`geopm(7) <geopm.7>`\ ). This name can also be passed to the :doc:`geopmagent(1) <geopmagent.1>` as the argument to the ``'-a'`` option. Policy Parameters ----------------- The ``Phi`` input is the only policy value. ``GPU_PHI``\ : The performance bias knob. The value must be between 0.0 and 1.0. If NAN is passed, it will use 0.5 by default. ConstConfigIOGroup Configuration File Generation ------------------------------------------------ This version of the agent uses ConstConfigIO to provide per-node Fe values. The GPU compute activity ConstConfigIOGroup configuration file can be generated by running:: integration/experiment/gpu_frequency_sweep/gen_gpu_activity_constconfig_recommendation.py --path <GPU_SWEEP_DIR> Depending on the number of runs, system noise, and other factors there may be more than one reasonable value for ``Fe``. In these cases a warning similar to the following will be provided:: 'Warning: Found N possible alternate Fe value(s) within 5% energy consumption of Fe for <frequency>. Consider using the energy-margin options.\n' If the occurs the user may choose to use the provided configuration file or rerun the recommendation script with the energy-margin option ``--gpu-energy-margin`` along with a value such as 0.05 (5%). This option attempt to identify a lower ``Fe`` for the gpu domain that costs less than the energy consumed at ``Fe`` plus the energy-margin percentage provided. An example ConstConfigIOGroup configuration file is provided below:: { "GPU_FREQUENCY_EFFICIENT_HIGH_INTENSITY": { "domain": "board", "description": "Defines the efficient compute frequency to use for GPUs. This value is based on a workload that scales strongly with the frequency domain.", "units": "hertz", "aggregation": "average", "values": [982000000.0] } } Example Policy -------------- An example policy is provided below:: {"GPU_PHI": 0.5} Report Extensions ----------------- ``GPU Frequency Requests``\ : The number of frequency requests made by the agent ``Resolved Max Frequency``\ : ``Fmax`` after ``phi`` has been taken into account ``Resolved Efficient Frequency``\ : ``Fe`` after ``phi`` has been taken into account ``Resolved Frequency Range``\ : The frequency selection range of the agent after ``phi`` has been taken into account ``GPU # Active Region Energy``\ : Per GPU energy reading during the Region of Interest (ROI) where ROI is determined as the first sample of GPU activity to the last sample of GPU activity. ``GPU # Active Region Time``\ : Per GPU time during the Region of Interest (ROI) where ROI is determined as the first sample of GPU activity to the last sample of GPU activity. ``GPU # Active Region Start Time``\ : Per GPU start time for the Region of Interest (ROI) where ROI is determined as the first sample of GPU activity to the last sample of GPU activity. ``GPU # Active Region Stop Time``\ : Per GPU stop time for the Region of Interest (ROI) where ROI is determined as the first sample of GPU activity to the last sample of GPU activity. Control Loop Rate ----------------- The agent gates the control loop to a cadence of 20ms. SEE ALSO -------- :doc:`geopm(7) <geopm.7>`\ , :doc:`geopm_agent_monitor(7) <geopm_agent_monitor.7>`\ , :doc:`geopm::Agent(3) <geopm::Agent.3>`\ , :doc:`geopm_agent(3) <geopm_agent.3>`\ , :doc:`geopm_prof(3) <geopm_prof.3>`\ , :doc:`geopmagent(1) <geopmagent.1>`\ , :doc:`geopmlaunch(1) <geopmlaunch.1>`