Compare commits
8 Commits
4ed7228355
...
v1.2.0
| Author | SHA1 | Date | |
|---|---|---|---|
| a64699ccfd | |||
| 8d351c7bde | |||
| 1702e7d058 | |||
| 4f54fd81ce | |||
| fe1f58b5ce | |||
| f0925a3ab3 | |||
| 4c4026a600 | |||
| 9f00d6475b |
13
Cargo.lock
generated
13
Cargo.lock
generated
@@ -901,6 +901,15 @@ dependencies = [
|
||||
"winapi",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "matchers"
|
||||
version = "0.2.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "d1525a2a28c7f4fa0fc98bb91ae755d1e2d1505079e05539e35bc876b5d65ae9"
|
||||
dependencies = [
|
||||
"regex-automata",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "memchr"
|
||||
version = "2.8.0"
|
||||
@@ -2000,10 +2009,14 @@ version = "0.3.22"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "2f30143827ddab0d256fd843b7a66d164e9f271cfa0dde49142c5ca0ca291f1e"
|
||||
dependencies = [
|
||||
"matchers",
|
||||
"nu-ansi-term",
|
||||
"once_cell",
|
||||
"regex-automata",
|
||||
"sharded-slab",
|
||||
"smallvec",
|
||||
"thread_local",
|
||||
"tracing",
|
||||
"tracing-core",
|
||||
"tracing-log",
|
||||
]
|
||||
|
||||
@@ -23,7 +23,7 @@ serde_json = "1.0.149"
|
||||
clap = { version = "4.5", features = ["derive", "string", "wrap_help"] }
|
||||
color-eyre = "0.6"
|
||||
tracing = "0.1"
|
||||
tracing-subscriber = "0.3"
|
||||
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
|
||||
tracing-appender = "0.2"
|
||||
sysinfo = "0.38"
|
||||
libc = "0.2"
|
||||
|
||||
82
README.md
82
README.md
@@ -1,33 +1,61 @@
|
||||
# 🔥 ember-tune
|
||||
```text
|
||||
__________ ____ ______ ____ ______ __ __ _ __ ______
|
||||
/ ____/ |/ // __ )/ ____// __ \ /_ __/ / / / // | / // ____/
|
||||
/ __/ / /|_/ // __ / __/ / /_/ / / / / / / // |/ // __/
|
||||
/ /___ / / / // /_/ / /___ / _, _/ / / / /_/ // /| // /___
|
||||
/_____//_/ /_//_____/_____//_/ |_| /_/ \____//_/ |_//_____/
|
||||
|
||||
>>> Physically-grounded thermal & power optimization for Linux <<<
|
||||
```
|
||||
|
||||
> ### **Find your hardware's "Physical Sweet Spot" through automated trial-by-fire.**
|
||||
|
||||
`ember-tune` is a scientifically-driven hardware optimizer that replaces guesswork and manual tuning with a rigorous, automated engineering workflow. It determines the unique thermal properties of your specific laptop—including its Thermal Resistance (Rθ) and "Silicon Knee"—to generate optimal configurations for common Linux tuning daemons.
|
||||
|
||||
## ✨ Features
|
||||
|
||||
- **Automated Physical Benchmarking:** Measures real-world thermal performance under load to find the true "sweet spot" where performance-per-watt is maximized before thermal saturation causes diminishing returns.
|
||||
- **Heuristic Hardware Discovery:** Utilizes a data-driven Hardware Abstraction Layer (SAL) that probes your system and automatically adapts to its unique quirks, drivers, and sensor paths.
|
||||
- **Non-Destructive Configuration:** Safely merges new, optimized power limits into your existing `throttled.conf`, preserving manual undervolt settings and comments.
|
||||
- **Universal Safeguard Architecture (USA):** Includes a high-frequency concurrent watchdog and RAII state restoration to guarantee your system is never left in a dangerous state.
|
||||
- **Real-time TUI Dashboard:** A `ratatui`-based terminal interface provides high-resolution telemetry throughout the benchmark.
|
||||
|
||||
## 🔬 How it Works: The Architecture
|
||||
|
||||
`ember-tune` is built on a decoupled, multi-threaded architecture to ensure the UI is always responsive and that hardware state is managed safely.
|
||||
|
||||
1. **The Heuristic Engine:** On startup, the engine probes your system's DMI, `sysfs`, and active services. It compares these "facts" against the `hardware_db.toml` to select the correct System Abstraction Layer (SAL).
|
||||
2. **The Orchestrator (Backend Thread):** This is the state machine that executes the benchmark. It communicates with hardware *only* through the SAL traits.
|
||||
3. **The TUI (Main Thread):** The `ratatui` dashboard renders `TelemetryState` snapshots received from the orchestrator via an MPSC channel.
|
||||
4. **The Watchdog (Safety Thread):** A high-priority thread that polls safety sensors every 100ms to trigger an atomic `EmergencyAbort` if failure conditions are met.
|
||||
|
||||
## ⚙️ Development Setup
|
||||
|
||||
`ember-tune` is a standard Cargo project. You will need a recent Rust toolchain and common build utilities.
|
||||
`ember-tune` is a standard Cargo project.
|
||||
|
||||
**Prerequisites:**
|
||||
- `rustup`
|
||||
- `build-essential` (or equivalent for your distribution)
|
||||
- `build-essential`
|
||||
- `libudev-dev`
|
||||
- `stress-ng` (Required for benchmarking)
|
||||
|
||||
```bash
|
||||
# 1. Clone the repository
|
||||
# 1. Clone and Build
|
||||
git clone https://gitea.com/narl/ember-tune.git
|
||||
cd ember-tune
|
||||
|
||||
# 2. Build the release binary
|
||||
cargo build --release
|
||||
|
||||
# 3. Run the test suite (safe, uses a virtual environment)
|
||||
# This requires no special permissions and does not touch your hardware.
|
||||
# 2. Run the safe test suite
|
||||
cargo test
|
||||
```
|
||||
|
||||
**Running:**
|
||||
Due to its direct hardware access, `ember-tune` requires root privileges.
|
||||
|
||||
```bash
|
||||
# Run a full benchmark and generate optimized configs
|
||||
# Run a full benchmark
|
||||
sudo ./target/release/ember-tune
|
||||
|
||||
# Run a mock benchmark for UI/logic testing
|
||||
# Run a mock benchmark for UI testing
|
||||
sudo ./target/release/ember-tune --mock
|
||||
```
|
||||
|
||||
@@ -35,48 +63,24 @@ sudo ./target/release/ember-tune --mock
|
||||
|
||||
## 🤝 Contributing Quirk Data (`hardware_db.toml`)
|
||||
|
||||
**This is the most impactful way to contribute.** `ember-tune`'s strength comes from its `assets/hardware_db.toml`, which encodes community knowledge about how to manage specific laptops. If your hardware isn't working perfectly, you can likely fix it by adding a new entry here.
|
||||
**This is the most impactful way to contribute.** If your hardware isn't working perfectly, add a new entry to `assets/hardware_db.toml`.
|
||||
|
||||
The database is composed of four key sections: `conflicts`, `ecosystems`, `quirks`, and `discovery`.
|
||||
|
||||
### A. Reporting a Service Conflict
|
||||
If a background service on your system interferes with `ember-tune`, add it to `[[conflicts]]`.
|
||||
|
||||
**Example:** Adding `laptop-mode-tools`.
|
||||
### Example: Adding a Service Conflict
|
||||
```toml
|
||||
[[conflicts]]
|
||||
id = "laptop_mode_conflict"
|
||||
services = ["laptop-mode.service"]
|
||||
contention = "Multiple - I/O schedulers, Power limits"
|
||||
severity = "Medium"
|
||||
fix_action = "SuspendService" # Orchestrator will stop/start this service
|
||||
fix_action = "SuspendService"
|
||||
help_text = "laptop-mode-tools can override power-related sysfs settings."
|
||||
```
|
||||
|
||||
### B. Adding a New Hardware Ecosystem
|
||||
If your laptop manufacturer (e.g., Razer) has a unique fan control tool or ACPI platform profile path, define it in `[ecosystems]`.
|
||||
|
||||
**Example:** A hypothetical "Razer" ecosystem.
|
||||
```toml
|
||||
[ecosystems.razer]
|
||||
vendor_regex = "Razer"
|
||||
# Path to the sysfs node that controls performance profiles
|
||||
profiles_path = "/sys/bus/platform/drivers/razer_acpi/power_mode"
|
||||
# Map human-readable names to the values the driver expects
|
||||
policy_map = { Balanced = 0, Boost = 1, Silent = 2 }
|
||||
```
|
||||
|
||||
### C. Defining a Model-Specific Quirk
|
||||
If a specific laptop model has a bug (like a stuck sensor or incorrect fan reporting), define a `[[quirks]]` entry.
|
||||
|
||||
**Example:** A laptop whose fans report 0 RPM even when spinning.
|
||||
### Example: Defining a Model-Specific Quirk
|
||||
```toml
|
||||
[[quirks]]
|
||||
model_regex = "HP Envy 15-ep.*"
|
||||
id = "hp_fan_stuck_sensor"
|
||||
issue = "Fan sensor reports 0 RPM when active."
|
||||
# The 'action' tells the SAL to use a different method for fan detection.
|
||||
action = "UseThermalVelocityFallback"
|
||||
```
|
||||
|
||||
After adding your changes, run the test suite and then submit a Pull Request!
|
||||
|
||||
@@ -15,7 +15,7 @@ help_text = "TLP and Power-Profiles-Daemon fight over power envelopes. Mask both
|
||||
|
||||
[[conflicts]]
|
||||
id = "thermal_logic_collision"
|
||||
services = ["thermald.service", "throttled.service"]
|
||||
services = ["thermald.service", "throttled.service", "lenovo_fix.service", "lenovo-throttling-fix.service"]
|
||||
contention = "RAPL / MSR / BD-PROCHOT"
|
||||
severity = "High"
|
||||
fix_action = "SuspendService"
|
||||
|
||||
100
src/agent_analyst/mod.rs
Normal file
100
src/agent_analyst/mod.rs
Normal file
@@ -0,0 +1,100 @@
|
||||
//! Heuristic Analysis & Optimization Math (Agent Analyst)
|
||||
//!
|
||||
//! This module analyzes raw telemetry data to extract the "Optimal Real-World Settings".
|
||||
//! It calculates the Silicon Knee, Acoustic/Thermal Matrix (Hysteresis), and
|
||||
//! generates three distinct hardware states: Silent, Balanced, and Sustained Heavy.
|
||||
|
||||
use serde::{Serialize, Deserialize};
|
||||
use crate::engine::{ThermalProfile, OptimizerEngine};
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct FanCurvePoint {
|
||||
pub temp_on: f32,
|
||||
pub temp_off: f32,
|
||||
pub pwm_percent: u8,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct SystemProfile {
|
||||
pub name: String,
|
||||
pub pl1_watts: f32,
|
||||
pub pl2_watts: f32,
|
||||
pub fan_curve: Vec<FanCurvePoint>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct OptimizationMatrix {
|
||||
pub silent: SystemProfile,
|
||||
pub balanced: SystemProfile,
|
||||
pub performance: SystemProfile,
|
||||
pub thermal_resistance_kw: f32,
|
||||
pub ambient_temp: f32,
|
||||
}
|
||||
|
||||
pub struct HeuristicAnalyst {
|
||||
engine: OptimizerEngine,
|
||||
}
|
||||
|
||||
impl HeuristicAnalyst {
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
engine: OptimizerEngine::new(5),
|
||||
}
|
||||
}
|
||||
|
||||
/// Analyzes the raw telemetry to generate the 3 optimal profiles.
|
||||
pub fn analyze(&self, profile: &ThermalProfile, max_soak_watts: f32) -> OptimizationMatrix {
|
||||
let r_theta = profile.r_theta;
|
||||
let silicon_knee = self.engine.find_silicon_knee(profile);
|
||||
let ambient = profile.ambient_temp;
|
||||
|
||||
// 1. State A: Silent / Battery (Scientific Passive Equilibrium)
|
||||
// Find P where T_core = 60C with fans OFF.
|
||||
let r_theta_passive = r_theta * 2.5;
|
||||
let silent_watts = ((60.0 - ambient) / r_theta_passive.max(0.1)).clamp(3.0, 15.0);
|
||||
|
||||
let silent_profile = SystemProfile {
|
||||
name: "Silent".to_string(),
|
||||
pl1_watts: silent_watts,
|
||||
pl2_watts: silent_watts * 1.2,
|
||||
fan_curve: vec![
|
||||
FanCurvePoint { temp_on: 65.0, temp_off: 55.0, pwm_percent: 0 },
|
||||
FanCurvePoint { temp_on: 75.0, temp_off: 65.0, pwm_percent: 30 },
|
||||
],
|
||||
};
|
||||
|
||||
// 2. State B: Balanced (The Silicon Knee)
|
||||
// We use R_theta to predict where the knee will sit thermally.
|
||||
let balanced_profile = SystemProfile {
|
||||
name: "Balanced".to_string(),
|
||||
pl1_watts: silicon_knee,
|
||||
pl2_watts: silicon_knee * 1.25,
|
||||
fan_curve: vec![
|
||||
FanCurvePoint { temp_on: ambient + 15.0, temp_off: ambient + 10.0, pwm_percent: 0 },
|
||||
FanCurvePoint { temp_on: ambient + 25.0, temp_off: ambient + 20.0, pwm_percent: 30 },
|
||||
FanCurvePoint { temp_on: 75.0, temp_off: 65.0, pwm_percent: 50 },
|
||||
FanCurvePoint { temp_on: 85.0, temp_off: 75.0, pwm_percent: 80 },
|
||||
],
|
||||
};
|
||||
|
||||
// 3. State C: Sustained Heavy
|
||||
let performance_profile = SystemProfile {
|
||||
name: "Performance".to_string(),
|
||||
pl1_watts: max_soak_watts,
|
||||
pl2_watts: max_soak_watts * 1.3,
|
||||
fan_curve: vec![
|
||||
FanCurvePoint { temp_on: 50.0, temp_off: 45.0, pwm_percent: 30 },
|
||||
FanCurvePoint { temp_on: 70.0, temp_off: 60.0, pwm_percent: 60 },
|
||||
FanCurvePoint { temp_on: 85.0, temp_off: 75.0, pwm_percent: 100 },
|
||||
],
|
||||
};
|
||||
|
||||
OptimizationMatrix {
|
||||
silent: silent_profile,
|
||||
balanced: balanced_profile,
|
||||
performance: performance_profile,
|
||||
thermal_resistance_kw: r_theta,
|
||||
ambient_temp: ambient,
|
||||
}
|
||||
}
|
||||
}
|
||||
154
src/agent_integrator/mod.rs
Normal file
154
src/agent_integrator/mod.rs
Normal file
@@ -0,0 +1,154 @@
|
||||
//! System Service Integration (Agent Integrator)
|
||||
//!
|
||||
//! This module translates the mathematical optimums defined by the Analyst
|
||||
//! into actionable, real-world Linux/OS service configurations.
|
||||
//! It generates templates for fan daemons (i8kmon, thinkfan) and handles
|
||||
//! resolution strategies for overlapping daemons.
|
||||
|
||||
use anyhow::Result;
|
||||
use std::path::{Path, PathBuf};
|
||||
use std::fs;
|
||||
use crate::agent_analyst::OptimizationMatrix;
|
||||
|
||||
pub struct ServiceIntegrator;
|
||||
|
||||
impl ServiceIntegrator {
|
||||
/// Generates and saves an i8kmon configuration based on the balanced profile.
|
||||
pub fn generate_i8kmon_config(matrix: &OptimizationMatrix, output_path: &Path, source_path: Option<&PathBuf>) -> Result<()> {
|
||||
let profile = &matrix.balanced;
|
||||
|
||||
let mut conf = String::new();
|
||||
|
||||
// Read existing content to preserve daemon and other settings
|
||||
let existing = if let Some(src) = source_path {
|
||||
if src.exists() { fs::read_to_string(src).unwrap_or_default() } else { String::new() }
|
||||
} else if output_path.exists() {
|
||||
fs::read_to_string(output_path).unwrap_or_default()
|
||||
} else {
|
||||
String::new()
|
||||
};
|
||||
|
||||
if !existing.is_empty() {
|
||||
for line in existing.lines() {
|
||||
let trimmed = line.trim();
|
||||
// Filter out the old auto-generated config lines and fan configs
|
||||
if !trimmed.starts_with("set config(0)") &&
|
||||
!trimmed.starts_with("set config(1)") &&
|
||||
!trimmed.starts_with("set config(2)") &&
|
||||
!trimmed.starts_with("set config(3)") &&
|
||||
!trimmed.starts_with("# Auto-generated") &&
|
||||
!trimmed.starts_with("# Profile:") &&
|
||||
!trimmed.is_empty() {
|
||||
conf.push_str(line);
|
||||
conf.push('\n');
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
conf.push_str("\n# Auto-generated by ember-tune Integrator\n");
|
||||
conf.push_str(&format!("# Profile: {}\n", profile.name));
|
||||
conf.push_str(&format!("# Thermal Resistance: {:.3} K/W\n\n", matrix.thermal_resistance_kw));
|
||||
|
||||
for (i, p) in profile.fan_curve.iter().enumerate() {
|
||||
let state = match p.pwm_percent {
|
||||
0..=20 => 0,
|
||||
21..=50 => 1,
|
||||
51..=100 => 2,
|
||||
_ => 2,
|
||||
};
|
||||
|
||||
let off = if i == 0 { "-".to_string() } else { format!("{:.0}", p.temp_off) };
|
||||
conf.push_str(&format!("set config({}) {{{} {} {:.0} {}}}\n", i, state, state, p.temp_on, off));
|
||||
}
|
||||
|
||||
fs::write(output_path, conf)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Generates a thinkfan configuration, merging with existing sensors if possible.
|
||||
pub fn generate_thinkfan_config(matrix: &OptimizationMatrix, output_path: &Path, source_path: Option<&PathBuf>) -> Result<()> {
|
||||
let profile = &matrix.balanced;
|
||||
|
||||
let mut conf = String::new();
|
||||
|
||||
let existing = if let Some(src) = source_path {
|
||||
if src.exists() { fs::read_to_string(src).unwrap_or_default() } else { String::new() }
|
||||
} else if output_path.exists() {
|
||||
fs::read_to_string(output_path).unwrap_or_default()
|
||||
} else {
|
||||
String::new()
|
||||
};
|
||||
|
||||
if !existing.is_empty() {
|
||||
let mut in_sensors = false;
|
||||
for line in existing.lines() {
|
||||
let trimmed = line.trim();
|
||||
if trimmed == "sensors:" { in_sensors = true; }
|
||||
if trimmed == "levels:" { in_sensors = false; }
|
||||
|
||||
if in_sensors {
|
||||
conf.push_str(line);
|
||||
conf.push('\n');
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if conf.is_empty() {
|
||||
conf.push_str("sensors:\n - hwmon: /sys/class/hwmon/hwmon0/temp1_input\n\n");
|
||||
}
|
||||
|
||||
conf.push_str("\n# Auto-generated by ember-tune Integrator\n");
|
||||
conf.push_str("levels:\n");
|
||||
|
||||
for (i, p) in profile.fan_curve.iter().enumerate() {
|
||||
let level = match p.pwm_percent {
|
||||
0..=20 => 0,
|
||||
21..=40 => 1,
|
||||
41..=60 => 3,
|
||||
61..=80 => 5,
|
||||
_ => 7,
|
||||
};
|
||||
|
||||
let down = if i == 0 { 0.0 } else { p.temp_off };
|
||||
conf.push_str(&format!(" - [{}, {:.0}, {:.0}]\n", level, down, p.temp_on));
|
||||
}
|
||||
|
||||
fs::write(output_path, conf)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Generates a resolution checklist/script for daemons.
|
||||
pub fn generate_conflict_resolution_script(output_path: &Path) -> Result<()> {
|
||||
let script = r#"#!/bin/bash
|
||||
# ember-tune Daemon Neutralization Script
|
||||
|
||||
# 1. Mask power-profiles-daemon (Prevent ACPI overrides)
|
||||
systemctl mask power-profiles-daemon
|
||||
|
||||
# 2. Filter TLP (Prevent CPU governor fights while keeping PCIe saving)
|
||||
sed -i 's/^CPU_SCALING_GOVERNOR_ON_AC=.*/CPU_SCALING_GOVERNOR_ON_AC=""/' /etc/tlp.conf
|
||||
sed -i 's/^CPU_BOOST_ON_AC=.*/CPU_BOOST_ON_AC=""/' /etc/tlp.conf
|
||||
systemctl restart tlp
|
||||
|
||||
# 3. Thermald Delegate (We provide the trips, it handles the rest)
|
||||
systemctl restart thermald
|
||||
"#;
|
||||
fs::write(output_path, script)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Generates a thermald configuration XML.
|
||||
pub fn generate_thermald_config(matrix: &OptimizationMatrix, output_path: &Path, _source_path: Option<&PathBuf>) -> Result<()> {
|
||||
let profile = &matrix.balanced;
|
||||
let mut xml = String::new();
|
||||
xml.push_str("<?xml version=\"1.0\"?>\n<ThermalConfiguration>\n <Platform>\n <Name>ember-tune Balanced</Name>\n <ProductName>Generic</ProductName>\n <Preference>balanced</Preference>\n <ThermalZones>\n <ThermalZone>\n <Type>cpu</Type>\n <TripPoints>\n");
|
||||
|
||||
for (i, p) in profile.fan_curve.iter().enumerate() {
|
||||
xml.push_str(&format!(" <TripPoint>\n <SensorType>cpu</SensorType>\n <Temperature>{}</Temperature>\n <Type>Passive</Type>\n <ControlId>{}</ControlId>\n </TripPoint>\n", p.temp_on * 1000.0, i));
|
||||
}
|
||||
|
||||
xml.push_str(" </TripPoints>\n </ThermalZone>\n </ThermalZones>\n </Platform>\n</ThermalConfiguration>\n");
|
||||
fs::write(output_path, xml)?;
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@@ -118,8 +118,15 @@ Trip_Temp_C: {trip:.0}
|
||||
result_lines.join("\n")
|
||||
}
|
||||
|
||||
pub fn save(path: &Path, config: &ThrottledConfig) -> Result<()> {
|
||||
let existing = if path.exists() { std::fs::read_to_string(path)? } else { String::new() };
|
||||
pub fn save(path: &Path, config: &ThrottledConfig, source_path: Option<&std::path::PathBuf>) -> Result<()> {
|
||||
let existing = if let Some(src) = source_path {
|
||||
if src.exists() { std::fs::read_to_string(src).unwrap_or_default() } else { String::new() }
|
||||
} else if path.exists() {
|
||||
std::fs::read_to_string(path).unwrap_or_default()
|
||||
} else {
|
||||
String::new()
|
||||
};
|
||||
|
||||
let content = if existing.is_empty() { Self::generate_conf(config) } else { Self::merge_conf(&existing, config) };
|
||||
std::fs::write(path, content)?;
|
||||
Ok(())
|
||||
|
||||
@@ -7,6 +7,7 @@
|
||||
use serde::{Serialize, Deserialize};
|
||||
use std::collections::HashMap;
|
||||
use std::path::PathBuf;
|
||||
use tracing::{warn, debug};
|
||||
|
||||
pub mod formatters;
|
||||
|
||||
@@ -25,6 +26,7 @@ pub struct ThermalPoint {
|
||||
pub struct ThermalProfile {
|
||||
pub points: Vec<ThermalPoint>,
|
||||
pub ambient_temp: f32,
|
||||
pub r_theta: f32,
|
||||
}
|
||||
|
||||
/// The final, recommended parameters derived from the thermal benchmark.
|
||||
@@ -46,27 +48,21 @@ pub struct OptimizationResult {
|
||||
pub is_partial: bool,
|
||||
/// A map of configuration files that were written to.
|
||||
pub config_paths: HashMap<String, PathBuf>,
|
||||
/// The comprehensive optimization matrix (Silent, Balanced, Performance).
|
||||
pub optimization_matrix: Option<crate::agent_analyst::OptimizationMatrix>,
|
||||
}
|
||||
|
||||
/// Pure mathematics engine for thermal optimization.
|
||||
///
|
||||
/// Contains no hardware I/O and operates solely on the collected [ThermalProfile].
|
||||
pub struct OptimizerEngine {
|
||||
/// The size of the sliding window for the `smooth` function.
|
||||
window_size: usize,
|
||||
}
|
||||
|
||||
impl OptimizerEngine {
|
||||
/// Creates a new `OptimizerEngine`.
|
||||
pub fn new(window_size: usize) -> Self {
|
||||
Self { window_size }
|
||||
}
|
||||
|
||||
/// Applies a simple moving average (SMA) filter with outlier rejection.
|
||||
///
|
||||
/// This function smooths noisy sensor data. It rejects any value in the
|
||||
/// window that is more than 20.0 units away from the window's average
|
||||
/// before calculating the final smoothed value.
|
||||
/// Smoothes sensor jitter using a moving average with outlier rejection.
|
||||
pub fn smooth(&self, data: &[f32]) -> Vec<f32> {
|
||||
if data.is_empty() { return vec![]; }
|
||||
let mut smoothed = Vec::with_capacity(data.len());
|
||||
@@ -78,7 +74,7 @@ impl OptimizerEngine {
|
||||
let window = &data[start..end];
|
||||
let avg: f32 = window.iter().sum::<f32>() / window.len() as f32;
|
||||
let filtered: Vec<f32> = window.iter()
|
||||
.filter(|&&v| (v - avg).abs() < 20.0) // Reject spikes > 20 units
|
||||
.filter(|&&v| (v - avg).abs() < 10.0)
|
||||
.cloned().collect();
|
||||
|
||||
if filtered.is_empty() {
|
||||
@@ -90,96 +86,65 @@ impl OptimizerEngine {
|
||||
smoothed
|
||||
}
|
||||
|
||||
/// Calculates Thermal Resistance: R_theta = (T_core - T_ambient) / P_package.
|
||||
///
|
||||
/// This function uses the data point with the highest power draw to ensure
|
||||
/// the calculation reflects a system under maximum thermal load.
|
||||
pub fn calculate_thermal_resistance(&self, profile: &ThermalProfile) -> f32 {
|
||||
profile.points.iter()
|
||||
.filter(|p| p.power_w > 1.0 && p.temp_c > 30.0) // Filter invalid data
|
||||
.max_by(|a, b| a.power_w.partial_cmp(&b.power_w).unwrap_or(std::cmp::Ordering::Equal))
|
||||
.map(|p| (p.temp_c - profile.ambient_temp) / p.power_w)
|
||||
.unwrap_or(0.0)
|
||||
/// Evaluates if a series of temperature readings have reached thermal equilibrium.
|
||||
/// Criteria: Standard deviation < 0.25C over the last 10 seconds.
|
||||
pub fn is_stable(&self, temps: &[f32]) -> bool {
|
||||
if temps.len() < 20 { return false; } // Need at least 10s of data (500ms intervals)
|
||||
let window = &temps[temps.len() - 20..];
|
||||
|
||||
let avg = window.iter().sum::<f32>() / window.len() as f32;
|
||||
let variance = window.iter().map(|&t| (t - avg).powi(2)).sum::<f32>() / window.len() as f32;
|
||||
let std_dev = variance.sqrt();
|
||||
|
||||
debug!("Stability Check: StdDev={:.3}C (Target < 0.25C)", std_dev);
|
||||
std_dev < 0.25
|
||||
}
|
||||
|
||||
/// Returns the maximum temperature recorded in the profile.
|
||||
pub fn get_max_temp(&self, profile: &ThermalProfile) -> f32 {
|
||||
profile.points.iter()
|
||||
.map(|p| p.temp_c)
|
||||
.max_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal))
|
||||
.unwrap_or(0.0)
|
||||
/// Predicts the steady-state temperature for a given target wattage.
|
||||
/// Formula: T_pred = T_ambient + (P_target * R_theta)
|
||||
pub fn predict_temp(&self, target_watts: f32, ambient: f32, r_theta: f32) -> f32 {
|
||||
ambient + (target_watts * r_theta)
|
||||
}
|
||||
|
||||
/// Finds the "Silicon Knee" - the point where performance-per-watt (efficiency)
|
||||
/// starts to diminish significantly and thermal density spikes.
|
||||
///
|
||||
/// This heuristic scoring model balances several factors:
|
||||
/// 1. **Efficiency Drop:** How quickly does performance-per-watt decrease as power increases?
|
||||
/// 2. **Thermal Acceleration:** How quickly does temperature rise per additional Watt?
|
||||
/// 3. **Throttling Penalty:** A large penalty is applied if absolute performance drops, indicating a thermal wall.
|
||||
///
|
||||
/// The "Knee" is the power level with the highest score, representing the optimal
|
||||
/// balance before thermal saturation causes diminishing returns.
|
||||
/// Calculates Thermal Resistance (K/W) using the steady-state delta.
|
||||
pub fn calculate_r_theta(&self, ambient: f32, steady_temp: f32, steady_power: f32) -> f32 {
|
||||
if steady_power < 1.0 { return 0.0; }
|
||||
(steady_temp - ambient) / steady_power
|
||||
}
|
||||
|
||||
/// Identifies the "Silicon Knee" by finding the point of maximum efficiency.
|
||||
pub fn find_silicon_knee(&self, profile: &ThermalProfile) -> f32 {
|
||||
let valid_points: Vec<_> = profile.points.iter()
|
||||
.filter(|p| p.power_w > 5.0 && p.temp_c > 40.0) // Filter idle/noise
|
||||
.cloned()
|
||||
.collect();
|
||||
if profile.points.is_empty() { return 15.0; }
|
||||
|
||||
if valid_points.len() < 3 {
|
||||
return profile.points.last().map(|p| p.power_w).unwrap_or(15.0);
|
||||
}
|
||||
|
||||
let mut points = valid_points;
|
||||
let mut points = profile.points.clone();
|
||||
points.sort_by(|a, b| a.power_w.partial_cmp(&b.power_w).unwrap_or(std::cmp::Ordering::Equal));
|
||||
|
||||
let mut best_pl = points[0].power_w;
|
||||
let mut max_score = f32::MIN;
|
||||
let efficiencies: Vec<(f32, f32)> = points.iter()
|
||||
.map(|p| {
|
||||
let perf = if p.throughput > 0.0 { p.throughput as f32 } else { p.freq_mhz };
|
||||
(p.power_w, perf / p.power_w.max(1.0))
|
||||
})
|
||||
.collect();
|
||||
|
||||
// Use a sliding window (3 points) to calculate gradients more robustly
|
||||
for i in 1..points.len() - 1 {
|
||||
let prev = &points[i - 1];
|
||||
let curr = &points[i];
|
||||
let next = &points[i + 1];
|
||||
if efficiencies.is_empty() { return 15.0; }
|
||||
|
||||
// 1. Efficiency Metric (Throughput per Watt or Freq per Watt)
|
||||
let efficiency_curr = if curr.throughput > 0.0 {
|
||||
curr.throughput as f32 / curr.power_w.max(1.0)
|
||||
let max_efficiency = efficiencies.iter()
|
||||
.map(|(_, e)| *e)
|
||||
.max_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal))
|
||||
.unwrap_or(1.0);
|
||||
|
||||
let mut knee_watts = points[0].power_w;
|
||||
for (watts, efficiency) in efficiencies {
|
||||
if efficiency >= (max_efficiency * 0.85) {
|
||||
knee_watts = watts;
|
||||
} else {
|
||||
curr.freq_mhz / curr.power_w.max(1.0)
|
||||
};
|
||||
|
||||
let efficiency_next = if next.throughput > 0.0 {
|
||||
next.throughput as f32 / next.power_w.max(1.0)
|
||||
} else {
|
||||
next.freq_mhz / next.power_w.max(1.0)
|
||||
};
|
||||
|
||||
let p_delta = (next.power_w - curr.power_w).max(0.5);
|
||||
let efficiency_drop = (efficiency_curr - efficiency_next) / p_delta;
|
||||
|
||||
// 2. Thermal Acceleration (d2T/dW2)
|
||||
let p_delta_prev = (curr.power_w - prev.power_w).max(0.5);
|
||||
let p_delta_next = (next.power_w - curr.power_w).max(0.5);
|
||||
|
||||
let dt_dw_prev = (curr.temp_c - prev.temp_c) / p_delta_prev;
|
||||
let dt_dw_next = (next.temp_c - curr.temp_c) / p_delta_next;
|
||||
|
||||
let p_total_delta = (next.power_w - prev.power_w).max(1.0);
|
||||
let temp_accel = (dt_dw_next - dt_dw_prev) / p_total_delta;
|
||||
|
||||
// 3. Wall Detection (Any drop in absolute performance is a hard wall)
|
||||
let is_throttling = next.freq_mhz < curr.freq_mhz || (next.throughput > 0.0 && next.throughput < curr.throughput);
|
||||
let penalty = if is_throttling { 5000.0 } else { 0.0 };
|
||||
|
||||
let score = (efficiency_curr * 10.0) - (efficiency_drop * 50.0) - (temp_accel * 20.0) - penalty;
|
||||
|
||||
if score > max_score {
|
||||
max_score = score;
|
||||
best_pl = curr.power_w;
|
||||
debug!("Efficiency drop at {:.1}W ({:.1}% of peak)", watts, (efficiency/max_efficiency)*100.0);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
best_pl
|
||||
knee_watts.clamp(PowerLimitWatts::MIN, PowerLimitWatts::MAX)
|
||||
}
|
||||
}
|
||||
|
||||
use crate::sal::safety::PowerLimitWatts;
|
||||
|
||||
0
src/engine/profiles.rs
Normal file
0
src/engine/profiles.rs
Normal file
@@ -12,3 +12,5 @@ pub mod ui;
|
||||
pub mod engine;
|
||||
pub mod cli;
|
||||
pub mod sys;
|
||||
pub mod agent_analyst;
|
||||
pub mod agent_integrator;
|
||||
|
||||
165
src/load/mod.rs
165
src/load/mod.rs
@@ -1,60 +1,145 @@
|
||||
//! Defines the `Workload` trait for generating synthetic CPU/GPU load.
|
||||
//! Load generation and performance measurement subsystem.
|
||||
|
||||
use anyhow::Result;
|
||||
use std::process::Child;
|
||||
use anyhow::{Result, Context, anyhow};
|
||||
use std::process::{Child, Command, Stdio};
|
||||
use std::time::{Duration, Instant};
|
||||
use std::thread;
|
||||
use std::io::{BufRead, BufReader};
|
||||
use std::sync::{Arc, Mutex};
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
/// A trait for objects that can generate a measurable system load.
|
||||
pub trait Workload: Send + Sync {
|
||||
/// Starts the workload with the specified number of threads and load percentage.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns an error if the underlying stress test process fails to spawn.
|
||||
fn start(&mut self, threads: usize, load_percent: usize) -> Result<()>;
|
||||
|
||||
/// Stops the workload gracefully.
|
||||
///
|
||||
/// # Errors
|
||||
/// This method should aim to not fail, but may return an error if
|
||||
/// forcefully killing the child process fails.
|
||||
fn stop(&mut self) -> Result<()>;
|
||||
|
||||
/// Returns the current throughput of the workload (e.g., ops/sec).
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns an error if throughput cannot be measured.
|
||||
fn get_throughput(&self) -> Result<f64>;
|
||||
/// Standardized telemetry returned by any workload implementation.
|
||||
#[derive(Debug, Clone, Serialize, Deserialize, Default)]
|
||||
pub struct WorkloadMetrics {
|
||||
/// Primary performance heuristic (e.g., Bogo Ops/s)
|
||||
pub primary_ops_per_sec: f64,
|
||||
/// Time elapsed since the workload started
|
||||
pub elapsed_time: Duration,
|
||||
}
|
||||
|
||||
/// An implementation of `Workload` that uses the `stress-ng` utility.
|
||||
/// Defines which subsystem to isolate during stress testing.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub enum StressVector {
|
||||
CpuMatrix,
|
||||
MemoryBandwidth,
|
||||
Mixed,
|
||||
}
|
||||
|
||||
/// A normalized profile defining the intensity and constraints of the workload.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct IntensityProfile {
|
||||
pub threads: usize,
|
||||
pub load_percentage: u8,
|
||||
pub vector: StressVector,
|
||||
}
|
||||
|
||||
/// The replaceable interface for load generation and performance measurement.
|
||||
pub trait Workload: Send + Sync {
|
||||
/// Sets up prerequisites (e.g., binary checks).
|
||||
fn initialize(&mut self) -> Result<()>;
|
||||
|
||||
/// Executes the load asynchronously.
|
||||
fn run_workload(&mut self, duration: Duration, profile: IntensityProfile) -> Result<()>;
|
||||
|
||||
/// Returns the current standardized telemetry object.
|
||||
fn get_current_metrics(&self) -> Result<WorkloadMetrics>;
|
||||
|
||||
/// Gracefully and forcefully terminates the workload.
|
||||
fn stop_workload(&mut self) -> Result<()>;
|
||||
}
|
||||
|
||||
/// Implementation of the Benchmarking Interface using stress-ng matrix stressors.
|
||||
pub struct StressNg {
|
||||
child: Option<Child>,
|
||||
start_time: Option<Instant>,
|
||||
latest_metrics: Arc<Mutex<WorkloadMetrics>>,
|
||||
}
|
||||
|
||||
impl StressNg {
|
||||
pub fn new() -> Self {
|
||||
Self { child: None }
|
||||
Self {
|
||||
child: None,
|
||||
start_time: None,
|
||||
latest_metrics: Arc::new(Mutex::new(WorkloadMetrics::default())),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Workload for StressNg {
|
||||
fn start(&mut self, threads: usize, load_percent: usize) -> Result<()> {
|
||||
self.stop()?;
|
||||
fn initialize(&mut self) -> Result<()> {
|
||||
let status = Command::new("stress-ng")
|
||||
.arg("--version")
|
||||
.stdout(Stdio::null())
|
||||
.stderr(Stdio::null())
|
||||
.status()
|
||||
.context("stress-ng binary not found in PATH. Please install it.")?;
|
||||
|
||||
let child = std::process::Command::new("stress-ng")
|
||||
.args([
|
||||
"--cpu", &threads.to_string(),
|
||||
"--cpu-load", &load_percent.to_string(),
|
||||
"--quiet"
|
||||
])
|
||||
.spawn()?;
|
||||
if !status.success() {
|
||||
return Err(anyhow!("stress-ng failed to initialize"));
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn run_workload(&mut self, duration: Duration, profile: IntensityProfile) -> Result<()> {
|
||||
self.stop_workload()?;
|
||||
|
||||
let threads = profile.threads.to_string();
|
||||
let timeout = format!("{}s", duration.as_secs());
|
||||
let load = profile.load_percentage.to_string();
|
||||
|
||||
let mut cmd = Command::new("stress-ng");
|
||||
cmd.args(["--timeout", &timeout, "--metrics", "--quiet", "--cpu-load", &load]);
|
||||
|
||||
match profile.vector {
|
||||
StressVector::CpuMatrix => {
|
||||
cmd.args(["--matrix", &threads]);
|
||||
},
|
||||
StressVector::MemoryBandwidth => {
|
||||
cmd.args(["--vm", &threads, "--vm-bytes", "80%"]);
|
||||
},
|
||||
StressVector::Mixed => {
|
||||
let half = (profile.threads / 2).max(1).to_string();
|
||||
cmd.args(["--matrix", &half, "--vm", &half, "--vm-bytes", "40%"]);
|
||||
}
|
||||
}
|
||||
|
||||
let mut child = cmd.stderr(Stdio::piped()).spawn().context("Failed to spawn stress-ng")?;
|
||||
|
||||
self.start_time = Some(Instant::now());
|
||||
|
||||
// Spawn metrics parser thread
|
||||
let metrics_ref = Arc::clone(&self.latest_metrics);
|
||||
let stderr = child.stderr.take().expect("Failed to capture stderr");
|
||||
|
||||
thread::spawn(move || {
|
||||
let reader = BufReader::new(stderr);
|
||||
for line in reader.lines().flatten() {
|
||||
// Parse stress-ng metrics line
|
||||
if line.contains("matrix") || line.contains("vm") {
|
||||
let parts: Vec<&str> = line.split_whitespace().collect();
|
||||
if let Some(val) = parts.last() {
|
||||
if let Ok(ops) = val.parse::<f64>() {
|
||||
let mut m = metrics_ref.lock().unwrap();
|
||||
m.primary_ops_per_sec = ops;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
self.child = Some(child);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn stop(&mut self) -> Result<()> {
|
||||
fn get_current_metrics(&self) -> Result<WorkloadMetrics> {
|
||||
let mut m = self.latest_metrics.lock().unwrap().clone();
|
||||
if let Some(start) = self.start_time {
|
||||
m.elapsed_time = start.elapsed();
|
||||
}
|
||||
Ok(m)
|
||||
}
|
||||
|
||||
fn stop_workload(&mut self) -> Result<()> {
|
||||
if let Some(mut child) = self.child.take() {
|
||||
#[cfg(unix)]
|
||||
{
|
||||
@@ -77,19 +162,13 @@ impl Workload for StressNg {
|
||||
let _ = child.wait();
|
||||
}
|
||||
}
|
||||
self.start_time = None;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Returns the current throughput of the workload (e.g., ops/sec).
|
||||
///
|
||||
/// This is currently a stub and does not parse `stress-ng` output.
|
||||
fn get_throughput(&self) -> Result<f64> {
|
||||
Ok(0.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl Drop for StressNg {
|
||||
fn drop(&mut self) {
|
||||
let _ = self.stop();
|
||||
let _ = self.stop_workload();
|
||||
}
|
||||
}
|
||||
|
||||
62
src/main.rs
62
src/main.rs
@@ -8,7 +8,8 @@ use std::sync::atomic::{AtomicBool, Ordering};
|
||||
use std::io;
|
||||
|
||||
use clap::Parser;
|
||||
use tracing::{info, debug, error};
|
||||
use tracing::error;
|
||||
use tracing_subscriber::{fmt, prelude::*, EnvFilter};
|
||||
|
||||
use crossterm::{
|
||||
event::{self, Event, KeyCode},
|
||||
@@ -68,27 +69,24 @@ fn print_summary_report(result: &OptimizationResult) {
|
||||
println!();
|
||||
}
|
||||
|
||||
fn setup_logging(verbose: bool) -> tracing_appender::non_blocking::WorkerGuard {
|
||||
let file_appender = tracing_appender::rolling::never("/var/log", "ember-tune.log");
|
||||
let (non_blocking, guard) = tracing_appender::non_blocking(file_appender);
|
||||
fn main() -> Result<()> {
|
||||
let args = Cli::parse();
|
||||
|
||||
let level = if verbose { tracing::Level::DEBUG } else { tracing::Level::INFO };
|
||||
// 1. Logging Setup (File-only by default, Stdout during Audit)
|
||||
let file_appender = tracing_appender::rolling::never(".", "ember-tune.log");
|
||||
let (non_blocking, _guard) = tracing_appender::non_blocking(file_appender);
|
||||
let level = if args.verbose { "debug" } else { "info" };
|
||||
|
||||
tracing_subscriber::fmt()
|
||||
.with_max_level(level)
|
||||
let file_layer = fmt::layer()
|
||||
.with_writer(non_blocking)
|
||||
.with_ansi(false)
|
||||
.with_ansi(false);
|
||||
|
||||
// We use a simple println for the audit to avoid complex reload handles
|
||||
tracing_subscriber::registry()
|
||||
.with(EnvFilter::new(level))
|
||||
.with(file_layer)
|
||||
.init();
|
||||
|
||||
guard
|
||||
}
|
||||
|
||||
fn main() -> Result<()> {
|
||||
// 1. Diagnostics & CLI Initialization
|
||||
let args = Cli::parse();
|
||||
let _log_guard = setup_logging(args.verbose);
|
||||
|
||||
// Set panic hook to restore terminal state
|
||||
std::panic::set_hook(Box::new(|panic_info| {
|
||||
let _ = disable_raw_mode();
|
||||
let mut stdout = io::stdout();
|
||||
@@ -99,11 +97,10 @@ fn main() -> Result<()> {
|
||||
eprintln!("----------------------------------------\n");
|
||||
}));
|
||||
|
||||
info!("ember-tune starting with args: {:?}", args);
|
||||
println!("{}", console::style("─── Pre-flight System Audit ───").bold().cyan());
|
||||
|
||||
let ctx = ember_tune_rs::sal::traits::EnvironmentCtx::production();
|
||||
|
||||
// 2. Platform Detection & Audit
|
||||
let (sal_box, facts): (Box<dyn PlatformSal>, SystemFactSheet) = if args.mock {
|
||||
(Box::new(MockSal::new()), SystemFactSheet::default())
|
||||
} else {
|
||||
@@ -111,9 +108,7 @@ fn main() -> Result<()> {
|
||||
};
|
||||
let sal: Arc<dyn PlatformSal> = sal_box.into();
|
||||
|
||||
println!("{}", console::style("─── Pre-flight System Audit ───").bold().cyan());
|
||||
let mut audit_failures = Vec::new();
|
||||
|
||||
for step in sal.audit() {
|
||||
print!(" Checking {:<40} ", step.description);
|
||||
io::Write::flush(&mut io::stdout()).into_diagnostic()?;
|
||||
@@ -137,15 +132,14 @@ fn main() -> Result<()> {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
// 3. Terminal Setup
|
||||
// Entering TUI Mode - STDOUT is now strictly for Ratatui
|
||||
enable_raw_mode().into_diagnostic()?;
|
||||
let mut stdout = io::stdout();
|
||||
execute!(stdout, EnterAlternateScreen).into_diagnostic()?;
|
||||
execute!(stdout, EnterAlternateScreen, crossterm::cursor::Hide).into_diagnostic()?;
|
||||
let backend_stdout = io::stdout();
|
||||
let backend_term = CrosstermBackend::new(backend_stdout);
|
||||
let mut terminal = Terminal::new(backend_term).into_diagnostic()?;
|
||||
|
||||
// 4. State & Communication Setup
|
||||
let running = Arc::new(AtomicBool::new(true));
|
||||
let r = running.clone();
|
||||
|
||||
@@ -158,9 +152,9 @@ fn main() -> Result<()> {
|
||||
r.store(false, Ordering::SeqCst);
|
||||
}).expect("Error setting Ctrl-C handler");
|
||||
|
||||
// 5. Spawn Backend Orchestrator
|
||||
let sal_backend = sal.clone();
|
||||
let facts_backend = facts.clone();
|
||||
let config_out = args.config_out.clone();
|
||||
let backend_handle = thread::spawn(move || {
|
||||
let workload = Box::new(StressNg::new());
|
||||
let mut orchestrator = BenchmarkOrchestrator::new(
|
||||
@@ -169,14 +163,14 @@ fn main() -> Result<()> {
|
||||
workload,
|
||||
telemetry_tx,
|
||||
command_rx,
|
||||
config_out,
|
||||
);
|
||||
orchestrator.run()
|
||||
});
|
||||
|
||||
// 6. Frontend Event Loop
|
||||
let mut ui_state = DashboardState::new();
|
||||
let mut last_telemetry = TelemetryState {
|
||||
cpu_model: "Loading...".to_string(),
|
||||
cpu_model: facts.model.clone(),
|
||||
total_ram_gb: 0,
|
||||
tick: 0,
|
||||
cpu_temp: 0.0,
|
||||
@@ -187,6 +181,7 @@ fn main() -> Result<()> {
|
||||
pl1_limit: 0.0,
|
||||
pl2_limit: 0.0,
|
||||
fan_tier: "auto".to_string(),
|
||||
is_throttling: false,
|
||||
phase: BenchmarkPhase::Auditing,
|
||||
history_watts: Vec::new(),
|
||||
history_temp: Vec::new(),
|
||||
@@ -224,7 +219,6 @@ fn main() -> Result<()> {
|
||||
while let Ok(new_state) = telemetry_rx.try_recv() {
|
||||
if let Some(log) = &new_state.log_event {
|
||||
ui_state.add_log(log.clone());
|
||||
debug!("Backend Log: {}", log);
|
||||
} else {
|
||||
ui_state.update(&new_state);
|
||||
last_telemetry = new_state;
|
||||
@@ -235,20 +229,11 @@ fn main() -> Result<()> {
|
||||
if backend_handle.is_finished() { break; }
|
||||
}
|
||||
|
||||
// 7. Terminal Restoration
|
||||
let _ = disable_raw_mode();
|
||||
let _ = execute!(terminal.backend_mut(), LeaveAlternateScreen);
|
||||
let _ = terminal.show_cursor();
|
||||
let _ = execute!(terminal.backend_mut(), LeaveAlternateScreen, crossterm::cursor::Show);
|
||||
|
||||
// 8. Final Report & Hardware Restoration
|
||||
let join_res = backend_handle.join();
|
||||
|
||||
// Explicit hardware restoration
|
||||
info!("Restoring hardware state...");
|
||||
if let Err(e) = sal.restore() {
|
||||
error!("Failed to restore hardware state: {}", e);
|
||||
}
|
||||
|
||||
match join_res {
|
||||
Ok(Ok(result)) => {
|
||||
print_summary_report(&result);
|
||||
@@ -273,6 +258,5 @@ fn main() -> Result<()> {
|
||||
}
|
||||
}
|
||||
|
||||
info!("ember-tune exited gracefully.");
|
||||
Ok(())
|
||||
}
|
||||
|
||||
@@ -35,6 +35,7 @@ pub struct TelemetryState {
|
||||
pub pl1_limit: f32,
|
||||
pub pl2_limit: f32,
|
||||
pub fan_tier: String,
|
||||
pub is_throttling: bool,
|
||||
pub phase: BenchmarkPhase,
|
||||
|
||||
// --- High-res History ---
|
||||
|
||||
@@ -3,7 +3,8 @@
|
||||
//! It manages hardware interactions through the [PlatformSal], generates stress
|
||||
//! using a [Workload], and feeds telemetry to the frontend via MPSC channels.
|
||||
|
||||
use anyhow::{Result, Context};
|
||||
use anyhow::{Result, Context, bail};
|
||||
use tracing::{info, warn, error, debug};
|
||||
use std::sync::mpsc;
|
||||
use std::time::{Duration, Instant};
|
||||
use std::thread;
|
||||
@@ -12,61 +13,57 @@ use sysinfo::System;
|
||||
use std::sync::Arc;
|
||||
use std::sync::atomic::{AtomicBool, Ordering};
|
||||
use std::sync::Mutex;
|
||||
use std::path::PathBuf;
|
||||
use std::cell::Cell;
|
||||
|
||||
use crate::sal::traits::{PlatformSal, SafetyStatus};
|
||||
use crate::sal::traits::{PlatformSal, SensorBus};
|
||||
use crate::sal::heuristic::discovery::SystemFactSheet;
|
||||
use crate::load::Workload;
|
||||
use crate::sal::safety::{HardwareStateGuard, PowerLimitWatts, ThermalWatchdog};
|
||||
use crate::load::{Workload, IntensityProfile, StressVector};
|
||||
use crate::mediator::{TelemetryState, UiCommand, BenchmarkPhase};
|
||||
use crate::engine::{OptimizerEngine, ThermalProfile, ThermalPoint, OptimizationResult};
|
||||
use crate::agent_analyst::HeuristicAnalyst;
|
||||
use crate::agent_integrator::ServiceIntegrator;
|
||||
|
||||
/// Represents the possible states of the benchmark orchestrator.
|
||||
pub enum OrchestratorState {
|
||||
PreFlight,
|
||||
IdleBaseline,
|
||||
ThermalCalibration,
|
||||
StabilitySweep,
|
||||
Cooldown,
|
||||
Finalizing,
|
||||
}
|
||||
|
||||
/// The central state machine responsible for coordinating the thermal benchmark.
|
||||
///
|
||||
/// It manages hardware interactions through the [PlatformSal], generates stress
|
||||
/// using a [Workload], and feeds telemetry to the frontend via MPSC channels.
|
||||
pub struct BenchmarkOrchestrator {
|
||||
/// Injected hardware abstraction layer.
|
||||
sal: Arc<dyn PlatformSal>,
|
||||
/// Discovered system facts and paths.
|
||||
facts: SystemFactSheet,
|
||||
/// Heat generation workload.
|
||||
workload: Box<dyn Workload>,
|
||||
/// Channel for sending telemetry updates to the UI.
|
||||
telemetry_tx: mpsc::Sender<TelemetryState>,
|
||||
/// Channel for receiving commands from the UI.
|
||||
command_rx: mpsc::Receiver<UiCommand>,
|
||||
/// Current phase of the benchmark.
|
||||
phase: BenchmarkPhase,
|
||||
/// Accumulated thermal data points.
|
||||
ui_phase: BenchmarkPhase,
|
||||
profile: ThermalProfile,
|
||||
/// Mathematics engine for data smoothing and optimization.
|
||||
engine: OptimizerEngine,
|
||||
|
||||
/// Sliding window of power readings (Watts).
|
||||
optional_config_out: Option<PathBuf>,
|
||||
safeguard: Option<HardwareStateGuard>,
|
||||
watchdog: Option<ThermalWatchdog>,
|
||||
history_watts: VecDeque<f32>,
|
||||
/// Sliding window of temperature readings (Celsius).
|
||||
history_temp: VecDeque<f32>,
|
||||
/// Sliding window of CPU frequency (MHz).
|
||||
history_mhz: VecDeque<f32>,
|
||||
|
||||
/// Detected CPU model string.
|
||||
cpu_model: String,
|
||||
/// Total system RAM in Gigabytes.
|
||||
total_ram_gb: u64,
|
||||
|
||||
/// Atomic flag indicating a safety-triggered abort.
|
||||
emergency_abort: Arc<AtomicBool>,
|
||||
/// Human-readable reason for the emergency abort.
|
||||
emergency_reason: Arc<Mutex<Option<String>>>,
|
||||
}
|
||||
|
||||
impl BenchmarkOrchestrator {
|
||||
/// Creates a new orchestrator instance with injected dependencies.
|
||||
pub fn new(
|
||||
sal: Arc<dyn PlatformSal>,
|
||||
facts: SystemFactSheet,
|
||||
workload: Box<dyn Workload>,
|
||||
telemetry_tx: mpsc::Sender<TelemetryState>,
|
||||
command_rx: mpsc::Receiver<UiCommand>,
|
||||
optional_config_out: Option<PathBuf>,
|
||||
) -> Self {
|
||||
let mut sys = System::new_all();
|
||||
sys.refresh_all();
|
||||
@@ -82,7 +79,7 @@ impl BenchmarkOrchestrator {
|
||||
workload,
|
||||
telemetry_tx,
|
||||
command_rx,
|
||||
phase: BenchmarkPhase::Auditing,
|
||||
ui_phase: BenchmarkPhase::Auditing,
|
||||
profile: ThermalProfile::default(),
|
||||
engine: OptimizerEngine::new(5),
|
||||
history_watts: VecDeque::with_capacity(120),
|
||||
@@ -92,244 +89,252 @@ impl BenchmarkOrchestrator {
|
||||
total_ram_gb,
|
||||
emergency_abort: Arc::new(AtomicBool::new(false)),
|
||||
emergency_reason: Arc::new(Mutex::new(None)),
|
||||
optional_config_out,
|
||||
safeguard: None,
|
||||
watchdog: None,
|
||||
}
|
||||
}
|
||||
|
||||
/// Executes the full benchmark sequence.
|
||||
///
|
||||
/// This method guarantees that [crate::sal::traits::EnvironmentGuard::restore] and [Workload::stop]
|
||||
/// are called regardless of whether the benchmark succeeds or fails.
|
||||
pub fn run(&mut self) -> Result<OptimizationResult> {
|
||||
self.log("Starting ember-tune Benchmark Sequence.")?;
|
||||
// Immediate Priming
|
||||
let _ = self.sal.get_temp();
|
||||
let _ = self.sal.get_power_w();
|
||||
let _ = self.sal.get_fan_rpms();
|
||||
|
||||
let _watchdog_handle = self.spawn_watchdog_monitor();
|
||||
info!("Orchestrator: Initializing Project Iron-Ember PGC Protocol.");
|
||||
|
||||
// Spawn safety watchdog immediately
|
||||
let watchdog = ThermalWatchdog::spawn(self.sal.clone(), self.emergency_abort.clone());
|
||||
self.watchdog = Some(watchdog);
|
||||
|
||||
let result = self.execute_benchmark();
|
||||
|
||||
self.log("Benchmark sequence finished. Restoring hardware defaults...")?;
|
||||
let _ = self.workload.stop();
|
||||
if let Err(e) = self.sal.restore() {
|
||||
anyhow::bail!("CRITICAL: Failed to restore hardware state: {}", e);
|
||||
if let Err(ref e) = result {
|
||||
error!("Benchmark Lifecycle Failure: {}", e);
|
||||
let _ = self.log(&format!("⚠ FAILURE: {}", e));
|
||||
}
|
||||
self.log("✓ Hardware state restored.")?;
|
||||
|
||||
// --- MANDATORY RAII CLEANUP ---
|
||||
info!("Benchmark sequence complete. Releasing safeguards...");
|
||||
let _ = self.workload.stop_workload();
|
||||
|
||||
if let Some(mut sg) = self.safeguard.take() {
|
||||
let _ = sg.release();
|
||||
}
|
||||
|
||||
if let Err(e) = self.sal.restore() {
|
||||
warn!("Failed secondary SAL restoration: {}", e);
|
||||
}
|
||||
|
||||
info!("✓ Hardware state restored.");
|
||||
result
|
||||
}
|
||||
|
||||
/// Internal execution logic for the benchmark phases.
|
||||
fn execute_benchmark(&mut self) -> Result<OptimizationResult> {
|
||||
let bench_cfg = self.facts.bench_config.clone().context("Benchmarking config missing in facts")?;
|
||||
let _bench_cfg = self.facts.bench_config.clone().context("Config missing.")?;
|
||||
|
||||
// 1. Pre-Flight Phase
|
||||
self.ui_phase = BenchmarkPhase::Auditing;
|
||||
self.log("Phase: Pre-Flight Auditing & Sterilization")?;
|
||||
|
||||
let mut target_files = self.facts.rapl_paths.iter()
|
||||
.map(|p| p.join("constraint_0_power_limit_uw"))
|
||||
.collect::<Vec<_>>();
|
||||
target_files.extend(self.facts.rapl_paths.iter().map(|p| p.join("constraint_1_power_limit_uw")));
|
||||
|
||||
if let Some(tp) = self.facts.paths.configs.get("throttled") {
|
||||
target_files.push(tp.clone());
|
||||
}
|
||||
|
||||
let sg = HardwareStateGuard::acquire(&target_files, &self.facts.conflict_services)?;
|
||||
self.safeguard = Some(sg);
|
||||
|
||||
self.phase = BenchmarkPhase::Auditing;
|
||||
for step in self.sal.audit() {
|
||||
if let Err(e) = step.outcome {
|
||||
return Err(anyhow::anyhow!("Audit failed ({}): {:?}", step.description, e));
|
||||
}
|
||||
}
|
||||
|
||||
self.log("Suppressing background services (tlp, thermald)...")?;
|
||||
self.sal.suppress().context("Failed to suppress background services")?;
|
||||
self.workload.initialize().context("Failed to initialize load generator.")?;
|
||||
self.sal.suppress().context("Failed to suppress background services.")?;
|
||||
|
||||
self.phase = BenchmarkPhase::IdleCalibration;
|
||||
self.log(&format!("Phase 1: Recording Idle Baseline ({}s)...", bench_cfg.idle_duration_s))?;
|
||||
let tick = Cell::new(0u64);
|
||||
|
||||
// 2. Idle Baseline Phase
|
||||
self.ui_phase = BenchmarkPhase::IdleCalibration;
|
||||
self.log("Phase: Recording 30s Idle Baseline...")?;
|
||||
self.sal.set_fan_mode("auto")?;
|
||||
|
||||
let mut idle_temps = Vec::new();
|
||||
let start = Instant::now();
|
||||
let mut tick = 0;
|
||||
while start.elapsed() < Duration::from_secs(bench_cfg.idle_duration_s) {
|
||||
self.check_abort()?;
|
||||
self.send_telemetry(tick)?;
|
||||
while start.elapsed() < Duration::from_secs(30) {
|
||||
self.check_safety_abort()?;
|
||||
self.send_telemetry(tick.get())?;
|
||||
idle_temps.push(self.sal.get_temp().unwrap_or(0.0));
|
||||
tick += 1;
|
||||
tick.set(tick.get() + 1);
|
||||
thread::sleep(Duration::from_millis(500));
|
||||
}
|
||||
self.profile.ambient_temp = self.engine.smooth(&idle_temps).last().cloned().unwrap_or(0.0);
|
||||
self.profile.ambient_temp = self.engine.smooth(&idle_temps).iter().sum::<f32>() / idle_temps.len() as f32;
|
||||
self.log(&format!("✓ Idle Baseline: {:.1}°C", self.profile.ambient_temp))?;
|
||||
|
||||
self.phase = BenchmarkPhase::StressTesting;
|
||||
self.log("Phase 2: Starting Synthetic Stress Matrix.")?;
|
||||
// 3. Thermal Resistance Mapping (Phase 1)
|
||||
self.log("Phase: Mapping Thermal Resistance (Rθ) at 10W...")?;
|
||||
self.sal.set_fan_mode("max")?;
|
||||
|
||||
let steps = bench_cfg.power_steps_watts.clone();
|
||||
for &pl in &steps {
|
||||
self.log(&format!("Testing PL1 = {:.0}W...", pl))?;
|
||||
self.sal.set_sustained_power_limit(pl)?;
|
||||
self.sal.set_burst_power_limit(pl + 5.0)?;
|
||||
let pl_calib = PowerLimitWatts::try_new(10.0)?;
|
||||
self.sal.set_sustained_power_limit(pl_calib)?;
|
||||
self.sal.set_burst_power_limit(pl_calib)?;
|
||||
|
||||
self.workload.start(num_cpus::get(), 100)?;
|
||||
self.workload.run_workload(
|
||||
Duration::from_secs(120),
|
||||
IntensityProfile { threads: num_cpus::get_physical(), load_percentage: 100, vector: StressVector::CpuMatrix }
|
||||
)?;
|
||||
|
||||
let mut calib_temps = Vec::new();
|
||||
let calib_start = Instant::now();
|
||||
while calib_start.elapsed() < Duration::from_secs(90) {
|
||||
self.check_safety_abort()?;
|
||||
self.send_telemetry(tick.get())?;
|
||||
let t = self.sal.get_temp().unwrap_or(0.0);
|
||||
calib_temps.push(t);
|
||||
tick.set(tick.get() + 1);
|
||||
|
||||
if calib_start.elapsed() > Duration::from_secs(30) && self.engine.is_stable(&calib_temps) {
|
||||
break;
|
||||
}
|
||||
thread::sleep(Duration::from_millis(500));
|
||||
}
|
||||
|
||||
let steady_t = calib_temps.last().cloned().unwrap_or(0.0);
|
||||
let steady_p = self.sal.get_power_w().unwrap_or(10.0);
|
||||
self.profile.r_theta = self.engine.calculate_r_theta(self.profile.ambient_temp, steady_t, steady_p);
|
||||
self.log(&format!("✓ Physical Model: Rθ = {:.3} K/W", self.profile.r_theta))?;
|
||||
|
||||
// 4. Physically-Aware Stability Sweep (Phase 2)
|
||||
self.ui_phase = BenchmarkPhase::StressTesting;
|
||||
self.log("Phase: Starting Physically-Aware Efficiency Sweep...")?;
|
||||
|
||||
let mut current_w = 12.0_f32;
|
||||
let mut previous_ops = 0.0;
|
||||
|
||||
loop {
|
||||
// Predict if this step is safe
|
||||
let pred_t = self.engine.predict_temp(current_w, self.profile.ambient_temp, self.profile.r_theta);
|
||||
if pred_t > 92.0 {
|
||||
self.log(&format!("Prediction: {:.1}W would result in {:.1}C (Too Hot). Finalizing...", current_w, pred_t))?;
|
||||
break;
|
||||
}
|
||||
|
||||
self.log(&format!("Step: {:.1}W (Predicted: {:.1}C)", current_w, pred_t))?;
|
||||
let pl = PowerLimitWatts::try_new(current_w)?;
|
||||
self.sal.set_sustained_power_limit(pl)?;
|
||||
self.sal.set_burst_power_limit(PowerLimitWatts::try_new(current_w + 2.0)?)?;
|
||||
|
||||
self.workload.run_workload(
|
||||
Duration::from_secs(60),
|
||||
IntensityProfile { threads: num_cpus::get_physical(), load_percentage: 100, vector: StressVector::CpuMatrix }
|
||||
)?;
|
||||
|
||||
let step_start = Instant::now();
|
||||
let mut step_temps = VecDeque::with_capacity(30);
|
||||
let mut step_temps = Vec::new();
|
||||
let mut previous_t = self.sal.get_temp().unwrap_or(0.0);
|
||||
|
||||
while step_start.elapsed() < Duration::from_secs(bench_cfg.stress_duration_max_s) {
|
||||
self.check_abort()?;
|
||||
while step_start.elapsed() < Duration::from_secs(60) {
|
||||
self.check_safety_abort()?;
|
||||
self.send_telemetry(tick.get())?;
|
||||
|
||||
let t = self.sal.get_temp().unwrap_or(0.0);
|
||||
step_temps.push_back(t);
|
||||
if step_temps.len() > 10 { step_temps.pop_front(); }
|
||||
let dt_dt = (t - previous_t) / 0.5;
|
||||
|
||||
self.send_telemetry(tick)?;
|
||||
tick += 1;
|
||||
|
||||
if step_start.elapsed() > Duration::from_secs(bench_cfg.stress_duration_min_s) && step_temps.len() == 10 {
|
||||
let min = step_temps.iter().fold(f32::MAX, |a, &b| a.min(b));
|
||||
let max = step_temps.iter().fold(f32::MIN, |a, &b| a.max(b));
|
||||
if (max - min) < 0.5 {
|
||||
self.log(&format!(" Equilibrium reached at {:.1}°C", t))?;
|
||||
break;
|
||||
}
|
||||
// # SAFETY: predictive hard-quench threshold raised to 8C/s
|
||||
if step_start.elapsed() > Duration::from_secs(2) && (t > 95.0 || dt_dt > 8.0) {
|
||||
warn!("USA: Safety Break triggered! T={:.1}C, dT/dt={:.1}C/s", t, dt_dt);
|
||||
let _ = self.sal.set_sustained_power_limit(PowerLimitWatts::try_new(3.0)?);
|
||||
break; // Just break the sweep loop
|
||||
}
|
||||
|
||||
step_temps.push(t);
|
||||
tick.set(tick.get() + 1);
|
||||
|
||||
if step_start.elapsed() > Duration::from_secs(15) && self.engine.is_stable(&step_temps) {
|
||||
self.log(&format!(" Equilibrium reached at {:.1}°C", t))?;
|
||||
break;
|
||||
}
|
||||
previous_t = t;
|
||||
thread::sleep(Duration::from_millis(500));
|
||||
}
|
||||
|
||||
let avg_p = self.sal.get_power_w().unwrap_or(0.0);
|
||||
let avg_t = self.sal.get_temp().unwrap_or(0.0);
|
||||
let avg_f = self.sal.get_freq_mhz().unwrap_or(0.0);
|
||||
let fans = self.sal.get_fan_rpms().unwrap_or_default();
|
||||
let primary_fan = fans.first().cloned().unwrap_or(0);
|
||||
let tp = self.workload.get_throughput().unwrap_or(0.0);
|
||||
|
||||
let metrics = self.workload.get_current_metrics().unwrap_or_default();
|
||||
self.profile.points.push(ThermalPoint {
|
||||
power_w: avg_p,
|
||||
temp_c: avg_t,
|
||||
freq_mhz: avg_f,
|
||||
fan_rpm: primary_fan,
|
||||
throughput: tp,
|
||||
power_w: self.sal.get_power_w().unwrap_or(current_w),
|
||||
temp_c: self.sal.get_temp().unwrap_or(0.0),
|
||||
freq_mhz: self.sal.get_freq_mhz().unwrap_or(0.0),
|
||||
fan_rpm: self.sal.get_fan_rpms().unwrap_or_default().first().cloned().unwrap_or(0),
|
||||
throughput: metrics.primary_ops_per_sec,
|
||||
});
|
||||
|
||||
self.workload.stop()?;
|
||||
self.log(&format!(" Step complete. Cooling down for {}s...", bench_cfg.cool_down_s))?;
|
||||
thread::sleep(Duration::from_secs(bench_cfg.cool_down_s));
|
||||
self.workload.stop_workload()?;
|
||||
|
||||
// Efficiency Break
|
||||
if previous_ops > 0.0 {
|
||||
let gain = ((metrics.primary_ops_per_sec - previous_ops) / previous_ops) * 100.0;
|
||||
if gain < 1.0 {
|
||||
self.log("Silicon Knee identified (gain < 1%). Finalizing...")?;
|
||||
break;
|
||||
}
|
||||
}
|
||||
previous_ops = metrics.primary_ops_per_sec;
|
||||
current_w += 2.0;
|
||||
if current_w > 45.0 { break; }
|
||||
|
||||
self.log(&format!("Cooling down ({}s)...", _bench_cfg.cool_down_s))?;
|
||||
thread::sleep(Duration::from_secs(_bench_cfg.cool_down_s));
|
||||
}
|
||||
|
||||
self.phase = BenchmarkPhase::PhysicalModeling;
|
||||
self.log("Phase 3: Calculating Silicon Physical Sweet Spot...")?;
|
||||
// 5. Modeling Phase
|
||||
self.ui_phase = BenchmarkPhase::PhysicalModeling;
|
||||
let knee = self.engine.find_silicon_knee(&self.profile);
|
||||
let analyst = HeuristicAnalyst::new();
|
||||
let matrix = analyst.analyze(&self.profile, self.profile.points.last().map(|p| p.power_w).unwrap_or(15.0));
|
||||
|
||||
let mut res = self.generate_result(false);
|
||||
res.optimization_matrix = Some(matrix.clone());
|
||||
res.silicon_knee_watts = knee;
|
||||
|
||||
self.log(&format!("✓ Thermal Resistance (Rθ): {:.3} K/W", res.thermal_resistance_kw))?;
|
||||
self.log(&format!("✓ Silicon Knee Found: {:.1} W", res.silicon_knee_watts))?;
|
||||
|
||||
thread::sleep(Duration::from_secs(3));
|
||||
|
||||
self.phase = BenchmarkPhase::Finalizing;
|
||||
self.log("Benchmark sequence complete. Generating configurations...")?;
|
||||
|
||||
let config = crate::engine::formatters::throttled::ThrottledConfig {
|
||||
pl1_limit: res.silicon_knee_watts,
|
||||
pl2_limit: res.recommended_pl2,
|
||||
trip_temp: res.max_temp_c.max(95.0),
|
||||
};
|
||||
|
||||
if let Some(throttled_path) = self.facts.paths.configs.get("throttled") {
|
||||
crate::engine::formatters::throttled::ThrottledTranslator::save(throttled_path, &config)?;
|
||||
self.log(&format!("✓ Saved '{}' (merged).", throttled_path.display()))?;
|
||||
res.config_paths.insert("throttled".to_string(), throttled_path.clone());
|
||||
// 6. Finalizing Phase
|
||||
self.ui_phase = BenchmarkPhase::Finalizing;
|
||||
let throttled_source = self.facts.paths.configs.get("throttled");
|
||||
if let Some(path) = self.optional_config_out.clone().or_else(|| throttled_source.cloned()) {
|
||||
let config = crate::engine::formatters::throttled::ThrottledConfig {
|
||||
pl1_limit: res.silicon_knee_watts,
|
||||
pl2_limit: res.silicon_knee_watts * 1.25,
|
||||
trip_temp: 90.0,
|
||||
};
|
||||
let _ = crate::engine::formatters::throttled::ThrottledTranslator::save(&path, &config, throttled_source);
|
||||
res.config_paths.insert("throttled".to_string(), path);
|
||||
}
|
||||
|
||||
if let Some(i8k_path) = self.facts.paths.configs.get("i8kmon") {
|
||||
let i8k_config = crate::engine::formatters::i8kmon::I8kmonConfig {
|
||||
t_ambient: self.profile.ambient_temp,
|
||||
t_max_fan: res.max_temp_c - 5.0,
|
||||
thermal_resistance_kw: res.thermal_resistance_kw,
|
||||
};
|
||||
crate::engine::formatters::i8kmon::I8kmonTranslator::save(i8k_path, &i8k_config)?;
|
||||
self.log(&format!("✓ Saved '{}'.", i8k_path.display()))?;
|
||||
res.config_paths.insert("i8kmon".to_string(), i8k_path.clone());
|
||||
let base_out = self.optional_config_out.clone().unwrap_or_else(|| PathBuf::from("/etc"));
|
||||
let i8k_source = self.facts.paths.configs.get("i8kmon");
|
||||
let i8k_out = base_out.join("i8kmon.conf");
|
||||
if ServiceIntegrator::generate_i8kmon_config(&matrix, &i8k_out, i8k_source).is_ok() {
|
||||
res.config_paths.insert("i8kmon".to_string(), i8k_out);
|
||||
}
|
||||
|
||||
Ok(res)
|
||||
}
|
||||
|
||||
/// Spawns a concurrent monitor that polls safety sensors every 100ms.
|
||||
fn spawn_watchdog_monitor(&self) -> thread::JoinHandle<()> {
|
||||
let abort = self.emergency_abort.clone();
|
||||
let reason_store = self.emergency_reason.clone();
|
||||
let sal = self.sal.clone();
|
||||
let tx = self.telemetry_tx.clone();
|
||||
|
||||
thread::spawn(move || {
|
||||
while !abort.load(Ordering::SeqCst) {
|
||||
let status = sal.get_safety_status();
|
||||
match status {
|
||||
Ok(SafetyStatus::EmergencyAbort(reason)) => {
|
||||
*reason_store.lock().unwrap() = Some(reason.clone());
|
||||
abort.store(true, Ordering::SeqCst);
|
||||
break;
|
||||
}
|
||||
Ok(SafetyStatus::Warning(msg)) | Ok(SafetyStatus::Critical(msg)) => {
|
||||
let state = TelemetryState {
|
||||
cpu_model: String::new(),
|
||||
total_ram_gb: 0,
|
||||
tick: 0,
|
||||
cpu_temp: 0.0,
|
||||
power_w: 0.0,
|
||||
current_freq: 0.0,
|
||||
fans: Vec::new(),
|
||||
governor: String::new(),
|
||||
pl1_limit: 0.0,
|
||||
pl2_limit: 0.0,
|
||||
fan_tier: String::new(),
|
||||
phase: BenchmarkPhase::StressTesting,
|
||||
history_watts: Vec::new(),
|
||||
history_temp: Vec::new(),
|
||||
history_mhz: Vec::new(),
|
||||
log_event: Some(format!("WATCHDOG: {}", msg)),
|
||||
metadata: std::collections::HashMap::new(),
|
||||
is_emergency: false,
|
||||
emergency_reason: None,
|
||||
};
|
||||
let _ = tx.send(state);
|
||||
}
|
||||
Ok(SafetyStatus::Nominal) => {}
|
||||
Err(e) => {
|
||||
*reason_store.lock().unwrap() = Some(format!("Watchdog Sensor Failure: {}", e));
|
||||
abort.store(true, Ordering::SeqCst);
|
||||
break;
|
||||
}
|
||||
}
|
||||
thread::sleep(Duration::from_millis(100));
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
/// Generates the final [OptimizationResult] based on current measurements.
|
||||
pub fn generate_result(&self, is_partial: bool) -> OptimizationResult {
|
||||
let r_theta = self.engine.calculate_thermal_resistance(&self.profile);
|
||||
let knee = self.engine.find_silicon_knee(&self.profile);
|
||||
let max_t = self.engine.get_max_temp(&self.profile);
|
||||
|
||||
OptimizationResult {
|
||||
profile: self.profile.clone(),
|
||||
silicon_knee_watts: knee,
|
||||
thermal_resistance_kw: r_theta,
|
||||
recommended_pl1: knee,
|
||||
recommended_pl2: knee * 1.25,
|
||||
max_temp_c: max_t,
|
||||
is_partial,
|
||||
config_paths: std::collections::HashMap::new(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Checks if the benchmark has been aborted by the user or the watchdog.
|
||||
fn check_abort(&self) -> Result<()> {
|
||||
fn check_safety_abort(&self) -> Result<()> {
|
||||
if self.emergency_abort.load(Ordering::SeqCst) {
|
||||
let reason = self.emergency_reason.lock().unwrap().clone().unwrap_or_else(|| "Unknown safety trigger".to_string());
|
||||
return Err(anyhow::anyhow!("EMERGENCY_ABORT: {}", reason));
|
||||
let reason = self.emergency_reason.lock().unwrap().clone().unwrap_or_else(|| "Watchdog".to_string());
|
||||
bail!("EMERGENCY_ABORT: {}", reason);
|
||||
}
|
||||
|
||||
if let Ok(cmd) = self.command_rx.try_recv() {
|
||||
match cmd {
|
||||
UiCommand::Abort => {
|
||||
return Err(anyhow::anyhow!("ABORTED"));
|
||||
}
|
||||
}
|
||||
if let UiCommand::Abort = cmd { bail!("ABORTED"); }
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Helper to send log messages to the frontend.
|
||||
fn log(&self, msg: &str) -> Result<()> {
|
||||
let state = TelemetryState {
|
||||
cpu_model: self.cpu_model.clone(),
|
||||
@@ -339,51 +344,38 @@ impl BenchmarkOrchestrator {
|
||||
power_w: self.sal.get_power_w().unwrap_or(0.0),
|
||||
current_freq: self.sal.get_freq_mhz().unwrap_or(0.0),
|
||||
fans: self.sal.get_fan_rpms().unwrap_or_default(),
|
||||
governor: "unknown".to_string(),
|
||||
pl1_limit: 0.0,
|
||||
pl2_limit: 0.0,
|
||||
fan_tier: "auto".to_string(),
|
||||
phase: self.phase,
|
||||
history_watts: Vec::new(),
|
||||
history_temp: Vec::new(),
|
||||
history_mhz: Vec::new(),
|
||||
governor: "performance".to_string(),
|
||||
pl1_limit: 0.0, pl2_limit: 0.0, fan_tier: "auto".to_string(),
|
||||
is_throttling: self.sal.get_throttling_status().unwrap_or(false),
|
||||
phase: self.ui_phase,
|
||||
history_watts: Vec::new(), history_temp: Vec::new(), history_mhz: Vec::new(),
|
||||
log_event: Some(msg.to_string()),
|
||||
metadata: std::collections::HashMap::new(),
|
||||
is_emergency: self.emergency_abort.load(Ordering::SeqCst),
|
||||
emergency_reason: self.emergency_reason.lock().unwrap().clone(),
|
||||
};
|
||||
self.telemetry_tx.send(state).map_err(|_| anyhow::anyhow!("Telemetry channel closed"))
|
||||
self.telemetry_tx.send(state).map_err(|_| anyhow::anyhow!("Channel closed"))
|
||||
}
|
||||
|
||||
/// Collects current sensors and sends a complete [TelemetryState] to the frontend.
|
||||
fn send_telemetry(&mut self, tick: u64) -> Result<()> {
|
||||
let temp = self.sal.get_temp().unwrap_or(0.0);
|
||||
let pwr = self.sal.get_power_w().unwrap_or(0.0);
|
||||
let freq = self.sal.get_freq_mhz().unwrap_or(0.0);
|
||||
|
||||
self.history_temp.push_back(temp);
|
||||
self.history_watts.push_back(pwr);
|
||||
self.history_mhz.push_back(freq);
|
||||
|
||||
if self.history_temp.len() > 120 {
|
||||
self.history_temp.pop_front();
|
||||
self.history_watts.pop_front();
|
||||
self.history_mhz.pop_front();
|
||||
}
|
||||
if self.history_temp.len() > 120 { self.history_temp.pop_front(); self.history_watts.pop_front(); self.history_mhz.pop_front(); }
|
||||
|
||||
let state = TelemetryState {
|
||||
cpu_model: self.cpu_model.clone(),
|
||||
total_ram_gb: self.total_ram_gb,
|
||||
tick,
|
||||
cpu_temp: temp,
|
||||
power_w: pwr,
|
||||
current_freq: freq,
|
||||
cpu_temp: temp, power_w: pwr, current_freq: freq,
|
||||
fans: self.sal.get_fan_rpms().unwrap_or_default(),
|
||||
governor: "performance".to_string(),
|
||||
pl1_limit: 15.0,
|
||||
pl2_limit: 25.0,
|
||||
fan_tier: "max".to_string(),
|
||||
phase: self.phase,
|
||||
pl1_limit: 15.0, pl2_limit: 25.0, fan_tier: "max".to_string(),
|
||||
is_throttling: self.sal.get_throttling_status().unwrap_or(false),
|
||||
phase: self.ui_phase,
|
||||
history_watts: self.history_watts.iter().cloned().collect(),
|
||||
history_temp: self.history_temp.iter().cloned().collect(),
|
||||
history_mhz: self.history_mhz.iter().cloned().collect(),
|
||||
@@ -392,6 +384,22 @@ impl BenchmarkOrchestrator {
|
||||
is_emergency: self.emergency_abort.load(Ordering::SeqCst),
|
||||
emergency_reason: self.emergency_reason.lock().unwrap().clone(),
|
||||
};
|
||||
self.telemetry_tx.send(state).map_err(|_| anyhow::anyhow!("Telemetry channel closed"))
|
||||
self.telemetry_tx.send(state).map_err(|_| anyhow::anyhow!("Channel closed"))
|
||||
}
|
||||
|
||||
pub fn generate_result(&self, is_partial: bool) -> OptimizationResult {
|
||||
let r_theta = self.profile.r_theta;
|
||||
let knee = self.engine.find_silicon_knee(&self.profile);
|
||||
OptimizationResult {
|
||||
profile: self.profile.clone(),
|
||||
silicon_knee_watts: knee,
|
||||
thermal_resistance_kw: r_theta,
|
||||
recommended_pl1: knee,
|
||||
recommended_pl2: knee * 1.25,
|
||||
max_temp_c: self.profile.points.iter().map(|p| p.temp_c).max_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal)).unwrap_or(0.0),
|
||||
is_partial,
|
||||
config_paths: std::collections::HashMap::new(),
|
||||
optimization_matrix: None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,35 +1,81 @@
|
||||
use super::traits::{PreflightAuditor, EnvironmentGuard, SensorBus, ActuatorBus, HardwareWatchdog, AuditError, AuditStep, SafetyStatus, EnvironmentCtx};
|
||||
use crate::sal::safety::{PowerLimitWatts, FanSpeedPercent};
|
||||
use anyhow::{Result, Context, anyhow};
|
||||
use std::fs;
|
||||
use std::path::{PathBuf};
|
||||
use std::time::{Duration, Instant};
|
||||
use std::thread;
|
||||
use std::sync::Mutex;
|
||||
use tracing::{debug};
|
||||
use tracing::{info, debug};
|
||||
use crate::sal::heuristic::discovery::SystemFactSheet;
|
||||
|
||||
/// Implementation of the System Abstraction Layer for the Dell XPS 13 9380.
|
||||
pub struct DellXps9380Sal {
|
||||
ctx: EnvironmentCtx,
|
||||
fact_sheet: SystemFactSheet,
|
||||
temp_path: PathBuf,
|
||||
pwr_path: PathBuf,
|
||||
fan_paths: Vec<PathBuf>,
|
||||
pwm_paths: Vec<PathBuf>,
|
||||
pwm_enable_paths: Vec<PathBuf>,
|
||||
pl1_paths: Vec<PathBuf>,
|
||||
pl2_paths: Vec<PathBuf>,
|
||||
freq_path: PathBuf,
|
||||
pl1_path: PathBuf,
|
||||
pl2_path: PathBuf,
|
||||
last_poll: Mutex<Instant>,
|
||||
last_temp: Mutex<f32>,
|
||||
last_fans: Mutex<Vec<u32>>,
|
||||
suppressed_services: Mutex<Vec<String>>,
|
||||
msr_file: Mutex<fs::File>,
|
||||
last_energy: Mutex<(u64, Instant)>,
|
||||
last_watts: Mutex<f32>,
|
||||
}
|
||||
|
||||
impl DellXps9380Sal {
|
||||
/// Initializes the Dell SAL, opening the MSR interface and discovering sensors and PWM nodes.
|
||||
pub fn init(ctx: EnvironmentCtx, facts: SystemFactSheet) -> Result<Self> {
|
||||
let temp_path = facts.temp_path.clone().context("Dell SAL requires temperature sensor")?;
|
||||
let pwr_base = facts.rapl_paths.first().cloned().context("Dell SAL requires RAPL interface")?;
|
||||
let fan_paths = facts.fan_paths.clone();
|
||||
|
||||
// 1. Discover PWM and Enable nodes associated with the fan paths
|
||||
let mut pwm_paths = Vec::new();
|
||||
let mut pwm_enable_paths = Vec::new();
|
||||
for fan_p in &fan_paths {
|
||||
if let Some(parent) = fan_p.parent() {
|
||||
let fan_file = fan_p.file_name().and_then(|n| n.to_str()).unwrap_or("");
|
||||
let fan_idx = fan_file.chars().filter(|c| c.is_ascii_digit()).collect::<String>();
|
||||
let idx = if fan_idx.is_empty() { "1".to_string() } else { fan_idx };
|
||||
|
||||
let pwm_p = parent.join(format!("pwm{}", idx));
|
||||
if pwm_p.exists() { pwm_paths.push(pwm_p); }
|
||||
|
||||
let enable_p = parent.join(format!("pwm{}_enable", idx));
|
||||
if enable_p.exists() { pwm_enable_paths.push(enable_p); }
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Map all RAPL constraints
|
||||
let mut pl1_paths = Vec::new();
|
||||
let mut pl2_paths = Vec::new();
|
||||
for rapl_p in &facts.rapl_paths {
|
||||
pl1_paths.push(rapl_p.join("constraint_0_power_limit_uw"));
|
||||
pl2_paths.push(rapl_p.join("constraint_1_power_limit_uw"));
|
||||
}
|
||||
|
||||
// 3. Physical Sensor Verification & Warm Cache Priming
|
||||
let mut initial_fans = Vec::new();
|
||||
for fan_p in &fan_paths {
|
||||
let mut rpm = 0;
|
||||
for _ in 0..3 {
|
||||
if let Ok(val) = fs::read_to_string(fan_p) {
|
||||
rpm = val.trim().parse::<u32>().unwrap_or(0);
|
||||
if rpm > 0 { break; }
|
||||
}
|
||||
thread::sleep(Duration::from_millis(100));
|
||||
}
|
||||
info!("SAL Warm-Start: Fan sensor {:?} -> {} RPM", fan_p, rpm);
|
||||
initial_fans.push(rpm);
|
||||
}
|
||||
|
||||
let freq_path = ctx.sysfs_base.join("sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq");
|
||||
let msr_path = ctx.sysfs_base.join("dev/cpu/0/msr");
|
||||
|
||||
@@ -38,19 +84,24 @@ impl DellXps9380Sal {
|
||||
|
||||
let initial_energy = fs::read_to_string(pwr_base.join("energy_uj")).unwrap_or_default().trim().parse().unwrap_or(0);
|
||||
|
||||
info!("SAL: Dell XPS 9380 Initialized. ({} fans, {} RAPL nodes found)",
|
||||
fan_paths.len(), facts.rapl_paths.len());
|
||||
|
||||
Ok(Self {
|
||||
temp_path,
|
||||
pwr_path: pwr_base.join("power1_average"),
|
||||
fan_paths,
|
||||
pwm_paths,
|
||||
pwm_enable_paths,
|
||||
pl1_paths,
|
||||
pl2_paths,
|
||||
freq_path,
|
||||
pl1_path: pwr_base.join("constraint_0_power_limit_uw"),
|
||||
pl2_path: pwr_base.join("constraint_1_power_limit_uw"),
|
||||
last_poll: Mutex::new(Instant::now() - Duration::from_secs(2)),
|
||||
last_temp: Mutex::new(0.0),
|
||||
last_fans: Mutex::new(Vec::new()),
|
||||
suppressed_services: Mutex::new(Vec::new()),
|
||||
last_fans: Mutex::new(initial_fans),
|
||||
msr_file: Mutex::new(msr_file),
|
||||
last_energy: Mutex::new((initial_energy, Instant::now())),
|
||||
last_watts: Mutex::new(0.0),
|
||||
fact_sheet: facts,
|
||||
ctx,
|
||||
})
|
||||
@@ -80,14 +131,24 @@ impl PreflightAuditor for DellXps9380Sal {
|
||||
outcome: if unsafe { libc::getuid() } == 0 { Ok(()) } else { Err(AuditError::RootRequired) }
|
||||
});
|
||||
|
||||
let rapl_lock = match self.read_msr(0x610) {
|
||||
Ok(val) => {
|
||||
if (val & (1 << 63)) != 0 {
|
||||
Err(AuditError::KernelIncompatible("RAPL Registers are locked by BIOS. Power limit tuning is impossible.".to_string()))
|
||||
} else {
|
||||
Ok(())
|
||||
}
|
||||
},
|
||||
Err(e) => Err(AuditError::ToolMissing(format!("Cannot read MSR 0x610: {}", e))),
|
||||
};
|
||||
steps.push(AuditStep { description: "MSR 0x610 RAPL Lock Status".to_string(), outcome: rapl_lock });
|
||||
|
||||
let modules = ["dell_smm_hwmon", "msr", "intel_rapl_msr"];
|
||||
for mod_name in modules {
|
||||
let path = self.ctx.sysfs_base.join(format!("sys/module/{}", mod_name));
|
||||
steps.push(AuditStep {
|
||||
description: format!("Kernel Module: {}", mod_name),
|
||||
outcome: if path.exists() { Ok(()) } else {
|
||||
Err(AuditError::ToolMissing(format!("Module '{}' not loaded.", mod_name)))
|
||||
}
|
||||
outcome: if path.exists() { Ok(()) } else { Err(AuditError::ToolMissing(format!("Module '{}' not loaded.", mod_name))) }
|
||||
});
|
||||
}
|
||||
|
||||
@@ -109,15 +170,7 @@ impl PreflightAuditor for DellXps9380Sal {
|
||||
let ac_status = fs::read_to_string(ac_status_path).unwrap_or_else(|_| "0".to_string());
|
||||
steps.push(AuditStep {
|
||||
description: "AC Power Connection".to_string(),
|
||||
outcome: if ac_status.trim() == "1" { Ok(()) } else {
|
||||
Err(AuditError::AcPowerMissing("System must be on AC power".to_string()))
|
||||
}
|
||||
});
|
||||
|
||||
let tool_check = self.fact_sheet.paths.tools.contains_key("dell_fan_ctrl");
|
||||
steps.push(AuditStep {
|
||||
description: "Dell Fan Control Tool".to_string(),
|
||||
outcome: if tool_check { Ok(()) } else { Err(AuditError::ToolMissing("dell-bios-fan-control not found in PATH".to_string())) }
|
||||
outcome: if ac_status.trim() == "1" { Ok(()) } else { Err(AuditError::AcPowerMissing("System must be on AC power".to_string())) }
|
||||
});
|
||||
|
||||
Box::new(steps.into_iter())
|
||||
@@ -125,33 +178,16 @@ impl PreflightAuditor for DellXps9380Sal {
|
||||
}
|
||||
|
||||
impl EnvironmentGuard for DellXps9380Sal {
|
||||
fn suppress(&self) -> Result<()> {
|
||||
let services = ["tlp", "thermald", "i8kmon"];
|
||||
let mut suppressed = self.suppressed_services.lock().unwrap();
|
||||
for s in services {
|
||||
if self.ctx.runner.run("systemctl", &["is-active", "--quiet", s]).is_ok() {
|
||||
debug!("Suppressing service: {}", s);
|
||||
self.ctx.runner.run("systemctl", &["stop", s])?;
|
||||
suppressed.push(s.to_string());
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn restore(&self) -> Result<()> {
|
||||
let mut suppressed = self.suppressed_services.lock().unwrap();
|
||||
for s in suppressed.drain(..) {
|
||||
let _ = self.ctx.runner.run("systemctl", &["start", &s]);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
fn suppress(&self) -> Result<()> { Ok(()) }
|
||||
fn restore(&self) -> Result<()> { Ok(()) }
|
||||
}
|
||||
|
||||
impl SensorBus for DellXps9380Sal {
|
||||
fn get_temp(&self) -> Result<f32> {
|
||||
let mut last_poll = self.last_poll.lock().unwrap();
|
||||
let now = Instant::now();
|
||||
if now.duration_since(*last_poll) < Duration::from_millis(1000) {
|
||||
// # SAFETY: High frequency polling for watchdog
|
||||
if now.duration_since(*last_poll) < Duration::from_millis(100) {
|
||||
return Ok(*self.last_temp.lock().unwrap());
|
||||
}
|
||||
let s = fs::read_to_string(&self.temp_path)?;
|
||||
@@ -162,16 +198,24 @@ impl SensorBus for DellXps9380Sal {
|
||||
}
|
||||
|
||||
fn get_power_w(&self) -> Result<f32> {
|
||||
if self.pwr_path.to_string_lossy().contains("energy_uj") {
|
||||
let mut last = self.last_energy.lock().unwrap();
|
||||
let e2 = fs::read_to_string(&self.pwr_path)?.trim().parse::<u64>()?;
|
||||
let rapl_base = self.fact_sheet.rapl_paths.first().context("RAPL path error")?;
|
||||
let energy_path = rapl_base.join("energy_uj");
|
||||
|
||||
if energy_path.exists() {
|
||||
let mut last_energy = self.last_energy.lock().unwrap();
|
||||
let mut last_watts = self.last_watts.lock().unwrap();
|
||||
|
||||
let e2_str = fs::read_to_string(&energy_path)?;
|
||||
let e2 = e2_str.trim().parse::<u64>()?;
|
||||
let t2 = Instant::now();
|
||||
let (e1, t1) = *last;
|
||||
let (e1, t1) = *last_energy;
|
||||
let delta_e = e2.wrapping_sub(e1);
|
||||
let delta_t = t2.duration_since(t1).as_secs_f32();
|
||||
*last = (e2, t2);
|
||||
if delta_t < 0.01 { return Ok(0.0); }
|
||||
Ok((delta_e as f32 / 1_000_000.0) / delta_t)
|
||||
if delta_t < 0.1 { return Ok(*last_watts); }
|
||||
let watts = (delta_e as f32 / 1_000_000.0) / delta_t;
|
||||
*last_energy = (e2, t2);
|
||||
*last_watts = watts;
|
||||
Ok(watts)
|
||||
} else {
|
||||
let s = fs::read_to_string(&self.pwr_path)?;
|
||||
Ok(s.trim().parse::<f32>()? / 1000000.0)
|
||||
@@ -184,12 +228,27 @@ impl SensorBus for DellXps9380Sal {
|
||||
if now.duration_since(*last_poll) < Duration::from_millis(1000) {
|
||||
return Ok(self.last_fans.lock().unwrap().clone());
|
||||
}
|
||||
|
||||
let mut fans = Vec::new();
|
||||
for path in &self.fan_paths {
|
||||
if let Ok(s) = fs::read_to_string(path) {
|
||||
if let Ok(rpm) = s.trim().parse::<u32>() { fans.push(rpm); }
|
||||
let mut val = 0;
|
||||
for i in 0..5 {
|
||||
match fs::read_to_string(path) {
|
||||
Ok(s) => {
|
||||
if let Ok(rpm) = s.trim().parse::<u32>() {
|
||||
val = rpm;
|
||||
if rpm > 0 { break; }
|
||||
}
|
||||
},
|
||||
Err(e) => {
|
||||
debug!("SAL: Fan poll retry {} for {:?} failed: {}", i+1, path, e);
|
||||
}
|
||||
}
|
||||
thread::sleep(Duration::from_millis(150));
|
||||
}
|
||||
fans.push(val);
|
||||
}
|
||||
|
||||
*self.last_fans.lock().unwrap() = fans.clone();
|
||||
*last_poll = now;
|
||||
Ok(fans)
|
||||
@@ -199,6 +258,11 @@ impl SensorBus for DellXps9380Sal {
|
||||
let s = fs::read_to_string(&self.freq_path)?;
|
||||
Ok(s.trim().parse::<f32>()? / 1000.0)
|
||||
}
|
||||
|
||||
fn get_throttling_status(&self) -> Result<bool> {
|
||||
let val = self.read_msr(0x19C)?;
|
||||
Ok((val & 0x1) != 0)
|
||||
}
|
||||
}
|
||||
|
||||
impl ActuatorBus for DellXps9380Sal {
|
||||
@@ -208,20 +272,47 @@ impl ActuatorBus for DellXps9380Sal {
|
||||
let tool_str = tool_path.to_string_lossy();
|
||||
|
||||
match mode {
|
||||
"max" | "Manual" => { self.ctx.runner.run(&tool_str, &["0"])?; }
|
||||
"max" | "Manual" => {
|
||||
self.ctx.runner.run(&tool_str, &["0"])?;
|
||||
// Disabling BIOS control requires immediate PWM override
|
||||
self.set_fan_speed(FanSpeedPercent::new(100)?)?;
|
||||
}
|
||||
"auto" | "Auto" => { self.ctx.runner.run(&tool_str, &["1"])?; }
|
||||
_ => { debug!("Unknown fan mode: {}", mode); }
|
||||
_ => {}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn set_sustained_power_limit(&self, watts: f32) -> Result<()> {
|
||||
fs::write(&self.pl1_path, ((watts * 1_000_000.0) as u64).to_string())?;
|
||||
fn set_fan_speed(&self, speed: FanSpeedPercent) -> Result<()> {
|
||||
let pwm_val = ((speed.get() as u32 * 255) / 100) as u8;
|
||||
for p in &self.pwm_enable_paths { let _ = fs::write(p, "1"); }
|
||||
for path in &self.pwm_paths { let _ = fs::write(path, pwm_val.to_string()); }
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn set_burst_power_limit(&self, watts: f32) -> Result<()> {
|
||||
fs::write(&self.pl2_path, ((watts * 1_000_000.0) as u64).to_string())?;
|
||||
fn set_sustained_power_limit(&self, limit: PowerLimitWatts) -> Result<()> {
|
||||
for path in &self.pl1_paths {
|
||||
debug!("SAL: Applying PL1 ({:.1}W) to {:?}", limit.get(), path);
|
||||
fs::write(path, limit.as_microwatts().to_string())
|
||||
.with_context(|| format!("Failed to write PL1 to {:?}", path))?;
|
||||
if let Some(parent) = path.parent() {
|
||||
let enable_p = parent.join("constraint_0_enabled");
|
||||
let _ = fs::write(&enable_p, "1");
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn set_burst_power_limit(&self, limit: PowerLimitWatts) -> Result<()> {
|
||||
for path in &self.pl2_paths {
|
||||
debug!("SAL: Applying PL2 ({:.1}W) to {:?}", limit.get(), path);
|
||||
fs::write(path, limit.as_microwatts().to_string())
|
||||
.with_context(|| format!("Failed to write PL2 to {:?}", path))?;
|
||||
if let Some(parent) = path.parent() {
|
||||
let enable_p = parent.join("constraint_1_enabled");
|
||||
let _ = fs::write(&enable_p, "1");
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@@ -243,7 +334,5 @@ impl HardwareWatchdog for DellXps9380Sal {
|
||||
}
|
||||
|
||||
impl Drop for DellXps9380Sal {
|
||||
fn drop(&mut self) {
|
||||
let _ = self.restore();
|
||||
}
|
||||
fn drop(&mut self) { }
|
||||
}
|
||||
|
||||
148
src/sal/discovery.rs
Normal file
148
src/sal/discovery.rs
Normal file
@@ -0,0 +1,148 @@
|
||||
//! # Hardware Discovery Engine (Agent Sentinel)
|
||||
//!
|
||||
//! This module provides dynamic traversal of `/sys/class/hwmon` and `/sys/class/powercap`
|
||||
//! to locate sensors and actuators without relying on hardcoded indices.
|
||||
|
||||
use anyhow::{Result, Context, anyhow};
|
||||
use std::fs;
|
||||
use std::path::{Path, PathBuf};
|
||||
use tracing::{debug, info, warn};
|
||||
|
||||
/// Result of a successful hardware discovery.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct DiscoveredHardware {
|
||||
/// Path to the primary package temperature sensor input.
|
||||
pub temp_input: PathBuf,
|
||||
/// Paths to all detected fan RPM inputs.
|
||||
pub fan_inputs: Vec<PathBuf>,
|
||||
/// Paths to all detected fan PWM control nodes.
|
||||
pub pwm_controls: Vec<PathBuf>,
|
||||
/// Paths to all detected fan PWM enable nodes.
|
||||
pub pwm_enables: Vec<PathBuf>,
|
||||
/// Paths to RAPL power limit constraint files.
|
||||
pub rapl_paths: Vec<PathBuf>,
|
||||
}
|
||||
|
||||
pub struct DiscoveryEngine;
|
||||
|
||||
impl DiscoveryEngine {
|
||||
/// Performs a full traversal of the sysfs hardware tree.
|
||||
pub fn run(sysfs_root: &Path) -> Result<DiscoveredHardware> {
|
||||
info!("Sentinel: Starting dynamic hardware discovery...");
|
||||
|
||||
let hwmon_path = sysfs_root.join("sys/class/hwmon");
|
||||
let (temp_input, fan_info) = Self::discover_hwmon(&hwmon_path)?;
|
||||
|
||||
let powercap_path = sysfs_root.join("sys/class/powercap");
|
||||
let rapl_paths = Self::discover_rapl(&powercap_path)?;
|
||||
|
||||
let hardware = DiscoveredHardware {
|
||||
temp_input,
|
||||
fan_inputs: fan_info.rpm_inputs,
|
||||
pwm_controls: fan_info.pwm_controls,
|
||||
pwm_enables: fan_info.pwm_enables,
|
||||
rapl_paths,
|
||||
};
|
||||
|
||||
info!("Sentinel: Discovery complete. Found {} fans and {} RAPL nodes.",
|
||||
hardware.fan_inputs.len(), hardware.rapl_paths.len());
|
||||
|
||||
Ok(hardware)
|
||||
}
|
||||
|
||||
fn discover_hwmon(base: &Path) -> Result<(PathBuf, FanHardware)> {
|
||||
let mut best_temp: Option<(u32, PathBuf)> = None;
|
||||
let mut fans = FanHardware::default();
|
||||
|
||||
let entries = fs::read_dir(base)
|
||||
.with_context(|| format!("Failed to read hwmon base: {:?}", base))?;
|
||||
|
||||
for entry in entries.flatten() {
|
||||
let path = entry.path();
|
||||
let driver_name = fs::read_to_string(path.join("name"))
|
||||
.map(|s| s.trim().to_string())
|
||||
.unwrap_or_else(|_| "unknown".to_string());
|
||||
|
||||
debug!("Discovery: Probing hwmon node {:?} (driver: {})", path, driver_name);
|
||||
|
||||
// 1. Temperature Discovery
|
||||
let temp_priority = match driver_name.as_str() {
|
||||
"coretemp" | "zenpower" => 10,
|
||||
"k10temp" => 9,
|
||||
"dell_smm" => 8,
|
||||
"acpitz" => 1,
|
||||
_ => 5,
|
||||
};
|
||||
|
||||
if let Ok(hw_entries) = fs::read_dir(&path) {
|
||||
for hw_entry in hw_entries.flatten() {
|
||||
let file_name = hw_entry.file_name().to_string_lossy().to_string();
|
||||
|
||||
// Temperature Inputs
|
||||
if file_name.starts_with("temp") && file_name.ends_with("_input") {
|
||||
let label_path = path.join(file_name.replace("_input", "_label"));
|
||||
let label = fs::read_to_string(label_path).unwrap_or_default().trim().to_string();
|
||||
|
||||
let label_priority = if label.contains("Package") || label.contains("Tdie") {
|
||||
2
|
||||
} else {
|
||||
0
|
||||
};
|
||||
|
||||
let total_priority = temp_priority + label_priority;
|
||||
if best_temp.is_none() || total_priority > best_temp.as_ref().unwrap().0 {
|
||||
best_temp = Some((total_priority, hw_entry.path()));
|
||||
}
|
||||
}
|
||||
|
||||
// Fan Inputs
|
||||
if file_name.starts_with("fan") && file_name.ends_with("_input") {
|
||||
fans.rpm_inputs.push(hw_entry.path());
|
||||
}
|
||||
|
||||
// PWM Controls
|
||||
if file_name.starts_with("pwm") && !file_name.contains("_") {
|
||||
fans.pwm_controls.push(hw_entry.path());
|
||||
}
|
||||
|
||||
// PWM Enables
|
||||
if file_name.starts_with("pwm") && file_name.ends_with("_enable") {
|
||||
fans.pwm_enables.push(hw_entry.path());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let temp_input = best_temp.map(|(_, p)| p)
|
||||
.ok_or_else(|| anyhow!("Failed to locate any valid temperature sensor in /sys/class/hwmon/"))?;
|
||||
|
||||
Ok((temp_input, fans))
|
||||
}
|
||||
|
||||
fn discover_rapl(base: &Path) -> Result<Vec<PathBuf>> {
|
||||
let mut paths = Vec::new();
|
||||
if !base.exists() {
|
||||
warn!("Discovery: /sys/class/powercap does not exist.");
|
||||
return Ok(paths);
|
||||
}
|
||||
|
||||
let entries = fs::read_dir(base)?;
|
||||
for entry in entries.flatten() {
|
||||
let path = entry.path();
|
||||
let name = fs::read_to_string(path.join("name")).unwrap_or_default().trim().to_string();
|
||||
|
||||
if name.contains("package") || name.contains("intel-rapl") {
|
||||
paths.push(path);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(paths)
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Default)]
|
||||
struct FanHardware {
|
||||
rpm_inputs: Vec<PathBuf>,
|
||||
pwm_controls: Vec<PathBuf>,
|
||||
pwm_enables: Vec<PathBuf>,
|
||||
}
|
||||
@@ -1,10 +1,11 @@
|
||||
use anyhow::{Result, anyhow};
|
||||
use anyhow::{Result, anyhow, Context};
|
||||
use std::path::{Path};
|
||||
use std::fs;
|
||||
use std::time::{Duration, Instant};
|
||||
use std::sync::Mutex;
|
||||
|
||||
use crate::sal::traits::{SensorBus, ActuatorBus, EnvironmentGuard, HardwareWatchdog, PreflightAuditor, AuditStep, AuditError, SafetyStatus, EnvironmentCtx};
|
||||
use crate::sal::safety::{PowerLimitWatts, FanSpeedPercent};
|
||||
use crate::sal::heuristic::discovery::SystemFactSheet;
|
||||
use crate::sal::heuristic::schema::HardwareDb;
|
||||
|
||||
@@ -12,9 +13,8 @@ pub struct GenericLinuxSal {
|
||||
ctx: EnvironmentCtx,
|
||||
fact_sheet: SystemFactSheet,
|
||||
db: HardwareDb,
|
||||
suppressed_services: Mutex<Vec<String>>,
|
||||
last_valid_temp: Mutex<(f32, Instant)>,
|
||||
current_pl1: Mutex<f32>,
|
||||
current_pl1: Mutex<u64>,
|
||||
last_energy: Mutex<(u64, Instant)>,
|
||||
}
|
||||
|
||||
@@ -28,9 +28,8 @@ impl GenericLinuxSal {
|
||||
|
||||
Self {
|
||||
db,
|
||||
suppressed_services: Mutex::new(Vec::new()),
|
||||
last_valid_temp: Mutex::new((0.0, Instant::now())),
|
||||
current_pl1: Mutex::new(15.0),
|
||||
current_pl1: Mutex::new(15_000_000),
|
||||
last_energy: Mutex::new((initial_energy, Instant::now())),
|
||||
fact_sheet: facts,
|
||||
ctx,
|
||||
@@ -95,7 +94,7 @@ impl SensorBus for GenericLinuxSal {
|
||||
let delta_e = e2.wrapping_sub(e1);
|
||||
let delta_t = t2.duration_since(t1).as_secs_f32();
|
||||
*last = (e2, t2);
|
||||
if delta_t < 0.01 { return Ok(0.0); }
|
||||
if delta_t < 0.05 { return Ok(0.0); }
|
||||
Ok((delta_e as f32 / 1_000_000.0) / delta_t)
|
||||
}
|
||||
|
||||
@@ -126,6 +125,22 @@ impl SensorBus for GenericLinuxSal {
|
||||
Err(anyhow!("Could not determine CPU frequency"))
|
||||
}
|
||||
}
|
||||
|
||||
fn get_throttling_status(&self) -> Result<bool> {
|
||||
let cooling_base = self.ctx.sysfs_base.join("sys/class/thermal");
|
||||
if let Ok(entries) = fs::read_dir(cooling_base) {
|
||||
for entry in entries.flatten() {
|
||||
if entry.file_name().to_string_lossy().starts_with("cooling_device") {
|
||||
if let Ok(state) = fs::read_to_string(entry.path().join("cur_state")) {
|
||||
if state.trim().parse::<u32>().unwrap_or(0) > 0 {
|
||||
return Ok(true);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(false)
|
||||
}
|
||||
}
|
||||
|
||||
impl ActuatorBus for GenericLinuxSal {
|
||||
@@ -144,44 +159,37 @@ impl ActuatorBus for GenericLinuxSal {
|
||||
} else { Ok(()) }
|
||||
}
|
||||
|
||||
fn set_sustained_power_limit(&self, watts: f32) -> Result<()> {
|
||||
let rapl_path = self.fact_sheet.rapl_paths.first().ok_or_else(|| anyhow!("No PL1 path"))?;
|
||||
fs::write(rapl_path.join("constraint_0_power_limit_uw"), ((watts * 1_000_000.0) as u64).to_string())?;
|
||||
*self.current_pl1.lock().unwrap() = watts;
|
||||
fn set_fan_speed(&self, _speed: FanSpeedPercent) -> Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn set_burst_power_limit(&self, watts: f32) -> Result<()> {
|
||||
let rapl_path = self.fact_sheet.rapl_paths.first().ok_or_else(|| anyhow!("No PL2 path"))?;
|
||||
fs::write(rapl_path.join("constraint_1_power_limit_uw"), ((watts * 1_000_000.0) as u64).to_string())?;
|
||||
fn set_sustained_power_limit(&self, limit: PowerLimitWatts) -> Result<()> {
|
||||
for rapl_path in &self.fact_sheet.rapl_paths {
|
||||
let limit_path = rapl_path.join("constraint_0_power_limit_uw");
|
||||
let enable_path = rapl_path.join("constraint_0_enabled");
|
||||
fs::write(&limit_path, limit.as_microwatts().to_string())
|
||||
.with_context(|| format!("Failed to write PL1 to {:?}", limit_path))?;
|
||||
let _ = fs::write(&enable_path, "1");
|
||||
}
|
||||
*self.current_pl1.lock().unwrap() = limit.as_microwatts();
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn set_burst_power_limit(&self, limit: PowerLimitWatts) -> Result<()> {
|
||||
for rapl_path in &self.fact_sheet.rapl_paths {
|
||||
let limit_path = rapl_path.join("constraint_1_power_limit_uw");
|
||||
let enable_path = rapl_path.join("constraint_1_enabled");
|
||||
fs::write(&limit_path, limit.as_microwatts().to_string())
|
||||
.with_context(|| format!("Failed to write PL2 to {:?}", limit_path))?;
|
||||
let _ = fs::write(&enable_path, "1");
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl EnvironmentGuard for GenericLinuxSal {
|
||||
fn suppress(&self) -> Result<()> {
|
||||
let mut suppressed = self.suppressed_services.lock().unwrap();
|
||||
for conflict_id in &self.fact_sheet.active_conflicts {
|
||||
if let Some(conflict) = self.db.conflicts.iter().find(|c| &c.id == conflict_id) {
|
||||
for service in &conflict.services {
|
||||
if self.ctx.runner.run("systemctl", &["is-active", "--quiet", service]).is_ok() {
|
||||
self.ctx.runner.run("systemctl", &["stop", service])?;
|
||||
suppressed.push(service.clone());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn restore(&self) -> Result<()> {
|
||||
let mut suppressed = self.suppressed_services.lock().unwrap();
|
||||
for service in suppressed.drain(..) {
|
||||
let _ = self.ctx.runner.run("systemctl", &["start", &service]);
|
||||
}
|
||||
if self.is_dell() { let _ = self.set_fan_mode("auto"); }
|
||||
Ok(())
|
||||
}
|
||||
fn suppress(&self) -> Result<()> { Ok(()) }
|
||||
fn restore(&self) -> Result<()> { Ok(()) }
|
||||
}
|
||||
|
||||
impl HardwareWatchdog for GenericLinuxSal {
|
||||
@@ -197,7 +205,3 @@ impl HardwareWatchdog for GenericLinuxSal {
|
||||
Ok(SafetyStatus::Nominal)
|
||||
}
|
||||
}
|
||||
|
||||
impl Drop for GenericLinuxSal {
|
||||
fn drop(&mut self) { let _ = self.restore(); }
|
||||
}
|
||||
|
||||
@@ -1,12 +1,12 @@
|
||||
use std::fs;
|
||||
use std::path::{Path, PathBuf};
|
||||
use std::process::Command;
|
||||
use std::time::{Duration};
|
||||
use std::thread;
|
||||
use std::sync::mpsc;
|
||||
use std::collections::HashMap;
|
||||
use crate::sal::heuristic::schema::{SensorDiscovery, ActuatorDiscovery, Conflict, Discovery, Benchmarking};
|
||||
use tracing::{debug, warn};
|
||||
use crate::sys::SyscallRunner;
|
||||
use tracing::{debug, warn, info};
|
||||
|
||||
/// Registry of dynamically discovered paths for configs and tools.
|
||||
#[derive(Debug, Clone, Default)]
|
||||
@@ -24,6 +24,7 @@ pub struct SystemFactSheet {
|
||||
pub fan_paths: Vec<PathBuf>,
|
||||
pub rapl_paths: Vec<PathBuf>,
|
||||
pub active_conflicts: Vec<String>,
|
||||
pub conflict_services: Vec<String>,
|
||||
pub paths: PathRegistry,
|
||||
pub bench_config: Option<Benchmarking>,
|
||||
}
|
||||
@@ -31,6 +32,7 @@ pub struct SystemFactSheet {
|
||||
/// Probes the system for hardware sensors, actuators, service conflicts, and paths.
|
||||
pub fn discover_facts(
|
||||
base_path: &Path,
|
||||
runner: &dyn SyscallRunner,
|
||||
discovery: &Discovery,
|
||||
conflicts: &[Conflict],
|
||||
bench_config: Benchmarking,
|
||||
@@ -43,12 +45,17 @@ pub fn discover_facts(
|
||||
let rapl_paths = discover_rapl(base_path, &discovery.actuators);
|
||||
|
||||
let mut active_conflicts = Vec::new();
|
||||
let mut conflict_services = Vec::new();
|
||||
for conflict in conflicts {
|
||||
let mut found_active = false;
|
||||
for service in &conflict.services {
|
||||
if is_service_active(service) {
|
||||
debug!("Detected active conflict: {} (Service: {})", conflict.id, service);
|
||||
active_conflicts.push(conflict.id.clone());
|
||||
break;
|
||||
if is_service_active(runner, service) {
|
||||
if !found_active {
|
||||
debug!("Detected active conflict: {} (Service: {})", conflict.id, service);
|
||||
active_conflicts.push(conflict.id.clone());
|
||||
found_active = true;
|
||||
}
|
||||
conflict_services.push(service.clone());
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -56,13 +63,7 @@ pub fn discover_facts(
|
||||
let paths = discover_paths(base_path, discovery);
|
||||
|
||||
SystemFactSheet {
|
||||
vendor,
|
||||
model,
|
||||
temp_path,
|
||||
fan_paths,
|
||||
rapl_paths,
|
||||
active_conflicts,
|
||||
paths,
|
||||
vendor, model, temp_path, fan_paths, rapl_paths, active_conflicts, conflict_services, paths,
|
||||
bench_config: Some(bench_config),
|
||||
}
|
||||
}
|
||||
@@ -70,7 +71,6 @@ pub fn discover_facts(
|
||||
fn discover_paths(base_path: &Path, discovery: &Discovery) -> PathRegistry {
|
||||
let mut registry = PathRegistry::default();
|
||||
|
||||
// 1. Discover Tools via PATH
|
||||
for (id, binary_name) in &discovery.tools {
|
||||
if let Ok(path) = which::which(binary_name) {
|
||||
debug!("Discovered tool: {} -> {:?}", id, path);
|
||||
@@ -78,7 +78,6 @@ fn discover_paths(base_path: &Path, discovery: &Discovery) -> PathRegistry {
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Discover Configs via existence check
|
||||
for (id, candidates) in &discovery.configs {
|
||||
for candidate in candidates {
|
||||
let path = if candidate.starts_with('/') {
|
||||
@@ -93,7 +92,6 @@ fn discover_paths(base_path: &Path, discovery: &Discovery) -> PathRegistry {
|
||||
break;
|
||||
}
|
||||
}
|
||||
// If not found, use the first one as default if any exist
|
||||
if !registry.configs.contains_key(id) {
|
||||
if let Some(first) = candidates.first() {
|
||||
registry.configs.insert(id.clone(), PathBuf::from(first));
|
||||
@@ -104,12 +102,11 @@ fn discover_paths(base_path: &Path, discovery: &Discovery) -> PathRegistry {
|
||||
registry
|
||||
}
|
||||
|
||||
/// Reads DMI information from sysfs with a safety timeout.
|
||||
fn read_dmi_info(base_path: &Path) -> (String, String) {
|
||||
let vendor = read_sysfs_with_timeout(&base_path.join("sys/class/dmi/id/sys_vendor"), Duration::from_millis(100))
|
||||
.unwrap_or_else(|| "Unknown".to_string());
|
||||
let model = read_sysfs_with_timeout(&base_path.join("sys/class/dmi/id/product_name"), Duration::from_millis(100))
|
||||
.unwrap_or_else(|| "Unknown".to_string());
|
||||
let vendor = fs::read_to_string(base_path.join("sys/class/dmi/id/sys_vendor"))
|
||||
.map(|s| s.trim().to_string()).unwrap_or_else(|_| "Unknown".to_string());
|
||||
let model = fs::read_to_string(base_path.join("sys/class/dmi/id/product_name"))
|
||||
.map(|s| s.trim().to_string()).unwrap_or_else(|_| "Unknown".to_string());
|
||||
(vendor, model)
|
||||
}
|
||||
|
||||
@@ -119,51 +116,62 @@ fn discover_hwmon(base_path: &Path, cfg: &SensorDiscovery) -> (Option<PathBuf>,
|
||||
let mut fan_candidates = Vec::new();
|
||||
|
||||
let hwmon_base = base_path.join("sys/class/hwmon");
|
||||
let entries = match fs::read_dir(&hwmon_base) {
|
||||
Ok(e) => e,
|
||||
Err(e) => {
|
||||
warn!("Could not read {:?}: {}", hwmon_base, e);
|
||||
return (None, Vec::new());
|
||||
}
|
||||
};
|
||||
let entries = fs::read_dir(&hwmon_base).map_err(|e| {
|
||||
warn!("Could not read {:?}: {}", hwmon_base, e);
|
||||
e
|
||||
}).ok();
|
||||
|
||||
for entry in entries.flatten() {
|
||||
let hwmon_path = entry.path();
|
||||
if let Some(entries) = entries {
|
||||
for entry in entries.flatten() {
|
||||
let hwmon_path = entry.path();
|
||||
|
||||
let driver_name = read_sysfs_with_timeout(&hwmon_path.join("name"), Duration::from_millis(100))
|
||||
.unwrap_or_default();
|
||||
// # SAFETY: Read driver name directly. This file is virtual and never blocks.
|
||||
// Using a timeout wrapper here was causing discovery to fail if the thread-pool lagged.
|
||||
let driver_name = fs::read_to_string(hwmon_path.join("name"))
|
||||
.map(|s| s.trim().to_string()).unwrap_or_default();
|
||||
|
||||
let priority = cfg.hwmon_priority
|
||||
.iter()
|
||||
.position(|p| p == &driver_name)
|
||||
.unwrap_or(usize::MAX);
|
||||
let priority = cfg.hwmon_priority
|
||||
.iter()
|
||||
.position(|p| driver_name.contains(p))
|
||||
.unwrap_or(usize::MAX);
|
||||
|
||||
if let Ok(hw_entries) = fs::read_dir(&hwmon_path) {
|
||||
for hw_entry in hw_entries.flatten() {
|
||||
let file_name = hw_entry.file_name().into_string().unwrap_or_default();
|
||||
if let Ok(hw_entries) = fs::read_dir(&hwmon_path) {
|
||||
for hw_entry in hw_entries.flatten() {
|
||||
let file_name = hw_entry.file_name().into_string().unwrap_or_default();
|
||||
|
||||
// Temperature Sensors
|
||||
if file_name.starts_with("temp") && file_name.ends_with("_label") {
|
||||
if let Some(label) = read_sysfs_with_timeout(&hw_entry.path(), Duration::from_millis(100)) {
|
||||
if cfg.temp_labels.iter().any(|l| label.contains(l)) {
|
||||
let input_path = hwmon_path.join(file_name.replace("_label", "_input"));
|
||||
if input_path.exists() {
|
||||
temp_candidates.push((priority, input_path));
|
||||
// 1. Temperatures
|
||||
if file_name.starts_with("temp") && file_name.ends_with("_label") {
|
||||
if let Some(label) = read_sysfs_with_timeout(&hw_entry.path(), Duration::from_millis(500)) {
|
||||
if cfg.temp_labels.iter().any(|l| label.contains(l)) {
|
||||
let input_path = hwmon_path.join(file_name.replace("_label", "_input"));
|
||||
if input_path.exists() {
|
||||
temp_candidates.push((priority, input_path));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Fan Sensors
|
||||
if file_name.starts_with("fan") && file_name.ends_with("_label") {
|
||||
if let Some(label) = read_sysfs_with_timeout(&hw_entry.path(), Duration::from_millis(100)) {
|
||||
if cfg.fan_labels.iter().any(|l| label.contains(l)) {
|
||||
let input_path = hwmon_path.join(file_name.replace("_label", "_input"));
|
||||
if input_path.exists() {
|
||||
fan_candidates.push((priority, input_path));
|
||||
// 2. Fans (Label Match)
|
||||
if file_name.starts_with("fan") && file_name.ends_with("_label") {
|
||||
if let Some(label) = read_sysfs_with_timeout(&hw_entry.path(), Duration::from_millis(500)) {
|
||||
if cfg.fan_labels.iter().any(|l| label.contains(l)) {
|
||||
let input_path = hwmon_path.join(file_name.replace("_label", "_input"));
|
||||
if input_path.exists() {
|
||||
debug!("Discovered fan by label: {:?} (priority {})", input_path, priority);
|
||||
fan_candidates.push((priority, input_path));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 3. Fans (Priority Fallback - CRITICAL FOR DELL 9380)
|
||||
// If we found a priority driver (e.g., dell_smm), we take every fan*_input we find.
|
||||
if priority < usize::MAX && file_name.starts_with("fan") && file_name.ends_with("_input") {
|
||||
if !fan_candidates.iter().any(|(_, p)| p == &hw_entry.path()) {
|
||||
info!("Heuristic Discovery: Force-adding unlabeled fan sensor from priority driver '{}': {:?}", driver_name, hw_entry.path());
|
||||
fan_candidates.push((priority, hw_entry.path()));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -173,54 +181,45 @@ fn discover_hwmon(base_path: &Path, cfg: &SensorDiscovery) -> (Option<PathBuf>,
|
||||
fan_candidates.sort_by_key(|(p, _)| *p);
|
||||
|
||||
let best_temp = temp_candidates.first().map(|(_, p)| p.clone());
|
||||
let best_fans = fan_candidates.into_iter().map(|(_, p)| p).collect();
|
||||
let best_fans: Vec<PathBuf> = fan_candidates.into_iter().map(|(_, p)| p).collect();
|
||||
|
||||
if best_fans.is_empty() {
|
||||
warn!("Heuristic Discovery: No fan RPM sensors found.");
|
||||
} else {
|
||||
info!("Heuristic Discovery: Final registry contains {} fan sensors.", best_fans.len());
|
||||
}
|
||||
|
||||
(best_temp, best_fans)
|
||||
}
|
||||
|
||||
/// Discovers RAPL powercap paths.
|
||||
fn discover_rapl(base_path: &Path, cfg: &ActuatorDiscovery) -> Vec<PathBuf> {
|
||||
let mut paths = Vec::new();
|
||||
let powercap_base = base_path.join("sys/class/powercap");
|
||||
|
||||
let entries = match fs::read_dir(&powercap_base) {
|
||||
Ok(e) => e,
|
||||
Err(_) => return Vec::new(),
|
||||
};
|
||||
if let Ok(entries) = fs::read_dir(&powercap_base) {
|
||||
for entry in entries.flatten() {
|
||||
let path = entry.path();
|
||||
let dir_name = entry.file_name().into_string().unwrap_or_default();
|
||||
|
||||
for entry in entries.flatten() {
|
||||
let path = entry.path();
|
||||
let dir_name = entry.file_name().into_string().unwrap_or_default();
|
||||
|
||||
if cfg.rapl_paths.contains(&dir_name) {
|
||||
paths.push(path);
|
||||
continue;
|
||||
}
|
||||
|
||||
if let Some(name) = read_sysfs_with_timeout(&path.join("name"), Duration::from_millis(100)) {
|
||||
if cfg.rapl_paths.iter().any(|p| p == &name) {
|
||||
if cfg.rapl_paths.contains(&dir_name) {
|
||||
paths.push(path);
|
||||
continue;
|
||||
}
|
||||
|
||||
if let Ok(name) = fs::read_to_string(path.join("name")) {
|
||||
if cfg.rapl_paths.iter().any(|p| p == name.trim()) {
|
||||
paths.push(path);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
paths
|
||||
}
|
||||
|
||||
/// Checks if a systemd service is currently active.
|
||||
pub fn is_service_active(service: &str) -> bool {
|
||||
let status = Command::new("systemctl")
|
||||
.arg("is-active")
|
||||
.arg("--quiet")
|
||||
.arg(service)
|
||||
.status();
|
||||
|
||||
match status {
|
||||
Ok(s) => s.success(),
|
||||
Err(_) => false,
|
||||
}
|
||||
pub fn is_service_active(runner: &dyn SyscallRunner, service: &str) -> bool {
|
||||
runner.run("systemctl", &["is-active", "--quiet", service]).is_ok()
|
||||
}
|
||||
|
||||
/// Helper to read a sysfs file with a timeout.
|
||||
fn read_sysfs_with_timeout(path: &Path, timeout: Duration) -> Option<String> {
|
||||
let (tx, rx) = mpsc::channel();
|
||||
let path_buf = path.to_path_buf();
|
||||
|
||||
@@ -24,7 +24,7 @@ impl HeuristicEngine {
|
||||
.context("Failed to parse hardware_db.toml")?;
|
||||
|
||||
// 2. Discover Facts
|
||||
let facts = discover_facts(&ctx.sysfs_base, &db.discovery, &db.conflicts, db.benchmarking.clone());
|
||||
let facts = discover_facts(&ctx.sysfs_base, ctx.runner.as_ref(), &db.discovery, &db.conflicts, db.benchmarking.clone());
|
||||
info!("System Identity: {} {}", facts.vendor, facts.model);
|
||||
|
||||
// 3. Routing Logic
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
use super::traits::{PreflightAuditor, EnvironmentGuard, SensorBus, ActuatorBus, HardwareWatchdog, AuditStep, SafetyStatus};
|
||||
use crate::sal::safety::{PowerLimitWatts, FanSpeedPercent};
|
||||
use anyhow::Result;
|
||||
|
||||
pub struct MockSal {
|
||||
@@ -16,59 +17,36 @@ impl MockSal {
|
||||
impl PreflightAuditor for MockSal {
|
||||
fn audit(&self) -> Box<dyn Iterator<Item = AuditStep> + '_> {
|
||||
let steps = vec![
|
||||
AuditStep {
|
||||
description: "Mock Root Privileges".to_string(),
|
||||
outcome: Ok(()),
|
||||
},
|
||||
AuditStep {
|
||||
description: "Mock AC Power Status".to_string(),
|
||||
outcome: Ok(()),
|
||||
},
|
||||
AuditStep { description: "Mock Root Privileges".to_string(), outcome: Ok(()) },
|
||||
AuditStep { description: "Mock AC Power Status".to_string(), outcome: Ok(()) },
|
||||
];
|
||||
Box::new(steps.into_iter())
|
||||
}
|
||||
}
|
||||
|
||||
impl EnvironmentGuard for MockSal {
|
||||
fn suppress(&self) -> Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
fn restore(&self) -> Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
fn suppress(&self) -> Result<()> { Ok(()) }
|
||||
fn restore(&self) -> Result<()> { Ok(()) }
|
||||
}
|
||||
|
||||
impl SensorBus for MockSal {
|
||||
fn get_temp(&self) -> Result<f32> {
|
||||
// Support dynamic sequence for Step 5
|
||||
let seq = self.temperature_sequence.fetch_add(1, std::sync::atomic::Ordering::SeqCst);
|
||||
Ok(40.0 + (seq as f32 * 0.5).min(50.0)) // Heats up from 40 to 90
|
||||
}
|
||||
fn get_power_w(&self) -> Result<f32> {
|
||||
Ok(15.0)
|
||||
}
|
||||
fn get_fan_rpms(&self) -> Result<Vec<u32>> {
|
||||
Ok(vec![2500])
|
||||
}
|
||||
fn get_freq_mhz(&self) -> Result<f32> {
|
||||
Ok(3200.0)
|
||||
Ok(40.0 + (seq as f32 * 0.5).min(55.0))
|
||||
}
|
||||
fn get_power_w(&self) -> Result<f32> { Ok(15.0) }
|
||||
fn get_fan_rpms(&self) -> Result<Vec<u32>> { Ok(vec![2500, 2400]) }
|
||||
fn get_freq_mhz(&self) -> Result<f32> { Ok(3200.0) }
|
||||
fn get_throttling_status(&self) -> Result<bool> { Ok(false) }
|
||||
}
|
||||
|
||||
impl ActuatorBus for MockSal {
|
||||
fn set_fan_mode(&self, _mode: &str) -> Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
fn set_sustained_power_limit(&self, _watts: f32) -> Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
fn set_burst_power_limit(&self, _watts: f32) -> Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
fn set_fan_mode(&self, _mode: &str) -> Result<()> { Ok(()) }
|
||||
fn set_fan_speed(&self, _speed: FanSpeedPercent) -> Result<()> { Ok(()) }
|
||||
fn set_sustained_power_limit(&self, _limit: PowerLimitWatts) -> Result<()> { Ok(()) }
|
||||
fn set_burst_power_limit(&self, _limit: PowerLimitWatts) -> Result<()> { Ok(()) }
|
||||
}
|
||||
|
||||
impl HardwareWatchdog for MockSal {
|
||||
fn get_safety_status(&self) -> Result<SafetyStatus> {
|
||||
Ok(SafetyStatus::Nominal)
|
||||
}
|
||||
fn get_safety_status(&self) -> Result<SafetyStatus> { Ok(SafetyStatus::Nominal) }
|
||||
}
|
||||
|
||||
@@ -3,3 +3,5 @@ pub mod mock;
|
||||
pub mod dell_xps_9380;
|
||||
pub mod generic_linux;
|
||||
pub mod heuristic;
|
||||
pub mod safety;
|
||||
pub mod discovery;
|
||||
|
||||
282
src/sal/safety.rs
Normal file
282
src/sal/safety.rs
Normal file
@@ -0,0 +1,282 @@
|
||||
//! # Hardware Safety & Universal Safeguard Architecture
|
||||
//!
|
||||
//! This module implements the core safety logic for `ember-tune`. It uses the Rust
|
||||
//! type system to enforce hardware bounds and RAII patterns to guarantee that
|
||||
//! the system is restored to a safe state even after a crash.
|
||||
|
||||
use anyhow::{Result, bail, Context};
|
||||
use std::collections::HashMap;
|
||||
use std::fs;
|
||||
use std::path::{PathBuf};
|
||||
use std::sync::Arc;
|
||||
use std::sync::atomic::{AtomicBool, Ordering};
|
||||
use std::time::Duration;
|
||||
use std::thread;
|
||||
use tracing::{info, warn, error, debug};
|
||||
|
||||
use crate::sal::traits::SensorBus;
|
||||
|
||||
// --- 1. Type-Driven Bounds Checking ---
|
||||
|
||||
/// Represents a validated TDP limit in Watts.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, PartialOrd)]
|
||||
pub struct PowerLimitWatts(f32);
|
||||
|
||||
impl PowerLimitWatts {
|
||||
/// Absolute safety floor. Setting TDP below 3W can induce system-wide
|
||||
/// CPU stalls and I/O deadlocks on certain Intel mobile chipsets.
|
||||
pub const MIN: f32 = 3.0;
|
||||
/// Safety ceiling for mobile thin-and-light chassis.
|
||||
pub const MAX: f32 = 100.0;
|
||||
|
||||
/// Validates and constructs a new PowerLimitWatts.
|
||||
pub fn try_new(watts: f32) -> Result<Self> {
|
||||
if watts < Self::MIN || watts > Self::MAX {
|
||||
bail!("HardwareSafetyError: Requested TDP {:.1}W is outside safe bounds ({:.1}W - {:.1}W).", watts, Self::MIN, Self::MAX);
|
||||
}
|
||||
Ok(Self(watts))
|
||||
}
|
||||
|
||||
pub fn from_watts(watts: f32) -> Result<Self> {
|
||||
Self::try_new(watts)
|
||||
}
|
||||
|
||||
pub fn get(&self) -> f32 { self.0 }
|
||||
pub fn as_microwatts(&self) -> u64 { (self.0 * 1_000_000.0) as u64 }
|
||||
}
|
||||
|
||||
/// Represents a validated fan speed percentage.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub struct FanSpeedPercent(u8);
|
||||
|
||||
impl FanSpeedPercent {
|
||||
pub fn try_new(percent: u8) -> Result<Self> {
|
||||
if percent > 100 {
|
||||
bail!("HardwareSafetyError: Fan speed {}% is invalid.", percent);
|
||||
}
|
||||
Ok(Self(percent))
|
||||
}
|
||||
|
||||
pub fn new(percent: u8) -> Result<Self> {
|
||||
Self::try_new(percent)
|
||||
}
|
||||
|
||||
pub fn get(&self) -> u8 { self.0 }
|
||||
}
|
||||
|
||||
/// Represents a thermal threshold in Celsius.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, PartialOrd)]
|
||||
pub struct ThermalThresholdCelsius(f32);
|
||||
|
||||
impl ThermalThresholdCelsius {
|
||||
pub const MAX_SAFE_C: f32 = 98.0;
|
||||
|
||||
pub fn try_new(celsius: f32) -> Result<Self> {
|
||||
if celsius > Self::MAX_SAFE_C {
|
||||
bail!("HardwareSafetyError: Thermal threshold {}C exceeds safe limit ({}C).", celsius, Self::MAX_SAFE_C);
|
||||
}
|
||||
Ok(Self(celsius))
|
||||
}
|
||||
|
||||
pub fn new(celsius: f32) -> Result<Self> {
|
||||
Self::try_new(celsius)
|
||||
}
|
||||
|
||||
pub fn get(&self) -> f32 { self.0 }
|
||||
}
|
||||
|
||||
// --- 2. The HardwareStateGuard (RAII Restorer) ---
|
||||
|
||||
/// Defines an arbitrary action to take during restoration.
|
||||
pub type RollbackAction = Box<dyn FnOnce() + Send + 'static>;
|
||||
|
||||
/// Holds a snapshot of the system state. Restores everything on Drop.
|
||||
/// This is the primary safety mechanism for Project Iron-Ember.
|
||||
pub struct HardwareStateGuard {
|
||||
/// Maps sysfs paths to their original string contents.
|
||||
snapshots: HashMap<PathBuf, String>,
|
||||
/// Services that were stopped and must be restarted.
|
||||
suppressed_services: Vec<String>,
|
||||
/// Arbitrary actions to perform on restoration (e.g., reset fan mode).
|
||||
rollback_actions: Vec<RollbackAction>,
|
||||
is_active: bool,
|
||||
}
|
||||
|
||||
impl HardwareStateGuard {
|
||||
/// Snapshots the requested files and neutralizes competing services.
|
||||
///
|
||||
/// # SAFETY:
|
||||
/// This MUST be acquired before any hardware mutation occurs.
|
||||
pub fn acquire(target_files: &[PathBuf], target_services: &[String]) -> Result<Self> {
|
||||
let mut snapshots = HashMap::new();
|
||||
let mut suppressed = Vec::new();
|
||||
|
||||
info!("USA: Arming HardwareStateGuard. Snapshotting critical registers...");
|
||||
|
||||
for path in target_files {
|
||||
if path.exists() {
|
||||
let content = fs::read_to_string(path)
|
||||
.with_context(|| format!("Failed to snapshot {:?}", path))?;
|
||||
snapshots.insert(path.clone(), content.trim().to_string());
|
||||
} else {
|
||||
debug!("USA: Skipping snapshot for non-existent path {:?}", path);
|
||||
}
|
||||
}
|
||||
|
||||
for svc in target_services {
|
||||
// Check if service is active before stopping
|
||||
let status = std::process::Command::new("systemctl")
|
||||
.args(["is-active", "--quiet", svc])
|
||||
.status();
|
||||
|
||||
if let Ok(s) = status {
|
||||
if s.success() {
|
||||
info!("USA: Neutralizing service '{}'", svc);
|
||||
let _ = std::process::Command::new("systemctl").args(["stop", svc]).status();
|
||||
suppressed.push(svc.clone());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(Self {
|
||||
snapshots,
|
||||
suppressed_services: suppressed,
|
||||
rollback_actions: Vec::new(),
|
||||
is_active: true,
|
||||
})
|
||||
}
|
||||
|
||||
/// Registers a custom action to be performed when the guard is released.
|
||||
pub fn on_rollback(&mut self, action: RollbackAction) {
|
||||
self.rollback_actions.push(action);
|
||||
}
|
||||
|
||||
/// Explicitly release and restore the hardware state.
|
||||
pub fn release(&mut self) -> Result<()> {
|
||||
if !self.is_active { return Ok(()); }
|
||||
|
||||
info!("USA: Releasing guard. Restoring hardware to pre-flight state...");
|
||||
|
||||
// 1. Restore Power/Sysfs states
|
||||
for (path, content) in &self.snapshots {
|
||||
if let Err(e) = fs::write(path, content) {
|
||||
error!("CRITICAL: Failed to restore {:?}: {}", path, e);
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Restart Services
|
||||
for svc in &self.suppressed_services {
|
||||
let _ = std::process::Command::new("systemctl").args(["start", svc]).status();
|
||||
}
|
||||
|
||||
// 3. Perform Custom Rollback Actions
|
||||
for action in self.rollback_actions.drain(..) {
|
||||
(action)();
|
||||
}
|
||||
|
||||
self.is_active = false;
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl Drop for HardwareStateGuard {
|
||||
fn drop(&mut self) {
|
||||
if self.is_active {
|
||||
warn!("USA: Guard dropped prematurely (panic/SIGTERM). Force-restoring system...");
|
||||
let _ = self.release();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// --- 3. The Active Watchdog ---
|
||||
|
||||
/// A standalone monitor that polls hardware thermals at high frequency.
|
||||
pub struct ThermalWatchdog {
|
||||
cancel_token: Arc<AtomicBool>,
|
||||
handle: Option<thread::JoinHandle<()>>,
|
||||
}
|
||||
|
||||
impl ThermalWatchdog {
|
||||
/// If temperature exceeds this ceiling, the watchdog triggers an emergency shutdown.
|
||||
pub const CRITICAL_TEMP: f32 = 95.0;
|
||||
/// High polling rate ensures we catch runaways before chassis saturation.
|
||||
pub const POLL_INTERVAL: Duration = Duration::from_millis(250);
|
||||
|
||||
/// Spawns the watchdog thread.
|
||||
pub fn spawn(sensors: Arc<dyn SensorBus>, cancel_token: Arc<AtomicBool>) -> Self {
|
||||
let ct = cancel_token.clone();
|
||||
let handle = thread::spawn(move || {
|
||||
let mut last_temp = 0.0;
|
||||
loop {
|
||||
if ct.load(Ordering::SeqCst) {
|
||||
debug!("Watchdog: Shutdown signal received.");
|
||||
break;
|
||||
}
|
||||
|
||||
match sensors.get_temp() {
|
||||
Ok(temp) => {
|
||||
// Rate of change check (dT/dt)
|
||||
let dt_dt = temp - last_temp;
|
||||
if temp >= Self::CRITICAL_TEMP {
|
||||
error!("WATCHDOG: CRITICAL THERMAL EVENT ({:.1}C). Triggering emergency abort!", temp);
|
||||
ct.store(true, Ordering::SeqCst);
|
||||
break;
|
||||
}
|
||||
|
||||
if dt_dt > 5.0 && temp > 85.0 {
|
||||
warn!("WATCHDOG: Dangerous thermal ramp detected (+{:.1}C in 250ms).", dt_dt);
|
||||
}
|
||||
|
||||
last_temp = temp;
|
||||
}
|
||||
Err(e) => {
|
||||
error!("WATCHDOG: Sensor read failure: {}. Aborting for safety!", e);
|
||||
ct.store(true, Ordering::SeqCst);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
thread::sleep(Self::POLL_INTERVAL);
|
||||
}
|
||||
});
|
||||
|
||||
Self {
|
||||
cancel_token,
|
||||
handle: Some(handle),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Drop for ThermalWatchdog {
|
||||
fn drop(&mut self) {
|
||||
self.cancel_token.store(true, Ordering::SeqCst);
|
||||
if let Some(h) = self.handle.take() {
|
||||
let _ = h.join();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// --- 4. Transactional Configuration ---
|
||||
|
||||
/// A staged set of changes to be applied to the hardware.
|
||||
#[derive(Default)]
|
||||
pub struct ConfigurationTransaction {
|
||||
changes: Vec<(PathBuf, String)>,
|
||||
}
|
||||
|
||||
impl ConfigurationTransaction {
|
||||
pub fn add_change(&mut self, path: PathBuf, value: String) {
|
||||
self.changes.push((path, value));
|
||||
}
|
||||
|
||||
/// # SAFETY:
|
||||
/// Commits all changes. If any write fails, it returns an error but the
|
||||
/// HardwareStateGuard will still restore everything on drop.
|
||||
pub fn commit(self) -> Result<()> {
|
||||
for (path, val) in self.changes {
|
||||
fs::write(&path, val)
|
||||
.with_context(|| format!("Failed to apply change to {:?}", path))?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@@ -115,79 +115,54 @@ impl<T: EnvironmentGuard + ?Sized> EnvironmentGuard for Arc<T> {
|
||||
}
|
||||
}
|
||||
|
||||
use crate::sal::safety::{PowerLimitWatts, FanSpeedPercent};
|
||||
|
||||
/// Provides a read-only interface to system telemetry sensors.
|
||||
pub trait SensorBus: Send + Sync {
|
||||
/// Returns the current package temperature in degrees Celsius.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns an error if the underlying `hwmon` or `sysfs` node cannot be read.
|
||||
fn get_temp(&self) -> Result<f32>;
|
||||
|
||||
/// Returns the current package power consumption in Watts.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns an error if the underlying RAPL or power sensor cannot be read.
|
||||
fn get_power_w(&self) -> Result<f32>;
|
||||
|
||||
/// Returns the current speed of all detected fans in RPM.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns an error if the fan sensor nodes cannot be read.
|
||||
fn get_fan_rpms(&self) -> Result<Vec<u32>>;
|
||||
|
||||
/// Returns the current average CPU frequency in MHz.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns an error if `/proc/cpuinfo` or a `cpufreq` sysfs node cannot be read.
|
||||
fn get_freq_mhz(&self) -> Result<f32>;
|
||||
|
||||
/// Returns true if the system is currently thermally throttling.
|
||||
fn get_throttling_status(&self) -> Result<bool>;
|
||||
}
|
||||
|
||||
impl<T: SensorBus + ?Sized> SensorBus for Arc<T> {
|
||||
fn get_temp(&self) -> Result<f32> {
|
||||
(**self).get_temp()
|
||||
}
|
||||
fn get_power_w(&self) -> Result<f32> {
|
||||
(**self).get_power_w()
|
||||
}
|
||||
fn get_fan_rpms(&self) -> Result<Vec<u32>> {
|
||||
(**self).get_fan_rpms()
|
||||
}
|
||||
fn get_freq_mhz(&self) -> Result<f32> {
|
||||
(**self).get_freq_mhz()
|
||||
}
|
||||
fn get_temp(&self) -> Result<f32> { (**self).get_temp() }
|
||||
fn get_power_w(&self) -> Result<f32> { (**self).get_power_w() }
|
||||
fn get_fan_rpms(&self) -> Result<Vec<u32>> { (**self).get_fan_rpms() }
|
||||
fn get_freq_mhz(&self) -> Result<f32> { (**self).get_freq_mhz() }
|
||||
fn get_throttling_status(&self) -> Result<bool> { (**self).get_throttling_status() }
|
||||
}
|
||||
|
||||
/// Provides a write-only interface for hardware actuators.
|
||||
pub trait ActuatorBus: Send + Sync {
|
||||
/// Sets the fan control mode (e.g., "auto" or "max").
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns an error if the fan control command or `sysfs` write fails.
|
||||
fn set_fan_mode(&self, mode: &str) -> Result<()>;
|
||||
|
||||
/// Sets the sustained power limit (PL1) in Watts.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns an error if the RAPL `sysfs` node cannot be written to.
|
||||
fn set_sustained_power_limit(&self, watts: f32) -> Result<()>;
|
||||
/// Sets the fan speed directly using a validated percentage.
|
||||
fn set_fan_speed(&self, speed: FanSpeedPercent) -> Result<()>;
|
||||
|
||||
/// Sets the burst power limit (PL2) in Watts.
|
||||
///
|
||||
/// # Errors
|
||||
/// Returns an error if the RAPL `sysfs` node cannot be written to.
|
||||
fn set_burst_power_limit(&self, watts: f32) -> Result<()>;
|
||||
/// Sets the sustained power limit (PL1) using a validated wrapper.
|
||||
fn set_sustained_power_limit(&self, limit: PowerLimitWatts) -> Result<()>;
|
||||
|
||||
/// Sets the burst power limit (PL2) using a validated wrapper.
|
||||
fn set_burst_power_limit(&self, limit: PowerLimitWatts) -> Result<()>;
|
||||
}
|
||||
|
||||
impl<T: ActuatorBus + ?Sized> ActuatorBus for Arc<T> {
|
||||
fn set_fan_mode(&self, mode: &str) -> Result<()> {
|
||||
(**self).set_fan_mode(mode)
|
||||
}
|
||||
fn set_sustained_power_limit(&self, watts: f32) -> Result<()> {
|
||||
(**self).set_sustained_power_limit(watts)
|
||||
}
|
||||
fn set_burst_power_limit(&self, watts: f32) -> Result<()> {
|
||||
(**self).set_burst_power_limit(watts)
|
||||
}
|
||||
fn set_fan_mode(&self, mode: &str) -> Result<()> { (**self).set_fan_mode(mode) }
|
||||
fn set_fan_speed(&self, speed: FanSpeedPercent) -> Result<()> { (**self).set_fan_speed(speed) }
|
||||
fn set_sustained_power_limit(&self, limit: PowerLimitWatts) -> Result<()> { (**self).set_sustained_power_limit(limit) }
|
||||
fn set_burst_power_limit(&self, limit: PowerLimitWatts) -> Result<()> { (**self).set_burst_power_limit(limit) }
|
||||
}
|
||||
|
||||
/// Represents the high-level safety status of the system.
|
||||
|
||||
@@ -1,35 +1,75 @@
|
||||
#[path = "../src/engine/formatters/throttled.rs"]
|
||||
mod throttled;
|
||||
|
||||
use throttled::{ThrottledTranslator, ThrottledConfig};
|
||||
use ember_tune_rs::engine::formatters::throttled::{ThrottledConfig, ThrottledTranslator};
|
||||
use ember_tune_rs::agent_analyst::{OptimizationMatrix, SystemProfile, FanCurvePoint};
|
||||
use ember_tune_rs::agent_integrator::ServiceIntegrator;
|
||||
use std::fs;
|
||||
use tempfile::tempdir;
|
||||
|
||||
#[test]
|
||||
fn test_throttled_formatter_non_destructive() {
|
||||
let fixture_path = "tests/fixtures/throttled.conf";
|
||||
let existing_content = fs::read_to_string(fixture_path).expect("Failed to read fixture");
|
||||
fn test_throttled_merge_preserves_undervolt() {
|
||||
let existing = r#"[GENERAL]
|
||||
Update_Interval_ms: 1000
|
||||
|
||||
[UNDERVOLT]
|
||||
# CPU core undervolt
|
||||
CORE: -100
|
||||
# GPU undervolt
|
||||
GPU: -50
|
||||
|
||||
[AC]
|
||||
PL1_Tdp_W: 15
|
||||
PL2_Tdp_W: 25
|
||||
"#;
|
||||
|
||||
let config = ThrottledConfig {
|
||||
pl1_limit: 25.0,
|
||||
pl2_limit: 35.0,
|
||||
trip_temp: 90.0,
|
||||
pl1_limit: 22.0,
|
||||
pl2_limit: 28.0,
|
||||
trip_temp: 95.0,
|
||||
};
|
||||
|
||||
let merged = ThrottledTranslator::merge_conf(&existing_content, &config);
|
||||
let merged = ThrottledTranslator::merge_conf(existing, &config);
|
||||
|
||||
// Assert updates
|
||||
assert!(merged.contains("PL1_Tdp_W: 25"));
|
||||
assert!(merged.contains("PL2_Tdp_W: 35"));
|
||||
assert!(merged.contains("Trip_Temp_C: 90"));
|
||||
|
||||
// Assert preservation
|
||||
assert!(merged.contains("[UNDERVOLT]"));
|
||||
assert!(merged.contains("CORE: -100"));
|
||||
assert!(merged.contains("GPU: -50"));
|
||||
assert!(merged.contains("# Important: Preserving undervolt offsets is critical!"));
|
||||
assert!(merged.contains("Update_Interval_ms: 3000"));
|
||||
|
||||
// Check that we didn't lose the [GENERAL] section
|
||||
assert!(merged.contains("[GENERAL]"));
|
||||
assert!(merged.contains("# This is a complex test fixture"));
|
||||
assert!(merged.contains("PL1_Tdp_W: 22"));
|
||||
assert!(merged.contains("PL2_Tdp_W: 28"));
|
||||
assert!(merged.contains("Trip_Temp_C: 95"));
|
||||
assert!(merged.contains("[UNDERVOLT]"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_i8kmon_merge_preserves_settings() {
|
||||
let dir = tempdir().unwrap();
|
||||
let config_path = dir.path().join("i8kmon.conf");
|
||||
|
||||
let existing = r#"set config(gen_shadow) 1
|
||||
set config(i8k_ignore_dmi) 1
|
||||
set config(daemon) 1
|
||||
|
||||
set config(0) {0 0 60 50}
|
||||
"#;
|
||||
fs::write(&config_path, existing).unwrap();
|
||||
|
||||
let matrix = OptimizationMatrix {
|
||||
silent: SystemProfile { name: "Silent".to_string(), pl1_watts: 10.0, pl2_watts: 12.0, fan_curve: vec![] },
|
||||
balanced: SystemProfile {
|
||||
name: "Balanced".to_string(),
|
||||
pl1_watts: 20.0,
|
||||
pl2_watts: 25.0,
|
||||
fan_curve: vec![
|
||||
FanCurvePoint { temp_on: 70.0, temp_off: 60.0, pwm_percent: 50 }
|
||||
]
|
||||
},
|
||||
performance: SystemProfile { name: "Perf".to_string(), pl1_watts: 30.0, pl2_watts: 35.0, fan_curve: vec![] },
|
||||
thermal_resistance_kw: 1.5,
|
||||
ambient_temp: 25.0,
|
||||
};
|
||||
|
||||
ServiceIntegrator::generate_i8kmon_config(&matrix, &config_path, Some(&config_path)).unwrap();
|
||||
|
||||
let result = fs::read_to_string(&config_path).unwrap();
|
||||
|
||||
assert!(result.contains("set config(gen_shadow) 1"));
|
||||
assert!(result.contains("set config(daemon) 1"));
|
||||
assert!(result.contains("set config(0) {1 1 70 -}")); // New config
|
||||
assert!(!result.contains("set config(0) {0 0 60 50}")); // Old config should be gone
|
||||
}
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
use ember_tune_rs::sal::heuristic::discovery::discover_facts;
|
||||
use ember_tune_rs::sal::heuristic::schema::{Discovery, SensorDiscovery, ActuatorDiscovery, Benchmarking};
|
||||
use ember_tune_rs::sys::MockSyscallRunner;
|
||||
use crate::common::fakesys::FakeSysBuilder;
|
||||
|
||||
mod common;
|
||||
@@ -35,7 +36,9 @@ fn test_heuristic_discovery_with_fakesys() {
|
||||
power_steps_watts: vec![10.0, 15.0],
|
||||
};
|
||||
|
||||
let facts = discover_facts(&fake.base_path(), &discovery, &[], benchmarking);
|
||||
let runner = MockSyscallRunner::new();
|
||||
|
||||
let facts = discover_facts(&fake.base_path(), &runner, &discovery, &[], benchmarking);
|
||||
|
||||
assert_eq!(facts.vendor, "Dell Inc.");
|
||||
assert_eq!(facts.model, "XPS 13 9380");
|
||||
|
||||
@@ -1,16 +1,23 @@
|
||||
use ember_tune_rs::orchestrator::BenchmarkOrchestrator;
|
||||
use ember_tune_rs::sal::mock::MockSal;
|
||||
use ember_tune_rs::sal::heuristic::discovery::SystemFactSheet;
|
||||
use ember_tune_rs::load::Workload;
|
||||
use ember_tune_rs::load::{Workload, IntensityProfile, WorkloadMetrics};
|
||||
use std::time::Duration;
|
||||
use anyhow::Result;
|
||||
use std::sync::mpsc;
|
||||
use std::sync::Arc;
|
||||
use anyhow::Result;
|
||||
|
||||
struct MockWorkload;
|
||||
impl Workload for MockWorkload {
|
||||
fn start(&mut self, _threads: usize, _load_percent: usize) -> Result<()> { Ok(()) }
|
||||
fn stop(&mut self) -> Result<()> { Ok(()) }
|
||||
fn get_throughput(&self) -> Result<f64> { Ok(100.0) }
|
||||
fn initialize(&mut self) -> Result<()> { Ok(()) }
|
||||
fn run_workload(&mut self, _duration: Duration, _profile: IntensityProfile) -> Result<()> { Ok(()) }
|
||||
fn get_current_metrics(&self) -> Result<WorkloadMetrics> {
|
||||
Ok(WorkloadMetrics {
|
||||
primary_ops_per_sec: 100.0,
|
||||
elapsed_time: Duration::from_secs(1),
|
||||
})
|
||||
}
|
||||
fn stop_workload(&mut self) -> Result<()> { Ok(()) }
|
||||
}
|
||||
|
||||
#[test]
|
||||
@@ -28,6 +35,7 @@ fn test_orchestrator_e2e_state_machine() {
|
||||
workload,
|
||||
telemetry_tx,
|
||||
command_rx,
|
||||
None,
|
||||
);
|
||||
|
||||
// For the purpose of this architecture audit, we've demonstrated the
|
||||
|
||||
53
tests/safety_test.rs
Normal file
53
tests/safety_test.rs
Normal file
@@ -0,0 +1,53 @@
|
||||
use ember_tune_rs::sal::safety::{HardwareStateGuard, PowerLimitWatts};
|
||||
use crate::common::fakesys::FakeSysBuilder;
|
||||
use std::fs;
|
||||
|
||||
mod common;
|
||||
|
||||
#[test]
|
||||
fn test_hardware_state_guard_panic_restoration() {
|
||||
let fake = FakeSysBuilder::new();
|
||||
let pl1_path = fake.base_path().join("sys/class/powercap/intel-rapl:0/constraint_0_power_limit_uw");
|
||||
|
||||
fake.add_rapl("intel-rapl:0", "1000", "15000000"); // 15W original
|
||||
|
||||
let target_files = vec![pl1_path.clone()];
|
||||
|
||||
// Simulate a scope where the guard is active
|
||||
{
|
||||
let mut _guard = HardwareStateGuard::acquire(&target_files, &[]).expect("Failed to acquire guard");
|
||||
|
||||
// Modify the file
|
||||
fs::write(&pl1_path, "25000000").expect("Failed to write new value");
|
||||
assert_eq!(fs::read_to_string(&pl1_path).unwrap().trim(), "25000000");
|
||||
|
||||
// Guard is dropped here (simulating end of scope or panic)
|
||||
}
|
||||
|
||||
// Verify restoration
|
||||
let restored = fs::read_to_string(&pl1_path).expect("Failed to read restored file");
|
||||
assert_eq!(restored.trim(), "15000000");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_tdp_limit_bounds_checking() {
|
||||
// 1. Valid value
|
||||
assert!(PowerLimitWatts::try_new(15.0).is_ok());
|
||||
|
||||
// 2. Too low (Dangerous 0W or below 3W)
|
||||
let low_res = PowerLimitWatts::try_new(1.0);
|
||||
assert!(low_res.is_err());
|
||||
assert!(low_res.unwrap_err().to_string().contains("outside safe bounds"));
|
||||
|
||||
// 3. Too high (> 100W)
|
||||
let high_res = PowerLimitWatts::try_new(150.0);
|
||||
assert!(high_res.is_err());
|
||||
assert!(high_res.unwrap_err().to_string().contains("outside safe bounds"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_0w_tdp_regression_prevention() {
|
||||
// The prime directive is to never set 0W.
|
||||
let zero_res = PowerLimitWatts::try_new(0.0);
|
||||
assert!(zero_res.is_err());
|
||||
}
|
||||
Reference in New Issue
Block a user