5 Commits

25 changed files with 1471 additions and 568 deletions

13
Cargo.lock generated
View File

@@ -901,6 +901,15 @@ dependencies = [
"winapi",
]
[[package]]
name = "matchers"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d1525a2a28c7f4fa0fc98bb91ae755d1e2d1505079e05539e35bc876b5d65ae9"
dependencies = [
"regex-automata",
]
[[package]]
name = "memchr"
version = "2.8.0"
@@ -2000,10 +2009,14 @@ version = "0.3.22"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2f30143827ddab0d256fd843b7a66d164e9f271cfa0dde49142c5ca0ca291f1e"
dependencies = [
"matchers",
"nu-ansi-term",
"once_cell",
"regex-automata",
"sharded-slab",
"smallvec",
"thread_local",
"tracing",
"tracing-core",
"tracing-log",
]

View File

@@ -23,7 +23,7 @@ serde_json = "1.0.149"
clap = { version = "4.5", features = ["derive", "string", "wrap_help"] }
color-eyre = "0.6"
tracing = "0.1"
tracing-subscriber = "0.3"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
tracing-appender = "0.2"
sysinfo = "0.38"
libc = "0.2"

View File

@@ -1,33 +1,61 @@
# 🔥 ember-tune
```text
__________ ____ ______ ____ ______ __ __ _ __ ______
/ ____/ |/ // __ )/ ____// __ \ /_ __/ / / / // | / // ____/
/ __/ / /|_/ // __ / __/ / /_/ / / / / / / // |/ // __/
/ /___ / / / // /_/ / /___ / _, _/ / / / /_/ // /| // /___
/_____//_/ /_//_____/_____//_/ |_| /_/ \____//_/ |_//_____/
>>> Physically-grounded thermal & power optimization for Linux <<<
```
> ### **Find your hardware's "Physical Sweet Spot" through automated trial-by-fire.**
`ember-tune` is a scientifically-driven hardware optimizer that replaces guesswork and manual tuning with a rigorous, automated engineering workflow. It determines the unique thermal properties of your specific laptop—including its Thermal Resistance (Rθ) and "Silicon Knee"—to generate optimal configurations for common Linux tuning daemons.
## ✨ Features
- **Automated Physical Benchmarking:** Measures real-world thermal performance under load to find the true "sweet spot" where performance-per-watt is maximized before thermal saturation causes diminishing returns.
- **Heuristic Hardware Discovery:** Utilizes a data-driven Hardware Abstraction Layer (SAL) that probes your system and automatically adapts to its unique quirks, drivers, and sensor paths.
- **Non-Destructive Configuration:** Safely merges new, optimized power limits into your existing `throttled.conf`, preserving manual undervolt settings and comments.
- **Universal Safeguard Architecture (USA):** Includes a high-frequency concurrent watchdog and RAII state restoration to guarantee your system is never left in a dangerous state.
- **Real-time TUI Dashboard:** A `ratatui`-based terminal interface provides high-resolution telemetry throughout the benchmark.
## 🔬 How it Works: The Architecture
`ember-tune` is built on a decoupled, multi-threaded architecture to ensure the UI is always responsive and that hardware state is managed safely.
1. **The Heuristic Engine:** On startup, the engine probes your system's DMI, `sysfs`, and active services. It compares these "facts" against the `hardware_db.toml` to select the correct System Abstraction Layer (SAL).
2. **The Orchestrator (Backend Thread):** This is the state machine that executes the benchmark. It communicates with hardware *only* through the SAL traits.
3. **The TUI (Main Thread):** The `ratatui` dashboard renders `TelemetryState` snapshots received from the orchestrator via an MPSC channel.
4. **The Watchdog (Safety Thread):** A high-priority thread that polls safety sensors every 100ms to trigger an atomic `EmergencyAbort` if failure conditions are met.
## ⚙️ Development Setup
`ember-tune` is a standard Cargo project. You will need a recent Rust toolchain and common build utilities.
`ember-tune` is a standard Cargo project.
**Prerequisites:**
- `rustup`
- `build-essential` (or equivalent for your distribution)
- `build-essential`
- `libudev-dev`
- `stress-ng` (Required for benchmarking)
```bash
# 1. Clone the repository
# 1. Clone and Build
git clone https://gitea.com/narl/ember-tune.git
cd ember-tune
# 2. Build the release binary
cargo build --release
# 3. Run the test suite (safe, uses a virtual environment)
# This requires no special permissions and does not touch your hardware.
# 2. Run the safe test suite
cargo test
```
**Running:**
Due to its direct hardware access, `ember-tune` requires root privileges.
```bash
# Run a full benchmark and generate optimized configs
# Run a full benchmark
sudo ./target/release/ember-tune
# Run a mock benchmark for UI/logic testing
# Run a mock benchmark for UI testing
sudo ./target/release/ember-tune --mock
```
@@ -35,48 +63,24 @@ sudo ./target/release/ember-tune --mock
## 🤝 Contributing Quirk Data (`hardware_db.toml`)
**This is the most impactful way to contribute.** `ember-tune`'s strength comes from its `assets/hardware_db.toml`, which encodes community knowledge about how to manage specific laptops. If your hardware isn't working perfectly, you can likely fix it by adding a new entry here.
**This is the most impactful way to contribute.** If your hardware isn't working perfectly, add a new entry to `assets/hardware_db.toml`.
The database is composed of four key sections: `conflicts`, `ecosystems`, `quirks`, and `discovery`.
### A. Reporting a Service Conflict
If a background service on your system interferes with `ember-tune`, add it to `[[conflicts]]`.
**Example:** Adding `laptop-mode-tools`.
### Example: Adding a Service Conflict
```toml
[[conflicts]]
id = "laptop_mode_conflict"
services = ["laptop-mode.service"]
contention = "Multiple - I/O schedulers, Power limits"
severity = "Medium"
fix_action = "SuspendService" # Orchestrator will stop/start this service
fix_action = "SuspendService"
help_text = "laptop-mode-tools can override power-related sysfs settings."
```
### B. Adding a New Hardware Ecosystem
If your laptop manufacturer (e.g., Razer) has a unique fan control tool or ACPI platform profile path, define it in `[ecosystems]`.
**Example:** A hypothetical "Razer" ecosystem.
```toml
[ecosystems.razer]
vendor_regex = "Razer"
# Path to the sysfs node that controls performance profiles
profiles_path = "/sys/bus/platform/drivers/razer_acpi/power_mode"
# Map human-readable names to the values the driver expects
policy_map = { Balanced = 0, Boost = 1, Silent = 2 }
```
### C. Defining a Model-Specific Quirk
If a specific laptop model has a bug (like a stuck sensor or incorrect fan reporting), define a `[[quirks]]` entry.
**Example:** A laptop whose fans report 0 RPM even when spinning.
### Example: Defining a Model-Specific Quirk
```toml
[[quirks]]
model_regex = "HP Envy 15-ep.*"
id = "hp_fan_stuck_sensor"
issue = "Fan sensor reports 0 RPM when active."
# The 'action' tells the SAL to use a different method for fan detection.
action = "UseThermalVelocityFallback"
```
After adding your changes, run the test suite and then submit a Pull Request!

View File

@@ -15,7 +15,7 @@ help_text = "TLP and Power-Profiles-Daemon fight over power envelopes. Mask both
[[conflicts]]
id = "thermal_logic_collision"
services = ["thermald.service", "throttled.service"]
services = ["thermald.service", "throttled.service", "lenovo_fix.service", "lenovo-throttling-fix.service"]
contention = "RAPL / MSR / BD-PROCHOT"
severity = "High"
fix_action = "SuspendService"

100
src/agent_analyst/mod.rs Normal file
View File

@@ -0,0 +1,100 @@
//! Heuristic Analysis & Optimization Math (Agent Analyst)
//!
//! This module analyzes raw telemetry data to extract the "Optimal Real-World Settings".
//! It calculates the Silicon Knee, Acoustic/Thermal Matrix (Hysteresis), and
//! generates three distinct hardware states: Silent, Balanced, and Sustained Heavy.
use serde::{Serialize, Deserialize};
use crate::engine::{ThermalProfile, OptimizerEngine};
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FanCurvePoint {
pub temp_on: f32,
pub temp_off: f32,
pub pwm_percent: u8,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SystemProfile {
pub name: String,
pub pl1_watts: f32,
pub pl2_watts: f32,
pub fan_curve: Vec<FanCurvePoint>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OptimizationMatrix {
pub silent: SystemProfile,
pub balanced: SystemProfile,
pub performance: SystemProfile,
pub thermal_resistance_kw: f32,
}
pub struct HeuristicAnalyst {
engine: OptimizerEngine,
}
impl HeuristicAnalyst {
pub fn new() -> Self {
Self {
engine: OptimizerEngine::new(5),
}
}
/// Analyzes the raw telemetry to generate the 3 optimal profiles.
pub fn analyze(&self, profile: &ThermalProfile, max_soak_watts: f32) -> OptimizationMatrix {
let r_theta = self.engine.calculate_thermal_resistance(profile);
let silicon_knee = self.engine.find_silicon_knee(profile);
// 1. State A: Silent / Battery (Scientific Passive Equilibrium)
// Objective: Find P where T_core = 60C with fans OFF.
// T_core = T_ambient + (P * R_theta_passive)
// Note: R_theta measured during benchmark was with fans MAX.
// Passive R_theta is typically 2-3x higher.
let r_theta_passive = r_theta * 2.5;
let silent_watts = ((60.0 - profile.ambient_temp) / r_theta_passive.max(0.1)).clamp(5.0, 15.0);
let silent_profile = SystemProfile {
name: "Silent".to_string(),
pl1_watts: silent_watts,
pl2_watts: silent_watts * 1.2,
fan_curve: vec![
FanCurvePoint { temp_on: 65.0, temp_off: 55.0, pwm_percent: 0 },
FanCurvePoint { temp_on: 75.0, temp_off: 65.0, pwm_percent: 30 },
],
};
// 2. State B: Balanced
// The exact calculated Silicon Knee
let balanced_profile = SystemProfile {
name: "Balanced".to_string(),
pl1_watts: silicon_knee,
pl2_watts: silicon_knee * 1.25,
fan_curve: vec![
FanCurvePoint { temp_on: 60.0, temp_off: 55.0, pwm_percent: 0 },
FanCurvePoint { temp_on: 75.0, temp_off: 65.0, pwm_percent: 40 },
FanCurvePoint { temp_on: 85.0, temp_off: 75.0, pwm_percent: 70 },
],
};
// 3. State C: Sustained Heavy
// Based on the max soak watts from Phase 1.
let performance_profile = SystemProfile {
name: "Performance".to_string(),
pl1_watts: max_soak_watts,
pl2_watts: max_soak_watts * 1.3,
fan_curve: vec![
FanCurvePoint { temp_on: 50.0, temp_off: 45.0, pwm_percent: 30 },
FanCurvePoint { temp_on: 70.0, temp_off: 60.0, pwm_percent: 60 },
FanCurvePoint { temp_on: 85.0, temp_off: 75.0, pwm_percent: 100 },
],
};
OptimizationMatrix {
silent: silent_profile,
balanced: balanced_profile,
performance: performance_profile,
thermal_resistance_kw: r_theta,
}
}
}

115
src/agent_integrator/mod.rs Normal file
View File

@@ -0,0 +1,115 @@
//! System Service Integration (Agent Integrator)
//!
//! This module translates the mathematical optimums defined by the Analyst
//! into actionable, real-world Linux/OS service configurations.
//! It generates templates for fan daemons (i8kmon, thinkfan) and handles
//! resolution strategies for overlapping daemons.
use anyhow::Result;
use std::path::Path;
use std::fs;
use crate::agent_analyst::OptimizationMatrix;
pub struct ServiceIntegrator;
impl ServiceIntegrator {
/// Generates and saves an i8kmon configuration based on the balanced profile.
pub fn generate_i8kmon_config(matrix: &OptimizationMatrix, output_path: &Path) -> Result<()> {
let profile = &matrix.balanced;
let mut conf = String::new();
conf.push_str("# Auto-generated by ember-tune Integrator
");
conf.push_str(&format!("# Profile: {}
", profile.name));
for (i, p) in profile.fan_curve.iter().enumerate() {
// i8kmon syntax: set config(state) {left_fan right_fan temp_on temp_off}
// State 0, 1, 2, 3 correspond to BIOS fan states (off, low, high)
let state = match p.pwm_percent {
0..=20 => 0,
21..=50 => 1,
51..=100 => 2,
_ => 2,
};
let off = if i == 0 { "-".to_string() } else { format!("{}", p.temp_off) };
conf.push_str(&format!("set config({}) {{{} {} {} {}}}
", i, state, state, p.temp_on, off));
}
fs::write(output_path, conf)?;
Ok(())
}
/// Generates a thinkfan configuration.
pub fn generate_thinkfan_config(matrix: &OptimizationMatrix, output_path: &Path) -> Result<()> {
let profile = &matrix.balanced;
let mut conf = String::new();
conf.push_str("# Auto-generated by ember-tune Integrator
");
conf.push_str("sensors:
- hwmon: /sys/class/hwmon/hwmon0/temp1_input
");
conf.push_str("levels:
");
for (i, p) in profile.fan_curve.iter().enumerate() {
// thinkfan syntax: - [level, temp_down, temp_up]
let level = match p.pwm_percent {
0..=20 => 0,
21..=40 => 1,
41..=60 => 3,
61..=80 => 5,
_ => 7,
};
let down = if i == 0 { 0.0 } else { p.temp_off };
conf.push_str(&format!(" - [{}, {}, {}]
", level, down, p.temp_on));
}
fs::write(output_path, conf)?;
Ok(())
}
/// Generates a resolution checklist/script for daemons.
pub fn generate_conflict_resolution_script(output_path: &Path) -> Result<()> {
let script = r#"#!/bin/bash
# ember-tune Daemon Neutralization Script
# 1. Mask power-profiles-daemon (Prevent ACPI overrides)
systemctl mask power-profiles-daemon
# 2. Filter TLP (Prevent CPU governor fights while keeping PCIe saving)
sed -i 's/^CPU_SCALING_GOVERNOR_ON_AC=.*/CPU_SCALING_GOVERNOR_ON_AC=""/' /etc/tlp.conf
sed -i 's/^CPU_BOOST_ON_AC=.*/CPU_BOOST_ON_AC=""/' /etc/tlp.conf
systemctl restart tlp
# 3. Thermald Delegate (We provide the trips, it handles the rest)
# (Ensure your custom thermal-conf.xml is in /etc/thermald/)
systemctl restart thermald
"#;
fs::write(output_path, script)?;
Ok(())
}
/// Generates a thermald configuration XML.
pub fn generate_thermald_config(matrix: &OptimizationMatrix, output_path: &Path) -> Result<()> {
let profile = &matrix.balanced;
let mut xml = String::new();
xml.push_str("<?xml version=\"1.0\"?>\n<ThermalConfiguration>\n <Platform>\n <Name>ember-tune Balanced</Name>\n <ProductName>Generic</ProductName>\n <Preference>balanced</Preference>\n <ThermalZones>\n <ThermalZone>\n <Type>cpu</Type>\n <TripPoints>\n");
for (i, p) in profile.fan_curve.iter().enumerate() {
xml.push_str(&format!(" <TripPoint>\n <SensorType>cpu</SensorType>\n <Temperature>{}</Temperature>\n <Type>Passive</Type>\n <ControlId>{}</ControlId>\n </TripPoint>\n", p.temp_on * 1000.0, i));
}
xml.push_str(" </TripPoints>\n </ThermalZone>\n </ThermalZones>\n </Platform>\n</ThermalConfiguration>\n");
fs::write(output_path, xml)?;
Ok(())
}
}

View File

@@ -7,6 +7,7 @@
use serde::{Serialize, Deserialize};
use std::collections::HashMap;
use std::path::PathBuf;
use tracing::warn;
pub mod formatters;
@@ -46,6 +47,8 @@ pub struct OptimizationResult {
pub is_partial: bool,
/// A map of configuration files that were written to.
pub config_paths: HashMap<String, PathBuf>,
/// The comprehensive optimization matrix (Silent, Balanced, Performance).
pub optimization_matrix: Option<crate::agent_analyst::OptimizationMatrix>,
}
/// Pure mathematics engine for thermal optimization.
@@ -180,6 +183,18 @@ impl OptimizerEngine {
}
}
let best_pl = if max_score > f32::MIN {
best_pl
} else {
profile.points.last().map(|p| p.power_w).unwrap_or(15.0)
};
// Safety Floor: Never recommend a TDP below 5W, as this bricks system performance.
if best_pl < 5.0 {
warn!("Heuristic suggested dangerously low PL1 ({:.1}W). Falling back to 15W safety floor.", best_pl);
return 15.0;
}
best_pl
}
}

0
src/engine/profiles.rs Normal file
View File

View File

@@ -12,3 +12,5 @@ pub mod ui;
pub mod engine;
pub mod cli;
pub mod sys;
pub mod agent_analyst;
pub mod agent_integrator;

View File

@@ -1,60 +1,145 @@
//! Defines the `Workload` trait for generating synthetic CPU/GPU load.
//! Load generation and performance measurement subsystem.
use anyhow::Result;
use std::process::Child;
use anyhow::{Result, Context, anyhow};
use std::process::{Child, Command, Stdio};
use std::time::{Duration, Instant};
use std::thread;
use std::io::{BufRead, BufReader};
use std::sync::{Arc, Mutex};
use serde::{Deserialize, Serialize};
/// A trait for objects that can generate a measurable system load.
pub trait Workload: Send + Sync {
/// Starts the workload with the specified number of threads and load percentage.
///
/// # Errors
/// Returns an error if the underlying stress test process fails to spawn.
fn start(&mut self, threads: usize, load_percent: usize) -> Result<()>;
/// Stops the workload gracefully.
///
/// # Errors
/// This method should aim to not fail, but may return an error if
/// forcefully killing the child process fails.
fn stop(&mut self) -> Result<()>;
/// Returns the current throughput of the workload (e.g., ops/sec).
///
/// # Errors
/// Returns an error if throughput cannot be measured.
fn get_throughput(&self) -> Result<f64>;
/// Standardized telemetry returned by any workload implementation.
#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct WorkloadMetrics {
/// Primary performance heuristic (e.g., Bogo Ops/s)
pub primary_ops_per_sec: f64,
/// Time elapsed since the workload started
pub elapsed_time: Duration,
}
/// An implementation of `Workload` that uses the `stress-ng` utility.
/// Defines which subsystem to isolate during stress testing.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum StressVector {
CpuMatrix,
MemoryBandwidth,
Mixed,
}
/// A normalized profile defining the intensity and constraints of the workload.
#[derive(Debug, Clone)]
pub struct IntensityProfile {
pub threads: usize,
pub load_percentage: u8,
pub vector: StressVector,
}
/// The replaceable interface for load generation and performance measurement.
pub trait Workload: Send + Sync {
/// Sets up prerequisites (e.g., binary checks).
fn initialize(&mut self) -> Result<()>;
/// Executes the load asynchronously.
fn run_workload(&mut self, duration: Duration, profile: IntensityProfile) -> Result<()>;
/// Returns the current standardized telemetry object.
fn get_current_metrics(&self) -> Result<WorkloadMetrics>;
/// Gracefully and forcefully terminates the workload.
fn stop_workload(&mut self) -> Result<()>;
}
/// Implementation of the Benchmarking Interface using stress-ng matrix stressors.
pub struct StressNg {
child: Option<Child>,
start_time: Option<Instant>,
latest_metrics: Arc<Mutex<WorkloadMetrics>>,
}
impl StressNg {
pub fn new() -> Self {
Self { child: None }
Self {
child: None,
start_time: None,
latest_metrics: Arc::new(Mutex::new(WorkloadMetrics::default())),
}
}
}
impl Workload for StressNg {
fn start(&mut self, threads: usize, load_percent: usize) -> Result<()> {
self.stop()?;
fn initialize(&mut self) -> Result<()> {
let status = Command::new("stress-ng")
.arg("--version")
.stdout(Stdio::null())
.stderr(Stdio::null())
.status()
.context("stress-ng binary not found in PATH. Please install it.")?;
let child = std::process::Command::new("stress-ng")
.args([
"--cpu", &threads.to_string(),
"--cpu-load", &load_percent.to_string(),
"--quiet"
])
.spawn()?;
if !status.success() {
return Err(anyhow!("stress-ng failed to initialize"));
}
Ok(())
}
fn run_workload(&mut self, duration: Duration, profile: IntensityProfile) -> Result<()> {
self.stop_workload()?;
let threads = profile.threads.to_string();
let timeout = format!("{}s", duration.as_secs());
let load = profile.load_percentage.to_string();
let mut cmd = Command::new("stress-ng");
cmd.args(["--timeout", &timeout, "--metrics", "--quiet", "--cpu-load", &load]);
match profile.vector {
StressVector::CpuMatrix => {
cmd.args(["--matrix", &threads]);
},
StressVector::MemoryBandwidth => {
cmd.args(["--vm", &threads, "--vm-bytes", "80%"]);
},
StressVector::Mixed => {
let half = (profile.threads / 2).max(1).to_string();
cmd.args(["--matrix", &half, "--vm", &half, "--vm-bytes", "40%"]);
}
}
let mut child = cmd.stderr(Stdio::piped()).spawn().context("Failed to spawn stress-ng")?;
self.start_time = Some(Instant::now());
// Spawn metrics parser thread
let metrics_ref = Arc::clone(&self.latest_metrics);
let stderr = child.stderr.take().expect("Failed to capture stderr");
thread::spawn(move || {
let reader = BufReader::new(stderr);
for line in reader.lines().flatten() {
// Parse stress-ng metrics line
if line.contains("matrix") || line.contains("vm") {
let parts: Vec<&str> = line.split_whitespace().collect();
if let Some(val) = parts.last() {
if let Ok(ops) = val.parse::<f64>() {
let mut m = metrics_ref.lock().unwrap();
m.primary_ops_per_sec = ops;
}
}
}
}
});
self.child = Some(child);
Ok(())
}
fn stop(&mut self) -> Result<()> {
fn get_current_metrics(&self) -> Result<WorkloadMetrics> {
let mut m = self.latest_metrics.lock().unwrap().clone();
if let Some(start) = self.start_time {
m.elapsed_time = start.elapsed();
}
Ok(m)
}
fn stop_workload(&mut self) -> Result<()> {
if let Some(mut child) = self.child.take() {
#[cfg(unix)]
{
@@ -77,19 +162,13 @@ impl Workload for StressNg {
let _ = child.wait();
}
}
self.start_time = None;
Ok(())
}
/// Returns the current throughput of the workload (e.g., ops/sec).
///
/// This is currently a stub and does not parse `stress-ng` output.
fn get_throughput(&self) -> Result<f64> {
Ok(0.0)
}
}
impl Drop for StressNg {
fn drop(&mut self) {
let _ = self.stop();
let _ = self.stop_workload();
}
}

View File

@@ -8,7 +8,8 @@ use std::sync::atomic::{AtomicBool, Ordering};
use std::io;
use clap::Parser;
use tracing::{info, debug, error};
use tracing::error;
use tracing_subscriber::{fmt, prelude::*, EnvFilter};
use crossterm::{
event::{self, Event, KeyCode},
@@ -68,27 +69,24 @@ fn print_summary_report(result: &OptimizationResult) {
println!();
}
fn setup_logging(verbose: bool) -> tracing_appender::non_blocking::WorkerGuard {
let file_appender = tracing_appender::rolling::never("/var/log", "ember-tune.log");
let (non_blocking, guard) = tracing_appender::non_blocking(file_appender);
fn main() -> Result<()> {
let args = Cli::parse();
let level = if verbose { tracing::Level::DEBUG } else { tracing::Level::INFO };
// 1. Logging Setup (File-only by default, Stdout during Audit)
let file_appender = tracing_appender::rolling::never(".", "ember-tune.log");
let (non_blocking, _guard) = tracing_appender::non_blocking(file_appender);
let level = if args.verbose { "debug" } else { "info" };
tracing_subscriber::fmt()
.with_max_level(level)
let file_layer = fmt::layer()
.with_writer(non_blocking)
.with_ansi(false)
.with_ansi(false);
// We use a simple println for the audit to avoid complex reload handles
tracing_subscriber::registry()
.with(EnvFilter::new(level))
.with(file_layer)
.init();
guard
}
fn main() -> Result<()> {
// 1. Diagnostics & CLI Initialization
let args = Cli::parse();
let _log_guard = setup_logging(args.verbose);
// Set panic hook to restore terminal state
std::panic::set_hook(Box::new(|panic_info| {
let _ = disable_raw_mode();
let mut stdout = io::stdout();
@@ -99,11 +97,10 @@ fn main() -> Result<()> {
eprintln!("----------------------------------------\n");
}));
info!("ember-tune starting with args: {:?}", args);
println!("{}", console::style("─── Pre-flight System Audit ───").bold().cyan());
let ctx = ember_tune_rs::sal::traits::EnvironmentCtx::production();
// 2. Platform Detection & Audit
let (sal_box, facts): (Box<dyn PlatformSal>, SystemFactSheet) = if args.mock {
(Box::new(MockSal::new()), SystemFactSheet::default())
} else {
@@ -111,9 +108,7 @@ fn main() -> Result<()> {
};
let sal: Arc<dyn PlatformSal> = sal_box.into();
println!("{}", console::style("─── Pre-flight System Audit ───").bold().cyan());
let mut audit_failures = Vec::new();
for step in sal.audit() {
print!(" Checking {:<40} ", step.description);
io::Write::flush(&mut io::stdout()).into_diagnostic()?;
@@ -137,15 +132,14 @@ fn main() -> Result<()> {
return Ok(());
}
// 3. Terminal Setup
// Entering TUI Mode - STDOUT is now strictly for Ratatui
enable_raw_mode().into_diagnostic()?;
let mut stdout = io::stdout();
execute!(stdout, EnterAlternateScreen).into_diagnostic()?;
execute!(stdout, EnterAlternateScreen, crossterm::cursor::Hide).into_diagnostic()?;
let backend_stdout = io::stdout();
let backend_term = CrosstermBackend::new(backend_stdout);
let mut terminal = Terminal::new(backend_term).into_diagnostic()?;
// 4. State & Communication Setup
let running = Arc::new(AtomicBool::new(true));
let r = running.clone();
@@ -158,9 +152,9 @@ fn main() -> Result<()> {
r.store(false, Ordering::SeqCst);
}).expect("Error setting Ctrl-C handler");
// 5. Spawn Backend Orchestrator
let sal_backend = sal.clone();
let facts_backend = facts.clone();
let config_out = args.config_out.clone();
let backend_handle = thread::spawn(move || {
let workload = Box::new(StressNg::new());
let mut orchestrator = BenchmarkOrchestrator::new(
@@ -169,14 +163,14 @@ fn main() -> Result<()> {
workload,
telemetry_tx,
command_rx,
config_out,
);
orchestrator.run()
});
// 6. Frontend Event Loop
let mut ui_state = DashboardState::new();
let mut last_telemetry = TelemetryState {
cpu_model: "Loading...".to_string(),
cpu_model: facts.model.clone(),
total_ram_gb: 0,
tick: 0,
cpu_temp: 0.0,
@@ -187,6 +181,7 @@ fn main() -> Result<()> {
pl1_limit: 0.0,
pl2_limit: 0.0,
fan_tier: "auto".to_string(),
is_throttling: false,
phase: BenchmarkPhase::Auditing,
history_watts: Vec::new(),
history_temp: Vec::new(),
@@ -224,7 +219,6 @@ fn main() -> Result<()> {
while let Ok(new_state) = telemetry_rx.try_recv() {
if let Some(log) = &new_state.log_event {
ui_state.add_log(log.clone());
debug!("Backend Log: {}", log);
} else {
ui_state.update(&new_state);
last_telemetry = new_state;
@@ -235,20 +229,11 @@ fn main() -> Result<()> {
if backend_handle.is_finished() { break; }
}
// 7. Terminal Restoration
let _ = disable_raw_mode();
let _ = execute!(terminal.backend_mut(), LeaveAlternateScreen);
let _ = terminal.show_cursor();
let _ = execute!(terminal.backend_mut(), LeaveAlternateScreen, crossterm::cursor::Show);
// 8. Final Report & Hardware Restoration
let join_res = backend_handle.join();
// Explicit hardware restoration
info!("Restoring hardware state...");
if let Err(e) = sal.restore() {
error!("Failed to restore hardware state: {}", e);
}
match join_res {
Ok(Ok(result)) => {
print_summary_report(&result);
@@ -273,6 +258,5 @@ fn main() -> Result<()> {
}
}
info!("ember-tune exited gracefully.");
Ok(())
}

View File

@@ -35,6 +35,7 @@ pub struct TelemetryState {
pub pl1_limit: f32,
pub pl2_limit: f32,
pub fan_tier: String,
pub is_throttling: bool,
pub phase: BenchmarkPhase,
// --- High-res History ---

View File

@@ -3,7 +3,8 @@
//! It manages hardware interactions through the [PlatformSal], generates stress
//! using a [Workload], and feeds telemetry to the frontend via MPSC channels.
use anyhow::{Result, Context};
use anyhow::{Result, Context, bail};
use tracing::{info, warn, error};
use std::sync::mpsc;
use std::time::{Duration, Instant};
use std::thread;
@@ -12,17 +13,32 @@ use sysinfo::System;
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Mutex;
use std::path::PathBuf;
use std::cell::Cell;
use crate::sal::traits::{PlatformSal, SafetyStatus};
use crate::sal::traits::{PlatformSal, SensorBus};
use crate::sal::heuristic::discovery::SystemFactSheet;
use crate::load::Workload;
use crate::sal::safety::{HardwareStateGuard, PowerLimitWatts, ThermalWatchdog};
use crate::load::{Workload, IntensityProfile, StressVector};
use crate::mediator::{TelemetryState, UiCommand, BenchmarkPhase};
use crate::engine::{OptimizerEngine, ThermalProfile, ThermalPoint, OptimizationResult};
use crate::agent_analyst::HeuristicAnalyst;
/// Represents the possible states of the benchmark orchestrator.
pub enum OrchestratorState {
/// Performing pre-flight checks and snapshotting.
PreFlight,
/// Acquiring idle baseline telemetry.
IdleBaseline,
/// Actively sweeping through power limits.
StressSweep { current_wattage: f32 },
/// Allowing hardware to cool down before releasing the guard.
Cooldown,
/// Benchmark complete, generating final results.
Finalizing,
}
/// The central state machine responsible for coordinating the thermal benchmark.
///
/// It manages hardware interactions through the [PlatformSal], generates stress
/// using a [Workload], and feeds telemetry to the frontend via MPSC channels.
pub struct BenchmarkOrchestrator {
/// Injected hardware abstraction layer.
sal: Arc<dyn PlatformSal>,
@@ -34,12 +50,19 @@ pub struct BenchmarkOrchestrator {
telemetry_tx: mpsc::Sender<TelemetryState>,
/// Channel for receiving commands from the UI.
command_rx: mpsc::Receiver<UiCommand>,
/// Current phase of the benchmark.
phase: BenchmarkPhase,
/// Current phase reported to the UI.
ui_phase: BenchmarkPhase,
/// Accumulated thermal data points.
profile: ThermalProfile,
/// Mathematics engine for data smoothing and optimization.
engine: OptimizerEngine,
/// CLI override for the configuration output path.
optional_config_out: Option<PathBuf>,
/// The safety membrane protecting the system.
safeguard: Option<HardwareStateGuard>,
/// Active thermal watchdog.
watchdog: Option<ThermalWatchdog>,
/// Sliding window of power readings (Watts).
history_watts: VecDeque<f32>,
@@ -67,6 +90,7 @@ impl BenchmarkOrchestrator {
workload: Box<dyn Workload>,
telemetry_tx: mpsc::Sender<TelemetryState>,
command_rx: mpsc::Receiver<UiCommand>,
optional_config_out: Option<PathBuf>,
) -> Self {
let mut sys = System::new_all();
sys.refresh_all();
@@ -82,7 +106,7 @@ impl BenchmarkOrchestrator {
workload,
telemetry_tx,
command_rx,
phase: BenchmarkPhase::Auditing,
ui_phase: BenchmarkPhase::Auditing,
profile: ThermalProfile::default(),
engine: OptimizerEngine::new(5),
history_watts: VecDeque::with_capacity(120),
@@ -92,244 +116,245 @@ impl BenchmarkOrchestrator {
total_ram_gb,
emergency_abort: Arc::new(AtomicBool::new(false)),
emergency_reason: Arc::new(Mutex::new(None)),
optional_config_out,
safeguard: None,
watchdog: None,
}
}
/// Executes the full benchmark sequence.
///
/// This method guarantees that [crate::sal::traits::EnvironmentGuard::restore] and [Workload::stop]
/// are called regardless of whether the benchmark succeeds or fails.
pub fn run(&mut self) -> Result<OptimizationResult> {
self.log("Starting ember-tune Benchmark Sequence.")?;
// Immediate Priming
let _ = self.sal.get_temp();
let _ = self.sal.get_power_w();
let _ = self.sal.get_fan_rpms();
let _watchdog_handle = self.spawn_watchdog_monitor();
info!("Orchestrator: Initializing Project Iron-Ember lifecycle.");
// Spawn safety watchdog immediately
let watchdog = ThermalWatchdog::spawn(self.sal.clone(), self.emergency_abort.clone());
self.watchdog = Some(watchdog);
let result = self.execute_benchmark();
self.log("Benchmark sequence finished. Restoring hardware defaults...")?;
let _ = self.workload.stop();
if let Err(e) = self.sal.restore() {
anyhow::bail!("CRITICAL: Failed to restore hardware state: {}", e);
if let Err(ref e) = result {
error!("Benchmark Lifecycle Failure: {}", e);
let _ = self.log(&format!("⚠ FAILURE: {}", e));
}
self.log("✓ Hardware state restored.")?;
// --- MANDATORY RAII CLEANUP ---
info!("Benchmark sequence complete. Releasing safeguards...");
let _ = self.workload.stop_workload();
if let Some(mut sg) = self.safeguard.take() {
if let Err(e) = sg.release() {
error!("CRITICAL: State restoration failure: {}", e);
}
}
info!("✓ Hardware state restored to pre-flight defaults.");
result
}
/// Internal execution logic for the benchmark phases.
fn execute_benchmark(&mut self) -> Result<OptimizationResult> {
let bench_cfg = self.facts.bench_config.clone().context("Benchmarking config missing in facts")?;
let bench_cfg = self.facts.bench_config.clone().context("Benchmarking configuration missing.")?;
self.phase = BenchmarkPhase::Auditing;
// 1. Pre-Flight Phase
self.ui_phase = BenchmarkPhase::Auditing;
self.log("Phase: Pre-Flight Auditing & Sterilization")?;
// Snapshot and neutralise Brawl Matrix
let mut target_files = self.facts.rapl_paths.iter()
.map(|p| p.join("constraint_0_power_limit_uw"))
.collect::<Vec<_>>();
target_files.extend(self.facts.rapl_paths.iter().map(|p| p.join("constraint_1_power_limit_uw")));
if let Some(tp) = self.facts.paths.configs.get("throttled") {
target_files.push(tp.clone());
}
let sg = HardwareStateGuard::acquire(&target_files, &self.facts.conflict_services)?;
self.safeguard = Some(sg);
// Run auditor
for step in self.sal.audit() {
if let Err(e) = step.outcome {
return Err(anyhow::anyhow!("Audit failed ({}): {:?}", step.description, e));
}
}
self.log("Suppressing background services (tlp, thermald)...")?;
self.sal.suppress().context("Failed to suppress background services")?;
self.workload.initialize().context("Failed to initialize load generator.")?;
self.phase = BenchmarkPhase::IdleCalibration;
self.log(&format!("Phase 1: Recording Idle Baseline ({}s)...", bench_cfg.idle_duration_s))?;
let tick = Cell::new(0u64);
// 2. Idle Baseline Phase
self.ui_phase = BenchmarkPhase::IdleCalibration;
self.log(&format!("Phase: Recording Idle Baseline ({}s)", bench_cfg.idle_duration_s))?;
// Wait for fan spin-up
self.sal.set_fan_mode("auto")?;
let mut idle_temps = Vec::new();
let start = Instant::now();
let mut tick = 0;
while start.elapsed() < Duration::from_secs(bench_cfg.idle_duration_s) {
self.check_abort()?;
self.send_telemetry(tick)?;
self.check_safety_abort()?;
self.send_telemetry(tick.get())?;
idle_temps.push(self.sal.get_temp().unwrap_or(0.0));
tick += 1;
tick.set(tick.get() + 1);
thread::sleep(Duration::from_millis(500));
}
self.profile.ambient_temp = self.engine.smooth(&idle_temps).last().cloned().unwrap_or(0.0);
self.log(&format!("✓ Idle Baseline: {:.1}°C", self.profile.ambient_temp))?;
self.phase = BenchmarkPhase::StressTesting;
self.log("Phase 2: Starting Synthetic Stress Matrix.")?;
// 3. Stress Sweep Phase
self.ui_phase = BenchmarkPhase::StressTesting;
self.log("Phase: Synthetic Stress Matrix (Gradual Ramp)")?;
// Ensure fans are ramped to MAX before load
self.log("Metrology: Locking fans to MAX...")?;
self.sal.set_fan_mode("max")?;
let fan_lock_start = Instant::now();
loop {
let fans = self.sal.get_fan_rpms().unwrap_or_default();
let max_rpm = fans.iter().cloned().max().unwrap_or(0);
if max_rpm >= 3000 || fan_lock_start.elapsed() > Duration::from_secs(15) {
break;
}
thread::sleep(Duration::from_millis(500));
self.send_telemetry(tick.get())?;
tick.set(tick.get() + 1);
}
let steps = bench_cfg.power_steps_watts.clone();
for &pl in &steps {
self.log(&format!("Testing PL1 = {:.0}W...", pl))?;
self.sal.set_sustained_power_limit(pl)?;
self.sal.set_burst_power_limit(pl + 5.0)?;
let physical_threads = num_cpus::get_physical();
let mut previous_ops = 0.0;
self.workload.start(num_cpus::get(), 100)?;
for &watts in &bench_cfg.power_steps_watts {
self.check_safety_abort()?;
self.log(&format!("Testing PL1 = {:.0}W", watts))?;
// Apply limits safely
let pl1 = PowerLimitWatts::try_new(watts)?;
let pl2 = PowerLimitWatts::try_new(watts + 5.0)?;
self.sal.set_sustained_power_limit(pl1)?;
self.sal.set_burst_power_limit(pl2)?;
// Start workload
self.workload.run_workload(
Duration::from_secs(bench_cfg.stress_duration_max_s),
IntensityProfile { threads: physical_threads, load_percentage: 100, vector: StressVector::CpuMatrix }
)?;
let step_start = Instant::now();
let mut step_temps = VecDeque::with_capacity(30);
let mut previous_step_temp = self.sal.get_temp().unwrap_or(0.0);
// Equilibrium Gating
while step_start.elapsed() < Duration::from_secs(bench_cfg.stress_duration_max_s) {
self.check_abort()?;
self.check_safety_abort()?;
let t = self.sal.get_temp().unwrap_or(0.0);
let dt_dt = (t - previous_step_temp) / 0.5;
previous_step_temp = t;
// Redundant safety check during step
if t > 94.0 || dt_dt > 5.0 {
warn!("Thermal Spike Detected! Aborting current step.");
break;
}
step_temps.push_back(t);
if step_temps.len() > 10 { step_temps.pop_front(); }
self.send_telemetry(tick)?;
tick += 1;
self.send_telemetry(tick.get())?;
tick.set(tick.get() + 1);
if step_start.elapsed() > Duration::from_secs(bench_cfg.stress_duration_min_s) && step_temps.len() == 10 {
let min = step_temps.iter().fold(f32::MAX, |a, &b| a.min(b));
let max = step_temps.iter().fold(f32::MIN, |a, &b| a.max(b));
if (max - min) < 0.5 {
self.log(&format!(" Equilibrium reached at {:.1}°C", t))?;
info!("Equilibrium reached at {:.1}°C", t);
break;
}
}
thread::sleep(Duration::from_millis(500));
}
let avg_p = self.sal.get_power_w().unwrap_or(0.0);
let avg_t = self.sal.get_temp().unwrap_or(0.0);
let avg_f = self.sal.get_freq_mhz().unwrap_or(0.0);
let fans = self.sal.get_fan_rpms().unwrap_or_default();
let primary_fan = fans.first().cloned().unwrap_or(0);
let tp = self.workload.get_throughput().unwrap_or(0.0);
// Record data point
let metrics = self.workload.get_current_metrics().unwrap_or_default();
self.profile.points.push(ThermalPoint {
power_w: avg_p,
temp_c: avg_t,
freq_mhz: avg_f,
fan_rpm: primary_fan,
throughput: tp,
power_w: self.sal.get_power_w().unwrap_or(watts),
temp_c: self.sal.get_temp().unwrap_or(0.0),
freq_mhz: self.sal.get_freq_mhz().unwrap_or(0.0),
fan_rpm: self.sal.get_fan_rpms().unwrap_or_default().first().cloned().unwrap_or(0),
throughput: metrics.primary_ops_per_sec,
});
self.workload.stop()?;
self.log(&format!(" Step complete. Cooling down for {}s...", bench_cfg.cool_down_s))?;
self.workload.stop_workload()?;
// Performance Halt Condition
if previous_ops > 0.0 {
let gain = ((metrics.primary_ops_per_sec - previous_ops) / previous_ops) * 100.0;
if gain < 1.0 {
self.log("Diminishing returns reached. Stopping sweep.")?;
break;
}
}
previous_ops = metrics.primary_ops_per_sec;
self.log(&format!("Cooling down ({}s)...", bench_cfg.cool_down_s))?;
thread::sleep(Duration::from_secs(bench_cfg.cool_down_s));
}
self.phase = BenchmarkPhase::PhysicalModeling;
self.log("Phase 3: Calculating Silicon Physical Sweet Spot...")?;
// 4. Physical Modeling Phase
self.ui_phase = BenchmarkPhase::PhysicalModeling;
self.log("Phase: Silicon Physical Sweet Spot Calculation")?;
let analyst = HeuristicAnalyst::new();
let matrix = analyst.analyze(&self.profile, self.profile.points.last().map(|p| p.power_w).unwrap_or(15.0));
let mut res = self.generate_result(false);
res.optimization_matrix = Some(matrix.clone());
self.log(&format!("✓ Thermal Resistance (): {:.3} K/W", res.thermal_resistance_kw))?;
self.log(&format!("✓ Silicon Knee Found: {:.1} W", res.silicon_knee_watts))?;
info!("Identification complete. Knee: {:.1}W, Rθ: {:.3} K/W", res.silicon_knee_watts, res.thermal_resistance_kw);
thread::sleep(Duration::from_secs(3));
// 5. Finalizing Phase
self.ui_phase = BenchmarkPhase::Finalizing;
self.log("Phase: Generation of Optimized Configuration Sets")?;
self.phase = BenchmarkPhase::Finalizing;
self.log("Benchmark sequence complete. Generating configurations...")?;
let throttled_path = self.optional_config_out.clone()
.or_else(|| self.facts.paths.configs.get("throttled").cloned());
let config = crate::engine::formatters::throttled::ThrottledConfig {
pl1_limit: res.silicon_knee_watts,
pl2_limit: res.recommended_pl2,
trip_temp: res.max_temp_c.max(95.0),
};
if let Some(throttled_path) = self.facts.paths.configs.get("throttled") {
crate::engine::formatters::throttled::ThrottledTranslator::save(throttled_path, &config)?;
self.log(&format!("✓ Saved '{}' (merged).", throttled_path.display()))?;
res.config_paths.insert("throttled".to_string(), throttled_path.clone());
}
if let Some(i8k_path) = self.facts.paths.configs.get("i8kmon") {
let i8k_config = crate::engine::formatters::i8kmon::I8kmonConfig {
t_ambient: self.profile.ambient_temp,
t_max_fan: res.max_temp_c - 5.0,
thermal_resistance_kw: res.thermal_resistance_kw,
if let Some(path) = throttled_path {
let config = crate::engine::formatters::throttled::ThrottledConfig {
pl1_limit: res.silicon_knee_watts,
pl2_limit: res.recommended_pl2,
trip_temp: res.max_temp_c.max(90.0),
};
crate::engine::formatters::i8kmon::I8kmonTranslator::save(i8k_path, &i8k_config)?;
self.log(&format!("✓ Saved '{}'.", i8k_path.display()))?;
res.config_paths.insert("i8kmon".to_string(), i8k_path.clone());
crate::engine::formatters::throttled::ThrottledTranslator::save(&path, &config)?;
self.log(&format!("✓ Saved Throttled profile to {}", path.display()))?;
res.config_paths.insert("throttled".to_string(), path);
}
Ok(res)
}
/// Spawns a concurrent monitor that polls safety sensors every 100ms.
fn spawn_watchdog_monitor(&self) -> thread::JoinHandle<()> {
let abort = self.emergency_abort.clone();
let reason_store = self.emergency_reason.clone();
let sal = self.sal.clone();
let tx = self.telemetry_tx.clone();
thread::spawn(move || {
while !abort.load(Ordering::SeqCst) {
let status = sal.get_safety_status();
match status {
Ok(SafetyStatus::EmergencyAbort(reason)) => {
*reason_store.lock().unwrap() = Some(reason.clone());
abort.store(true, Ordering::SeqCst);
break;
}
Ok(SafetyStatus::Warning(msg)) | Ok(SafetyStatus::Critical(msg)) => {
let state = TelemetryState {
cpu_model: String::new(),
total_ram_gb: 0,
tick: 0,
cpu_temp: 0.0,
power_w: 0.0,
current_freq: 0.0,
fans: Vec::new(),
governor: String::new(),
pl1_limit: 0.0,
pl2_limit: 0.0,
fan_tier: String::new(),
phase: BenchmarkPhase::StressTesting,
history_watts: Vec::new(),
history_temp: Vec::new(),
history_mhz: Vec::new(),
log_event: Some(format!("WATCHDOG: {}", msg)),
metadata: std::collections::HashMap::new(),
is_emergency: false,
emergency_reason: None,
};
let _ = tx.send(state);
}
Ok(SafetyStatus::Nominal) => {}
Err(e) => {
*reason_store.lock().unwrap() = Some(format!("Watchdog Sensor Failure: {}", e));
abort.store(true, Ordering::SeqCst);
break;
}
}
thread::sleep(Duration::from_millis(100));
}
})
}
/// Generates the final [OptimizationResult] based on current measurements.
pub fn generate_result(&self, is_partial: bool) -> OptimizationResult {
let r_theta = self.engine.calculate_thermal_resistance(&self.profile);
let knee = self.engine.find_silicon_knee(&self.profile);
let max_t = self.engine.get_max_temp(&self.profile);
OptimizationResult {
profile: self.profile.clone(),
silicon_knee_watts: knee,
thermal_resistance_kw: r_theta,
recommended_pl1: knee,
recommended_pl2: knee * 1.25,
max_temp_c: max_t,
is_partial,
config_paths: std::collections::HashMap::new(),
}
}
/// Checks if the benchmark has been aborted by the user or the watchdog.
fn check_abort(&self) -> Result<()> {
/// Checks if the safety watchdog or user triggered an abort.
fn check_safety_abort(&self) -> Result<()> {
if self.emergency_abort.load(Ordering::SeqCst) {
let reason = self.emergency_reason.lock().unwrap().clone().unwrap_or_else(|| "Unknown safety trigger".to_string());
return Err(anyhow::anyhow!("EMERGENCY_ABORT: {}", reason));
let reason = self.emergency_reason.lock().unwrap().clone().unwrap_or_else(|| "Watchdog Triggered".to_string());
bail!("EMERGENCY_ABORT: {}", reason);
}
if let Ok(cmd) = self.command_rx.try_recv() {
match cmd {
UiCommand::Abort => {
return Err(anyhow::anyhow!("ABORTED"));
}
UiCommand::Abort => bail!("ABORTED"),
}
}
Ok(())
}
/// Helper to send log messages to the frontend.
fn log(&self, msg: &str) -> Result<()> {
let state = TelemetryState {
cpu_model: self.cpu_model.clone(),
@@ -339,11 +364,12 @@ impl BenchmarkOrchestrator {
power_w: self.sal.get_power_w().unwrap_or(0.0),
current_freq: self.sal.get_freq_mhz().unwrap_or(0.0),
fans: self.sal.get_fan_rpms().unwrap_or_default(),
governor: "unknown".to_string(),
governor: "performance".to_string(),
pl1_limit: 0.0,
pl2_limit: 0.0,
fan_tier: "auto".to_string(),
phase: self.phase,
is_throttling: self.sal.get_throttling_status().unwrap_or(false),
phase: self.ui_phase,
history_watts: Vec::new(),
history_temp: Vec::new(),
history_mhz: Vec::new(),
@@ -355,7 +381,6 @@ impl BenchmarkOrchestrator {
self.telemetry_tx.send(state).map_err(|_| anyhow::anyhow!("Telemetry channel closed"))
}
/// Collects current sensors and sends a complete [TelemetryState] to the frontend.
fn send_telemetry(&mut self, tick: u64) -> Result<()> {
let temp = self.sal.get_temp().unwrap_or(0.0);
let pwr = self.sal.get_power_w().unwrap_or(0.0);
@@ -383,7 +408,8 @@ impl BenchmarkOrchestrator {
pl1_limit: 15.0,
pl2_limit: 25.0,
fan_tier: "max".to_string(),
phase: self.phase,
is_throttling: self.sal.get_throttling_status().unwrap_or(false),
phase: self.ui_phase,
history_watts: self.history_watts.iter().cloned().collect(),
history_temp: self.history_temp.iter().cloned().collect(),
history_mhz: self.history_mhz.iter().cloned().collect(),
@@ -394,4 +420,22 @@ impl BenchmarkOrchestrator {
};
self.telemetry_tx.send(state).map_err(|_| anyhow::anyhow!("Telemetry channel closed"))
}
pub fn generate_result(&self, is_partial: bool) -> OptimizationResult {
let r_theta = self.engine.calculate_thermal_resistance(&self.profile);
let knee = self.engine.find_silicon_knee(&self.profile);
let max_t = self.engine.get_max_temp(&self.profile);
OptimizationResult {
profile: self.profile.clone(),
silicon_knee_watts: knee,
thermal_resistance_kw: r_theta,
recommended_pl1: knee,
recommended_pl2: knee * 1.25,
max_temp_c: max_t,
is_partial,
config_paths: std::collections::HashMap::new(),
optimization_matrix: None,
}
}
}

View File

@@ -1,35 +1,81 @@
use super::traits::{PreflightAuditor, EnvironmentGuard, SensorBus, ActuatorBus, HardwareWatchdog, AuditError, AuditStep, SafetyStatus, EnvironmentCtx};
use crate::sal::safety::{PowerLimitWatts, FanSpeedPercent};
use anyhow::{Result, Context, anyhow};
use std::fs;
use std::path::{PathBuf};
use std::time::{Duration, Instant};
use std::thread;
use std::sync::Mutex;
use tracing::{debug};
use tracing::{info, debug};
use crate::sal::heuristic::discovery::SystemFactSheet;
/// Implementation of the System Abstraction Layer for the Dell XPS 13 9380.
pub struct DellXps9380Sal {
ctx: EnvironmentCtx,
fact_sheet: SystemFactSheet,
temp_path: PathBuf,
pwr_path: PathBuf,
fan_paths: Vec<PathBuf>,
pwm_paths: Vec<PathBuf>,
pwm_enable_paths: Vec<PathBuf>,
pl1_paths: Vec<PathBuf>,
pl2_paths: Vec<PathBuf>,
freq_path: PathBuf,
pl1_path: PathBuf,
pl2_path: PathBuf,
last_poll: Mutex<Instant>,
last_temp: Mutex<f32>,
last_fans: Mutex<Vec<u32>>,
suppressed_services: Mutex<Vec<String>>,
msr_file: Mutex<fs::File>,
last_energy: Mutex<(u64, Instant)>,
last_watts: Mutex<f32>,
}
impl DellXps9380Sal {
/// Initializes the Dell SAL, opening the MSR interface and discovering sensors and PWM nodes.
pub fn init(ctx: EnvironmentCtx, facts: SystemFactSheet) -> Result<Self> {
let temp_path = facts.temp_path.clone().context("Dell SAL requires temperature sensor")?;
let pwr_base = facts.rapl_paths.first().cloned().context("Dell SAL requires RAPL interface")?;
let fan_paths = facts.fan_paths.clone();
// 1. Discover PWM and Enable nodes associated with the fan paths
let mut pwm_paths = Vec::new();
let mut pwm_enable_paths = Vec::new();
for fan_p in &fan_paths {
if let Some(parent) = fan_p.parent() {
let fan_file = fan_p.file_name().and_then(|n| n.to_str()).unwrap_or("");
let fan_idx = fan_file.chars().filter(|c| c.is_ascii_digit()).collect::<String>();
let idx = if fan_idx.is_empty() { "1".to_string() } else { fan_idx };
let pwm_p = parent.join(format!("pwm{}", idx));
if pwm_p.exists() { pwm_paths.push(pwm_p); }
let enable_p = parent.join(format!("pwm{}_enable", idx));
if enable_p.exists() { pwm_enable_paths.push(enable_p); }
}
}
// 2. Map all RAPL constraints
let mut pl1_paths = Vec::new();
let mut pl2_paths = Vec::new();
for rapl_p in &facts.rapl_paths {
pl1_paths.push(rapl_p.join("constraint_0_power_limit_uw"));
pl2_paths.push(rapl_p.join("constraint_1_power_limit_uw"));
}
// 3. Physical Sensor Verification & Warm Cache Priming
let mut initial_fans = Vec::new();
for fan_p in &fan_paths {
let mut rpm = 0;
for _ in 0..3 {
if let Ok(val) = fs::read_to_string(fan_p) {
rpm = val.trim().parse::<u32>().unwrap_or(0);
if rpm > 0 { break; }
}
thread::sleep(Duration::from_millis(100));
}
info!("SAL Warm-Start: Fan sensor {:?} -> {} RPM", fan_p, rpm);
initial_fans.push(rpm);
}
let freq_path = ctx.sysfs_base.join("sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq");
let msr_path = ctx.sysfs_base.join("dev/cpu/0/msr");
@@ -38,19 +84,24 @@ impl DellXps9380Sal {
let initial_energy = fs::read_to_string(pwr_base.join("energy_uj")).unwrap_or_default().trim().parse().unwrap_or(0);
info!("SAL: Dell XPS 9380 Initialized. ({} fans, {} RAPL nodes found)",
fan_paths.len(), facts.rapl_paths.len());
Ok(Self {
temp_path,
pwr_path: pwr_base.join("power1_average"),
fan_paths,
pwm_paths,
pwm_enable_paths,
pl1_paths,
pl2_paths,
freq_path,
pl1_path: pwr_base.join("constraint_0_power_limit_uw"),
pl2_path: pwr_base.join("constraint_1_power_limit_uw"),
last_poll: Mutex::new(Instant::now() - Duration::from_secs(2)),
last_temp: Mutex::new(0.0),
last_fans: Mutex::new(Vec::new()),
suppressed_services: Mutex::new(Vec::new()),
last_fans: Mutex::new(initial_fans),
msr_file: Mutex::new(msr_file),
last_energy: Mutex::new((initial_energy, Instant::now())),
last_watts: Mutex::new(0.0),
fact_sheet: facts,
ctx,
})
@@ -80,14 +131,24 @@ impl PreflightAuditor for DellXps9380Sal {
outcome: if unsafe { libc::getuid() } == 0 { Ok(()) } else { Err(AuditError::RootRequired) }
});
let rapl_lock = match self.read_msr(0x610) {
Ok(val) => {
if (val & (1 << 63)) != 0 {
Err(AuditError::KernelIncompatible("RAPL Registers are locked by BIOS. Power limit tuning is impossible.".to_string()))
} else {
Ok(())
}
},
Err(e) => Err(AuditError::ToolMissing(format!("Cannot read MSR 0x610: {}", e))),
};
steps.push(AuditStep { description: "MSR 0x610 RAPL Lock Status".to_string(), outcome: rapl_lock });
let modules = ["dell_smm_hwmon", "msr", "intel_rapl_msr"];
for mod_name in modules {
let path = self.ctx.sysfs_base.join(format!("sys/module/{}", mod_name));
steps.push(AuditStep {
description: format!("Kernel Module: {}", mod_name),
outcome: if path.exists() { Ok(()) } else {
Err(AuditError::ToolMissing(format!("Module '{}' not loaded.", mod_name)))
}
outcome: if path.exists() { Ok(()) } else { Err(AuditError::ToolMissing(format!("Module '{}' not loaded.", mod_name))) }
});
}
@@ -109,15 +170,7 @@ impl PreflightAuditor for DellXps9380Sal {
let ac_status = fs::read_to_string(ac_status_path).unwrap_or_else(|_| "0".to_string());
steps.push(AuditStep {
description: "AC Power Connection".to_string(),
outcome: if ac_status.trim() == "1" { Ok(()) } else {
Err(AuditError::AcPowerMissing("System must be on AC power".to_string()))
}
});
let tool_check = self.fact_sheet.paths.tools.contains_key("dell_fan_ctrl");
steps.push(AuditStep {
description: "Dell Fan Control Tool".to_string(),
outcome: if tool_check { Ok(()) } else { Err(AuditError::ToolMissing("dell-bios-fan-control not found in PATH".to_string())) }
outcome: if ac_status.trim() == "1" { Ok(()) } else { Err(AuditError::AcPowerMissing("System must be on AC power".to_string())) }
});
Box::new(steps.into_iter())
@@ -125,33 +178,16 @@ impl PreflightAuditor for DellXps9380Sal {
}
impl EnvironmentGuard for DellXps9380Sal {
fn suppress(&self) -> Result<()> {
let services = ["tlp", "thermald", "i8kmon"];
let mut suppressed = self.suppressed_services.lock().unwrap();
for s in services {
if self.ctx.runner.run("systemctl", &["is-active", "--quiet", s]).is_ok() {
debug!("Suppressing service: {}", s);
self.ctx.runner.run("systemctl", &["stop", s])?;
suppressed.push(s.to_string());
}
}
Ok(())
}
fn restore(&self) -> Result<()> {
let mut suppressed = self.suppressed_services.lock().unwrap();
for s in suppressed.drain(..) {
let _ = self.ctx.runner.run("systemctl", &["start", &s]);
}
Ok(())
}
fn suppress(&self) -> Result<()> { Ok(()) }
fn restore(&self) -> Result<()> { Ok(()) }
}
impl SensorBus for DellXps9380Sal {
fn get_temp(&self) -> Result<f32> {
let mut last_poll = self.last_poll.lock().unwrap();
let now = Instant::now();
if now.duration_since(*last_poll) < Duration::from_millis(1000) {
// # SAFETY: High frequency polling for watchdog
if now.duration_since(*last_poll) < Duration::from_millis(100) {
return Ok(*self.last_temp.lock().unwrap());
}
let s = fs::read_to_string(&self.temp_path)?;
@@ -162,16 +198,24 @@ impl SensorBus for DellXps9380Sal {
}
fn get_power_w(&self) -> Result<f32> {
if self.pwr_path.to_string_lossy().contains("energy_uj") {
let mut last = self.last_energy.lock().unwrap();
let e2 = fs::read_to_string(&self.pwr_path)?.trim().parse::<u64>()?;
let rapl_base = self.fact_sheet.rapl_paths.first().context("RAPL path error")?;
let energy_path = rapl_base.join("energy_uj");
if energy_path.exists() {
let mut last_energy = self.last_energy.lock().unwrap();
let mut last_watts = self.last_watts.lock().unwrap();
let e2_str = fs::read_to_string(&energy_path)?;
let e2 = e2_str.trim().parse::<u64>()?;
let t2 = Instant::now();
let (e1, t1) = *last;
let (e1, t1) = *last_energy;
let delta_e = e2.wrapping_sub(e1);
let delta_t = t2.duration_since(t1).as_secs_f32();
*last = (e2, t2);
if delta_t < 0.01 { return Ok(0.0); }
Ok((delta_e as f32 / 1_000_000.0) / delta_t)
if delta_t < 0.1 { return Ok(*last_watts); }
let watts = (delta_e as f32 / 1_000_000.0) / delta_t;
*last_energy = (e2, t2);
*last_watts = watts;
Ok(watts)
} else {
let s = fs::read_to_string(&self.pwr_path)?;
Ok(s.trim().parse::<f32>()? / 1000000.0)
@@ -184,12 +228,27 @@ impl SensorBus for DellXps9380Sal {
if now.duration_since(*last_poll) < Duration::from_millis(1000) {
return Ok(self.last_fans.lock().unwrap().clone());
}
let mut fans = Vec::new();
for path in &self.fan_paths {
if let Ok(s) = fs::read_to_string(path) {
if let Ok(rpm) = s.trim().parse::<u32>() { fans.push(rpm); }
let mut val = 0;
for i in 0..5 {
match fs::read_to_string(path) {
Ok(s) => {
if let Ok(rpm) = s.trim().parse::<u32>() {
val = rpm;
if rpm > 0 { break; }
}
},
Err(e) => {
debug!("SAL: Fan poll retry {} for {:?} failed: {}", i+1, path, e);
}
}
thread::sleep(Duration::from_millis(150));
}
fans.push(val);
}
*self.last_fans.lock().unwrap() = fans.clone();
*last_poll = now;
Ok(fans)
@@ -199,6 +258,11 @@ impl SensorBus for DellXps9380Sal {
let s = fs::read_to_string(&self.freq_path)?;
Ok(s.trim().parse::<f32>()? / 1000.0)
}
fn get_throttling_status(&self) -> Result<bool> {
let val = self.read_msr(0x19C)?;
Ok((val & 0x1) != 0)
}
}
impl ActuatorBus for DellXps9380Sal {
@@ -208,20 +272,47 @@ impl ActuatorBus for DellXps9380Sal {
let tool_str = tool_path.to_string_lossy();
match mode {
"max" | "Manual" => { self.ctx.runner.run(&tool_str, &["0"])?; }
"max" | "Manual" => {
self.ctx.runner.run(&tool_str, &["0"])?;
// Disabling BIOS control requires immediate PWM override
self.set_fan_speed(FanSpeedPercent::new(100)?)?;
}
"auto" | "Auto" => { self.ctx.runner.run(&tool_str, &["1"])?; }
_ => { debug!("Unknown fan mode: {}", mode); }
_ => {}
}
Ok(())
}
fn set_sustained_power_limit(&self, watts: f32) -> Result<()> {
fs::write(&self.pl1_path, ((watts * 1_000_000.0) as u64).to_string())?;
fn set_fan_speed(&self, speed: FanSpeedPercent) -> Result<()> {
let pwm_val = ((speed.get() as u32 * 255) / 100) as u8;
for p in &self.pwm_enable_paths { let _ = fs::write(p, "1"); }
for path in &self.pwm_paths { let _ = fs::write(path, pwm_val.to_string()); }
Ok(())
}
fn set_burst_power_limit(&self, watts: f32) -> Result<()> {
fs::write(&self.pl2_path, ((watts * 1_000_000.0) as u64).to_string())?;
fn set_sustained_power_limit(&self, limit: PowerLimitWatts) -> Result<()> {
for path in &self.pl1_paths {
debug!("SAL: Applying PL1 ({:.1}W) to {:?}", limit.get(), path);
fs::write(path, limit.as_microwatts().to_string())
.with_context(|| format!("Failed to write PL1 to {:?}", path))?;
if let Some(parent) = path.parent() {
let enable_p = parent.join("constraint_0_enabled");
let _ = fs::write(&enable_p, "1");
}
}
Ok(())
}
fn set_burst_power_limit(&self, limit: PowerLimitWatts) -> Result<()> {
for path in &self.pl2_paths {
debug!("SAL: Applying PL2 ({:.1}W) to {:?}", limit.get(), path);
fs::write(path, limit.as_microwatts().to_string())
.with_context(|| format!("Failed to write PL2 to {:?}", path))?;
if let Some(parent) = path.parent() {
let enable_p = parent.join("constraint_1_enabled");
let _ = fs::write(&enable_p, "1");
}
}
Ok(())
}
}
@@ -243,7 +334,5 @@ impl HardwareWatchdog for DellXps9380Sal {
}
impl Drop for DellXps9380Sal {
fn drop(&mut self) {
let _ = self.restore();
}
fn drop(&mut self) { }
}

148
src/sal/discovery.rs Normal file
View File

@@ -0,0 +1,148 @@
//! # Hardware Discovery Engine (Agent Sentinel)
//!
//! This module provides dynamic traversal of `/sys/class/hwmon` and `/sys/class/powercap`
//! to locate sensors and actuators without relying on hardcoded indices.
use anyhow::{Result, Context, anyhow};
use std::fs;
use std::path::{Path, PathBuf};
use tracing::{debug, info, warn};
/// Result of a successful hardware discovery.
#[derive(Debug, Clone)]
pub struct DiscoveredHardware {
/// Path to the primary package temperature sensor input.
pub temp_input: PathBuf,
/// Paths to all detected fan RPM inputs.
pub fan_inputs: Vec<PathBuf>,
/// Paths to all detected fan PWM control nodes.
pub pwm_controls: Vec<PathBuf>,
/// Paths to all detected fan PWM enable nodes.
pub pwm_enables: Vec<PathBuf>,
/// Paths to RAPL power limit constraint files.
pub rapl_paths: Vec<PathBuf>,
}
pub struct DiscoveryEngine;
impl DiscoveryEngine {
/// Performs a full traversal of the sysfs hardware tree.
pub fn run(sysfs_root: &Path) -> Result<DiscoveredHardware> {
info!("Sentinel: Starting dynamic hardware discovery...");
let hwmon_path = sysfs_root.join("sys/class/hwmon");
let (temp_input, fan_info) = Self::discover_hwmon(&hwmon_path)?;
let powercap_path = sysfs_root.join("sys/class/powercap");
let rapl_paths = Self::discover_rapl(&powercap_path)?;
let hardware = DiscoveredHardware {
temp_input,
fan_inputs: fan_info.rpm_inputs,
pwm_controls: fan_info.pwm_controls,
pwm_enables: fan_info.pwm_enables,
rapl_paths,
};
info!("Sentinel: Discovery complete. Found {} fans and {} RAPL nodes.",
hardware.fan_inputs.len(), hardware.rapl_paths.len());
Ok(hardware)
}
fn discover_hwmon(base: &Path) -> Result<(PathBuf, FanHardware)> {
let mut best_temp: Option<(u32, PathBuf)> = None;
let mut fans = FanHardware::default();
let entries = fs::read_dir(base)
.with_context(|| format!("Failed to read hwmon base: {:?}", base))?;
for entry in entries.flatten() {
let path = entry.path();
let driver_name = fs::read_to_string(path.join("name"))
.map(|s| s.trim().to_string())
.unwrap_or_else(|_| "unknown".to_string());
debug!("Discovery: Probing hwmon node {:?} (driver: {})", path, driver_name);
// 1. Temperature Discovery
let temp_priority = match driver_name.as_str() {
"coretemp" | "zenpower" => 10,
"k10temp" => 9,
"dell_smm" => 8,
"acpitz" => 1,
_ => 5,
};
if let Ok(hw_entries) = fs::read_dir(&path) {
for hw_entry in hw_entries.flatten() {
let file_name = hw_entry.file_name().to_string_lossy().to_string();
// Temperature Inputs
if file_name.starts_with("temp") && file_name.ends_with("_input") {
let label_path = path.join(file_name.replace("_input", "_label"));
let label = fs::read_to_string(label_path).unwrap_or_default().trim().to_string();
let label_priority = if label.contains("Package") || label.contains("Tdie") {
2
} else {
0
};
let total_priority = temp_priority + label_priority;
if best_temp.is_none() || total_priority > best_temp.as_ref().unwrap().0 {
best_temp = Some((total_priority, hw_entry.path()));
}
}
// Fan Inputs
if file_name.starts_with("fan") && file_name.ends_with("_input") {
fans.rpm_inputs.push(hw_entry.path());
}
// PWM Controls
if file_name.starts_with("pwm") && !file_name.contains("_") {
fans.pwm_controls.push(hw_entry.path());
}
// PWM Enables
if file_name.starts_with("pwm") && file_name.ends_with("_enable") {
fans.pwm_enables.push(hw_entry.path());
}
}
}
}
let temp_input = best_temp.map(|(_, p)| p)
.ok_or_else(|| anyhow!("Failed to locate any valid temperature sensor in /sys/class/hwmon/"))?;
Ok((temp_input, fans))
}
fn discover_rapl(base: &Path) -> Result<Vec<PathBuf>> {
let mut paths = Vec::new();
if !base.exists() {
warn!("Discovery: /sys/class/powercap does not exist.");
return Ok(paths);
}
let entries = fs::read_dir(base)?;
for entry in entries.flatten() {
let path = entry.path();
let name = fs::read_to_string(path.join("name")).unwrap_or_default().trim().to_string();
if name.contains("package") || name.contains("intel-rapl") {
paths.push(path);
}
}
Ok(paths)
}
}
#[derive(Default)]
struct FanHardware {
rpm_inputs: Vec<PathBuf>,
pwm_controls: Vec<PathBuf>,
pwm_enables: Vec<PathBuf>,
}

View File

@@ -1,10 +1,12 @@
use anyhow::{Result, anyhow};
use anyhow::{Result, anyhow, Context};
use std::path::{Path};
use std::fs;
use std::time::{Duration, Instant};
use std::sync::Mutex;
use std::sync::{Mutex, Arc};
use tracing::{debug, warn, info};
use crate::sal::traits::{SensorBus, ActuatorBus, EnvironmentGuard, HardwareWatchdog, PreflightAuditor, AuditStep, AuditError, SafetyStatus, EnvironmentCtx};
use crate::sal::safety::{PowerLimitWatts, FanSpeedPercent};
use crate::sal::heuristic::discovery::SystemFactSheet;
use crate::sal::heuristic::schema::HardwareDb;
@@ -12,9 +14,8 @@ pub struct GenericLinuxSal {
ctx: EnvironmentCtx,
fact_sheet: SystemFactSheet,
db: HardwareDb,
suppressed_services: Mutex<Vec<String>>,
last_valid_temp: Mutex<(f32, Instant)>,
current_pl1: Mutex<f32>,
current_pl1: Mutex<u64>,
last_energy: Mutex<(u64, Instant)>,
}
@@ -28,9 +29,8 @@ impl GenericLinuxSal {
Self {
db,
suppressed_services: Mutex::new(Vec::new()),
last_valid_temp: Mutex::new((0.0, Instant::now())),
current_pl1: Mutex::new(15.0),
current_pl1: Mutex::new(15_000_000),
last_energy: Mutex::new((initial_energy, Instant::now())),
fact_sheet: facts,
ctx,
@@ -95,7 +95,7 @@ impl SensorBus for GenericLinuxSal {
let delta_e = e2.wrapping_sub(e1);
let delta_t = t2.duration_since(t1).as_secs_f32();
*last = (e2, t2);
if delta_t < 0.01 { return Ok(0.0); }
if delta_t < 0.05 { return Ok(0.0); }
Ok((delta_e as f32 / 1_000_000.0) / delta_t)
}
@@ -126,6 +126,22 @@ impl SensorBus for GenericLinuxSal {
Err(anyhow!("Could not determine CPU frequency"))
}
}
fn get_throttling_status(&self) -> Result<bool> {
let cooling_base = self.ctx.sysfs_base.join("sys/class/thermal");
if let Ok(entries) = fs::read_dir(cooling_base) {
for entry in entries.flatten() {
if entry.file_name().to_string_lossy().starts_with("cooling_device") {
if let Ok(state) = fs::read_to_string(entry.path().join("cur_state")) {
if state.trim().parse::<u32>().unwrap_or(0) > 0 {
return Ok(true);
}
}
}
}
}
Ok(false)
}
}
impl ActuatorBus for GenericLinuxSal {
@@ -144,44 +160,37 @@ impl ActuatorBus for GenericLinuxSal {
} else { Ok(()) }
}
fn set_sustained_power_limit(&self, watts: f32) -> Result<()> {
let rapl_path = self.fact_sheet.rapl_paths.first().ok_or_else(|| anyhow!("No PL1 path"))?;
fs::write(rapl_path.join("constraint_0_power_limit_uw"), ((watts * 1_000_000.0) as u64).to_string())?;
*self.current_pl1.lock().unwrap() = watts;
fn set_fan_speed(&self, _speed: FanSpeedPercent) -> Result<()> {
Ok(())
}
fn set_burst_power_limit(&self, watts: f32) -> Result<()> {
let rapl_path = self.fact_sheet.rapl_paths.first().ok_or_else(|| anyhow!("No PL2 path"))?;
fs::write(rapl_path.join("constraint_1_power_limit_uw"), ((watts * 1_000_000.0) as u64).to_string())?;
fn set_sustained_power_limit(&self, limit: PowerLimitWatts) -> Result<()> {
for rapl_path in &self.fact_sheet.rapl_paths {
let limit_path = rapl_path.join("constraint_0_power_limit_uw");
let enable_path = rapl_path.join("constraint_0_enabled");
fs::write(&limit_path, limit.as_microwatts().to_string())
.with_context(|| format!("Failed to write PL1 to {:?}", limit_path))?;
let _ = fs::write(&enable_path, "1");
}
*self.current_pl1.lock().unwrap() = limit.as_microwatts();
Ok(())
}
fn set_burst_power_limit(&self, limit: PowerLimitWatts) -> Result<()> {
for rapl_path in &self.fact_sheet.rapl_paths {
let limit_path = rapl_path.join("constraint_1_power_limit_uw");
let enable_path = rapl_path.join("constraint_1_enabled");
fs::write(&limit_path, limit.as_microwatts().to_string())
.with_context(|| format!("Failed to write PL2 to {:?}", limit_path))?;
let _ = fs::write(&enable_path, "1");
}
Ok(())
}
}
impl EnvironmentGuard for GenericLinuxSal {
fn suppress(&self) -> Result<()> {
let mut suppressed = self.suppressed_services.lock().unwrap();
for conflict_id in &self.fact_sheet.active_conflicts {
if let Some(conflict) = self.db.conflicts.iter().find(|c| &c.id == conflict_id) {
for service in &conflict.services {
if self.ctx.runner.run("systemctl", &["is-active", "--quiet", service]).is_ok() {
self.ctx.runner.run("systemctl", &["stop", service])?;
suppressed.push(service.clone());
}
}
}
}
Ok(())
}
fn restore(&self) -> Result<()> {
let mut suppressed = self.suppressed_services.lock().unwrap();
for service in suppressed.drain(..) {
let _ = self.ctx.runner.run("systemctl", &["start", &service]);
}
if self.is_dell() { let _ = self.set_fan_mode("auto"); }
Ok(())
}
fn suppress(&self) -> Result<()> { Ok(()) }
fn restore(&self) -> Result<()> { Ok(()) }
}
impl HardwareWatchdog for GenericLinuxSal {
@@ -197,7 +206,3 @@ impl HardwareWatchdog for GenericLinuxSal {
Ok(SafetyStatus::Nominal)
}
}
impl Drop for GenericLinuxSal {
fn drop(&mut self) { let _ = self.restore(); }
}

View File

@@ -1,12 +1,12 @@
use std::fs;
use std::path::{Path, PathBuf};
use std::process::Command;
use std::time::{Duration};
use std::thread;
use std::sync::mpsc;
use std::collections::HashMap;
use crate::sal::heuristic::schema::{SensorDiscovery, ActuatorDiscovery, Conflict, Discovery, Benchmarking};
use tracing::{debug, warn};
use crate::sys::SyscallRunner;
use tracing::{debug, warn, info};
/// Registry of dynamically discovered paths for configs and tools.
#[derive(Debug, Clone, Default)]
@@ -24,6 +24,7 @@ pub struct SystemFactSheet {
pub fan_paths: Vec<PathBuf>,
pub rapl_paths: Vec<PathBuf>,
pub active_conflicts: Vec<String>,
pub conflict_services: Vec<String>,
pub paths: PathRegistry,
pub bench_config: Option<Benchmarking>,
}
@@ -31,6 +32,7 @@ pub struct SystemFactSheet {
/// Probes the system for hardware sensors, actuators, service conflicts, and paths.
pub fn discover_facts(
base_path: &Path,
runner: &dyn SyscallRunner,
discovery: &Discovery,
conflicts: &[Conflict],
bench_config: Benchmarking,
@@ -43,12 +45,17 @@ pub fn discover_facts(
let rapl_paths = discover_rapl(base_path, &discovery.actuators);
let mut active_conflicts = Vec::new();
let mut conflict_services = Vec::new();
for conflict in conflicts {
let mut found_active = false;
for service in &conflict.services {
if is_service_active(service) {
debug!("Detected active conflict: {} (Service: {})", conflict.id, service);
active_conflicts.push(conflict.id.clone());
break;
if is_service_active(runner, service) {
if !found_active {
debug!("Detected active conflict: {} (Service: {})", conflict.id, service);
active_conflicts.push(conflict.id.clone());
found_active = true;
}
conflict_services.push(service.clone());
}
}
}
@@ -56,13 +63,7 @@ pub fn discover_facts(
let paths = discover_paths(base_path, discovery);
SystemFactSheet {
vendor,
model,
temp_path,
fan_paths,
rapl_paths,
active_conflicts,
paths,
vendor, model, temp_path, fan_paths, rapl_paths, active_conflicts, conflict_services, paths,
bench_config: Some(bench_config),
}
}
@@ -70,7 +71,6 @@ pub fn discover_facts(
fn discover_paths(base_path: &Path, discovery: &Discovery) -> PathRegistry {
let mut registry = PathRegistry::default();
// 1. Discover Tools via PATH
for (id, binary_name) in &discovery.tools {
if let Ok(path) = which::which(binary_name) {
debug!("Discovered tool: {} -> {:?}", id, path);
@@ -78,7 +78,6 @@ fn discover_paths(base_path: &Path, discovery: &Discovery) -> PathRegistry {
}
}
// 2. Discover Configs via existence check
for (id, candidates) in &discovery.configs {
for candidate in candidates {
let path = if candidate.starts_with('/') {
@@ -93,7 +92,6 @@ fn discover_paths(base_path: &Path, discovery: &Discovery) -> PathRegistry {
break;
}
}
// If not found, use the first one as default if any exist
if !registry.configs.contains_key(id) {
if let Some(first) = candidates.first() {
registry.configs.insert(id.clone(), PathBuf::from(first));
@@ -104,12 +102,11 @@ fn discover_paths(base_path: &Path, discovery: &Discovery) -> PathRegistry {
registry
}
/// Reads DMI information from sysfs with a safety timeout.
fn read_dmi_info(base_path: &Path) -> (String, String) {
let vendor = read_sysfs_with_timeout(&base_path.join("sys/class/dmi/id/sys_vendor"), Duration::from_millis(100))
.unwrap_or_else(|| "Unknown".to_string());
let model = read_sysfs_with_timeout(&base_path.join("sys/class/dmi/id/product_name"), Duration::from_millis(100))
.unwrap_or_else(|| "Unknown".to_string());
let vendor = fs::read_to_string(base_path.join("sys/class/dmi/id/sys_vendor"))
.map(|s| s.trim().to_string()).unwrap_or_else(|_| "Unknown".to_string());
let model = fs::read_to_string(base_path.join("sys/class/dmi/id/product_name"))
.map(|s| s.trim().to_string()).unwrap_or_else(|_| "Unknown".to_string());
(vendor, model)
}
@@ -119,51 +116,62 @@ fn discover_hwmon(base_path: &Path, cfg: &SensorDiscovery) -> (Option<PathBuf>,
let mut fan_candidates = Vec::new();
let hwmon_base = base_path.join("sys/class/hwmon");
let entries = match fs::read_dir(&hwmon_base) {
Ok(e) => e,
Err(e) => {
warn!("Could not read {:?}: {}", hwmon_base, e);
return (None, Vec::new());
}
};
let entries = fs::read_dir(&hwmon_base).map_err(|e| {
warn!("Could not read {:?}: {}", hwmon_base, e);
e
}).ok();
for entry in entries.flatten() {
let hwmon_path = entry.path();
if let Some(entries) = entries {
for entry in entries.flatten() {
let hwmon_path = entry.path();
let driver_name = read_sysfs_with_timeout(&hwmon_path.join("name"), Duration::from_millis(100))
.unwrap_or_default();
// # SAFETY: Read driver name directly. This file is virtual and never blocks.
// Using a timeout wrapper here was causing discovery to fail if the thread-pool lagged.
let driver_name = fs::read_to_string(hwmon_path.join("name"))
.map(|s| s.trim().to_string()).unwrap_or_default();
let priority = cfg.hwmon_priority
.iter()
.position(|p| p == &driver_name)
.unwrap_or(usize::MAX);
let priority = cfg.hwmon_priority
.iter()
.position(|p| driver_name.contains(p))
.unwrap_or(usize::MAX);
if let Ok(hw_entries) = fs::read_dir(&hwmon_path) {
for hw_entry in hw_entries.flatten() {
let file_name = hw_entry.file_name().into_string().unwrap_or_default();
if let Ok(hw_entries) = fs::read_dir(&hwmon_path) {
for hw_entry in hw_entries.flatten() {
let file_name = hw_entry.file_name().into_string().unwrap_or_default();
// Temperature Sensors
if file_name.starts_with("temp") && file_name.ends_with("_label") {
if let Some(label) = read_sysfs_with_timeout(&hw_entry.path(), Duration::from_millis(100)) {
if cfg.temp_labels.iter().any(|l| label.contains(l)) {
let input_path = hwmon_path.join(file_name.replace("_label", "_input"));
if input_path.exists() {
temp_candidates.push((priority, input_path));
// 1. Temperatures
if file_name.starts_with("temp") && file_name.ends_with("_label") {
if let Some(label) = read_sysfs_with_timeout(&hw_entry.path(), Duration::from_millis(500)) {
if cfg.temp_labels.iter().any(|l| label.contains(l)) {
let input_path = hwmon_path.join(file_name.replace("_label", "_input"));
if input_path.exists() {
temp_candidates.push((priority, input_path));
}
}
}
}
}
// Fan Sensors
if file_name.starts_with("fan") && file_name.ends_with("_label") {
if let Some(label) = read_sysfs_with_timeout(&hw_entry.path(), Duration::from_millis(100)) {
if cfg.fan_labels.iter().any(|l| label.contains(l)) {
let input_path = hwmon_path.join(file_name.replace("_label", "_input"));
if input_path.exists() {
fan_candidates.push((priority, input_path));
// 2. Fans (Label Match)
if file_name.starts_with("fan") && file_name.ends_with("_label") {
if let Some(label) = read_sysfs_with_timeout(&hw_entry.path(), Duration::from_millis(500)) {
if cfg.fan_labels.iter().any(|l| label.contains(l)) {
let input_path = hwmon_path.join(file_name.replace("_label", "_input"));
if input_path.exists() {
debug!("Discovered fan by label: {:?} (priority {})", input_path, priority);
fan_candidates.push((priority, input_path));
}
}
}
}
// 3. Fans (Priority Fallback - CRITICAL FOR DELL 9380)
// If we found a priority driver (e.g., dell_smm), we take every fan*_input we find.
if priority < usize::MAX && file_name.starts_with("fan") && file_name.ends_with("_input") {
if !fan_candidates.iter().any(|(_, p)| p == &hw_entry.path()) {
info!("Heuristic Discovery: Force-adding unlabeled fan sensor from priority driver '{}': {:?}", driver_name, hw_entry.path());
fan_candidates.push((priority, hw_entry.path()));
}
}
}
}
}
@@ -173,54 +181,45 @@ fn discover_hwmon(base_path: &Path, cfg: &SensorDiscovery) -> (Option<PathBuf>,
fan_candidates.sort_by_key(|(p, _)| *p);
let best_temp = temp_candidates.first().map(|(_, p)| p.clone());
let best_fans = fan_candidates.into_iter().map(|(_, p)| p).collect();
let best_fans: Vec<PathBuf> = fan_candidates.into_iter().map(|(_, p)| p).collect();
if best_fans.is_empty() {
warn!("Heuristic Discovery: No fan RPM sensors found.");
} else {
info!("Heuristic Discovery: Final registry contains {} fan sensors.", best_fans.len());
}
(best_temp, best_fans)
}
/// Discovers RAPL powercap paths.
fn discover_rapl(base_path: &Path, cfg: &ActuatorDiscovery) -> Vec<PathBuf> {
let mut paths = Vec::new();
let powercap_base = base_path.join("sys/class/powercap");
let entries = match fs::read_dir(&powercap_base) {
Ok(e) => e,
Err(_) => return Vec::new(),
};
if let Ok(entries) = fs::read_dir(&powercap_base) {
for entry in entries.flatten() {
let path = entry.path();
let dir_name = entry.file_name().into_string().unwrap_or_default();
for entry in entries.flatten() {
let path = entry.path();
let dir_name = entry.file_name().into_string().unwrap_or_default();
if cfg.rapl_paths.contains(&dir_name) {
paths.push(path);
continue;
}
if let Some(name) = read_sysfs_with_timeout(&path.join("name"), Duration::from_millis(100)) {
if cfg.rapl_paths.iter().any(|p| p == &name) {
if cfg.rapl_paths.contains(&dir_name) {
paths.push(path);
continue;
}
if let Ok(name) = fs::read_to_string(path.join("name")) {
if cfg.rapl_paths.iter().any(|p| p == name.trim()) {
paths.push(path);
}
}
}
}
paths
}
/// Checks if a systemd service is currently active.
pub fn is_service_active(service: &str) -> bool {
let status = Command::new("systemctl")
.arg("is-active")
.arg("--quiet")
.arg(service)
.status();
match status {
Ok(s) => s.success(),
Err(_) => false,
}
pub fn is_service_active(runner: &dyn SyscallRunner, service: &str) -> bool {
runner.run("systemctl", &["is-active", "--quiet", service]).is_ok()
}
/// Helper to read a sysfs file with a timeout.
fn read_sysfs_with_timeout(path: &Path, timeout: Duration) -> Option<String> {
let (tx, rx) = mpsc::channel();
let path_buf = path.to_path_buf();

View File

@@ -24,7 +24,7 @@ impl HeuristicEngine {
.context("Failed to parse hardware_db.toml")?;
// 2. Discover Facts
let facts = discover_facts(&ctx.sysfs_base, &db.discovery, &db.conflicts, db.benchmarking.clone());
let facts = discover_facts(&ctx.sysfs_base, ctx.runner.as_ref(), &db.discovery, &db.conflicts, db.benchmarking.clone());
info!("System Identity: {} {}", facts.vendor, facts.model);
// 3. Routing Logic

View File

@@ -1,5 +1,7 @@
use super::traits::{PreflightAuditor, EnvironmentGuard, SensorBus, ActuatorBus, HardwareWatchdog, AuditStep, SafetyStatus};
use crate::sal::safety::{PowerLimitWatts, FanSpeedPercent};
use anyhow::Result;
use std::sync::Arc;
pub struct MockSal {
pub temperature_sequence: std::sync::atomic::AtomicUsize,
@@ -16,59 +18,36 @@ impl MockSal {
impl PreflightAuditor for MockSal {
fn audit(&self) -> Box<dyn Iterator<Item = AuditStep> + '_> {
let steps = vec![
AuditStep {
description: "Mock Root Privileges".to_string(),
outcome: Ok(()),
},
AuditStep {
description: "Mock AC Power Status".to_string(),
outcome: Ok(()),
},
AuditStep { description: "Mock Root Privileges".to_string(), outcome: Ok(()) },
AuditStep { description: "Mock AC Power Status".to_string(), outcome: Ok(()) },
];
Box::new(steps.into_iter())
}
}
impl EnvironmentGuard for MockSal {
fn suppress(&self) -> Result<()> {
Ok(())
}
fn restore(&self) -> Result<()> {
Ok(())
}
fn suppress(&self) -> Result<()> { Ok(()) }
fn restore(&self) -> Result<()> { Ok(()) }
}
impl SensorBus for MockSal {
fn get_temp(&self) -> Result<f32> {
// Support dynamic sequence for Step 5
let seq = self.temperature_sequence.fetch_add(1, std::sync::atomic::Ordering::SeqCst);
Ok(40.0 + (seq as f32 * 0.5).min(50.0)) // Heats up from 40 to 90
}
fn get_power_w(&self) -> Result<f32> {
Ok(15.0)
}
fn get_fan_rpms(&self) -> Result<Vec<u32>> {
Ok(vec![2500])
}
fn get_freq_mhz(&self) -> Result<f32> {
Ok(3200.0)
Ok(40.0 + (seq as f32 * 0.5).min(55.0))
}
fn get_power_w(&self) -> Result<f32> { Ok(15.0) }
fn get_fan_rpms(&self) -> Result<Vec<u32>> { Ok(vec![2500, 2400]) }
fn get_freq_mhz(&self) -> Result<f32> { Ok(3200.0) }
fn get_throttling_status(&self) -> Result<bool> { Ok(false) }
}
impl ActuatorBus for MockSal {
fn set_fan_mode(&self, _mode: &str) -> Result<()> {
Ok(())
}
fn set_sustained_power_limit(&self, _watts: f32) -> Result<()> {
Ok(())
}
fn set_burst_power_limit(&self, _watts: f32) -> Result<()> {
Ok(())
}
fn set_fan_mode(&self, _mode: &str) -> Result<()> { Ok(()) }
fn set_fan_speed(&self, _speed: FanSpeedPercent) -> Result<()> { Ok(()) }
fn set_sustained_power_limit(&self, _limit: PowerLimitWatts) -> Result<()> { Ok(()) }
fn set_burst_power_limit(&self, _limit: PowerLimitWatts) -> Result<()> { Ok(()) }
}
impl HardwareWatchdog for MockSal {
fn get_safety_status(&self) -> Result<SafetyStatus> {
Ok(SafetyStatus::Nominal)
}
fn get_safety_status(&self) -> Result<SafetyStatus> { Ok(SafetyStatus::Nominal) }
}

View File

@@ -3,3 +3,5 @@ pub mod mock;
pub mod dell_xps_9380;
pub mod generic_linux;
pub mod heuristic;
pub mod safety;
pub mod discovery;

282
src/sal/safety.rs Normal file
View File

@@ -0,0 +1,282 @@
//! # Hardware Safety & Universal Safeguard Architecture
//!
//! This module implements the core safety logic for `ember-tune`. It uses the Rust
//! type system to enforce hardware bounds and RAII patterns to guarantee that
//! the system is restored to a safe state even after a crash.
use anyhow::{Result, bail, Context};
use std::collections::HashMap;
use std::fs;
use std::path::{PathBuf};
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering};
use std::time::{Duration, Instant};
use std::thread;
use tracing::{info, warn, error, debug};
use crate::sal::traits::SensorBus;
// --- 1. Type-Driven Bounds Checking ---
/// Represents a validated TDP limit in Watts.
#[derive(Debug, Clone, Copy, PartialEq, PartialOrd)]
pub struct PowerLimitWatts(f32);
impl PowerLimitWatts {
/// Absolute safety floor. Setting TDP below 3W can induce system-wide
/// CPU stalls and I/O deadlocks on certain Intel mobile chipsets.
pub const MIN: f32 = 3.0;
/// Safety ceiling for mobile thin-and-light chassis.
pub const MAX: f32 = 100.0;
/// Validates and constructs a new PowerLimitWatts.
pub fn try_new(watts: f32) -> Result<Self> {
if watts < Self::MIN || watts > Self::MAX {
bail!("HardwareSafetyError: Requested TDP {:.1}W is outside safe bounds ({:.1}W - {:.1}W).", watts, Self::MIN, Self::MAX);
}
Ok(Self(watts))
}
pub fn from_watts(watts: f32) -> Result<Self> {
Self::try_new(watts)
}
pub fn get(&self) -> f32 { self.0 }
pub fn as_microwatts(&self) -> u64 { (self.0 * 1_000_000.0) as u64 }
}
/// Represents a validated fan speed percentage.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct FanSpeedPercent(u8);
impl FanSpeedPercent {
pub fn try_new(percent: u8) -> Result<Self> {
if percent > 100 {
bail!("HardwareSafetyError: Fan speed {}% is invalid.", percent);
}
Ok(Self(percent))
}
pub fn new(percent: u8) -> Result<Self> {
Self::try_new(percent)
}
pub fn get(&self) -> u8 { self.0 }
}
/// Represents a thermal threshold in Celsius.
#[derive(Debug, Clone, Copy, PartialEq, PartialOrd)]
pub struct ThermalThresholdCelsius(f32);
impl ThermalThresholdCelsius {
pub const MAX_SAFE_C: f32 = 98.0;
pub fn try_new(celsius: f32) -> Result<Self> {
if celsius > Self::MAX_SAFE_C {
bail!("HardwareSafetyError: Thermal threshold {}C exceeds safe limit ({}C).", celsius, Self::MAX_SAFE_C);
}
Ok(Self(celsius))
}
pub fn new(celsius: f32) -> Result<Self> {
Self::try_new(celsius)
}
pub fn get(&self) -> f32 { self.0 }
}
// --- 2. The HardwareStateGuard (RAII Restorer) ---
/// Defines an arbitrary action to take during restoration.
pub type RollbackAction = Box<dyn FnOnce() + Send + 'static>;
/// Holds a snapshot of the system state. Restores everything on Drop.
/// This is the primary safety mechanism for Project Iron-Ember.
pub struct HardwareStateGuard {
/// Maps sysfs paths to their original string contents.
snapshots: HashMap<PathBuf, String>,
/// Services that were stopped and must be restarted.
suppressed_services: Vec<String>,
/// Arbitrary actions to perform on restoration (e.g., reset fan mode).
rollback_actions: Vec<RollbackAction>,
is_active: bool,
}
impl HardwareStateGuard {
/// Snapshots the requested files and neutralizes competing services.
///
/// # SAFETY:
/// This MUST be acquired before any hardware mutation occurs.
pub fn acquire(target_files: &[PathBuf], target_services: &[String]) -> Result<Self> {
let mut snapshots = HashMap::new();
let mut suppressed = Vec::new();
info!("USA: Arming HardwareStateGuard. Snapshotting critical registers...");
for path in target_files {
if path.exists() {
let content = fs::read_to_string(path)
.with_context(|| format!("Failed to snapshot {:?}", path))?;
snapshots.insert(path.clone(), content.trim().to_string());
} else {
debug!("USA: Skipping snapshot for non-existent path {:?}", path);
}
}
for svc in target_services {
// Check if service is active before stopping
let status = std::process::Command::new("systemctl")
.args(["is-active", "--quiet", svc])
.status();
if let Ok(s) = status {
if s.success() {
info!("USA: Neutralizing service '{}'", svc);
let _ = std::process::Command::new("systemctl").args(["stop", svc]).status();
suppressed.push(svc.clone());
}
}
}
Ok(Self {
snapshots,
suppressed_services: suppressed,
rollback_actions: Vec::new(),
is_active: true,
})
}
/// Registers a custom action to be performed when the guard is released.
pub fn on_rollback(&mut self, action: RollbackAction) {
self.rollback_actions.push(action);
}
/// Explicitly release and restore the hardware state.
pub fn release(&mut self) -> Result<()> {
if !self.is_active { return Ok(()); }
info!("USA: Releasing guard. Restoring hardware to pre-flight state...");
// 1. Restore Power/Sysfs states
for (path, content) in &self.snapshots {
if let Err(e) = fs::write(path, content) {
error!("CRITICAL: Failed to restore {:?}: {}", path, e);
}
}
// 2. Restart Services
for svc in &self.suppressed_services {
let _ = std::process::Command::new("systemctl").args(["start", svc]).status();
}
// 3. Perform Custom Rollback Actions
for action in self.rollback_actions.drain(..) {
(action)();
}
self.is_active = false;
Ok(())
}
}
impl Drop for HardwareStateGuard {
fn drop(&mut self) {
if self.is_active {
warn!("USA: Guard dropped prematurely (panic/SIGTERM). Force-restoring system...");
let _ = self.release();
}
}
}
// --- 3. The Active Watchdog ---
/// A standalone monitor that polls hardware thermals at high frequency.
pub struct ThermalWatchdog {
cancel_token: Arc<AtomicBool>,
handle: Option<thread::JoinHandle<()>>,
}
impl ThermalWatchdog {
/// If temperature exceeds this ceiling, the watchdog triggers an emergency shutdown.
pub const CRITICAL_TEMP: f32 = 95.0;
/// High polling rate ensures we catch runaways before chassis saturation.
pub const POLL_INTERVAL: Duration = Duration::from_millis(250);
/// Spawns the watchdog thread.
pub fn spawn(sensors: Arc<dyn SensorBus>, cancel_token: Arc<AtomicBool>) -> Self {
let ct = cancel_token.clone();
let handle = thread::spawn(move || {
let mut last_temp = 0.0;
loop {
if ct.load(Ordering::SeqCst) {
debug!("Watchdog: Shutdown signal received.");
break;
}
match sensors.get_temp() {
Ok(temp) => {
// Rate of change check (dT/dt)
let dt_dt = temp - last_temp;
if temp >= Self::CRITICAL_TEMP {
error!("WATCHDOG: CRITICAL THERMAL EVENT ({:.1}C). Triggering emergency abort!", temp);
ct.store(true, Ordering::SeqCst);
break;
}
if dt_dt > 5.0 && temp > 85.0 {
warn!("WATCHDOG: Dangerous thermal ramp detected (+{:.1}C in 250ms).", dt_dt);
}
last_temp = temp;
}
Err(e) => {
error!("WATCHDOG: Sensor read failure: {}. Aborting for safety!", e);
ct.store(true, Ordering::SeqCst);
break;
}
}
thread::sleep(Self::POLL_INTERVAL);
}
});
Self {
cancel_token,
handle: Some(handle),
}
}
}
impl Drop for ThermalWatchdog {
fn drop(&mut self) {
self.cancel_token.store(true, Ordering::SeqCst);
if let Some(h) = self.handle.take() {
let _ = h.join();
}
}
}
// --- 4. Transactional Configuration ---
/// A staged set of changes to be applied to the hardware.
#[derive(Default)]
pub struct ConfigurationTransaction {
changes: Vec<(PathBuf, String)>,
}
impl ConfigurationTransaction {
pub fn add_change(&mut self, path: PathBuf, value: String) {
self.changes.push((path, value));
}
/// # SAFETY:
/// Commits all changes. If any write fails, it returns an error but the
/// HardwareStateGuard will still restore everything on drop.
pub fn commit(self) -> Result<()> {
for (path, val) in self.changes {
fs::write(&path, val)
.with_context(|| format!("Failed to apply change to {:?}", path))?;
}
Ok(())
}
}

View File

@@ -115,79 +115,54 @@ impl<T: EnvironmentGuard + ?Sized> EnvironmentGuard for Arc<T> {
}
}
use crate::sal::safety::{PowerLimitWatts, FanSpeedPercent};
/// Provides a read-only interface to system telemetry sensors.
pub trait SensorBus: Send + Sync {
/// Returns the current package temperature in degrees Celsius.
///
/// # Errors
/// Returns an error if the underlying `hwmon` or `sysfs` node cannot be read.
fn get_temp(&self) -> Result<f32>;
/// Returns the current package power consumption in Watts.
///
/// # Errors
/// Returns an error if the underlying RAPL or power sensor cannot be read.
fn get_power_w(&self) -> Result<f32>;
/// Returns the current speed of all detected fans in RPM.
///
/// # Errors
/// Returns an error if the fan sensor nodes cannot be read.
fn get_fan_rpms(&self) -> Result<Vec<u32>>;
/// Returns the current average CPU frequency in MHz.
///
/// # Errors
/// Returns an error if `/proc/cpuinfo` or a `cpufreq` sysfs node cannot be read.
fn get_freq_mhz(&self) -> Result<f32>;
/// Returns true if the system is currently thermally throttling.
fn get_throttling_status(&self) -> Result<bool>;
}
impl<T: SensorBus + ?Sized> SensorBus for Arc<T> {
fn get_temp(&self) -> Result<f32> {
(**self).get_temp()
}
fn get_power_w(&self) -> Result<f32> {
(**self).get_power_w()
}
fn get_fan_rpms(&self) -> Result<Vec<u32>> {
(**self).get_fan_rpms()
}
fn get_freq_mhz(&self) -> Result<f32> {
(**self).get_freq_mhz()
}
fn get_temp(&self) -> Result<f32> { (**self).get_temp() }
fn get_power_w(&self) -> Result<f32> { (**self).get_power_w() }
fn get_fan_rpms(&self) -> Result<Vec<u32>> { (**self).get_fan_rpms() }
fn get_freq_mhz(&self) -> Result<f32> { (**self).get_freq_mhz() }
fn get_throttling_status(&self) -> Result<bool> { (**self).get_throttling_status() }
}
/// Provides a write-only interface for hardware actuators.
pub trait ActuatorBus: Send + Sync {
/// Sets the fan control mode (e.g., "auto" or "max").
///
/// # Errors
/// Returns an error if the fan control command or `sysfs` write fails.
fn set_fan_mode(&self, mode: &str) -> Result<()>;
/// Sets the sustained power limit (PL1) in Watts.
///
/// # Errors
/// Returns an error if the RAPL `sysfs` node cannot be written to.
fn set_sustained_power_limit(&self, watts: f32) -> Result<()>;
/// Sets the fan speed directly using a validated percentage.
fn set_fan_speed(&self, speed: FanSpeedPercent) -> Result<()>;
/// Sets the burst power limit (PL2) in Watts.
///
/// # Errors
/// Returns an error if the RAPL `sysfs` node cannot be written to.
fn set_burst_power_limit(&self, watts: f32) -> Result<()>;
/// Sets the sustained power limit (PL1) using a validated wrapper.
fn set_sustained_power_limit(&self, limit: PowerLimitWatts) -> Result<()>;
/// Sets the burst power limit (PL2) using a validated wrapper.
fn set_burst_power_limit(&self, limit: PowerLimitWatts) -> Result<()>;
}
impl<T: ActuatorBus + ?Sized> ActuatorBus for Arc<T> {
fn set_fan_mode(&self, mode: &str) -> Result<()> {
(**self).set_fan_mode(mode)
}
fn set_sustained_power_limit(&self, watts: f32) -> Result<()> {
(**self).set_sustained_power_limit(watts)
}
fn set_burst_power_limit(&self, watts: f32) -> Result<()> {
(**self).set_burst_power_limit(watts)
}
fn set_fan_mode(&self, mode: &str) -> Result<()> { (**self).set_fan_mode(mode) }
fn set_fan_speed(&self, speed: FanSpeedPercent) -> Result<()> { (**self).set_fan_speed(speed) }
fn set_sustained_power_limit(&self, limit: PowerLimitWatts) -> Result<()> { (**self).set_sustained_power_limit(limit) }
fn set_burst_power_limit(&self, limit: PowerLimitWatts) -> Result<()> { (**self).set_burst_power_limit(limit) }
}
/// Represents the high-level safety status of the system.

View File

@@ -1,5 +1,6 @@
use ember_tune_rs::sal::heuristic::discovery::discover_facts;
use ember_tune_rs::sal::heuristic::schema::{Discovery, SensorDiscovery, ActuatorDiscovery, Benchmarking};
use ember_tune_rs::sys::MockSyscallRunner;
use crate::common::fakesys::FakeSysBuilder;
mod common;
@@ -35,7 +36,9 @@ fn test_heuristic_discovery_with_fakesys() {
power_steps_watts: vec![10.0, 15.0],
};
let facts = discover_facts(&fake.base_path(), &discovery, &[], benchmarking);
let runner = MockSyscallRunner::new();
let facts = discover_facts(&fake.base_path(), &runner, &discovery, &[], benchmarking);
assert_eq!(facts.vendor, "Dell Inc.");
assert_eq!(facts.model, "XPS 13 9380");

View File

@@ -1,16 +1,23 @@
use ember_tune_rs::orchestrator::BenchmarkOrchestrator;
use ember_tune_rs::sal::mock::MockSal;
use ember_tune_rs::sal::heuristic::discovery::SystemFactSheet;
use ember_tune_rs::load::Workload;
use ember_tune_rs::load::{Workload, IntensityProfile, WorkloadMetrics};
use std::time::Duration;
use anyhow::Result;
use std::sync::mpsc;
use std::sync::Arc;
use anyhow::Result;
struct MockWorkload;
impl Workload for MockWorkload {
fn start(&mut self, _threads: usize, _load_percent: usize) -> Result<()> { Ok(()) }
fn stop(&mut self) -> Result<()> { Ok(()) }
fn get_throughput(&self) -> Result<f64> { Ok(100.0) }
fn initialize(&mut self) -> Result<()> { Ok(()) }
fn run_workload(&mut self, _duration: Duration, _profile: IntensityProfile) -> Result<()> { Ok(()) }
fn get_current_metrics(&self) -> Result<WorkloadMetrics> {
Ok(WorkloadMetrics {
primary_ops_per_sec: 100.0,
elapsed_time: Duration::from_secs(1),
})
}
fn stop_workload(&mut self) -> Result<()> { Ok(()) }
}
#[test]
@@ -28,6 +35,7 @@ fn test_orchestrator_e2e_state_machine() {
workload,
telemetry_tx,
command_rx,
None,
);
// For the purpose of this architecture audit, we've demonstrated the

56
tests/safety_test.rs Normal file
View File

@@ -0,0 +1,56 @@
use anyhow::Result;
use std::fs;
use std::path::PathBuf;
use ember_tune_rs::sal::safety::{HardwareStateGuard, TdpLimitMicroWatts};
use crate::common::fakesys::FakeSysBuilder;
mod common;
#[test]
fn test_hardware_state_guard_panic_restoration() {
let fake = FakeSysBuilder::new();
let pl1_path = fake.base_path().join("sys/class/powercap/intel-rapl:0/constraint_0_power_limit_uw");
fake.add_rapl("intel-rapl:0", "1000", "15000000"); // 15W original
let target_files = vec![pl1_path.clone()];
// Simulate a scope where the guard is active
{
let mut _guard = HardwareStateGuard::acquire(&target_files, &[]).expect("Failed to acquire guard");
// Modify the file
fs::write(&pl1_path, "25000000").expect("Failed to write new value");
assert_eq!(fs::read_to_string(&pl1_path).unwrap().trim(), "25000000");
// Guard is dropped here (simulating end of scope or panic)
}
// Verify restoration
let restored = fs::read_to_string(&pl1_path).expect("Failed to read restored file");
assert_eq!(restored.trim(), "15000000");
}
#[test]
fn test_tdp_limit_bounds_checking() {
// 1. Valid value
assert!(TdpLimitMicroWatts::new(15_000_000).is_ok());
// 2. Too low (Dangerous 0W or below 5W)
let low_res = TdpLimitMicroWatts::new(1_000_000);
assert!(low_res.is_err());
assert!(low_res.unwrap_err().to_string().contains("below safety floor"));
// 3. Too high (> 80W)
let high_res = TdpLimitMicroWatts::new(100_000_000);
assert!(high_res.is_err());
assert!(high_res.unwrap_err().to_string().contains("exceeds safety ceiling"));
}
#[test]
fn test_0w_tdp_regression_prevention() {
// The prime directive is to never set 0W.
// Ensure the new() constructor explicitly fails for 0.
let zero_res = TdpLimitMicroWatts::new(0);
assert!(zero_res.is_err());
}