-
Notifications
You must be signed in to change notification settings - Fork 0
1.2 Core Features
lrrajesh edited this page Sep 14, 2016
·
1 revision
-
Plugin architecture based on Module Component Architecture (MCA)
- Sophisticated auto-select algorithms based on system size, available resources
- Binary proprietary plugin support
- On-the-fly updates for maintenance*
- Addition of new plugin capabilities/features without requiring system-wide restart if compatibility requirements are met
-
Hardware discovery support
- Automatic reporting of hardware inventory on startup
- Automatic updating upon node removal and replacement
-
Scalable overlay network*
- Supports multiple topologies, including both tree and mesh plugins
- Automatic route failover and restoration, messages cached pending comm recovery
- Both in-band and out-of-band transports with auto-failover between them
-
Sensors
- Both push and pull models supported
- Read as a group at regular intervals according to a specified rate, or individual sensors can be read at their own regular interval, or individual readings of any combination of sensors can be polled upon request
- Polling requests can return information directly to the caller, or can include the reading in the next database update, as specified by the caller
- Data collected locally and sent to an aggregator for recording into a database at specified time intervals
- Environment sensors
- Processor temperature - on-board sensor for reading processor temperatures when coretemp kernel module loaded
- Processor frequency - on-board sensor for reading processor frequencies. Requires read access to /sys/devices/system/cpu directory
- IPMI readings of AC power, cabinet temperature, water and air temperatures, etc.
- Processor power - reading processor power from on-board MSR. Only supported for Intel SandyBridge, IvyBridge, and Haswell processors
- Resource utilization
- Wide range of process and node-level resource utilization, including memory, cpu, network I/O, and disk I/O
- Process failure
- Monitors specified file(s) for programmatic access within specified time intervals and/or file size restrictions
- Heartbeat monitoring
-
Analytics*
- Supports in-flight reduction of sensor data
- User-defined workflows for data analysis using available plugins connected in user-defined chains
- Analysis chain outputs can be included in database reporting or sent to requestor at user direction
- Plugins can support event and alert generation
- "Tap" plugin directs copied stream of selected data from specified point to remote requestor
- Chains can be defined at any aggregation point in system
-
Diagnostics*
- Supports system diagnostic capabilities includes memory diagnostic, processor diagnostic and Ethernet diagnostic
- Launch diagnostic from octl on specific node or a list of nodes
-
Database
- Both raw and processed data can be stored in one or more databases
- Supports both SQL and non-SQL* databases
- Multiple instances of either type can be used in parallel*
- Target database for different data sources and/or types can be specified using MCA parameters during configure and startup, and can be altered by command during operation*
- Cross-data correlation maintained
- Relationship between job, sensor, and performance data tracked and linked for easy retrieval*
-
Fault Tolerance
- Self-healing communication system (see above)
- Non-heartbeat detection of node failures
- Automatic state recovery based on retrieval of state information from peers*
*indicates areas of development
Following are Sensys software tool components:
- orcmd – An Sensys daemon which runs on all compute nodes with root privileges. This daemon is used for RAS Monitoring data collection (In-Band) sensor data from the compute nodes.
- orcmd aggregator – Another Sensys daemon (orcmd) to collect OOB and In-Band RAS monitoring data from the compute nodes from a service node (Rack Controller, Row Controller or from a system management server). OOB data collection requires an IPMI access to BMC on the compute nodes. In-Band data collection goes through a management network. The orcmd is same as above with role assigned as aggregator in the Sensys configuration file (orcm-site.xml).
- orcmsched – A scheduler daemon runs on the SMS node or Head node. This daemon is used for managing the compute resources in a cluster.
- octl – An administrative command line interface