Skip to content

1.2 Core Features

lrrajesh edited this page Sep 14, 2016 · 1 revision
  • Plugin architecture based on Module Component Architecture (MCA)

    • Sophisticated auto-select algorithms based on system size, available resources
    • Binary proprietary plugin support
    • On-the-fly updates for maintenance*
    • Addition of new plugin capabilities/features without requiring system-wide restart if compatibility requirements are met
  • Hardware discovery support

    • Automatic reporting of hardware inventory on startup
    • Automatic updating upon node removal and replacement
  • Scalable overlay network*

    • Supports multiple topologies, including both tree and mesh plugins
    • Automatic route failover and restoration, messages cached pending comm recovery
    • Both in-band and out-of-band transports with auto-failover between them
  • Sensors

    • Both push and pull models supported
    • Read as a group at regular intervals according to a specified rate, or individual sensors can be read at their own regular interval, or individual readings of any combination of sensors can be polled upon request
    • Polling requests can return information directly to the caller, or can include the reading in the next database update, as specified by the caller
    • Data collected locally and sent to an aggregator for recording into a database at specified time intervals
    • Environment sensors
      • Processor temperature - on-board sensor for reading processor temperatures when coretemp kernel module loaded
      • Processor frequency - on-board sensor for reading processor frequencies. Requires read access to /sys/devices/system/cpu directory
      • IPMI readings of AC power, cabinet temperature, water and air temperatures, etc.
      • Processor power - reading processor power from on-board MSR. Only supported for Intel SandyBridge, IvyBridge, and Haswell processors
    • Resource utilization
      • Wide range of process and node-level resource utilization, including memory, cpu, network I/O, and disk I/O
    • Process failure
      • Monitors specified file(s) for programmatic access within specified time intervals and/or file size restrictions
    • Heartbeat monitoring
  • Analytics*

    • Supports in-flight reduction of sensor data
    • User-defined workflows for data analysis using available plugins connected in user-defined chains
    • Analysis chain outputs can be included in database reporting or sent to requestor at user direction
    • Plugins can support event and alert generation
    • "Tap" plugin directs copied stream of selected data from specified point to remote requestor
    • Chains can be defined at any aggregation point in system
  • Diagnostics*

    • Supports system diagnostic capabilities includes memory diagnostic, processor diagnostic and Ethernet diagnostic
    • Launch diagnostic from octl on specific node or a list of nodes
  • Database

    • Both raw and processed data can be stored in one or more databases
    • Supports both SQL and non-SQL* databases
      • Multiple instances of either type can be used in parallel*
      • Target database for different data sources and/or types can be specified using MCA parameters during configure and startup, and can be altered by command during operation*
    • Cross-data correlation maintained
      • Relationship between job, sensor, and performance data tracked and linked for easy retrieval*
  • Fault Tolerance

    • Self-healing communication system (see above)
    • Non-heartbeat detection of node failures
    • Automatic state recovery based on retrieval of state information from peers*

*indicates areas of development

Following are Sensys software tool components:

  1. orcmd – An Sensys daemon which runs on all compute nodes with root privileges. This daemon is used for RAS Monitoring data collection (In-Band) sensor data from the compute nodes.
  2. orcmd aggregator – Another Sensys daemon (orcmd) to collect OOB and In-Band RAS monitoring data from the compute nodes from a service node (Rack Controller, Row Controller or from a system management server). OOB data collection requires an IPMI access to BMC on the compute nodes. In-Band data collection goes through a management network. The orcmd is same as above with role assigned as aggregator in the Sensys configuration file (orcm-site.xml).
  3. orcmsched – A scheduler daemon runs on the SMS node or Head node. This daemon is used for managing the compute resources in a cluster.
  4. octl – An administrative command line interface
Clone this wiki locally