From 1f6307fec6ef701489131d7b1a6f6fa466edb6d0 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 11:57:08 -0500 Subject: [PATCH 01/27] ht/welcome massive restructure --- h2o-docs/src/product/welcome.rst | 1125 +++--------------------------- 1 file changed, 94 insertions(+), 1031 deletions(-) diff --git a/h2o-docs/src/product/welcome.rst b/h2o-docs/src/product/welcome.rst index 391b0cdbb6bf..4ff268ee03ad 100644 --- a/h2o-docs/src/product/welcome.rst +++ b/h2o-docs/src/product/welcome.rst @@ -1,1078 +1,141 @@ -Welcome to H2O 3 +Welcome to H2O-3 ================ -H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment. +H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform. It lets you build machine learning models on big data and provides easy productionalization of those models in an enterprise environment. -H2O's core code is written in Java. Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines. The algorithms are implemented on top of H2O's distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading. The data is read in parallel and is distributed across the cluster and stored in memory in a columnar format in a compressed way. H2O’s data parser has built-in intelligence to guess the schema of the incoming dataset and supports data ingest from multiple sources in various formats. +Basic framework +--------------- -H2O’s REST API allows access to all the capabilities of H2O from an external program or script via JSON over HTTP. The Rest API is used by H2O’s web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python). +H2O's core code is written in Java. A distributed key-value store is used to access and reference data, models, objects, etc. across all nodes and machines. The algorithms are implemented on top of H2O's distributed map-reduce framework and utilize the Java fork/join framework for multi-threading. The data is read in parallel and is distributed across the cluster. It is stored in-memory in a columnar format in a compressed way. H2O's data parser has built-in intelligence to guess the schema of the incoming dataset and supports data ingest from multiple sources in various formats. -The speed, quality, ease-of-use, and model-deployment for the various cutting edge Supervised and Unsupervised algorithms like Deep Learning, Tree Ensembles, and GLRM make H2O a highly sought after API for big data data science. +REST API +~~~~~~~~ -H2O is licensed under the `Apache License, Version 2.0 `_. +H2O's REST API allow access to all the capabilities of H2O frm an external program or script through JSON over HTTP. The REST API is used by H2O's web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python). -Requirements ------------- - -At a minimum, we recommend the following for compatibility with H2O: - -- **Operating Systems**: - - - Windows 7 or later - - OS X 10.9 or later - - Ubuntu 12.04 - - RHEL/CentOS 6 or later - -- **Languages**: R and Python are not required to use H2O unless you want to use H2O in those environments, but Java is always required (see `below `__). - - - R version 3 or later - - Python 3.6.x, 3.7.x, 3.8.x, 3.9.x, 3.10.x, 3.11.x - -- **Browser**: An internet browser is required to use H2O's web UI, Flow. Supported versions include the latest version of Chrome, Firefox, Safari, or Internet Explorer. - -Java Requirements -~~~~~~~~~~~~~~~~~ - -H2O runs on Java. To build H2O or run H2O tests, the 64-bit JDK is required. To run the H2O binary using either the command line, R, or Python packages, only 64-bit JRE is required. +The speed, quality, ease-of-use, and model-deployment for our various supervised and unsupervised algorithms (such as Deep Learning, GLRM, or our tree ensembles) make H2O a highly sought after API for big data data science. -H2O supports the following versions of Java: +H2O is licensed under the `Apache License, Version 2.0 `__. -- Java SE 17, 16, 15, 14, 13, 12, 11, 10, 9, 8 - -Click `here `__ to download the latest supported version. - -Using Unsupported Java Versions -''''''''''''''''''''''''''''''' - -We recommend that only power users force an unsupported Java version. Unsupported Java versions can only be used for experiments. For production versions, we only guarantee the Java versions from the supported list. +Requirements +------------ -To force an unsupported Java version: +We recommend the following at minimum for compatibility with H2O: -:: +- **Operating systems**: - java -jar -Dsys.ai.h2o.debug.allowJavaVersions=19 h2o.jar - -Running H2O on Hadoop -''''''''''''''''''''' - -Java support is different between H2O and Hadoop. Hadoop only supports `Java 8 and Java 11 `__. When running H2O on Hadoop, we recommend only running H2O on Java 8 or Java 11. - -Additional Requirements -~~~~~~~~~~~~~~~~~~~~~~~ - -- **Hadoop**: Hadoop is not required to run H2O unless you want to deploy H2O on a Hadoop cluster. Supported versions are listed on the `Download page `_ (when you select the Install on Hadoop tab) and include: + - Windows 7+ + - Mac OS 10.9+ + - Ubuntu 12.04 + - RHEL/CentOS 6+ - - Cloudera CDH 5.4 or later - - Hortonworks HDP 2.2 or later - - MapR 4.0 or later - - IBM Open Platform 4.2 - - Refer to the :ref:`on-hadoop` section for detailed information. - -- **Conda 3.6+ repo**: Conda is not required to run H2O unless you want to run H2O on the Anaconda Cloud. Refer to the :ref:`anaconda` section for more information. - -- **Spark**: Version 3.4, 3.3, 3.2, 3.1, 3.0, 2.4, or 2.3. Spark is only required if you want to run `Sparkling Water `__. - - -New Users ---------- - -If you're just getting started with H2O, here are some links to help you -learn more: - -- `Downloads page `_: First things first - download a copy of H2O here by selecting a build under "Download H2O" (the "Bleeding Edge" build contains the latest changes, while the latest alpha release is a more stable build), then use the installation instruction tabs to install H2O on your client of choice (standalone, R, Python, Hadoop, or Maven). - - For first-time users, we recommend downloading the latest alpha release and the default standalone option (the first tab) as the installation method. Make sure to install Java if it is not already installed. - -- **Tutorials**: To see a step-by-step example of our algorithms in action, select a model type from the following list: - - - `Deep Learning `_ - - `Gradient Boosting Machine (GBM) `_ - - `Generalized Linear Model (GLM) `_ - - `Kmeans `_ - - `Distributed Random Forest (DRF) `_ - -- :ref:`using-flow`: This section describes our new intuitive web interface, Flow. This interface is similar to IPython notebooks, and allows you to create a visual workflow to share with others. - -- `Launch from the command line `_: This document describes some of the additional options that you can configure when launching H2O (for example, to specify a different directory for saved Flow data, to allocate more memory, or to use a flatfile for quick configuration of a cluster). - -- :ref:`Data_Science`: This section describes the science behind our algorithms and provides a detailed, per-algo view of each model type. - -- `GitHub Help `_: The GitHub Help system is a useful resource for becoming familiar with Git. - -.. note:: +- **Languages**: R and Python are not required to use H2O (unless you want to use H2O in those environments), but Java is always required (see `Java requirements ,http://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html#java-requirements>`__). - By default, this setup is open. Follow `security guidelines `__ if you want to secure your installation. - -Use Cases -~~~~~~~~~ - -H2O can handle a wide range of practical use cases due to its robust catalogue of supported algorithms, wrappers, and machine learning tools. Several example problems H2O can handle are: - -- determining outliers in housing price based on number of bedrooms, number of bathrooms, access to waterfront, etc. through `anomaly detection `__ -- revealing natural customer `segments `__ in retail data to determine which groups are purchasing which products -- linking multiple records to the same person with `probabilistic matching `__ -- upsampling the minority class for credit card fraud data to handle `imbalanced data `__ -- `detecting drift `__ on avocado sales pre-2018 and 2018+ to determine if a model is still relevant for new data - -To further explore the capabilities of H2O, check out some of our best practice `tutorials `__. - - -New User Quick Start -~~~~~~~~~~~~~~~~~~~~ - -New users can follow the steps below to quickly get up and running with H2O directly from the **h2o-3** repository. These steps guide you through cloning the repository, starting H2O, and importing a dataset. Once you're up and running, you'll be better able to follow examples included within this user guide. - -1. In a terminal window, create a folder for the H2O repository. The example below creates a folder called "repos" on the desktop. - - .. code-block:: bash - - user$ mkdir ~/Desktop/repos - -2. Change directories to that new folder, and then clone the repository. Notice that the prompt changes when you change directories. - - .. code-block:: bash - - user$ cd ~/Desktop/repos - repos user$ git clone https://github.com/h2oai/h2o-3.git - -3. After the repo is cloned, change directories to the **h2o** folder. - - .. code-block:: bash - - repos user$ cd h2o-3 - h2o-3 user$ - -4. Run the following command to retrieve sample datasets. These datasets are used throughout this User Guide and within the `Booklets `_. - - .. code-block:: bash - - h2o-3 user$ ./gradlew syncSmalldata - -At this point, determine whether you want to complete this quick start in either R or Python, and run the corresponding commands below from either the R or Python tab. - -.. tabs:: - .. code-tab:: r R - - # Download and install R: - # 1. Go to http://cran.r-project.org/mirrors.html. - # 2. Select your closest local mirror. - # 3. Select your operating system (Linux, OS X, or Windows). - # 4. Depending on your OS, download the appropriate file, along with any required packages. - # 5. When the download is complete, unzip the file and install. - - # Start R - h2o-3 user$ r - ... - Type 'demo()' for some demos, 'help()' for on-line help, or - 'help.start()' for an HTML browser interface to help. - Type 'q()' to quit R. - > - - # By default, this setup is open. - # Follow our security guidelines (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html) - # if you want to secure your installation. - - # Copy and paste the following commands in R to download dependency packages. - > pkgs <- c("methods", "statmod", "stats", "graphics", "RCurl", "jsonlite", "tools", "utils") - > for (pkg in pkgs) {if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }} - - # Run the following command to load the H2O: - > library(h2o) - - # Run the following command to initialize H2O on your local machine (single-node cluster) using all available CPUs. - > h2o.init() - - # Import the Iris (with headers) dataset. - > path <- "smalldata/iris/iris_wheader.csv" - > iris <- h2o.importFile(path) - - # View a summary of the imported dataset. - > print(iris) - - sepal_len sepal_wid petal_len petal_wid class - ----------- ----------- ----------- ----------- ----------- - 5.1 3.5 1.4 0.2 Iris-setosa - 4.9 3 1.4 0.2 Iris-setosa - 4.7 3.2 1.3 0.2 Iris-setosa - 4.6 3.1 1.5 0.2 Iris-setosa - 5 3.6 1.4 0.2 Iris-setosa - 5.4 3.9 1.7 0.4 Iris-setosa - 4.6 3.4 1.4 0.3 Iris-setosa - 5 3.4 1.5 0.2 Iris-setosa - 4.4 2.9 1.4 0.2 Iris-setosa - 4.9 3.1 1.5 0.1 Iris-setosa - [150 rows x 5 columns] - > - - .. code-tab:: python - - # By default, this setup is open. - # Follow our security guidelines (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html) - # if you want to secure your installation. - - # Before starting Python, run the following commands to install dependencies. - # Prepend these commands with `sudo` only if necessary: - # h2o-3 user$ [sudo] pip install -U requests - # h2o-3 user$ [sudo] pip install -U tabulate - - # Start python: - # h2o-3 user$ python - - # Run the following commands to import the H2O module: - >>> import h2o - - # Run the following command to initialize H2O on your local machine (single-node cluster): - >>> h2o.init() - - # If desired, run the GLM, GBM, or Deep Learning demo(s): - >>> h2o.demo("glm") - >>> h2o.demo("gbm") - >>> h2o.demo("deeplearning") - - # Import the Iris (with headers) dataset: - >>> path = "smalldata/iris/iris_wheader.csv" - >>> iris = h2o.import_file(path=path) - - # View a summary of the imported dataset: - >>> iris.summary - # sepal_len sepal_wid petal_len petal_wid class - # 5.1 3.5 1.4 0.2 Iris-setosa - # 4.9 3 1.4 0.2 Iris-setosa - # 4.7 3.2 1.3 0.2 Iris-setosa - # 4.6 3.1 1.5 0.2 Iris-setosa - # 5 3.6 1.4 0.2 Iris-setosa - # 5.4 3.9 1.7 0.4 Iris-setosa - # 4.6 3.4 1.4 0.3 Iris-setosa - # 5 3.4 1.5 0.2 Iris-setosa - # 4.4 2.9 1.4 0.2 Iris-setosa - # 4.9 3.1 1.5 0.1 Iris-setosa - # - # [150 rows x 5 columns] - # - - - -Experienced Users ------------------ - -If you've used previous versions of H2O, the following links will help guide you through the process of upgrading to H2O-3. - -- `Recent Changes `_: This document describes the most recent changes in the latest build of H2O. It lists new features, enhancements (including changed parameter default values), and bug fixes for each release, organized by sub-categories such as Python, R, and Web UI. - -- `API Related Changes `__: This section describes changes made in H2O-3 that can affect backwards compatibility. - -- `Contributing code `_: If you're interested in contributing code to H2O, we appreciate your assistance! This document describes how to access our list of Jiras that are suggested tasks for contributors and how to contact us. - -Flow Users ----------- - -H2O Flow is a notebook-style open-source user interface for H2O. It is a web-based interactive environment that allows you to combine code execution, text, mathematics, plots, and rich media in a single document, similar to iPython Notebooks. An entire section dedicated to starting and using the features available in Flow is available `later in this document `__. - -Sparkling Water Users ---------------------- - -Sparkling Water is a gradle project with the following submodules: - -- Core: Implementation of H2OContext, H2ORDD, and all technical - integration code -- Examples: Application, demos, examples -- ML: Implementation of MLlib pipelines for H2O algorithms -- Assembly: Creates "fatJar" composed of all other modules -- py: Implementation of (h2o) Python binding to Sparkling Water - -The best way to get started is to modify the core module or create a new module, which extends a project. - -Users of our Spark-compatible solution, Sparkling Water, should be aware that Sparkling Water is only supported with the latest version of H2O. For more information about Sparkling Water, refer to the following links. - -Sparkling Water is versioned according to the Spark versioning, so make sure to use the Sparkling Water version that corresponds to the installed version of Spark. - -Getting Started with Sparkling Water -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -- `Download Sparkling Water `_: Go here to download Sparkling Water. - -- Sparkling Water Documentation for: `3.3 `__, `3.2 `__, `3.1 `__, `3.0 `__, `2.4 `__, or `2.3 `__. Read this documentation first to get started with Sparkling Water. - -- Launch on Hadoop and Import from HDFS (`3.3 `__, `3.2 `__, `3.1 `__, `3.0 `__, `2.4 `__, or `2.3 `__): Go here to learn how to start Sparkling Water on Hadoop. - -- `Sparkling Water Tutorials `_: Go here for demos and examples. - - - `Sparkling Water K-means Tutorial `_: Go here to view a demo that uses Scala to create a K-means model. - - - `Sparkling Water GBM Tutorial `_: Go here to view a demo that uses Scala to create a GBM model. - - - `Sparkling Water on YARN `_: Follow these instructions to run Sparkling Water on a YARN cluster. - -- `Building Machine Learning Applications with Sparkling Water `_: This short tutorial describes project building and demonstrates the capabilities of Sparkling Water using Spark Shell to build a Deep Learning model. - -- Sparkling Water FAQ for `3.3 `__, `3.2 `__, `3.1 `__, `3.0 `__, `2.4 `__, or `2.3 `__. This FAQ provides answers to many common questions about Sparkling Water. - -- `Connecting RStudio to Sparkling Water `_: This illustrated tutorial describes how to use RStudio to connect to Sparkling Water. - -Sparkling Water Blog Posts -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -- `How Sparkling Water Brings H2O to Spark `_ - -- `H2O - The Killer App on Spark `_ - -- `In-memory Big Data: Spark + H2O `_ - -Sparkling Water Meetup Slide Decks -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -- `Sparkling Water Meetups `_ - -- `Interactive Session on Sparkling Water `_ - -- `Sparkling Water Hands-On `_ - -PySparkling -~~~~~~~~~~~~ - -PySparkling documentation is available for `3.3 `__, `3.2 `__, `3.1 `__, `3.0 `__, `2.4 `__, and `2.3 `__. - -**Note**: PySparkling requires Sparkling Water 2.3 or later. We recommended Sparkling Water 3.3. - -PySparkling can be installed by downloading and running the PySparkling shell or using ``pip``. PySparkling can also be installed from the PyPi repository. Follow the instructions on the `Download page `__ for Sparkling Water. - -RSparkling -~~~~~~~~~~ - -The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling Water package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark using R. - -This package implements basic functionality (creating an H2OContext, showing the H2O Flow interface, and converting between Spark DataFrames and H2O Frames). The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms. - -The rsparkling package uses sparklyr for Spark job deployment and initialization of Sparkling Water. After that, users can use the regular H2O R package for modeling. - -Refer to the `Sparkling Water User Guide `__ for more information. - -Python Users --------------- - -Pythonistas will be glad to know that H2O now provides support for this popular programming language. Python users can also use H2O with IPython notebooks. For more information, refer to the following links. - -- Instructions for using H2O with Python are available in the `Downloading and Installing H2O `__ section and on the `H2O Download page `__. Select the version you want to install (latest stable release or nightly build), then click the **Install in Python** tab. - -- `Python docs <../h2o-py/docs/index.html>`_: This document represents the definitive guide to using - Python with H2O. - -- `Grid Search in Python `_: This notebook demonstrates the use of grid search in Python. - -.. _anaconda: - -Anaconda Cloud Users -~~~~~~~~~~~~~~~~~~~~ - -You can run H2O in an Anaconda Cloud environment. Conda 2.7, 3.5, and 3.6 repos are supported as are a number of H2O versions. Refer to `https://anaconda.org/h2oai/h2o/files `__ to view a list of available H2O versions. Anaconda users can refer to the `Install on Anaconda Cloud `__ section for information about installing H2O in an Anaconda Cloud. - -R Users -------- - -Currently, the only version of R that is known to be incompatible with H2O is R version 3.1.0 (codename "Spring Dance"). If you are using that version, we recommend upgrading the R version before using H2O. - -To check which version of H2O is installed in R, use ``versions::installed.versions("h2o")``. - -- `R User HTML <../h2o-r/docs/index.html>`__ and `R User PDF <../h2o-r/h2o_package.pdf>`__ Documentation: This document contains all commands in the H2O package for R, including examples and arguments. It represents the definitive guide to using H2O in R. - -- `Connecting RStudio to Sparkling Water `_: This illustrated tutorial describes how to use RStudio to connect to Sparkling Water. - -- `RStudio Cheat Sheet `__: Download this PDF to keep as a quick reference when using H2O in R. - -**Note**: If you are running R on Linux, then you must install ``libcurl``, which allows H2O to communicate with R. We also recommend disabling SElinux and any firewalls, at least initially until you have confirmed H2O can initialize. - -- On Ubuntu, run: ``apt-get install libcurl4-openssl-dev`` -- On CentOs, run: ``yum install libcurl-devel`` - -API Users ---------- - -API users will be happy to know that the APIs have been more thoroughly documented in the latest release of H2O and additional capabilities (such as exporting weights and biases for Deep Learning models) have been added. - -REST APIs are generated immediately out of the code, allowing users to implement machine learning in many ways. For example, REST APIs could be used to call a model created by sensor data and to set up auto-alerts if the sensor data falls below a specified threshold. - -- `H2O 3 REST API Overview `_: This document describes how the REST API commands are used in H2O, versioning, experimental APIs, verbs, status codes, formats, schemas, payloads, metadata, and examples. - -- `REST API Reference `_: This document represents the definitive guide to the H2O REST API. - -- `REST API Schema Reference `_: This document represents the definitive guide to the H2O REST API schemas. - -Java Users --------------- - -Refer to H2O's `Java Requirements <#java-requirements>`__ for more information. For Java developers, the following resources will help you create your own custom app that uses H2O. - -- `H2O Core Java Developer Documentation <../h2o-core/javadoc/index.html>`_: The definitive Java API guide - for the core components of H2O. - -- `H2O Algos Java Developer Documentation <../h2o-algos/javadoc/index.html>`_: The definitive Java API guide - for the algorithms used by H2O. - -- `h2o-genmodel (POJO/MOJO) Javadoc <../h2o-genmodel/javadoc/index.html>`_: Provides a step-by-step guide to creating and implementing POJOs or MOJOs in a Java application. - -Developers ----------- - -If you're looking to use H2O to help you develop your own apps, the following links will provide helpful references. - -H2O's build is completely managed by Gradle. Any IDEA with Gradle support is sufficient for H2O-3 development. The latest versions of IntelliJ IDEA have been thoroughly tested and are proven to work well. -Just open the folder with H2O-3 in IntellliJ IDEA, and it will automatically recognize that Gradle is required and will import the project. The Gradle wrapper present in the repository itself may be used manually/directly to build and test if required. - -For JUnit tests to pass, you may need multiple H2O nodes. Create a "Run/Debug" configuration with the following parameters: - -.. code-block:: bash - - Type: Application - Main class: H2OApp - Use class path of module: h2o-app - -After starting multiple "worker" node processes in addition to the JUnit test process, they will cloud up and run the multi-node JUnit tests. - -- `Developer Documentation `_: Detailed instructions on how to build and - launch H2O, including how to clone the repository, how to pull from the repository, and how to install required dependencies. - -- You can view instructions for using H2O with Maven on the `Download page `__. Select the version of H2O you want to install (latest stable release or nightly build), then click the **Use from Maven** tab. - -- `Maven install `_: This page provides information on how to build a version of H2O that generates the correct IDE files. - -- `H2O Community `__: Join our community support and outreach by accessing self-paced courses, scoping out meetups, and interacting with other users and our team. - -- `H2O Droplet Project Templates `_: This page provides template info for projects created in Java, Scala, or Sparkling Water. - -- `Hacking Algos `_: This blog post by Cliff walks you through building a new algorithm, using K-Means, Quantiles, and Grep as examples. - -- `KV Store Guide `_: Learn more about performance characteristics when implementing new algorithms. - -- `Contributing code `_: If you're interested in contributing code to H2O, we appreciate your assistance! - -.. _on-hadoop: - -Hadoop Users ------------- - -This section describes how to use H2O on Hadoop. - -Supported Versions -~~~~~~~~~~~~~~~~~~ - -- CDH 5.4 -- CDH 5.5 -- CDH 5.6 -- CDH 5.7 -- CDH 5.8 -- CDH 5.9 -- CDH 5.10 -- CDH 5.13 -- CDH 5.14 -- CDH 5.15 -- CDH 5.16 -- CDH 6.0 -- CDH 6.1 -- CDH 6.2 -- CDH 6.3 -- CDP 7.0 -- CDP 7.1 -- CDP 7.2 -- HDP 2.2 -- HDP 2.3 -- HDP 2.4 -- HDP 2.5 -- HDP 2.6 -- HDP 3.0 -- HDP 3.1 -- MapR 4.0 -- MapR 5.0 -- MapR 5.1 -- MapR 5.2 -- MapR 6.0 -- MapR 6.1 -- IOP 4.2 -- EMR 6.10 - -**Important Points to Remember**: - -- The command used to launch H2O differs from previous versions. (Refer to the `Walkthrough`_ section.) -- Launching H2O on Hadoop requires at least 6 GB of memory -- Each H2O node runs as a mapper -- Run only one mapper per host -- There are no combiners or reducers -- Each H2O cluster must have a unique job name -- ``-mapperXmx``, ``-nodes``, and ``-output`` are required -- Root permissions are not required - just unzip the H2O .zip file on any single node - - -Prerequisite: Open Communication Paths -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -H2O communicates using two communication paths. Verify these are open and available for use by H2O. - -**Path 1: mapper to driver** - -Optionally specify this port using the ``-driverport`` option in the ``hadoop jar`` command (see "Hadoop Launch Parameters" below). This port is opened on the driver host (the host where you entered the ``hadoop jar`` command). By default, this port is chosen randomly by the operating system. If you don't want to specify an exact port but you still want to restrict the port to a certain range of ports, you can use the option ``-driverportrange``. - -**Path 2: mapper to mapper** + - R version 3+ + - Python 3.6.x, 3.7.x, 3.8.x, 3.9.x, 3.10.x, 3.11.x -Optionally specify this port using the ``-baseport`` option in the ``hadoop jar`` command (refer to `Hadoop Launch Parameters`_ below. This port and the next subsequent port are opened on the mapper hosts (the Hadoop worker nodes) where the H2O mapper nodes are placed by the Resource Manager. By default, ports 54321 and 54322 are used. - -The mapper port is adaptive: if 54321 and 54322 are not available, H2O will try 54323 and 54324 and so on. The mapper port is designed to be adaptive because sometimes if the YARN cluster is low on resources, YARN will place two H2O mappers for the same H2O cluster request on the same physical host. For this reason, we recommend opening a range of more than two ports (20 ports should be sufficient). - ------------------------ - -.. _Walkthrough: - -Walkthrough -~~~~~~~~~~~ - -The following steps show you how to download or build H2O with Hadoop and the parameters involved in launching H2O from the command line. - -1. Download the latest H2O release for your version of Hadoop. Refer to the `H2O on Hadoop `__ tab of the download page for either the latest stable release or the nightly bleeding edge release. - -2. Prepare the job input on the Hadoop Node by unzipping the build file and changing to the directory with the Hadoop and H2O's driver jar files. - - .. code-block:: bash - - unzip h2o-{{project_version}}-*.zip - cd h2o-{{project_version}}-* - -3. To launch H2O nodes and form a cluster on the Hadoop cluster, run: - - .. code-block:: bash - - hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g - - The above command launches a 6g node of H2O. We recommend you launch the cluster with at least four times the memory of your data file size. - - - *mapperXmx* is the mapper size or the amount of memory allocated to each node. Specify at least 6 GB. - - - *nodes* is the number of nodes requested to form the cluster. - - - *output* is the name of the directory created each time a H2O cluster is created so it is necessary for the name to be unique each time it is launched. - -4. To monitor your job, direct your web browser to your standard job tracker Web UI. To access H2O's Web UI, direct your web browser to one of the launched instances. If you are unsure where your JVM is launched, review the output from your command after the nodes have clouded up and formed a cluster. Any of the nodes' IP addresses will work as there is no master node. - - .. code-block:: bash - - Determining driver host interface for mapper->driver callback... - [Possible callback IP address: 172.16.2.181] - [Possible callback IP address: 127.0.0.1] - ... - Waiting for H2O cluster to come up... - H2O node 172.16.2.184:54321 requested flatfile - Sending flatfiles to nodes... - [Sending flatfile to node 172.16.2.184:54321] - H2O node 172.16.2.184:54321 reports H2O cluster size 1 - H2O cluster (1 nodes) is up - Blocking until the H2O cluster shuts down... - -.. _Hadoop Launch Parameters: - -Hadoop Launch Parameters -~~~~~~~~~~~~~~~~~~~~~~~~ - -- ``-h | -help``: Display help -- ``-jobname ``: Specify a job name for the Jobtracker to use; the default is ``H2O_nnnnn`` (where n is chosen randomly) -- ``-principal -keytab | -run_as_user ``: Optionally specify a Kerberos principal and keytab or specify the ``run_as_user`` parameter to start clusters on behalf of the user/principal. Note that using ``run_as_user`` implies that the Hadoop cluster does not have Kerberos. -- ``-driverif driver callback interface>``: Specify the IP address for callback messages from the mapper to the driver. -- ``-driverport callback interface>``: Specify the port number for callback messages from the mapper to the driver. -- ``-driverportrange callback interface>``: Specify the allowed port range of the driver callback interface, eg. 50000-55000. -- ``-network [,]``: Specify the IPv4 network(s) to bind to the H2O nodes; multiple networks can be specified to force H2O to use the specified host in the Hadoop cluster. ``10.1.2.0/24`` allows 256 possibilities. -- ``-timeout ``: Specify the timeout duration (in seconds) to wait for the cluster to form before failing. **Note**: The default value is 120 seconds; if your cluster is very busy, this may not provide enough time for the nodes to launch. If H2O does not launch, try increasing this value (for example, ``-timeout 600``). -- ``-disown``: Exit the driver after the cluster forms. - - **Note**: For Qubole users who include the ``-disown`` flag, if your cluster is dying right after launch, add ``-Dmapred.jobclient.killjob.onexit=false`` as a launch parameter. - -- ``-notify ``: Specify a file to write when the cluster is up. The file contains the IP and port of the embedded web server for one of the nodes in the cluster. All mappers must start before the H2O cluster is considered "up". -- ``-mapperXmx ``: Specify the amount of memory to allocate to H2O (at least 6g). -- ``-extramempercent``: Specify the extra memory for internal JVM use outside of the Java heap. This is a percentage of ``mapperXmx``. **Recommendation**: Set this to a high value when running XGBoost, for example, 120. -- ``-n | -nodes ``: Specify the number of nodes. -- ``-nthreads ``: Specify the maximum number of parallel threads of execution. This is usually capped by the max number of vcores. -- ``-baseport ``: Specify the initialization port for the H2O nodes. The default is ``54321``. -- ``-license ``: Specify the directory of local filesytem location and the license file name. -- ``-o | -output ``: Specify the HDFS directory for the output. -- ``-flow_dir ``: Specify the directory for saved flows. By default, H2O will try to find the HDFS home directory to use as the directory for flows. If the HDFS home directory is not found, flows cannot be saved unless a directory is specified using ``-flow_dir``. -- ``-port_offset ``: This parameter allows you to specify the relationship of the API port ("web port") and the internal communication port. The h2o port and API port are derived from each other, and we cannot fully decouple them. Instead, we allow you to specify an offset such that h2o port = api port + offset. This allows you to move the communication port to a specific range that can be firewalled. -- ``-proxy``: Enables Proxy mode. -- ``-report_hostname``: This flag allows the user to specify the machine hostname instead of the IP address when launching H2O Flow. This option can only be used when H2O on Hadoop is started in Proxy mode (with ``-proxy``). - -**JVM arguments** - - - ``-ea``: Enable assertions to verify boolean expressions for error detection. - - ``-verbose:gc``: Include heap and garbage collection information in the logs. Deprecated in Java 9, removed in Java 10. - - ``-XX:+PrintGCDetails``: Include a short message after each garbage collection. Deprecated in Java 9, removed in Java 10. - - ``-Xlog:gc=info``: Prints garbage collection information into the logs. Introduced in Java 9. Usage enforced since Java 10. A replacement for ``-verbose:gc`` and ``-XX:+PrintGCDetails`` tags which are deprecated in Java 9 and removed in Java 10. - -Configuring HDFS -~~~~~~~~~~~~~~~~ - -When running H2O-3 on Hadoop, you do not need to worry about configuring HDFS. The ``-hdfs_config`` flag is used to configure access to HDFS from a standalone cluster. However, it is also used for anything that requries Hadoop (such as Hive). - -If you are accessing HDFS/Hive without Kerberos, then you will need to pass ``-hdfs_config`` and path to the ``core-site.xml`` that you got from your Hadoop edge node. If you are accessing Kerberized Hadoop, you will also need to pass ``hdfs-site.xml``. - -Accessing S3 Data from Hadoop -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -H2O launched on Hadoop can access S3 Data in addition to to HDFS. To enable access, follow the instructions below. - -Edit Hadoop's ``core-site.xml``, then set the ``HADOOP_CONF_DIR`` environment property to the directory containing the ``core-site.xml`` file. For an example ``core-site.xml`` file, refer to :ref:`Core-site.xml`. Typically, the configuration directory for most Hadoop distributions is ``/etc/hadoop/conf``. - -You can also pass the S3 credentials when launching H2O with the Hadoop jar command. Use the ``-D`` flag to pass the credentials: - -.. code-block:: bash - - hadoop jar h2odriver.jar -Dfs.s3.awsAccessKeyId="${AWS_ACCESS_KEY}" -Dfs.s3n.awsSecretAccessKey="${AWS_SECRET_KEY}" -n 3 -mapperXmx 10g -output outputDirectory - -where ``AWS_ACCESS_KEY`` represents your user name and ``AWS_SECRET_KEY`` represents your password. - -Then import the data with the S3 URL path: - -- To import the data from the Flow API: - - .. code-block:: bash - - importFiles [ "s3:/path/to/bucket/file/file.tab.gz" ] - -- To import the data from the R API: - - .. code-block:: bash - - h2o.importFile(path = "s3://bucket/path/to/file.csv") - -- To import the data from the Python API: - - .. code-block:: bash - - h2o.import_frame(path = "s3://bucket/path/to/file.csv") - -YARN Best Practices -~~~~~~~~~~~~~~~~~~~ - -YARN (Yet Another Resource Manager) is a resource management framework. H2O can be launched as an application on YARN. If you want to run H2O on Hadoop, essentially, you are running H2O on YARN. If you are not currently using YARN to manage your cluster resources, we strongly recommend it. - -Using H2O with YARN -''''''''''''''''''' - -When you launch H2O on Hadoop using the ``hadoop jar`` command, YARN allocates the necessary resources to launch the requested number of nodes. H2O launches as a MapReduce (V2) task, where each mapper is an H2O node of the specified size. - -``hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName`` - -Occasionally, YARN may reject a job request. This usually occurs because either there is not enough memory to launch the job or because of an incorrect configuration. - -If YARN rejects the job request, try launching the job with less memory to see if that is the cause of the failure. Specify smaller values for ``-mapperXmx`` (we recommend a minimum of ``2g``) and ``-nodes`` (start with ``1``) to confirm that H2O can launch successfully. - -To resolve configuration issues, adjust the maximum memory that YARN will allow when launching each mapper. If the cluster manager settings are configured for the default maximum memory size but the memory required for the request exceeds that amount, YARN will not launch and H2O will time out. If you are using the default configuration, change the configuration settings in your cluster manager to specify memory allocation when launching mapper tasks. To calculate the amount of memory required for a successful launch, use the following formula: - - YARN container size (``mapreduce.map.memory.mb``) = ``-mapperXmx`` value + (``-mapperXmx`` \* ``-extramempercent`` [default is 10%]) - -The ``mapreduce.map.memory.mb`` value must be less than the YARN memory configuration values for the launch to succeed. - -Configuring YARN -'''''''''''''''' - -**For Cloudera, configure the settings in Cloudera Manager. Depending on how the cluster is configured, you may need to change the settings for more than one role group.** - -1. Click **Configuration** and enter the following search term in quotes: **yarn.nodemanager.resource.memory-mb**. - -2. Enter the amount of memory (in GB) to allocate in the **Value** field. If more than one group is listed, change the values for all listed groups. - - .. figure:: images/TroubleshootingHadoopClouderayarnnodemgr.png - :alt: Cloudera Configuration - -3. Click the **Save Changes** button in the upper-right corner. - -4. Enter the following search term in quotes: **yarn.scheduler.maximum-allocation-mb** - -5. Change the value, click the **Save Changes** button in the upper-right corner, and redeploy. - - .. figure:: images/TroubleshootingHadoopClouderayarnscheduler.png - :alt: Cloudera Configuration - -**For Hortonworks,** -`configure `__ **the settings in Ambari.** - -1. Select **YARN**, then click the **Configs** tab. - -2. Select the group. - -3. In the **Node Manager** section, enter the amount of memory (in MB) to allocate in the **yarn.nodemanager.resource.memory-mb** entry field. - - .. figure:: images/TroubleshootingHadoopAmbariNodeMgr.png - :alt: Ambari Configuration - -4. In the **Scheduler** section, enter the amount of memory (in MB) to allocate in the **yarn.scheduler.maximum-allocation-mb** entry field. - - .. figure:: images/TroubleshootingHadoopAmbariyarnscheduler.png - :alt: Ambari Configuration - -5. Click the **Save** button at the bottom of the page and redeploy the cluster. - -**For MapR:** - -1. Edit the **yarn-site.xml** file for the node running the ResourceManager. - -2. Change the values for the ``yarn.nodemanager.resource.memory-mb`` and ``yarn.scheduler.maximum-allocation-mb`` properties. - -3. Restart the ResourceManager and redeploy the cluster. - -To verify the values were changed, check the values for the following properties: - -.. code-block:: bash - - - yarn.nodemanager.resource.memory-mb - - yarn.scheduler.maximum-allocation-mb - -Limiting CPU Usage -'''''''''''''''''' - -To limit the number of CPUs used by H2O, use the ``-nthreads`` option and specify the maximum number of CPUs for a single container to use. The following example limits the number of CPUs to four: - -``hadoop jar h2odriver.jar -nthreads 4 -nodes 1 -mapperXmx 6g -output hdfsOutputDirName`` - -**Note**: The default is 4\*the number of CPUs. You must specify at least four CPUs; otherwise, the following error message displays: ``ERROR: nthreads invalid (must be >= 4)`` - -Specifying Queues -''''''''''''''''' - -If you do not specify a queue when launching H2O, H2O jobs are submitted to the default queue. Jobs submitted to the default queue have a lower priority than jobs submitted to a specific queue. +- **Browser**: An internet browser is required to use H2O's web UI, Flow. + + - Google Chrome + - Firefox + - Safari + - Microsoft Edge -To specify a queue with Hadoop, enter ``-Dmapreduce.job.queuename=`` (where ```` is the name of the queue) when launching Hadoop. +Java requirements +~~~~~~~~~~~~~~~~~ -For example, +H2O runs on Java. The 64-bit JDK is required to build H2O or run H2O tests. Only the 64-bit JRE is required to run the H2O binary using either the command line, R, or Python packages. -.. code-block:: bash +Java support +'''''''''''' - hadoop jar h2odriver.jar -Dmapreduce.job.queuename= -nodes -mapperXmx 6g -output hdfsOutputDirName +H2O supports the following versions of Java: -Specifying Output Directories -''''''''''''''''''''''''''''' +- Java SE 17, +- Java SE 16, +- Java SE 15, +- Java SE 14, +- Java SE 13, +- Java SE 12, +- Java SE 11, +- Java SE 10, +- Java SE 9, +- Java SE 8 -To prevent overwriting multiple users' files, each job must have a unique output directory name. Change the ``-output hdfsOutputDir`` argument (where ``hdfsOutputDir`` is the name of the directory. +`Download the latest supported version of Java `__. -Alternatively, you can delete the directory (manually or by using a script) instead of creating a unique directory each time you launch H2O. +Unsupported Java versions +''''''''''''''''''''''''' -Customizing YARN -'''''''''''''''' +We recommend that only power users force an unsupported Java version. Unsupported Java versions can only be used for experiments. For production versions, we only guarantee the Java versions from the supported list. -Most of the configurable YARN variables are stored in ``yarn-site.xml``. To prevent settings from being overridden, you can mark a config as "final." If you change any values in ``yarn-site.xml``, you must restart YARN to confirm the changes. +How to force an unsupported java version +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Accessing Logs -'''''''''''''' +:: + java -jar -Dsys.ai.h2o.debug.allowJavaVersions=19 h2o.jar -Access logs for a YARN job with the ``yarn logs -applicationId `` command from a terminal. Note that this command must be run by the same userid as the job owner, and only after the job has finished. +Java support with H2O and Hadoop +'''''''''''''''''''''''''''''''' -How H2O runs on YARN -~~~~~~~~~~~~~~~~~~~~ +Java support is different between H2O and Hadoop. Hadoop only supports `Java 8 and Java 11 `__. Therefore, when running H2O on Hadoop, we recommend only running H2O on Java 8 or Java 11. -Let's say that you have a Hadoop cluster with six worker nodes and six HDFS nodes. -For architectural diagramming purposes, the worker nodes and HDFS nodes are shown as separate blocks in the block diagram, -but they may actually be running on the same physical machines. -The ``hadoop jar`` command that you run on the edge node talks to the YARN Resource Manager to launch an H2O MRv2 (MapReduce v2) job. -The Resource Manager places the requested number of H2O nodes (aka MRv2 mappers, aka YARN containers) -- three in this example -- on worker nodes. -See the picture below: +Optional requirements +~~~~~~~~~~~~~~~~~~~~~ - .. figure:: images/h2o-on-yarn-1.png +This section outlines requirements for optional ways you can run H2O. -Once the H2O job's nodes all start, they find each other and create an H2O cluster (as shown by the dark blue line encircling the three H2O nodes). -The three H2O nodes work together to perform distributed Machine Learning functions as a group, as shown below. +Optional Hadoop requirements +'''''''''''''''''''''''''''' -Note how the three worker nodes that are not part of the H2O job have been removed from the picture below for explanatory purposes. -They aren't part of the compute and memory resources used by the H2O job. -The full complement of HDFS is still available, however: +Hadoop is only required if you want to deploy H2O on a Hadoop cluster. Supported versions are listed on the `Downloads `__ page (when you select the Install on Hadoop tab) and include: - .. figure:: images/h2o-on-yarn-2.png +- Cloudera CDH 5.4+ +- Hortonworks HDP 2.2+ +- MapR 4.0+ +- IBM Open Platform 4.2 -Data is then read in from HDFS *once* (as shown by the red lines), and stored as distributed H2O Frames in H2O's in-memory column-compressed Distributed Key/Value (DKV) store. See the picture below: +See the `Hadoop users `__ section for more details. - .. figure:: images/h2o-on-yarn-3.png +Optional Conda requirements +''''''''''''''''''''''''''' -Machine Learning algorithms can then run very fast in a parallel and distributed way (as shown by the light blue lines). -They iteratively sweep over the data over and over again to build models, which is why the in-memory storage makes H2O fast. +Conda is only required if you want to run H2O on the Anaconda cloud: -Note how the HDFS nodes have been removed from the picture below for explanatory purposes, to emphasize that the data lives in memory during the model training process: +- Conda 3.6+ repository - .. figure:: images/h2o-on-yarn-4.png +Optional Spark requirements +''''''''''''''''''''''''''' -Hadoop and AWS -~~~~~~~~~~~~~~ +Spark is only required if you want to run Sparkling Water. Supported spark versions: -AWS access credential configuration is provided to H2O by the Hadoop environment itself. There are a number of Hadoop distributions, and each distribution supports different means/providers to configure access to AWS. It is considered best practice to follow you Hadoop provider's guide. +- Spark 3.4 +- Spark 3.3 +- Spark 3.2 +- Spark 3.1 +- Spark 3.0 +- Spark 2.4 +- Spark 2.3 -Since Apache Hadoop 2.8, accessing multiple buckets with distinct credentials by means of the S3A protocol is possible. Please refer to the `Hadoop documentation `__ for more information. Users of derived distributions are advised to follow the respective documentation of their distribution and the specific version they use. -Docker Users +User support ------------ -This section describes how to use H2O on Docker and walks you through the followings steps: - -- Installing Docker on Mac or Linux OS -- Creating and modifying the Dockerfile -- Building a Docker image from the Dockerfile -- Running the Docker build -- Launching H2O -- Accessing H2O from the web browser or from R/Python - -Prerequisites -~~~~~~~~~~~~~ - -- Linux kernel version 3.8+ or Mac OS X 10.6+ -- VirtualBox -- Latest version of Docker is installed and configured -- Docker daemon is running - enter all commands below in the Docker - daemon window -- Using ``User`` directory (not ``root``) - -**Notes**: - -- Older Linux kernel versions are known to cause kernel panics that break Docker. There are ways around it, but these should be attempted at your own risk. To check the version of your kernel, run ``uname -r`` at the command prompt. The walkthrough that follows has been tested on a Mac OS X 10.10.1. -- The Dockerfile always pulls the latest H2O release. -- The Docker image only needs to be built once. - -Walkthrough -~~~~~~~~~~~ - -**Step 1 - Install and Launch Docker** - -Depending on your OS, select the appropriate installation method: - -- `Mac - Installation `__. **Note**: By default, Docker allocates 2GB of memory for Mac installations. Be sure to increase this value. We normally suggest 3-4 times the size of the dataset for the amount of memory required. -- `Ubuntu - Installation `__ -- `Other OS Installations `__ - -**Step 2 - Create or Download Dockerfile** - -**Note**: If the following commands do not work, prepend them with ``sudo``. - -1. Create a folder on the Host OS to host your Dockerfile by running: - -.. todo:: figure out if branch_name is getting replaced with the actual branch_name or how to set that up - - .. code-block:: bash - - mkdir -p /data/h2o-{{branch_name}} - -2. Next, either download or create a Dockerfile, which is a build recipe that builds the container. - - Download and use our `Dockerfile template `__ by running: - - .. code-block:: bash - - cd /data/h2o-{{branch_name}} - wget https://raw.githubusercontent.com/h2oai/h2o-3/master/Dockerfile - - The Dockerfile: - - - obtains and updates the base image (Ubuntu 14.04) - - installs Java 8 - - obtains and downloads the H2O build from H2O's S3 repository - - exposes ports 54321 and 54322 in preparation for launching H2O on those ports - -**Step 3 - Build Docker image from Dockerfile** - -From the **/data/h2o-{{branch\_name}}** directory, run the following. Note below that ``v5`` represents the current version number. - - .. code-block:: bash - - docker build -t "h2o.ai/{{branch_name}}:v5" . - -Because it assembles all the necessary parts for the image, this process can take a few minutes. - -**Step 4 - Run Docker Build** - -On a Mac, use the argument ``-p 54321:54321`` to expressly map the port 54321. This is not necessary on Linux. Note below that ``v5`` represents the version number. - - .. code-block:: bash - - docker run -ti -p 54321:54321 h2o.ai/{{branch_name}}:v5 /bin/bash - -**Step 5 - Launch H2O** - -Navigate to the ``/opt`` directory and launch H2O. Change the value of ``-Xmx`` to the amount of memory you want to allocate to the H2O instance. By default, H2O launches on port 54321. - - .. code-block:: bash - - cd /opt - java -Xmx1g -jar h2o.jar - -**Step 6 - Access H2O from the web browser or R** - -- **On Linux**: After H2O launches, copy and paste the IP address and port of the H2O instance into the address bar of your browser. In the following example, the IP is ``172.17.0.5:54321``. - - .. code-block:: bash - - 03:58:25.963 main INFO WATER: Cloud of size 1 formed [/172.17.0.5:54321 (00:00:00.000)] - -- **On OSX**: Locate the IP address of the Docker's network (``192.168.59.103`` in the following examples) that bridges to your Host OS by opening a new Terminal window (not a bash for your container) and running ``boot2docker ip``. - - .. code-block:: bash - - $ boot2docker ip - 192.168.59.103 - -You can also view the IP address (``192.168.99.100`` in the example below) by scrolling to the top of the Docker daemon window: - -:: - - - ## . - ## ## ## == - ## ## ## ## ## === - /"""""""""""""""""\___/ === - ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ / ===- ~~~ - \______ o __/ - \ \ __/ - \____\_______/ - - - docker is configured to use the default machine with IP 192.168.99.100 - For help getting started, check out the docs at https://docs.docker.com - -After obtaining the IP address, point your browser to the specified ip address and port to open Flow. In R and Python, you can access the instance by installing the latest version of the H2O R or Python package and then initializing H2O: - - -.. tabs:: - .. code-tab:: r R - - # Initialize H2O - library(h2o) - dockerH2O <- h2o.init(ip = "192.168.59.103", port = 54321) - - .. code-tab:: python - - # Initialize H2O - import h2o - docker_h2o = h2o.init(ip = "192.168.59.103", port = 54321) - - -Kubernetes Integration ----------------------- - -H2O nodes must be treated as stateful by the Kubernetes environment because H2O is a stateful application. H2O nodes are, therefore, spawned together and deallocated together as a single unit. Subsequently, Kubernetes tooling for stateless applications is not applicable to H2O. In Kubernetes, a set of pods sharing a common state is named as a StatefulSet. - -H2O Pods deployed on Kubernetes cluster require a `headless service `__ for H2O Node discovery. The headless service, instead of load-balancing incoming requests to the underlying H2O pods, returns a set of adresses of all the underlying pods. - -.. figure:: images/h2o-k8s-clustering.png - -Requirements -~~~~~~~~~~~~ - -To spawn an H2O cluster inside of a Kubernetes cluster, the following are needed: - -- A Kubernetes cluster: either local development (e.g. `ks3 `__) or easy start (e.g. `OpenShift `__ by RedHat). -- A Docker image with H2O inside. -- A Kubernetes deployment definition with a `StatefulSet `__ of H2O pods and a headless service. - -Creating the Docker Image -~~~~~~~~~~~~~~~~~~~~~~~~~ - -A simple Docker container with H2O running on startup is enough: - -.. code:: bash - - FROM ubuntu:latest - ARG H2O_VERSION - RUN apt-get update \ - && apt-get install default-jdk unzip wget -y - RUN wget http://h2o-release.s3.amazonaws.com/h2o/rel-zahradnik/1/h2o-${H2O_VERSION} - && unzip h2o-${H2O_VERSION}.zip - ENV H2O_VERSION ${H2O_VERSION} - CMD java -jar h2o-${H2O_VERSION}/h2o.jar - -To build the Docker image, use ``docker build . -t {image-name} --build-arg H2O_VERSION=3.30.0.1``. Make sure to replace ``{image-name}`` with the meaningful H2O deployment name. **Note:** For the rest of this example, the Docker image will be named ``h2o-k8s``. - -Creating the Headless Service -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -First, a headless service must be created on Kubernetes: - -.. code:: bash - - apiVersion: v1 - kind: Service - metadata: - name: h2o-service - namespace: default - spec: - type: ClusterIP - clusterIP: None - selector: - app: h2o-k8s - ports: - - protocol: TCP - port: 54321 - -The ``clusterIP: None`` setting defines the service as headless. - -The ``port: 54321`` setting is the default H2O port. Users and client libraries use this port to talk to the H2O cluster. - -The ``app: h2o-k8s`` setting is of **great importance** because it is the name of the application with H2O pods inside. While the name is arbitrarily chosen for this example, it **must** correspond to the chosen H2O deployment name. - -Creating the H2O Deployment -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -We strongly recommended running H2O as a `StatefulSet `__ on a Kubernetes cluster. Treating H2O nodes as stateful ensures that: - -- H2O nodes are treated as a single unit. They will be brought up and down gracefully and together. -- No attempts will be made by a K8S healthcheck to restart individual H2O nodes in case of an error. -- The cluster will be restared as a whole (if required). -- Persistent storages and volumes associated with the StatefulSet of H2O nodes will not be deleted once the cluster is brought down. - -.. code:: bash - - apiVersion: apps/v1 - kind: StatefulSet - metadata: - name: h2o-stateful-set - namespace: default - spec: - serviceName: h2o-service - podManagementPolicy: "Parallel" - replicas: 3 - selector: - matchLabels: - app: h2o-k8s - template: - metadata: - labels: - app: h2o-k8s - spec: - terminationGracePeriodSeconds: 10 - containers: - - name: h2o-k8s - image: 'h2oai/h2o-open-source-k8s:latest' - resources: - requests: - memory: "4Gi" - ports: - - containerPort: 54321 - protocol: TCP - env: - - name: H2O_KUBERNETES_SERVICE_DNS - value: h2o-service.default.svc.cluster.local - - name: H2O_NODE_LOOKUP_TIMEOUT - value: '180' - - name: H2O_NODE_EXPECTED_COUNT - value: '3' - -The environment variables used are described below: - -- ``H2O_KUBERNETES_SERVICE_DNS`` - **[MANDATORY]** Crucial for the clustering to work. The format usually follows the ``..svc.cluster.local`` pattern. This setting enables H2O node discovery via DNS. It must be modified to match the name of the headless service created. Also, pay attention to the rest of the address. It must match the specifics of your Kubernetes implementation. -- ``H2O_NODE_LOOKUP_TIMEOUT`` - **[OPTIONAL]** Node lookup constraint. Specify the time before the node lookup times out. -- ``H2O_NODE_EXPECTED_COUNT`` - **[OPTIONAL]** Node lookup constraint. This is the expected number of H2O pods to be discovered. -- ``H2O_KUBERNETES_API_PORT`` - **[OPTIONAL]** Port for Kubernetes API checks to listen on. Defaults to 8080. - -If none of the optional lookup constraints are specified, a sensible default node lookup timeout will be set - currently -defaults to 3 minutes. If any of the lookup constraints are defined, the H2O node lookup is terminated on whichever -condition is met first. - -In the above example, ``'h2oai/h2o-open-source-k8s:latest'`` retrieves the latest build of the H2O Docker image. Replace ``latest`` with ``nightly`` to get the bleeding-edge Docker image with H2O inside. +H2O supports many different types of users. -The documentation for the official H2O Docker images is available at the official `H2O Docker Hub page `__. +.. toctree:: + :maxdepth: 1 -Exposing the H2O Cluster -~~~~~~~~~~~~~~~~~~~~~~~~ + getting-started/getting-started + getting-started/flow-users + getting-started/python-users + getting-started/r-users + getting-started/sparkling-users + getting-started/api-users + getting-started/java-users + getting-started/hadoop-users + getting-started/docker-users + getting-started/kubernetes-users + getting-started/experienced-users -Exposing the H2O cluster is the responsibility of the Kubernetes administrator. By default, an `Ingress `__ can be created. Different platforms offer different capabilities (e.g. OpenShift offers `Routes `__). -For more information on running an H2O cluster on a Kubernetes cluster, refer to this `link `__. From 3702bca91d4edcda1643689649ca1b777b6f261b Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 11:57:41 -0500 Subject: [PATCH 02/27] ht/added getting started --- .../getting-started/getting-started.rst | 194 ++++++++++++++++++ 1 file changed, 194 insertions(+) create mode 100644 h2o-docs/src/product/getting-started/getting-started.rst diff --git a/h2o-docs/src/product/getting-started/getting-started.rst b/h2o-docs/src/product/getting-started/getting-started.rst new file mode 100644 index 000000000000..9934c58819c7 --- /dev/null +++ b/h2o-docs/src/product/getting-started/getting-started.rst @@ -0,0 +1,194 @@ +Getting started +=============== + +Here are some helpful links to help you get started learning H2O. + +Downloads page +-------------- + +To begin, download a copy of H2O from the `Downloads page `__. + +1. Click H2O Open Source Platform or scroll down to the H2O section. Here you have access to the different ways to download H2O: + +- Latest stable: this version is the most recentl alpha release version of H2O. +- Nightly bleeding edge: this version contains all the latest changes to H2O that haven't been released officially yet. +- Prior releases: this houses all previously released versions of H2O. + +For first-time users, we recomment downloading the latest alpha release and the default standalone option (the DOWNLOAD AND RUN tab) as the installation method. Make sure to install Java if it is not already installed. + +.. note:: + By default, this setup is open. Follow `security guidelines `__ if you want to secure your installation. + +Using Flow - H2O's web UI +------------------------- + +`This section describes our web interface, Flow `__. Flow is similar to IPython notebooks and allows you to create a visual workflow to share with others. + + +Tutorials of Flow +~~~~~~~~~~~~~~~~~ + +The following examples use H2O Flow. To see a step-by-step example of one of our algorithms in action, select a model type from the following list: + +- `Deep Learning `__ +- `Distributed Random Forest (DRF) `__ +- `Generalized Linear Model (GLM) `__ +- `Gradient Boosting Machine (GBM) `__ +- `K-Means `__ + +Launch from the command line +---------------------------- + +You can configure H2O when you launch it from the command line. For example, you can specify a different directory for saved Flow data, you could allocate more memory, or you could use a flatfile for a quick configuration of your cluster. See more details about `configuring the additional options when you launch H2O `__. + + +Algorithms +---------- + +`This section describes the science behind our algorithms `__ and provides a detailed, per-algorithm view of each model type. + +Use cases +--------- + +H2O can handle a wide variety of practical use cases due to its robust catalogue of supported algorithms, wrappers, and machine learning tools. The following are some example problems H2O can handle: + +- Determining outliers in housing prices based on number of bedrooms, number of bathrooms, access to waterfront, etc. through `anomaly detection `__. +- Revealing natural customer `segments `__ in retail data to determine which groups are purchasing which products. +- Linking multiple records to the same person with `probabilistic matching `__. +- Unsampling the minority class for credit card fraud data to handle `imbalanced data `__. +- `Detecting drift `__ on avocado sales pre-2018 and 2018+ to determine if a model is still relevant for new data. + +See our `best practice tutorials `__ to further explore the capabilities of H2O. + +New user quickstart +------------------- + +You can follow these steps to quickly get up and running with H2O directly from the `H2O-3 repository `__. These steps will guide you through cloning the repository, starting H2O, and importing a dataset. Once you're up and running, you'll be better able to follow examples included within this user guide. + +1. In a terminal window, create a folder for the H2O repository: + +.. code-block:: bash + + user$ mkdir ~/Desktop/repos + +2. Change directories to that new folder, and then clone the repository. Notice that the prompt changes when you change directories: + +.. code-block:: bash + + user$ cd ~/Desktop/repos + repos user$ git clone https://github.com/h2oai/h2o-3.git + +3. After the repository is cloned, change directories to the ``h2o-3`` folder: + + .. code-block:: bash + + repos user$ cd h2o-3 + h2o-3 user$ + +4. Run the following command to retrieve sample datasets. These datasets are used throughout the user guide and within the `booklets `__. + +.. code-block:: bash + + h2o-3 user$ ./gradlew syncSmalldata + +At this point, choose whether you want to complete this quickstart in Python or R. Then, run the following corresponding commands from either the Python or R tab: + +.. tabs:: + .. code-tab:: python + + # By default, this setup is open. + # Follow our security guidelines (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html) + # if you want to secure your installation. + + # Before starting Python, run the following commands to install dependencies. + # Prepend these commands with `sudo` only if necessary: + # h2o-3 user$ [sudo] pip install -U requests + # h2o-3 user$ [sudo] pip install -U tabulate + + # Start python: + # h2o-3 user$ python + + # Run the following commands to import the H2O module: + >>> import h2o + + # Run the following command to initialize H2O on your local machine (single-node cluster): + >>> h2o.init() + + # If desired, run the GLM, GBM, or Deep Learning demo(s): + >>> h2o.demo("glm") + >>> h2o.demo("gbm") + >>> h2o.demo("deeplearning") + + # Import the Iris (with headers) dataset: + >>> path = "smalldata/iris/iris_wheader.csv" + >>> iris = h2o.import_file(path=path) + + # View a summary of the imported dataset: + >>> iris.summary + # sepal_len sepal_wid petal_len petal_wid class + # 5.1 3.5 1.4 0.2 Iris-setosa + # 4.9 3 1.4 0.2 Iris-setosa + # 4.7 3.2 1.3 0.2 Iris-setosa + # 4.6 3.1 1.5 0.2 Iris-setosa + # 5 3.6 1.4 0.2 Iris-setosa + # 5.4 3.9 1.7 0.4 Iris-setosa + # 4.6 3.4 1.4 0.3 Iris-setosa + # 5 3.4 1.5 0.2 Iris-setosa + # 4.4 2.9 1.4 0.2 Iris-setosa + # 4.9 3.1 1.5 0.1 Iris-setosa + # + # [150 rows x 5 columns] + # + + .. code-tab:: r R + + # Download and install R: + # 1. Go to http://cran.r-project.org/mirrors.html. + # 2. Select your closest local mirror. + # 3. Select your operating system (Linux, OS X, or Windows). + # 4. Depending on your OS, download the appropriate file, along with any required packages. + # 5. When the download is complete, unzip the file and install. + + # Start R + h2o-3 user$ r + ... + Type 'demo()' for some demos, 'help()' for on-line help, or + 'help.start()' for an HTML browser interface to help. + Type 'q()' to quit R. + > + + # By default, this setup is open. + # Follow our security guidelines (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html) + # if you want to secure your installation. + + # Copy and paste the following commands in R to download dependency packages. + > pkgs <- c("methods", "statmod", "stats", "graphics", "RCurl", "jsonlite", "tools", "utils") + > for (pkg in pkgs) {if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }} + + # Run the following command to load the H2O: + > library(h2o) + + # Run the following command to initialize H2O on your local machine (single-node cluster) using all available CPUs. + > h2o.init() + + # Import the Iris (with headers) dataset. + > path <- "smalldata/iris/iris_wheader.csv" + > iris <- h2o.importFile(path) + + # View a summary of the imported dataset. + > print(iris) + + sepal_len sepal_wid petal_len petal_wid class + ----------- ----------- ----------- ----------- ----------- + 5.1 3.5 1.4 0.2 Iris-setosa + 4.9 3 1.4 0.2 Iris-setosa + 4.7 3.2 1.3 0.2 Iris-setosa + 4.6 3.1 1.5 0.2 Iris-setosa + 5 3.6 1.4 0.2 Iris-setosa + 5.4 3.9 1.7 0.4 Iris-setosa + 4.6 3.4 1.4 0.3 Iris-setosa + 5 3.4 1.5 0.2 Iris-setosa + 4.4 2.9 1.4 0.2 Iris-setosa + 4.9 3.1 1.5 0.1 Iris-setosa + [150 rows x 5 columns] + > From fa5ce3423a7c08dcff548b44bef60bda4908a074 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 11:58:02 -0500 Subject: [PATCH 03/27] ht/added flow users --- h2o-docs/src/product/getting-started/flow-users.rst | 6 ++++++ 1 file changed, 6 insertions(+) create mode 100644 h2o-docs/src/product/getting-started/flow-users.rst diff --git a/h2o-docs/src/product/getting-started/flow-users.rst b/h2o-docs/src/product/getting-started/flow-users.rst new file mode 100644 index 000000000000..08a503c1df3e --- /dev/null +++ b/h2o-docs/src/product/getting-started/flow-users.rst @@ -0,0 +1,6 @@ +Flow users +========== + +H2O Flow is a notebook-style open source UI for H2O. It's a web-based interactive environment that lets you combine code execution, text, mathematics, plots, and rich media in a single document (similar to iPython Notebooks). + +See more about `H2O Flow `__. \ No newline at end of file From fb572ec23fe645c944a1f7b360a0321309038c00 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 11:58:17 -0500 Subject: [PATCH 04/27] ht/added python users --- .../product/getting-started/python-users.rst | 35 +++++++++++++++++++ 1 file changed, 35 insertions(+) create mode 100644 h2o-docs/src/product/getting-started/python-users.rst diff --git a/h2o-docs/src/product/getting-started/python-users.rst b/h2o-docs/src/product/getting-started/python-users.rst new file mode 100644 index 000000000000..50c3e7cc9a34 --- /dev/null +++ b/h2o-docs/src/product/getting-started/python-users.rst @@ -0,0 +1,35 @@ +Python users +============ + +Pythonistas can rest easy knowing that H2O provides support for this popular programming language. You can also use H2O with IPython notebooks. + +Getting started with Python +--------------------------- + +The following sections will help you begin using Python for H2O. + +Installing H2O with Python +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can find instructions for using H2O with Python in the `Downloading and installing H2O `__ section and on the `Downloads page `__. + +From the Downloads page: + +1. Select the version of H2O you want. +2. Click the Install in Python tab. +3. Follow the on-page instructions. + +Python documentation +~~~~~~~~~~~~~~~~~~~~ + +See our `Python-specific documentation `__. + +Grid search in Python +~~~~~~~~~~~~~~~~~~~~~ + +See a notebook demonstration for how to use grid search in Python. + +Anaconda Cloud users +-------------------- + +You can run H2O in an Anaconda Cloud environment. Conda 2.7, 3.5, and 3.6 repositories are supported (as are a number of H2O versions). See Anaconda's `official H2O package `__ to view a list of all available H2O versions. You can refer to the `Install on Anaconda Cloud `__ section for information about installing H2O in an Anaconda Cloud. \ No newline at end of file From 263d3511cf7cb2374c7df0a922701177a84ae20b Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 11:58:34 -0500 Subject: [PATCH 05/27] ht/added r users --- .../src/product/getting-started/r-users.rst | 53 +++++++++++++++++++ 1 file changed, 53 insertions(+) create mode 100644 h2o-docs/src/product/getting-started/r-users.rst diff --git a/h2o-docs/src/product/getting-started/r-users.rst b/h2o-docs/src/product/getting-started/r-users.rst new file mode 100644 index 000000000000..6a30a2d09d11 --- /dev/null +++ b/h2o-docs/src/product/getting-started/r-users.rst @@ -0,0 +1,53 @@ +R users +======= + +R users rejoice: H2O supports your chosen programming language! + + +Getting started with R +---------------------- + +The following sections will help you begin using R for H2O. + +See `this cheatsheet on H2O in R `__ for a quick start. + +.. note:: + + If you are running R on Linus, then you must install ``libcurl`` which allows H2O to communicate with R. We also recommend disabling SElinux and any firewalls (at least initially until you confirmed H2O can initialize). + + - On Ubuntu, run: ``apt-get install libcurl4-openssl-dev`` + - On CentOS, run: ``yum install libcurl-devel`` + +Installing H2O with R +~~~~~~~~~~~~~~~~~~~~~ + +You can find instructions for using H2O with Python in the `Downloading and installing H2O `__ section and on the `Downloads page `__. + +From the Downloads page: + +1. Select the version of H2O you want. +2. Click the Install in R tab. +3. Follow the on-page instructions. + +Checking your R version for H2O +''''''''''''''''''''''''''''''' + +To check which version of H2O is installed in R, run the following: + +:: + versions::installed.versions("h2o") + +.. note:: + + R version 3.1.0 ("Spring Dance") is incompatible with H2O. If you are using that version, we recommend upgrading your R version before using H2O. + + +R documentation +~~~~~~~~~~~~~~~ + +See our `R-specific documentation `__. This documentation also exists as a PDF: `R user PDF `__. + +Connecting RStudio to Sparkling Water +------------------------------------- + +See our `illustrated tutorial on how to use RStudio to connect to Sparkling Water `__. \ No newline at end of file From da953e5aa3553f9d3d3584fe81c0e94991a545bd Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 11:58:51 -0500 Subject: [PATCH 06/27] ht/added sparkling users --- .../getting-started/sparkling-users.rst | 120 ++++++++++++++++++ 1 file changed, 120 insertions(+) create mode 100644 h2o-docs/src/product/getting-started/sparkling-users.rst diff --git a/h2o-docs/src/product/getting-started/sparkling-users.rst b/h2o-docs/src/product/getting-started/sparkling-users.rst new file mode 100644 index 000000000000..40258b16b66d --- /dev/null +++ b/h2o-docs/src/product/getting-started/sparkling-users.rst @@ -0,0 +1,120 @@ +Sparkling Water users +===================== + +Sparkling Water is a gradle project with the following submodules: + +- **Core**: Implementation of H2OContext, H2ORDD, and all technical integration code. +- **Examples**: Application, demos, and examples. +- **ML**: Implementation of `MLlib `__ pipelines for H2O algorithms. +- **Assembly**: This creates "fatJar" (composed of all other modules). +- **py**: Implementation of (H2O) Python binding to Sparkling Water. + +The best way to get started is to modify the core module or create a new module (which extends the project). + +.. note:: + + Sparkling Water is only supported with the latest version of H2O. + + Sparkling Water is versioned according to the Spark versioning, so make sure to use the Sparkling Water version that corresponds to your installed version of spark. + +Getting started with Sparking Water +----------------------------------- + +This section contains links that will help you get started using Sparkling Water. + +Download Sparkling Water +~~~~~~~~~~~~~~~~~~~~~~~~ + +1. Navigate to the `Downloads page `__. +2. Click Sparkling Water or scroll down to the Sparkling Water section. +3. Select the version of Spark you have to download the corresponding version of Sparkling Water. + +Sparkling Water documentation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The documentation for Sparkling Water is separate from the H2O user guide. Read this documentation to get started with Sparkling Water. + +- `Sparkling Water for Spark 3.5 `__. + +- `Sparkling Water K-Means tutorial `__: This tutorial uses Scala to create a K-Means model. +- `Sparkling Water GBM tutorial `__: This tutorial uses Scala to create a GBM model. +- `Sparkling Water on YARN `__: This tutorial walks you through how to run Sparkling Water on a YARN cluster. +- `Building machine learning applications with Sparkling Water `__: This tutorial describes project building and demonstrates the capabilities of Sparkling Water using Spark Shell to build a Deep Learning model. +- `Connecting RStudio to Sparkling Water `__: This illustrated tutorial describes how to use RStudio to connect to Sparkling Water. + +Sparkling Water FAQ +~~~~~~~~~~~~~~~~~~~ + +The frequently asked questions provide answers to many common questions about Sparkling Water. + +- Sparkling Water FAQ for 3.5 `__ +- Sparkling Water FAQ for 3.4 `__ +- Sparkling Water FAQ for 3.3 `__ +- Sparkling Water FAQ for 3.2 `__ +- Sparkling Water FAQ for 3.1 `__ +- `Sparkling Water FAQ for 3.0 `__ +- `Sparkling Water FAQ for 2.4 `__ +- `Sparkling Water FAQ for 2.3 `__ + +Sparkling Water blog posts +-------------------------- + +- `How Sparkling Water Brings H2O to Spark `_ +- `H2O - The Killer App on Spark `_ +- `In-memory Big Data: Spark + H2O `_ + +PySparkling +----------- + +PySparkling can be installed by downloading and running the PySparkling shell or by using ``pip``. PySparkling can also be installed from the `PyPI `__ repository. Follow the instructions for how to install PySparkling on the `Download page `__ for Sparkling Water. + +PySparkling documentation +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Documentation for PySparkling is available for the following versions: + +- `PySparkling 3.5 `__ +- `PySparkling 3.4 `__ +- `PySparkling 3.3 `__ +- `PySparkling 3.2 `__ +- `PySparkling 3.1 `__ +- `PySparkling 3.0 `__ +- `PySparkling 2.4 `__ +- `PySparkling 2.3 `__ + +RSparkling +---------- + +The RSparkling R package is an extension package for `sparklyr `__ that creates an R front-end for the Sparkling Water package from H2O. This provides an interface to H2O's high performance, distributed machine learning algorithms on Spark using R. + +This package implements basic functionality by creating an H2OContext, showing the H2O Flow interface, and converting between Spark DataFrames. The main purpose of this package is to provide a connector between sparklyr and H2O's machine learning algorithms. + +The RSparkling package uses sparklyr for Spark job deployment and initialization of Sparkling Water. After that, you can use the regular H2O R package for modeling. + +RSparkling documentation +~~~~~~~~~~~~~~~~~~~~~~~~ + +Documentation for RSparkling is available for the following versions: + +- `RSparkling 3.5 `__ +- `RSparkling 3.4 `__ +- `RSparkling 3.3 `__ +- `RSparkling 3.2 `__ +- `RSparkling 3.1 `__ +- `RSparkling 3.0 `__ +- `RSparkling 2.4 `__ +- `RSparkling 2.3 `__ + + From 78c1761bca6210beba11604485583343e3bceeeb Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 11:59:07 -0500 Subject: [PATCH 07/27] ht/added api users --- .../src/product/getting-started/api-users.rst | 20 +++++++++++++++++++ 1 file changed, 20 insertions(+) create mode 100644 h2o-docs/src/product/getting-started/api-users.rst diff --git a/h2o-docs/src/product/getting-started/api-users.rst b/h2o-docs/src/product/getting-started/api-users.rst new file mode 100644 index 000000000000..99aa1f1cb2a0 --- /dev/null +++ b/h2o-docs/src/product/getting-started/api-users.rst @@ -0,0 +1,20 @@ +API users +========= + +Our REST APIs are generated immediately out of the code, allowing you to implement machine learning in many ways. For example, REST APIs can be used to call a model created by sensor data and to set up auto-alerts if the sensor data falls below a specified threshold. + +REST API references +------------------- + +See the definitive `guide to H2O's REST API `__. + +Schemas +~~~~~~~ + +See the definitive `guide to H2O's REST API schemas `__. + + +REST API example +~~~~~~~~~~~~~~~~ + +See an `in-depth explanation of how H2O REST API commands are used `__. This explanation includes versioning, experimental APIs, verbs, status codes, formats, schemas, payloads, metadata, and examples. \ No newline at end of file From 82271ec71d11ac444e621bb33ac6fd257dbe8441 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 11:59:26 -0500 Subject: [PATCH 08/27] ht/added java users --- .../product/getting-started/java-users.rst | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 h2o-docs/src/product/getting-started/java-users.rst diff --git a/h2o-docs/src/product/getting-started/java-users.rst b/h2o-docs/src/product/getting-started/java-users.rst new file mode 100644 index 000000000000..d555c2595dac --- /dev/null +++ b/h2o-docs/src/product/getting-started/java-users.rst @@ -0,0 +1,22 @@ +Java users +========== + +The following resources will help you create your own custom app that uses H2O. See `H2O's Java requirements `__for more information. + +Java developer documentation +---------------------------- + +Core components +~~~~~~~~~~~~~~~ + +The definitive `Java API guide for the core components of H2O `__. + +Algorithms +~~~~~~~~~~ + +The definitive `Java API guide for the algorithms used by H2O `__. + +Example +------- + +`This Javadoc provides a step-by-step guide to creating and implementing POJOs or MOJOs `__ in a Java application. \ No newline at end of file From 9a30a7061cf2852350f4e2c66109ad23d7d8f591 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 11:59:56 -0500 Subject: [PATCH 09/27] ht/added hadoop users --- .../product/getting-started/hadoop-users.rst | 383 ++++++++++++++++++ 1 file changed, 383 insertions(+) create mode 100644 h2o-docs/src/product/getting-started/hadoop-users.rst diff --git a/h2o-docs/src/product/getting-started/hadoop-users.rst b/h2o-docs/src/product/getting-started/hadoop-users.rst new file mode 100644 index 000000000000..51b80b268b51 --- /dev/null +++ b/h2o-docs/src/product/getting-started/hadoop-users.rst @@ -0,0 +1,383 @@ +Hadoop users +============ + +This section describes how to use H2O on Hadoop. + +Supported Versions +------------------ + +- CDH 5.4 +- CDH 5.5 +- CDH 5.6 +- CDH 5.7 +- CDH 5.8 +- CDH 5.9 +- CDH 5.10 +- CDH 5.13 +- CDH 5.14 +- CDH 5.15 +- CDH 5.16 +- CDH 6.0 +- CDH 6.1 +- CDH 6.2 +- CDH 6.3 +- CDP 7.0 +- CDP 7.1 +- CDP 7.2 +- HDP 2.2 +- HDP 2.3 +- HDP 2.4 +- HDP 2.5 +- HDP 2.6 +- HDP 3.0 +- HDP 3.1 +- MapR 4.0 +- MapR 5.0 +- MapR 5.1 +- MapR 5.2 +- MapR 6.0 +- MapR 6.1 +- IOP 4.2 +- EMR 6.10 + +.. note:: + + Important points to remember: + + - The command used to launch H2O differs from previous versions (see the `Walkthrough `__ section). + - Launching H2O on Hadoop requires at least 6GB of memory. + - Each H2O node runs as a mapper (run only one mapper per host). + - There are no combiners or reducers. + - Each H2O cluster needs a unique job name. + - ``-mapperXmx``, ``-nodes``, and ``-output`` are required. + - Root permissions are not required (just unzip the H2O ZIP file on any single node). + +Prerequisite: Open communication paths +-------------------------------------- + +H2O communicates using two communication paths. Verify these paths are open and available for use by H2O. + +Path 1: Mapper to driver +~~~~~~~~~~~~~~~~~~~~~~~~ + +Optionally specify this port using the ``-driverport`` option in the ``hadoop jar`` command (see `Hadoop launch parameters `__). This port is opened on the driver host (the host where you entered the ``hadoop jar`` command). By default, this port is chosen randomly by the operating system. If you don't want to spcify an exact port but still want to restrict the port to a certain range of pors, you can use the option ``-driverportrange``. + +Path 2: Mapper to mapper +~~~~~~~~~~~~~~~~~~~~~~~~ + +Optionally specify this port using the ``-baseport`` option in the ``hadoop jar`` command see `Hadoop launch parameters `__). This port and the next subsequent port are opened on the mapper hosts (i.e. the Hadoop worker nodes) where the H2O mapper nodes are placed by the Resource Manager. By default, ports ``54321`` and ``54322`` are used. + +The mapper port is adaptive: if ``54321`` and ``54322`` are not available, h2O will try ``54323`` and ``54324`` and so on. The mapper port is designed to be adaptive because sometimes if the YARN cluster is low on resources, YARN will place two H2O mappers for the same H2O cluster request on the same physical host. For this reason, we recommend opening a range of more than two ports: 20 ports should be sufficient. + +Walkthrough +----------- + +The following steps show you how to download or build H2O with Hadoop and the parameters involved in launching H2O from the command line. + +1. Download the latest H2O release for your version of Hadoop from the `Downloads page `__. Refer to the H2O on Hadoop tab of the H2O download page for the latest stable release or the nightly bleeding edge release. +2. Prepare the job input on the Hadoop node by unzipping the build file and changing to the directory with the Hadoop and H2O's driver jar files: + +:: + unzip h2o-{{project_version}}-*.zip + cd h2o-{{project_version}}-* + +3. Launch H2O nodes and form a cluster on the Hadoop cluster by running: + +:: + hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g + + The above command launches a 6g node of H2O. We recommend you launch the cluster with at least four times the memory of your data file size. + + - *mapperXmx* is the mapper size or the amount of memory allocated to each node. Specify at least 6 GB. + + - *nodes* is the number of nodes requested to form the cluster. + + - *output* is the name of the directory created each time a H2O cluster is created so it is necessary for the name to be unique each time it is launched. + +4. Monitor your job by directing your web browser to your standard job tracker web UI. To access H2O's web UI, direct your web browser to one of the launched instances. If you are unsure where your JVM is launched, review the output from your command after the nodes have clouded and formed a cluster. Any nodes' IP addresses will work as there is no master node: + +:: + Determining driver host interface for mapper->driver callback... + [Possible callback IP address: 172.16.2.181] + [Possible callback IP address: 127.0.0.1] + ... + Waiting for H2O cluster to come up... + H2O node 172.16.2.184:54321 requested flatfile + Sending flatfiles to nodes... + [Sending flatfile to node 172.16.2.184:54321] + H2O node 172.16.2.184:54321 reports H2O cluster size 1 + H2O cluster (1 nodes) is up + Blocking until the H2O cluster shuts down... + +Hadoop launch parameters +------------------------ + +- ``-h | -help``: Display help. +- ``-jobname ``: Specify a job name for the Jobtracker to use; the default is ``H2O_nnnnn`` (where n is chosen randomly). +- ``-principal -keytab | -run_as_user ``: Optionally specify a Kerberos principal and keytab or specify the ``run_as_user`` parameter to start clusters on behalf of the user/principal. Note that using ``run_as_user`` implies that the Hadoop cluster does not have Kerberos. +- ``-driverif driver callback interface>``: Specify the IP address for callback messages from the mapper to the driver. +- ``-driverport callback interface>``: Specify the port number for callback messages from the mapper to the driver. +- ``-driverportrange callback interface>``: Specify the allowed port range of the driver callback interface, eg. 50000-55000. +- ``-network [,]``: Specify the IPv4 network(s) to bind to the H2O nodes; multiple networks can be specified to force H2O to use the specified host in the Hadoop cluster. ``10.1.2.0/24`` allows 256 possibilities. +- ``-timeout ``: Specify the timeout duration (in seconds) to wait for the cluster to form before failing. + + **Note**: The default value is 120 seconds; if your cluster is very busy, this may not provide enough time for the nodes to launch. If H2O does not launch, try increasing this value (for example, ``-timeout 600``). + +- ``-disown``: Exit the driver after the cluster forms. + + **Note**: For Qubole users who include the ``-disown`` flag, if your cluster is dying right after launch, add ``-Dmapred.jobclient.killjob.onexit=false`` as a launch parameter. + +- ``-notify ``: Specify a file to write when the cluster is up. The file contains the IP and port of the embedded web server for one of the nodes in the cluster. All mappers must start before the H2O cluster is considered "up". +- ``-mapperXmx ``: Specify the amount of memory to allocate to H2O (at least 6g). +- ``-extramempercent``: Specify the extra memory for internal JVM use outside of the Java heap. This is a percentage of ``mapperXmx``. + + **Recommendation**: Set this to a high value when running XGBoost (for example, 120). + +- ``-n | -nodes ``: Specify the number of nodes. +- ``-nthreads ``: Specify the maximum number of parallel threads of execution. This is usually capped by the max number of vcores. +- ``-baseport ``: Specify the initialization port for the H2O nodes. The default is ``54321``. +- ``-license ``: Specify the directory of local filesytem location and the license file name. +- ``-o | -output ``: Specify the HDFS directory for the output. +- ``-flow_dir ``: Specify the directory for saved flows. By default, H2O will try to find the HDFS home directory to use as the directory for flows. If the HDFS home directory is not found, flows cannot be saved unless a directory is specified using ``-flow_dir``. +- ``-port_offset ``: This parameter allows you to specify the relationship of the API port ("web port") and the internal communication port. The h2o port and API port are derived from each other, and we cannot fully decouple them. Instead, we allow you to specify an offset such that h2o port = api port + offset. This allows you to move the communication port to a specific range that can be firewalled. +- ``-proxy``: Enables Proxy mode. +- ``-report_hostname``: This flag allows the user to specify the machine hostname instead of the IP address when launching H2O Flow. This option can only be used when H2O on Hadoop is started in Proxy mode (with ``-proxy``). + +JVM arguments +~~~~~~~~~~~~~ + +- ``-ea``: Enable assertions to verify boolean expressions for error detection. +- ``-verbose:gc``: Include heap and garbage collection information in the logs. Deprecated in Java 9, removed in Java 10. +- ``-XX:+PrintGCDetails``: Include a short message after each garbage collection. Deprecated in Java 9, removed in Java 10. +- ``-Xlog:gc=info``: Prints garbage collection information into the logs. Introduced in Java 9. Usage enforced since Java 10. A replacement for ``-verbose:gc`` and ``-XX:+PrintGCDetails`` tags which are deprecated in Java 9 and removed in Java 10. + +Configure HDFS +-------------- +When running H2O on Hadoop, you do not need to worry about configuring HDFS. The ``-hdfs_config`` flag is used to configure access to HDFS from a standalone cluster. However, it's also used for anything that requires Hadoop (such as Hive). + +If you are accessing HDFS/Hive without Kerberos, then you will need to pass ``-hdfs_config`` and path to the ``core-site.xml`` that you got from your Hadoop edge node. If you are accessing Kerberized Hadoop, you will also need to pass ``hdfs-site.xml``. + +Access S3 data from Hadoop +-------------------------- + +H2O launched on Hadoop can access S3 data in addition to HDFS. To enable access, follow these instructions: + +1. Edit Hadoop's ``core-site.xml``. +2. Set the ``HADOOP_CONF_DIR`` environment property to the directory containing the ``core_site.xml``. See the `core-site.xml example `__ for more information. + +.. note:: + + Typically the configuration directory for most Hadoop distributions is ``/etc/hadoop/conf``. + +You can also pass the S3 credentials when launching H2O with the Hadoop jar command. use the ``-D`` flag to pass the credentials: + +.. code-block:: bash + + hadoop jar h2odriver.jar -Dfs.s3.awsAccessKeyId="${AWS_ACCESS_KEY}" -Dfs.s3n.awsSecretAccessKey="${AWS_SECRET_KEY}" -n 3 -mapperXmx 10g -output outputDirectory + +where: + +- ``AWS_ACCESS_KEY`` represents your username. +- ``AWS_SECRET_KEY`` represents your password. + +3. Import the data with the S3 URL path: + +.. tabs:: + code-tab:: r R + + h2o.importFile(path = "s3://bucket/path/to/file.csv") + + code-tab:: python + + h2o.import_frame(path = "s3://bucket/path/to/file.csv") + + code-tab:: Flow + + importFiles [ "s3:/path/to/bucket/file/file.tab.gz" ] + +YARN best practices +------------------- + +YARN (Yet another resource negotiator) is a resource management framework. H2O can be launched as an application on YARN. If you want to run H2O on Hadoop, essentailly, you are running H2O on YARN. We strongly recommend using YARN to manage your cluster resources. + +H2O with YARN +~~~~~~~~~~~~~ + +When you launch H2O on Hadoop using the ``hadoop jar`` command, YARN allocates the necessary resources to launch the requested number of nodes. H2O launches as a map-reduce (V2) task where each mapper is an H2O node of the specified size: + +.. code-block:: bash + + hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName + +Troubleshoot YARN +''''''''''''''''' + +Occassionally, YARN may reject a job request. This usually occurs because there is either not enough memory to launch the job or because of an incorrect configuration. + +Failure with too little memory +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If YARN rejects the job request, try re-launching the job with less memory first to see if that is the cause of the failure. Specify smaller values for ``-mapperXmx`` (we recommend a minimum or ``2g``) and ``-nodes`` (start with ``1``) to confirm that H2O can launch successfully. + +Failure due to configuration issues +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To resolve configuration issues, adjust the maximum memory that YARN will allow when launching each mapper. If the cluster manager settings are configured for the default maximum memory size but the memory requried for the request exceeds that amount, YARN will not launch and H2O will time out. + +If you are using the default configuration, change the configuration settings in your cluster manager to specify memory allocation when launching mapper tasks. To calculate the amount of memory required for a successful launch, the the following formula: + + YARN container size (``mapreduce.map.memory.mb``) = ``-mapperXmx`` value + (``-mapperXmx`` :math:`\times` ``-extramempercent`` [default is 10%]) + +The ``mapreduce.map.memory.mb`` value must be less than the YARN memory configuration values for the launch to succeed. + +Configure YARN +~~~~~~~~~~~~~~ + +Cloudera +'''''''' + +For Cloudera, configure the settings in Cloudera Manager. Depending on how the cluster is configured, you may need to change the settings for more than one role group. + +1. Click **Configuration** and enter the following search term in quotes: "yarn.nodemanager.resource.memory-mb". +2. Enter the amount of memory (in GB) to allocate in the **Value** field. If more than one group is listed, change the value for all listed groups. + + + .. figure:: ../images/TroubleshootingHadoopClouderayarnnodemgr.png + :alt: Cloudera configuration page with the value setting highlighted in red. + +3. Click **Save Changes**. +4. Enter the following search term in quotes: "yarn.scheduler.maximum-allocation-mb". +5. Change the value, click **Save Changes**, and redeploy. + + .. figure:: ../images/TroubleshootingHadoopClouderayarnscheduler.png + :alt: Cloudera configuration page with the value setting highlighted in red. + +Hortonworks +''''''''''' + +For Hortonworks, configure the settings in Ambari. See more on `Hortonworks configuration `__. + +1. Select **YARN**, then click the **Configs** tab. +2. Select the group. +3. Go to **Node Manager** section. Enter the amount of memory (in MB) to allocate in the **yarn.nodemanager.resource.memory-mb** entry field. + + .. figure:: ../images/TroubleshootingHadoopAmbariNodeMgr.png + :alt: Ambari configuration node manager section with the yarn.nodemanager.resource.memory-mb section highlighted in red. + +4. In the **Scheduler** section, enter the amount of memory (in MB) to allocate in the **yarn.scheduler.maximum-allocation-mb** entry field. + +.. figure:: ../images/TroubleshootingHadoopAmbariyarnscheduler.png + :alt: Ambari configuration scheduler section with the yarn.scheduler.maximum-allocation-mb section highlighted in red. + +5. Click **Save** and redeploy the cluster. + +MapR +'''' + +1. Edit the **yarn-site.xml** file for the node running the ResourceManager. +2. Change the values for the ``yarn.nodemanager.resource.memory-mb`` and ``yarn.scheduler.maximum-allocation-mb`` properties. +3. Restart the ResourceManager and redeploy the cluster. + +To verify the values were changes, check the values for the following properties: + +.. code-block:: bash + + - yarn.nodemanager.resource.memory-mb + - yarn.scheduler.maximum-allocation-mb + +Limit CPU usage +~~~~~~~~~~~~~~~ + +To limit the number of CPUs used by H2O, use the ``-nthreads`` option and specify the maximum number of CPUs for a single container to use. The following example limits the number of CPUs to four: + +.. code-block:: bash + + hadoop jar h2odriver.jar -nthreads 4 -nodes 1 -mapperXmx 6g -output hdfsOutputDirName + +.. note:: + + The default is 4 :math:`\times` the number of CPUs. You need to specify at least 4 CPUs or the following error message displays: + + ``ERROR: nthreads invalid (must be >= 4)`` + +Specify a queue +~~~~~~~~~~~~~~~ + +If you do not specify a queue when launching H2O, H2O jobs are submitted to the default queue. Jobs submitted to the default queue have a lower priority than jobs submitted to a specific queue. + +To specify a queue with Hadoop, enter ``-Dmapreduce.job.queuename=`` (where ```` is the name of the queue) when launching Hadoop. + +Queue example +''''''''''''' + +The following is an example of specifying a queue: + +.. code-block:: bash + + hadoop jar h2odriver.jar -Dmapreduce.job.queuename= -nodes -mapperXmx 6g -output hdfsOutputDirName + +Specify an output directory +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To prevent overwriting multiple users' files, each job must have a unique output directory name. Change the ``-output hdfsOutputDir`` argument (where ``hdfsOutputDir`` is the name of the directory). + +Alternatively, you can delete the directory (manually or by using a script) instead of creating a unique directory each time you launch H2O. + +YARN Customization +~~~~~~~~~~~~~~~~~~ + +Most of the configurable YARN variables are stored in ``yarn-site.xml``. To prevent settings from being overridden, you can mark a config as "final." If you change any values in ``yarn-site.xml``, you must restart YARN to confirm the changes. + +Access your logs +~~~~~~~~~~~~~~~~ + +Access logs for a YARN job with the ``yarn logs -applicationId `` command from a terminal. + +.. note:: + + This command must be run by the same userID as the job owner and can only be run after the job has finished. + +How H2O runs on YARN +~~~~~~~~~~~~~~~~~~~~ + +Let's say that you have a Hadoop cluster with six worker nodes and six HDFS nodes. For architectural diagramming purposes, the worker nodes and HDFS nodes are shown as separate blocks in the following diagrams, but they may be running on the same physical machines. + +The ``hadoop jar`` command that you run on the edge node talks to the YARN Resource Manager to launch an H2O MRv2 (map-reduce V2) job. The Resource Manager then places the requested number of H2O nodes (i.e. MRv2 mappers and YARN mappers), three in this example, on worker nodes. + + + .. figure:: ../images/h2o-on-yarn-1.png + :alt: Hadoop cluster showing YARN resource manager placing requested number of H2O nodes on worker nodes. + +Once the H2O job's nodes all start, they find each other and create an H2O cluster (as shown by the dark blue line encircling the three H2O nodes in the following figure). The three H2O nodes work together to perform distributed Machine Learning functions as a group. + +.. note:: + + The three worker nodes that are not part of the H2O job have been removed from the following picture for explanatory purposes. They aren't part of the compute or memory resources used by the H2O job, The full complement of HDFS is still available, though. + + + .. figure:: ../images/h2o-on-yarn-2.png + :alt: Hadoop cluster showing H2O nodes forming a cluster to perform distributed machine learning functions as a group. + +Data is then read in from HDFS once (seen by the red lines in the following figure) and stored as distributed H2O frames in H2O's in-memory column-compressed distributed key-value (DKV) store. + + + .. figure:: ../images/h2o-on-yarn-3.png + :alt: Hadoop cluster showing data read from HDFS and stored as distributed H2O frames. + +Machine Learning algorithms then run very fast in a parallel and distributed way (as shown by the light blue lines in the following image). They iteratively sweep the data over and over again to build models. This is why the in-memory storage makes H2O fast. + +.. note:: + + The HDFS nodes have been removed from the following figure for explanatory purposes to emphasize that the data lives in-memory during the model training process. + + .. figure:: ../images/h2o-on-yarn-4.png + :alt: Hadoop cluster showing algorithms running in parallel, iteratively sweeping data to build models. + +Hadoop and AWS +-------------- + +AWS access credential configuration is provided to H2O by the Hadoop environment itself. There are a number of Hadoop distributions, and each distribution supports different means/providers to configure access to AWS. It's considered best practice to follow your Hadoop provider's guide. + +You can access multiple buckets with distinct credentials by means of the S3A protocol. See the `Hadoop documentation `__ for more information. If you use derived distributions, we advise you to follow the respective documentation of your distribution and the specific version you are using. + + From cd74e4a55bf177024a6560ebbd6205af7e8add5f Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 12:00:09 -0500 Subject: [PATCH 10/27] ht/added docker users --- .../product/getting-started/docker-users.rst | 165 ++++++++++++++++++ 1 file changed, 165 insertions(+) create mode 100644 h2o-docs/src/product/getting-started/docker-users.rst diff --git a/h2o-docs/src/product/getting-started/docker-users.rst b/h2o-docs/src/product/getting-started/docker-users.rst new file mode 100644 index 000000000000..974bb51c3a3e --- /dev/null +++ b/h2o-docs/src/product/getting-started/docker-users.rst @@ -0,0 +1,165 @@ +Docker users +============ + +This section describes how to use H2O on Docker. It walks you through the following steps: + +1. Installing Docker on Mac or Linux OS. +2. Creating and modifying your Dockerfile. +3. Building a Docker image from the Dockerfile. +4. Running the Docker build. +5. Launching H2O. +6. Accessing H2O frm the web browser or from Python/R. + +Prerequisites +------------- + +- Linux kernal verison 3.8+ or Mac OS 10.6+ +- VirtualBox +- Latest version of Docker installed and configured +- Docker daemon running (enter all following commands in the Docker daemon window) +- In ``User`` directory (not ``root``) + +.. note:: + + - Older Linux kernal versions can cause kernal panics that break Docker. There are ways around it, but attempt these at your own risk. Check the version of your kernel by running ``uname -r``. + - The Dockerfile always pulls the latest H2O release. + - The Docker image only needs to be built once. + +Walkthrough +----------- + +The following steps walk you through how to use H2O on Docker. + +.. note:: + + If the following commands don't work, prepend them with ``sudo``. + +Step 1: Install and launch Docker +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Depending on your operating system, select the appropriate installation method: + +- `Mac installation `__ +- `Ubuntu installation `__ +- `Other OS installations `__ + +.. note:: + + By default, Docker allocates 2GB of memory for Mac installations. Be sure to increase this value. We suggest 3-4 times the size of the dataset for the amount of memory required. + +Step 2: Create or download Dockerfile +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1. Create a folder on the Host OS to host your Dockerfile: + +.. code-block:: bash + + mkdir -p /data/h2o-{{branch_name}} + +2. Download or create a Dockerfile, which is a build recipe that builds the container. Download and use our `Dockerfile template `__: + +.. code-block:: bash + + cd /data/h2o- + wget https://raw.githubusercontent.com/h2oai/h2o-3/master/Dockerfile + +This Dockerfile will do the following: + +- Obtain and update the base image (Ubuntu 14.0.4). +- Install Java 8. +- Obtain and download the H2O build from H2O's S3 repository. +- Expose ports ``54321`` and ``54322`` in preparation for launching H2O on those ports. + +Step 3: Build a Docker image from the Dockerfile +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +From the ``/data/h2o-`` directory, run the following (note that ``v5`` represents the current version number): + +.. code-block:: bash + + docker build -t "h2o.ai/{{branch_name}}:v5" + +.. note:: + + This process can take a few minutes because it assembles all the necessary parts for the image. + +Step 4: Run the Docker build +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On a mac, use the argument ``-p 54321:54321`` to expressly map the port ``54321`` (this is not necessary on Linux). + +.. code-block:: bash + + docker run -ti -p 54321:54321 h2o.ai/{{branch_name}}:v5 /bin/bash + +Step 5: Launch H2O +~~~~~~~~~~~~~~~~~~ + +Navigate to the ``/opt`` directory and launch H2O. Update the value of ``-Xmx`` to the amount of memory you want ot allocate to the H2O instance. By default, H2O will launch on port ``54321``. + +.. code-block:: bash + + cd /opt + java -Xmx1g -jar h2o.jar + +Step 6: Access H2O from the web browser or Python/R +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. tab:: On Linux + + After H2O launches, copy and paste the IP address and port of the H2O instance into the address bar of your browser. In the following example, the IP is ``172.17.0.5:54321``. + + .. code-block:: bash + + 03:58:25.963 main INFO WATER: Cloud of size 1 formed [/172.17.0.5:54321 (00:00:00.000)] + + .. tab:: On MacOS + + Locate the IP address of the Docker's network (``192.168.59.103`` in the following example) that bridges to your Host OS by opening a new terminal window (not a bash for your container) and running ``boot2docker ip``. + + .. code-block:: bash + + $ boot2docker ip + 192.168.59.103 + + +You can also view the IP address (``192.168.99.100`` in the following example) by scrolling to the top of the Docker daemon window: + +:: + + + ## . + ## ## ## == + ## ## ## ## ## === + /"""""""""""""""""\___/ === + ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ / ===- ~~~ + \______ o __/ + \ \ __/ + \____\_______/ + + + docker is configured to use the default machine with IP 192.168.99.100 + For help getting started, check out the docs at https://docs.docker.com + +Access Flow +''''''''''' + +After obtaining the IP address, point your browser to the specified IP address and port to open Flow. In R and Python, you can access the instance by installing the latest version of the H2O R or Python package and then initializing H2O: + +.. tabs:: + .. code-tab:: python + + # Initialize H2O + import h2o + docker_h2o = h2o.init(ip = "192.168.59.103", port = 54321) + + .. code-tab:: r R + + # Initialize H2O + library(h2o) + dockerH2O <- h2o.init(ip = "192.168.59.103", port = 54321) + + + + From b45b08356c3d9e740672208daf20ad75da141109 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 12:00:22 -0500 Subject: [PATCH 11/27] ht/added kubernetes users --- .../getting-started/kubernetes-users.rst | 145 ++++++++++++++++++ 1 file changed, 145 insertions(+) create mode 100644 h2o-docs/src/product/getting-started/kubernetes-users.rst diff --git a/h2o-docs/src/product/getting-started/kubernetes-users.rst b/h2o-docs/src/product/getting-started/kubernetes-users.rst new file mode 100644 index 000000000000..21a12dfd4e50 --- /dev/null +++ b/h2o-docs/src/product/getting-started/kubernetes-users.rst @@ -0,0 +1,145 @@ +Kubernetes users +================ + +H2O nodes must be treated as stateful by the Kubernetes environment because H2O is a stateful application. H2O nodes are, therefore, spawned together and deallocated together as a single unit. Subsequently, Kubernetes tooling for stateless applications is not applicable to H2O. In Kubernetes, a set of pods sharing a common state is named a `StatefulSet `__. + +H2O pods deployed on a Kubernetes cluster require a `headless service `__ for H2O node discovery. The headless service returns a set of addresses to all the underlying pods instead of load-balancing incoming requests to the underlying H2O pods. + +.. figure:: images/h2o-k8s-clustering.png +:alt: Kubernetes headless service enclosing an underlying H2O cluster made of a StatefulSet. + +Kubernetes integration +---------------------- + +This section outlines how to integrate H2O and Kubernetes. + +Requirements +~~~~~~~~~~~~ + +To spawn an H2O cluster inside of a Kubernetes cluster, you need the following: + +- A Kubernetes cluster: either local development (e.g. `ks3 `__) or easy start (e.g. `OpenShift `__ by RedHat) +- A Docker image with H2O inside. +- A Kubernetes deployment definition with a StatefulSet of H2O pods and a headless service. + +Create the Docker image +~~~~~~~~~~~~~~~~~~~~~~~ + +A simple Docker container with H2O running on startup is enough: + +.. code:: bash + + FROM ubuntu:latest + ARG H2O_VERSION + RUN apt-get update \ + && apt-get install default-jdk unzip wget -y + RUN wget http://h2o-release.s3.amazonaws.com/h2o/rel-zahradnik/1/h2o-${H2O_VERSION} + && unzip h2o-${H2O_VERSION}.zip + ENV H2O_VERSION ${H2O_VERSION} + CMD java -jar h2o-${H2O_VERSION}/h2o.jar + +To build the Docker image, user ``docker build . -t {image-name} --build-arg H2O_VERSION=``. Make sure to replace ``{image-name}`` with the meaningful H2O deployment name and ```` with your H2O version. + +.. note:: + + For the rest of this example, the docker image will be named ``h2o-k8s``. + +Create the headless service +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +First, create a headless service on Kubernetes: + +.. code:: bash + + apiVersion: v1 + kind: Service + metadata: + name: h2o-service + namespace: default + spec: + type: ClusterIP + clusterIP: None + selector: + app: h2o-k8s + ports: + - protocol: TCP + port: 54321 + +Where: + +- ``clusterIP: None``: This setting defines the service as headless. +- ``port: 54321``: This setting is the default H2O port. Users and client libraries use this port to talk to the H2O cluster. +- ``app: h2o-k8s``: This setting is of great importance because it is the name of the application with the H2O pods inside. While the name is arbitrarily chosen in this example, it must correspond to the chosen H2O deployment name. + +Create the H2O deployment +~~~~~~~~~~~~~~~~~~~~~~~~~ + +We strongly recomming you run H2O as a StatefulSet on your Kubernetes cluster. Treating H2O nodes as stateful ensures the following: + +- H2O nodes will be treated as a single unit and will be brought up and down gracefully and together. +- No attempts will be made by a Kubernetes healthcheck to restart individual H2O nodes in case of an error. +- The cluster will be restarted as a whole, if required. +- Persistent storages and volumes associated with the StatefulSet of H2O nodes will not be deleted once the cluster is brought down. + +.. code:: bash + + apiVersion: apps/v1 + kind: StatefulSet + metadata: + name: h2o-stateful-set + namespace: default + spec: + serviceName: h2o-service + podManagementPolicy: "Parallel" + replicas: 3 + selector: + matchLabels: + app: h2o-k8s + template: + metadata: + labels: + app: h2o-k8s + spec: + terminationGracePeriodSeconds: 10 + containers: + - name: h2o-k8s + image: 'h2oai/h2o-open-source-k8s:latest' + resources: + requests: + memory: "4Gi" + ports: + - containerPort: 54321 + protocol: TCP + env: + - name: H2O_KUBERNETES_SERVICE_DNS + value: h2o-service.default.svc.cluster.local + - name: H2O_NODE_LOOKUP_TIMEOUT + value: '180' + - name: H2O_NODE_EXPECTED_COUNT + value: '3' + +Where: + +- ``H2O_KUBERNETES_SERVICE_DNS``: *Required* Crucial for clustering to work. This format usually follows the ``..svc.cluster.local`` pattern. This setting enables H2O node discovery through DNS. It must be modified to match the name of the headless service you created. Be sure you also pay attention to the rest of the address: it needs to match the specifics of your Kubernetes implementation. +- ``H2O_NODE_LOOKUP_TIMEOUT``: Node lookup constraint. Specify the time before the node lookup times out. +- ``H2O_NODE_EXPECTED_COUNT``: Node lookup constraint. Specofu the expected number of H2O pods to be discovered. +- ``H2O_KUBERNETES_API_PORT``: Port for Kubernetes API checks to listen on (defaults to ``8080``). + +If none of these optional lookup constraints are specified, a sensible default node lookup timeout will be set (defaults to three minutes). If any of the lookup constraints are defined, the H2O node lookup is terminated on whichever condition is met first. + +In the above example, ``'h2oai/h2o-open-source-k8s:latest'`` retrieves the latest build of the H2O Docker image. Replace ``latest`` with ``nightly`` to get the bleeding-edge Docker image with H2O inside. + +Documentation +''''''''''''' + +The documentation for the official H2O Docker images is available at the official `H2O Docker Hub page `__. + +Expose the H2O cluster +~~~~~~~~~~~~~~~~~~~~~~ + +Exposing the H2O cluster is the responsibility of the Kubernetes administrator. By default, an `Ingress `__ can be created. Different platforms offer different capabilities (e.g. OpenShift offers `Routes `__). + +See more information on `running an H2O cluster on a Kubernetes cluster `__. + + + From dc52ce30c4233c861f27e84bf540fb70368b7ce8 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 12:00:33 -0500 Subject: [PATCH 12/27] ht/added experienced users --- .../getting-started/experienced-users.rst | 79 +++++++++++++++++++ 1 file changed, 79 insertions(+) create mode 100644 h2o-docs/src/product/getting-started/experienced-users.rst diff --git a/h2o-docs/src/product/getting-started/experienced-users.rst b/h2o-docs/src/product/getting-started/experienced-users.rst new file mode 100644 index 000000000000..f354c74c4389 --- /dev/null +++ b/h2o-docs/src/product/getting-started/experienced-users.rst @@ -0,0 +1,79 @@ +Experienced users +================= + +If you've used previous versions of H2O, the following links will help guide you through the process of upgrading H2O. + +Changes +------- + +Change log +~~~~~~~~~~ + +`This page houses the most recent changes in the latest build of H2O `__. It lists new features, improvements, security updates, documentation improvements, and bug fixes for each release. + +API-related changes +~~~~~~~~~~~~~~~~~~~ + +The `API-related changes `__ section describes changes made to H2O that can affect backward compatibility. + +Developers +---------- + +If you're looking to use H2O to help you develop your own apps, the following links will provide helpful references. + +Gradle +~~~~~~ + +H2O's build is completely managed by Gradle. Any IDEA with Gradle support is sufficient for H2O-3 development. The latest versions of IntelliJ IDEA are thoroughly tested and proven to work well. + +Open the folder with H2O-3 in IntelliJ IDEA and it will automatically recognize that Gradle is requried and will import the project. The Gradle wrapper present in the repository itself may be used manually/directly to build and test if required. + +For JUnit tests to pass, you may need multiple H2O nodes. Create a "Run/Debug" configuration: + +:: + Type: Application + Main class: H2OApp + Use class path of module: h2o-app + +After starting multiple "worker" node processes in addition to the JUnit test process, they will cloud up and run the multi-node JUnit tests. + +Maven install +~~~~~~~~~~~~~ + +You can view instructions for using H2O with Maven on the `Downloads page `__. + +1. Select H2O Open Source Platform or scroll down to H2O. +2. Select the version of H2O you want to install (latest stable or nightly build). +3. Click the Use from Maven tab. + +`This page provides information on how to build a version of H2O that generates the correct IDE files `__ for your Maven installation. + +Developer resources +~~~~~~~~~~~~~~~~~~~ + +Documentation +''''''''''''' + +See the detailed `instructions on how to build and launch H2O `__, including how to clone the repository, how to pull from the repository, and how to install required dependencies. + +Droplet project templates +^^^^^^^^^^^^^^^^^^^^^^^^^ + +`This page provides template information `__ for projects created in Java, Scala, or Sparkling Water. + +Blogs +''''' + +Learn more about performance characteristics when implementing new algorithms in this `KV Store guide blog `__. + +This `blog post by Cliff `__ walks you through building a new algorithm, using K-Means, Quantiles, and Grep as examples. + +Join the H2O community +---------------------- + +`Join our community support and outreach `__ by accessing self-paced courses, scoping out meetups, and interacting with other users and our team. + +Contributing code +~~~~~~~~~~~~~~~~~ + +If you're interested in contributing code to H2O, we appreciate your assistance! See `how to contribute to H2O `__. This document describes how to access our list of issues, or suggested tasks for contributors, and how to contact us. \ No newline at end of file From 3f442b9d01b5794cfad87c5d50c0e50a523a1b5f Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 12:10:41 -0500 Subject: [PATCH 13/27] ht/welcome & docker page fixes --- .../product/getting-started/docker-users.rst | 14 ++++------- h2o-docs/src/product/welcome.rst | 25 +++++++++++-------- 2 files changed, 19 insertions(+), 20 deletions(-) diff --git a/h2o-docs/src/product/getting-started/docker-users.rst b/h2o-docs/src/product/getting-started/docker-users.rst index 974bb51c3a3e..d53aa2aa1d83 100644 --- a/h2o-docs/src/product/getting-started/docker-users.rst +++ b/h2o-docs/src/product/getting-started/docker-users.rst @@ -121,7 +121,7 @@ Step 6: Access H2O from the web browser or Python/R .. code-block:: bash $ boot2docker ip - 192.168.59.103 + 192.168.59.103 You can also view the IP address (``192.168.99.100`` in the following example) by scrolling to the top of the Docker daemon window: @@ -151,15 +151,11 @@ After obtaining the IP address, point your browser to the specified IP address a .. code-tab:: python # Initialize H2O - import h2o - docker_h2o = h2o.init(ip = "192.168.59.103", port = 54321) + import h2o + docker_h2o = h2o.init(ip = "192.168.59.103", port = 54321) .. code-tab:: r R # Initialize H2O - library(h2o) - dockerH2O <- h2o.init(ip = "192.168.59.103", port = 54321) - - - - + library(h2o) + dockerH2O <- h2o.init(ip = "192.168.59.103", port = 54321) diff --git a/h2o-docs/src/product/welcome.rst b/h2o-docs/src/product/welcome.rst index 4ff268ee03ad..59871df66441 100644 --- a/h2o-docs/src/product/welcome.rst +++ b/h2o-docs/src/product/welcome.rst @@ -51,15 +51,15 @@ Java support H2O supports the following versions of Java: -- Java SE 17, -- Java SE 16, -- Java SE 15, -- Java SE 14, -- Java SE 13, -- Java SE 12, -- Java SE 11, -- Java SE 10, -- Java SE 9, +- Java SE 17 +- Java SE 16 +- Java SE 15 +- Java SE 14 +- Java SE 13 +- Java SE 12 +- Java SE 11 +- Java SE 10 +- Java SE 9 - Java SE 8 `Download the latest supported version of Java `__. @@ -69,10 +69,13 @@ Unsupported Java versions We recommend that only power users force an unsupported Java version. Unsupported Java versions can only be used for experiments. For production versions, we only guarantee the Java versions from the supported list. -How to force an unsupported java version +How to force an unsupported Java version ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -:: +The following code forces an unsupported Java version: + +.. code-block:: bash + java -jar -Dsys.ai.h2o.debug.allowJavaVersions=19 h2o.jar Java support with H2O and Hadoop From 3bccf9388615ff19a3b913c53648874925d3367e Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 12:19:11 -0500 Subject: [PATCH 14/27] ht/hadoop page fixes --- .../product/getting-started/hadoop-users.rst | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/h2o-docs/src/product/getting-started/hadoop-users.rst b/h2o-docs/src/product/getting-started/hadoop-users.rst index 51b80b268b51..f73713c787b2 100644 --- a/h2o-docs/src/product/getting-started/hadoop-users.rst +++ b/h2o-docs/src/product/getting-started/hadoop-users.rst @@ -78,12 +78,14 @@ The following steps show you how to download or build H2O with Hadoop and the pa 2. Prepare the job input on the Hadoop node by unzipping the build file and changing to the directory with the Hadoop and H2O's driver jar files: :: + unzip h2o-{{project_version}}-*.zip cd h2o-{{project_version}}-* 3. Launch H2O nodes and form a cluster on the Hadoop cluster by running: :: + hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g The above command launches a 6g node of H2O. We recommend you launch the cluster with at least four times the memory of your data file size. @@ -97,6 +99,7 @@ The following steps show you how to download or build H2O with Hadoop and the pa 4. Monitor your job by directing your web browser to your standard job tracker web UI. To access H2O's web UI, direct your web browser to one of the launched instances. If you are unsure where your JVM is launched, review the output from your command after the nodes have clouded and formed a cluster. Any nodes' IP addresses will work as there is no master node: :: + Determining driver host interface for mapper->driver callback... [Possible callback IP address: 172.16.2.181] [Possible callback IP address: 127.0.0.1] @@ -183,15 +186,15 @@ where: 3. Import the data with the S3 URL path: .. tabs:: - code-tab:: r R + .. code-tab:: r R h2o.importFile(path = "s3://bucket/path/to/file.csv") - code-tab:: python + .. code-tab:: python h2o.import_frame(path = "s3://bucket/path/to/file.csv") - code-tab:: Flow + .. code-tab:: bash Flow importFiles [ "s3:/path/to/bucket/file/file.tab.gz" ] @@ -345,7 +348,7 @@ Let's say that you have a Hadoop cluster with six worker nodes and six HDFS node The ``hadoop jar`` command that you run on the edge node talks to the YARN Resource Manager to launch an H2O MRv2 (map-reduce V2) job. The Resource Manager then places the requested number of H2O nodes (i.e. MRv2 mappers and YARN mappers), three in this example, on worker nodes. - .. figure:: ../images/h2o-on-yarn-1.png +.. figure:: ../images/h2o-on-yarn-1.png :alt: Hadoop cluster showing YARN resource manager placing requested number of H2O nodes on worker nodes. Once the H2O job's nodes all start, they find each other and create an H2O cluster (as shown by the dark blue line encircling the three H2O nodes in the following figure). The three H2O nodes work together to perform distributed Machine Learning functions as a group. @@ -355,13 +358,13 @@ Once the H2O job's nodes all start, they find each other and create an H2O clust The three worker nodes that are not part of the H2O job have been removed from the following picture for explanatory purposes. They aren't part of the compute or memory resources used by the H2O job, The full complement of HDFS is still available, though. - .. figure:: ../images/h2o-on-yarn-2.png +.. figure:: ../images/h2o-on-yarn-2.png :alt: Hadoop cluster showing H2O nodes forming a cluster to perform distributed machine learning functions as a group. Data is then read in from HDFS once (seen by the red lines in the following figure) and stored as distributed H2O frames in H2O's in-memory column-compressed distributed key-value (DKV) store. - .. figure:: ../images/h2o-on-yarn-3.png +.. figure:: ../images/h2o-on-yarn-3.png :alt: Hadoop cluster showing data read from HDFS and stored as distributed H2O frames. Machine Learning algorithms then run very fast in a parallel and distributed way (as shown by the light blue lines in the following image). They iteratively sweep the data over and over again to build models. This is why the in-memory storage makes H2O fast. @@ -370,7 +373,7 @@ Machine Learning algorithms then run very fast in a parallel and distributed way The HDFS nodes have been removed from the following figure for explanatory purposes to emphasize that the data lives in-memory during the model training process. - .. figure:: ../images/h2o-on-yarn-4.png +.. figure:: ../images/h2o-on-yarn-4.png :alt: Hadoop cluster showing algorithms running in parallel, iteratively sweeping data to build models. Hadoop and AWS From 6a747110cd38721b2f38514f03827724039959e3 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 12:23:42 -0500 Subject: [PATCH 15/27] ht/getting started page fixes --- .../getting-started/getting-started.rst | 94 +++++++++---------- 1 file changed, 47 insertions(+), 47 deletions(-) diff --git a/h2o-docs/src/product/getting-started/getting-started.rst b/h2o-docs/src/product/getting-started/getting-started.rst index 9934c58819c7..cba3774684f5 100644 --- a/h2o-docs/src/product/getting-started/getting-started.rst +++ b/h2o-docs/src/product/getting-started/getting-started.rst @@ -14,7 +14,7 @@ To begin, download a copy of H2O from the `Downloads page `__ if you want to secure your installation. @@ -80,7 +80,7 @@ You can follow these steps to quickly get up and running with H2O directly from 3. After the repository is cloned, change directories to the ``h2o-3`` folder: - .. code-block:: bash +.. code-block:: bash repos user$ cd h2o-3 h2o-3 user$ @@ -94,51 +94,51 @@ You can follow these steps to quickly get up and running with H2O directly from At this point, choose whether you want to complete this quickstart in Python or R. Then, run the following corresponding commands from either the Python or R tab: .. tabs:: - .. code-tab:: python - - # By default, this setup is open. - # Follow our security guidelines (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html) - # if you want to secure your installation. - - # Before starting Python, run the following commands to install dependencies. - # Prepend these commands with `sudo` only if necessary: - # h2o-3 user$ [sudo] pip install -U requests - # h2o-3 user$ [sudo] pip install -U tabulate - - # Start python: - # h2o-3 user$ python - - # Run the following commands to import the H2O module: - >>> import h2o - - # Run the following command to initialize H2O on your local machine (single-node cluster): - >>> h2o.init() - - # If desired, run the GLM, GBM, or Deep Learning demo(s): - >>> h2o.demo("glm") - >>> h2o.demo("gbm") - >>> h2o.demo("deeplearning") - - # Import the Iris (with headers) dataset: - >>> path = "smalldata/iris/iris_wheader.csv" - >>> iris = h2o.import_file(path=path) - - # View a summary of the imported dataset: - >>> iris.summary - # sepal_len sepal_wid petal_len petal_wid class - # 5.1 3.5 1.4 0.2 Iris-setosa - # 4.9 3 1.4 0.2 Iris-setosa - # 4.7 3.2 1.3 0.2 Iris-setosa - # 4.6 3.1 1.5 0.2 Iris-setosa - # 5 3.6 1.4 0.2 Iris-setosa - # 5.4 3.9 1.7 0.4 Iris-setosa - # 4.6 3.4 1.4 0.3 Iris-setosa - # 5 3.4 1.5 0.2 Iris-setosa - # 4.4 2.9 1.4 0.2 Iris-setosa - # 4.9 3.1 1.5 0.1 Iris-setosa - # - # [150 rows x 5 columns] - # + .. code-tab:: python + + # By default, this setup is open. + # Follow our security guidelines (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html) + # if you want to secure your installation. + + # Before starting Python, run the following commands to install dependencies. + # Prepend these commands with `sudo` only if necessary: + # h2o-3 user$ [sudo] pip install -U requests + # h2o-3 user$ [sudo] pip install -U tabulate + + # Start python: + # h2o-3 user$ python + + # Run the following commands to import the H2O module: + >>> import h2o + + # Run the following command to initialize H2O on your local machine (single-node cluster): + >>> h2o.init() + + # If desired, run the GLM, GBM, or Deep Learning demo(s): + >>> h2o.demo("glm") + >>> h2o.demo("gbm") + >>> h2o.demo("deeplearning") + + # Import the Iris (with headers) dataset: + >>> path = "smalldata/iris/iris_wheader.csv" + >>> iris = h2o.import_file(path=path) + + # View a summary of the imported dataset: + >>> iris.summary + # sepal_len sepal_wid petal_len petal_wid class + # 5.1 3.5 1.4 0.2 Iris-setosa + # 4.9 3 1.4 0.2 Iris-setosa + # 4.7 3.2 1.3 0.2 Iris-setosa + # 4.6 3.1 1.5 0.2 Iris-setosa + # 5 3.6 1.4 0.2 Iris-setosa + # 5.4 3.9 1.7 0.4 Iris-setosa + # 4.6 3.4 1.4 0.3 Iris-setosa + # 5 3.4 1.5 0.2 Iris-setosa + # 4.4 2.9 1.4 0.2 Iris-setosa + # 4.9 3.1 1.5 0.1 Iris-setosa + # + # [150 rows x 5 columns] + # .. code-tab:: r R From 7d73d9112581d065393d76016e9ab183183c8f99 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 12:28:10 -0500 Subject: [PATCH 16/27] ht/link fixes --- h2o-docs/src/product/getting-started/flow-users.rst | 2 +- h2o-docs/src/product/getting-started/getting-started.rst | 6 +++--- h2o-docs/src/product/getting-started/python-users.rst | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/h2o-docs/src/product/getting-started/flow-users.rst b/h2o-docs/src/product/getting-started/flow-users.rst index 08a503c1df3e..eeecf05a7a52 100644 --- a/h2o-docs/src/product/getting-started/flow-users.rst +++ b/h2o-docs/src/product/getting-started/flow-users.rst @@ -3,4 +3,4 @@ Flow users H2O Flow is a notebook-style open source UI for H2O. It's a web-based interactive environment that lets you combine code execution, text, mathematics, plots, and rich media in a single document (similar to iPython Notebooks). -See more about `H2O Flow `__. \ No newline at end of file +See more about `H2O Flow <../flow.html>`__. \ No newline at end of file diff --git a/h2o-docs/src/product/getting-started/getting-started.rst b/h2o-docs/src/product/getting-started/getting-started.rst index cba3774684f5..c4b7ff3b6c53 100644 --- a/h2o-docs/src/product/getting-started/getting-started.rst +++ b/h2o-docs/src/product/getting-started/getting-started.rst @@ -17,7 +17,7 @@ To begin, download a copy of H2O from the `Downloads page `__ if you want to secure your installation. + By default, this setup is open. Follow `security guidelines <../security.html>`__ if you want to secure your installation. Using Flow - H2O's web UI ------------------------- @@ -45,7 +45,7 @@ You can configure H2O when you launch it from the command line. For example, you Algorithms ---------- -`This section describes the science behind our algorithms `__ and provides a detailed, per-algorithm view of each model type. +`This section describes the science behind our algorithms <../data-science.html#data-science>`__ and provides a detailed, per-algorithm view of each model type. Use cases --------- @@ -85,7 +85,7 @@ You can follow these steps to quickly get up and running with H2O directly from repos user$ cd h2o-3 h2o-3 user$ -4. Run the following command to retrieve sample datasets. These datasets are used throughout the user guide and within the `booklets `__. +4. Run the following command to retrieve sample datasets. These datasets are used throughout the user guide and within the `booklets <../additional-resources.html#algorithms>`__. .. code-block:: bash diff --git a/h2o-docs/src/product/getting-started/python-users.rst b/h2o-docs/src/product/getting-started/python-users.rst index 50c3e7cc9a34..5133229fe850 100644 --- a/h2o-docs/src/product/getting-started/python-users.rst +++ b/h2o-docs/src/product/getting-started/python-users.rst @@ -32,4 +32,4 @@ See a notebook demonstration for how to use grid search in Python. Anaconda Cloud users -------------------- -You can run H2O in an Anaconda Cloud environment. Conda 2.7, 3.5, and 3.6 repositories are supported (as are a number of H2O versions). See Anaconda's `official H2O package `__ to view a list of all available H2O versions. You can refer to the `Install on Anaconda Cloud `__ section for information about installing H2O in an Anaconda Cloud. \ No newline at end of file +You can run H2O in an Anaconda Cloud environment. Conda 2.7, 3.5, and 3.6 repositories are supported (as are a number of H2O versions). See Anaconda's `official H2O package `__ to view a list of all available H2O versions. You can refer to the `Install on Anaconda Cloud <../downloading.html#install-on-anaconda-cloud>`__ section for information about installing H2O in an Anaconda Cloud. \ No newline at end of file From 1ab5c90e966a8dd7897842b22b200f714a476d23 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 12:33:30 -0500 Subject: [PATCH 17/27] ht/link fixes --- .../product/getting-started/python-users.rst | 2 +- .../src/product/getting-started/r-users.rst | 3 ++- .../getting-started/sparkling-users.rst | 26 +++++++++---------- 3 files changed, 16 insertions(+), 15 deletions(-) diff --git a/h2o-docs/src/product/getting-started/python-users.rst b/h2o-docs/src/product/getting-started/python-users.rst index 5133229fe850..ceabd58c4e64 100644 --- a/h2o-docs/src/product/getting-started/python-users.rst +++ b/h2o-docs/src/product/getting-started/python-users.rst @@ -11,7 +11,7 @@ The following sections will help you begin using Python for H2O. Installing H2O with Python ~~~~~~~~~~~~~~~~~~~~~~~~~~ -You can find instructions for using H2O with Python in the `Downloading and installing H2O `__ section and on the `Downloads page `__. +You can find instructions for using H2O with Python in the `Downloading and installing H2O <../downloading.html#install-in-python>`__ section and on the `Downloads page `__. From the Downloads page: diff --git a/h2o-docs/src/product/getting-started/r-users.rst b/h2o-docs/src/product/getting-started/r-users.rst index 6a30a2d09d11..36730df6faef 100644 --- a/h2o-docs/src/product/getting-started/r-users.rst +++ b/h2o-docs/src/product/getting-started/r-users.rst @@ -21,7 +21,7 @@ See `this cheatsheet on H2O in R `__ section and on the `Downloads page `__. +You can find instructions for using H2O with Python in the `Downloading and installing H2O <../downloading.html#install-in-r>`__ section and on the `Downloads page `__. From the Downloads page: @@ -35,6 +35,7 @@ Checking your R version for H2O To check which version of H2O is installed in R, run the following: :: + versions::installed.versions("h2o") .. note:: diff --git a/h2o-docs/src/product/getting-started/sparkling-users.rst b/h2o-docs/src/product/getting-started/sparkling-users.rst index 40258b16b66d..b837dafd353c 100644 --- a/h2o-docs/src/product/getting-started/sparkling-users.rst +++ b/h2o-docs/src/product/getting-started/sparkling-users.rst @@ -34,14 +34,14 @@ Sparkling Water documentation The documentation for Sparkling Water is separate from the H2O user guide. Read this documentation to get started with Sparkling Water. -- `Sparkling Water for Spark 3.5 `__ +- `Sparkling Water for Spark 3.4 `__ +- `Sparkling Water for Spark 3.3 `__ +- `Sparkling Water for Spark 3.2 `__ +- `Sparkling Water for Spark 3.1 `__ +- `Sparkling Water for Spark 3.0 `__ +- `Sparkling Water for Spark 2.4 `__ +- `Sparkling Water for Spark 2.3 `__ Sparkling Water tutorials ~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -59,11 +59,11 @@ Sparkling Water FAQ The frequently asked questions provide answers to many common questions about Sparkling Water. -- Sparkling Water FAQ for 3.5 `__ -- Sparkling Water FAQ for 3.4 `__ -- Sparkling Water FAQ for 3.3 `__ -- Sparkling Water FAQ for 3.2 `__ -- Sparkling Water FAQ for 3.1 `__ +- `Sparkling Water FAQ for 3.5 `__ +- `Sparkling Water FAQ for 3.4 `__ +- `Sparkling Water FAQ for 3.3 `__ +- `Sparkling Water FAQ for 3.2 `__ +- `Sparkling Water FAQ for 3.1 `__ - `Sparkling Water FAQ for 3.0 `__ - `Sparkling Water FAQ for 2.4 `__ - `Sparkling Water FAQ for 2.3 `__ From d456f7d36b864c543627b6e14465701f6f13c302 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 12:42:42 -0500 Subject: [PATCH 18/27] ht/link & spacing fixes --- h2o-docs/src/product/getting-started/hadoop-users.rst | 4 ++-- h2o-docs/src/product/getting-started/java-users.rst | 2 +- h2o-docs/src/product/getting-started/kubernetes-users.rst | 4 ++-- h2o-docs/src/product/welcome.rst | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/h2o-docs/src/product/getting-started/hadoop-users.rst b/h2o-docs/src/product/getting-started/hadoop-users.rst index f73713c787b2..a2f76f15efe0 100644 --- a/h2o-docs/src/product/getting-started/hadoop-users.rst +++ b/h2o-docs/src/product/getting-started/hadoop-users.rst @@ -269,8 +269,8 @@ For Hortonworks, configure the settings in Ambari. See more on `Hortonworks conf 4. In the **Scheduler** section, enter the amount of memory (in MB) to allocate in the **yarn.scheduler.maximum-allocation-mb** entry field. -.. figure:: ../images/TroubleshootingHadoopAmbariyarnscheduler.png - :alt: Ambari configuration scheduler section with the yarn.scheduler.maximum-allocation-mb section highlighted in red. + .. figure:: ../images/TroubleshootingHadoopAmbariyarnscheduler.png + :alt: Ambari configuration scheduler section with the yarn.scheduler.maximum-allocation-mb section highlighted in red. 5. Click **Save** and redeploy the cluster. diff --git a/h2o-docs/src/product/getting-started/java-users.rst b/h2o-docs/src/product/getting-started/java-users.rst index d555c2595dac..75fab287a3cc 100644 --- a/h2o-docs/src/product/getting-started/java-users.rst +++ b/h2o-docs/src/product/getting-started/java-users.rst @@ -1,7 +1,7 @@ Java users ========== -The following resources will help you create your own custom app that uses H2O. See `H2O's Java requirements `__for more information. +The following resources will help you create your own custom app that uses H2O. See `H2O's Java requirements `__ for more information. Java developer documentation ---------------------------- diff --git a/h2o-docs/src/product/getting-started/kubernetes-users.rst b/h2o-docs/src/product/getting-started/kubernetes-users.rst index 21a12dfd4e50..7e3dc7369437 100644 --- a/h2o-docs/src/product/getting-started/kubernetes-users.rst +++ b/h2o-docs/src/product/getting-started/kubernetes-users.rst @@ -5,8 +5,8 @@ H2O nodes must be treated as stateful by the Kubernetes environment because H2O H2O pods deployed on a Kubernetes cluster require a `headless service `__ for H2O node discovery. The headless service returns a set of addresses to all the underlying pods instead of load-balancing incoming requests to the underlying H2O pods. -.. figure:: images/h2o-k8s-clustering.png -:alt: Kubernetes headless service enclosing an underlying H2O cluster made of a StatefulSet. +.. figure:: ../images/h2o-k8s-clustering.png + :alt: Kubernetes headless service enclosing an underlying H2O cluster made of a StatefulSet. Kubernetes integration ---------------------- diff --git a/h2o-docs/src/product/welcome.rst b/h2o-docs/src/product/welcome.rst index 59871df66441..ba72c832c972 100644 --- a/h2o-docs/src/product/welcome.rst +++ b/h2o-docs/src/product/welcome.rst @@ -29,7 +29,7 @@ We recommend the following at minimum for compatibility with H2O: - Ubuntu 12.04 - RHEL/CentOS 6+ -- **Languages**: R and Python are not required to use H2O (unless you want to use H2O in those environments), but Java is always required (see `Java requirements ,http://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html#java-requirements>`__). +- **Languages**: R and Python are not required to use H2O (unless you want to use H2O in those environments), but Java is always required (see `Java requirements `__). - R version 3+ - Python 3.6.x, 3.7.x, 3.8.x, 3.9.x, 3.10.x, 3.11.x From cdd6bb34bcf97c8c4aecb5a33f304deeb95f2c49 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Fri, 26 Apr 2024 13:03:15 -0500 Subject: [PATCH 19/27] ht/minor fix --- h2o-docs/src/product/getting-started/experienced-users.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/h2o-docs/src/product/getting-started/experienced-users.rst b/h2o-docs/src/product/getting-started/experienced-users.rst index f354c74c4389..d3cdf78bc9b8 100644 --- a/h2o-docs/src/product/getting-started/experienced-users.rst +++ b/h2o-docs/src/product/getting-started/experienced-users.rst @@ -31,6 +31,7 @@ Open the folder with H2O-3 in IntelliJ IDEA and it will automatically recognize For JUnit tests to pass, you may need multiple H2O nodes. Create a "Run/Debug" configuration: :: + Type: Application Main class: H2OApp Use class path of module: h2o-app From aa0b7fa2ff80f81dc0dbccb3421cee36af18847a Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Thu, 2 May 2024 09:25:06 -0500 Subject: [PATCH 20/27] ht/direct link for several sections --- h2o-docs/src/product/getting-started/hadoop-users.rst | 2 +- h2o-docs/src/product/getting-started/python-users.rst | 7 +++---- h2o-docs/src/product/getting-started/r-users.rst | 7 +++---- h2o-docs/src/product/welcome.rst | 2 +- 4 files changed, 8 insertions(+), 10 deletions(-) diff --git a/h2o-docs/src/product/getting-started/hadoop-users.rst b/h2o-docs/src/product/getting-started/hadoop-users.rst index a2f76f15efe0..7f2914657402 100644 --- a/h2o-docs/src/product/getting-started/hadoop-users.rst +++ b/h2o-docs/src/product/getting-started/hadoop-users.rst @@ -74,7 +74,7 @@ Walkthrough The following steps show you how to download or build H2O with Hadoop and the parameters involved in launching H2O from the command line. -1. Download the latest H2O release for your version of Hadoop from the `Downloads page `__. Refer to the H2O on Hadoop tab of the H2O download page for the latest stable release or the nightly bleeding edge release. +1. Download the latest H2O release for your version of Hadoop from the `Downloads page `__. Refer to the H2O on Hadoop tab of the H2O download page for the latest stable release or the nightly bleeding edge release. 2. Prepare the job input on the Hadoop node by unzipping the build file and changing to the directory with the Hadoop and H2O's driver jar files: :: diff --git a/h2o-docs/src/product/getting-started/python-users.rst b/h2o-docs/src/product/getting-started/python-users.rst index ceabd58c4e64..b1f9d183d9e7 100644 --- a/h2o-docs/src/product/getting-started/python-users.rst +++ b/h2o-docs/src/product/getting-started/python-users.rst @@ -11,13 +11,12 @@ The following sections will help you begin using Python for H2O. Installing H2O with Python ~~~~~~~~~~~~~~~~~~~~~~~~~~ -You can find instructions for using H2O with Python in the `Downloading and installing H2O <../downloading.html#install-in-python>`__ section and on the `Downloads page `__. +You can find instructions for using H2O with Python in the `Downloading and installing H2O <../downloading.html#install-in-python>`__ section and on the `Downloads page `__. From the Downloads page: -1. Select the version of H2O you want. -2. Click the Install in Python tab. -3. Follow the on-page instructions. +1. Click the Install in Python tab. +2. Follow the on-page instructions. Python documentation ~~~~~~~~~~~~~~~~~~~~ diff --git a/h2o-docs/src/product/getting-started/r-users.rst b/h2o-docs/src/product/getting-started/r-users.rst index 36730df6faef..b0b9a38e4d40 100644 --- a/h2o-docs/src/product/getting-started/r-users.rst +++ b/h2o-docs/src/product/getting-started/r-users.rst @@ -21,13 +21,12 @@ See `this cheatsheet on H2O in R `__ section and on the `Downloads page `__. +You can find instructions for using H2O with Python in the `Downloading and installing H2O <../downloading.html#install-in-r>`__ section and on the `Downloads page `__. From the Downloads page: -1. Select the version of H2O you want. -2. Click the Install in R tab. -3. Follow the on-page instructions. +1. Click the Install in R tab. +2. Follow the on-page instructions. Checking your R version for H2O ''''''''''''''''''''''''''''''' diff --git a/h2o-docs/src/product/welcome.rst b/h2o-docs/src/product/welcome.rst index ba72c832c972..83d34083f31d 100644 --- a/h2o-docs/src/product/welcome.rst +++ b/h2o-docs/src/product/welcome.rst @@ -91,7 +91,7 @@ This section outlines requirements for optional ways you can run H2O. Optional Hadoop requirements '''''''''''''''''''''''''''' -Hadoop is only required if you want to deploy H2O on a Hadoop cluster. Supported versions are listed on the `Downloads `__ page (when you select the Install on Hadoop tab) and include: +Hadoop is only required if you want to deploy H2O on a Hadoop cluster. Supported versions are listed on the `Downloads `__ page (when you select the Install on Hadoop tab) and include: - Cloudera CDH 5.4+ - Hortonworks HDP 2.2+ From 66f6f4b183d9bc3c27b860f4fc075b52ea846931 Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Wed, 5 Jun 2024 08:36:55 -0500 Subject: [PATCH 21/27] ht/h2o > h2o-3 + others several frm >from note/admonition fixes reorder hadoop example several title lengthens --- .../src/product/getting-started/api-users.rst | 6 +- .../product/getting-started/docker-users.rst | 28 ++--- .../getting-started/experienced-users.rst | 22 ++-- .../product/getting-started/flow-users.rst | 2 +- .../getting-started/getting-started.rst | 26 ++--- .../product/getting-started/hadoop-users.rst | 107 ++++++++++-------- .../product/getting-started/java-users.rst | 6 +- .../getting-started/kubernetes-users.rst | 50 ++++---- .../product/getting-started/python-users.rst | 12 +- .../src/product/getting-started/r-users.rst | 18 +-- .../getting-started/sparkling-users.rst | 14 +-- h2o-docs/src/product/welcome.rst | 32 +++--- 12 files changed, 165 insertions(+), 158 deletions(-) diff --git a/h2o-docs/src/product/getting-started/api-users.rst b/h2o-docs/src/product/getting-started/api-users.rst index 99aa1f1cb2a0..5467442354e6 100644 --- a/h2o-docs/src/product/getting-started/api-users.rst +++ b/h2o-docs/src/product/getting-started/api-users.rst @@ -6,15 +6,15 @@ Our REST APIs are generated immediately out of the code, allowing you to impleme REST API references ------------------- -See the definitive `guide to H2O's REST API `__. +See the definitive `guide to H2O-3's REST API `__. Schemas ~~~~~~~ -See the definitive `guide to H2O's REST API schemas `__. +See the definitive `guide to H2O-3's REST API schemas `__. REST API example ~~~~~~~~~~~~~~~~ -See an `in-depth explanation of how H2O REST API commands are used `__. This explanation includes versioning, experimental APIs, verbs, status codes, formats, schemas, payloads, metadata, and examples. \ No newline at end of file +See an `in-depth explanation of how H2O-3 REST API commands are used `__. This explanation includes versioning, experimental APIs, verbs, status codes, formats, schemas, payloads, metadata, and examples. \ No newline at end of file diff --git a/h2o-docs/src/product/getting-started/docker-users.rst b/h2o-docs/src/product/getting-started/docker-users.rst index d53aa2aa1d83..58a0e0e831b4 100644 --- a/h2o-docs/src/product/getting-started/docker-users.rst +++ b/h2o-docs/src/product/getting-started/docker-users.rst @@ -1,14 +1,14 @@ Docker users ============ -This section describes how to use H2O on Docker. It walks you through the following steps: +This section describes how to use H2O-3 on Docker. It walks you through the following steps: 1. Installing Docker on Mac or Linux OS. 2. Creating and modifying your Dockerfile. 3. Building a Docker image from the Dockerfile. 4. Running the Docker build. -5. Launching H2O. -6. Accessing H2O frm the web browser or from Python/R. +5. Launching H2O-3. +6. Accessing H2O-3 from the web browser or from Python/R. Prerequisites ------------- @@ -22,13 +22,13 @@ Prerequisites .. note:: - Older Linux kernal versions can cause kernal panics that break Docker. There are ways around it, but attempt these at your own risk. Check the version of your kernel by running ``uname -r``. - - The Dockerfile always pulls the latest H2O release. + - The Dockerfile always pulls the latest H2O-3 release. - The Docker image only needs to be built once. Walkthrough ----------- -The following steps walk you through how to use H2O on Docker. +The following steps walk you through how to use H2O-3 on Docker. .. note:: @@ -67,8 +67,8 @@ This Dockerfile will do the following: - Obtain and update the base image (Ubuntu 14.0.4). - Install Java 8. -- Obtain and download the H2O build from H2O's S3 repository. -- Expose ports ``54321`` and ``54322`` in preparation for launching H2O on those ports. +- Obtain and download the H2O-3 build from H2O-3's S3 repository. +- Expose ports ``54321`` and ``54322`` in preparation for launching H2O-3 on those ports. Step 3: Build a Docker image from the Dockerfile ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -92,23 +92,23 @@ On a mac, use the argument ``-p 54321:54321`` to expressly map the port ``54321` docker run -ti -p 54321:54321 h2o.ai/{{branch_name}}:v5 /bin/bash -Step 5: Launch H2O -~~~~~~~~~~~~~~~~~~ +Step 5: Launch H2O-3 +~~~~~~~~~~~~~~~~~~~~ -Navigate to the ``/opt`` directory and launch H2O. Update the value of ``-Xmx`` to the amount of memory you want ot allocate to the H2O instance. By default, H2O will launch on port ``54321``. +Navigate to the ``/opt`` directory and launch H2O-3. Update the value of ``-Xmx`` to the amount of memory you want ot allocate to the H2O-3 instance. By default, H2O-3 will launch on port ``54321``. .. code-block:: bash cd /opt java -Xmx1g -jar h2o.jar -Step 6: Access H2O from the web browser or Python/R -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Step 6: Access H2O-3 from the web browser or Python/R +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tabs:: .. tab:: On Linux - After H2O launches, copy and paste the IP address and port of the H2O instance into the address bar of your browser. In the following example, the IP is ``172.17.0.5:54321``. + After H2O-3 launches, copy and paste the IP address and port of the H2O-3 instance into the address bar of your browser. In the following example, the IP is ``172.17.0.5:54321``. .. code-block:: bash @@ -145,7 +145,7 @@ You can also view the IP address (``192.168.99.100`` in the following example) b Access Flow ''''''''''' -After obtaining the IP address, point your browser to the specified IP address and port to open Flow. In R and Python, you can access the instance by installing the latest version of the H2O R or Python package and then initializing H2O: +After obtaining the IP address, point your browser to the specified IP address and port to open Flow. In R and Python, you can access the instance by installing the latest version of the H2O R or Python package and then initializing H2O-3: .. tabs:: .. code-tab:: python diff --git a/h2o-docs/src/product/getting-started/experienced-users.rst b/h2o-docs/src/product/getting-started/experienced-users.rst index d3cdf78bc9b8..c43086edff87 100644 --- a/h2o-docs/src/product/getting-started/experienced-users.rst +++ b/h2o-docs/src/product/getting-started/experienced-users.rst @@ -1,7 +1,7 @@ Experienced users ================= -If you've used previous versions of H2O, the following links will help guide you through the process of upgrading H2O. +If you've used previous versions of H2O-3, the following links will help guide you through the process of upgrading H2O-3. Changes ------- @@ -9,26 +9,26 @@ Changes Change log ~~~~~~~~~~ -`This page houses the most recent changes in the latest build of H2O `__. It lists new features, improvements, security updates, documentation improvements, and bug fixes for each release. +`This page houses the most recent changes in the latest build of H2O-3 `__. It lists new features, improvements, security updates, documentation improvements, and bug fixes for each release. API-related changes ~~~~~~~~~~~~~~~~~~~ -The `API-related changes `__ section describes changes made to H2O that can affect backward compatibility. +The `API-related changes `__ section describes changes made to H2O-3 that can affect backward compatibility. Developers ---------- -If you're looking to use H2O to help you develop your own apps, the following links will provide helpful references. +If you're looking to use H2O-3 to help you develop your own apps, the following links will provide helpful references. Gradle ~~~~~~ -H2O's build is completely managed by Gradle. Any IDEA with Gradle support is sufficient for H2O-3 development. The latest versions of IntelliJ IDEA are thoroughly tested and proven to work well. +H2O-3's build is completely managed by Gradle. Any IDEA with Gradle support is sufficient for H2O-3 development. The latest versions of IntelliJ IDEA are thoroughly tested and proven to work well. Open the folder with H2O-3 in IntelliJ IDEA and it will automatically recognize that Gradle is requried and will import the project. The Gradle wrapper present in the repository itself may be used manually/directly to build and test if required. -For JUnit tests to pass, you may need multiple H2O nodes. Create a "Run/Debug" configuration: +For JUnit tests to pass, you may need multiple H2O-3 nodes. Create a "Run/Debug" configuration: :: @@ -41,13 +41,13 @@ After starting multiple "worker" node processes in addition to the JUnit test pr Maven install ~~~~~~~~~~~~~ -You can view instructions for using H2O with Maven on the `Downloads page `__. +You can view instructions for using H2O-3 with Maven on the `Downloads page `__. 1. Select H2O Open Source Platform or scroll down to H2O. -2. Select the version of H2O you want to install (latest stable or nightly build). +2. Select the version of H2O-3 you want to install (latest stable or nightly build). 3. Click the Use from Maven tab. -`This page provides information on how to build a version of H2O that generates the correct IDE files `__ for your Maven installation. +`This page provides information on how to build a version of H2O-3 that generates the correct IDE files `__ for your Maven installation. Developer resources ~~~~~~~~~~~~~~~~~~~ @@ -55,7 +55,7 @@ Developer resources Documentation ''''''''''''' -See the detailed `instructions on how to build and launch H2O `__, including how to clone the repository, how to pull from the repository, and how to install required dependencies. +See the detailed `instructions on how to build and launch H2O-3 `__, including how to clone the repository, how to pull from the repository, and how to install required dependencies. Droplet project templates ^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -77,4 +77,4 @@ Join the H2O community Contributing code ~~~~~~~~~~~~~~~~~ -If you're interested in contributing code to H2O, we appreciate your assistance! See `how to contribute to H2O `__. This document describes how to access our list of issues, or suggested tasks for contributors, and how to contact us. \ No newline at end of file +If you're interested in contributing code to H2O-3, we appreciate your assistance! See `how to contribute to H2O-3 `__. This document describes how to access our list of issues, or suggested tasks for contributors, and how to contact us. \ No newline at end of file diff --git a/h2o-docs/src/product/getting-started/flow-users.rst b/h2o-docs/src/product/getting-started/flow-users.rst index eeecf05a7a52..30e8d1765cd7 100644 --- a/h2o-docs/src/product/getting-started/flow-users.rst +++ b/h2o-docs/src/product/getting-started/flow-users.rst @@ -1,6 +1,6 @@ Flow users ========== -H2O Flow is a notebook-style open source UI for H2O. It's a web-based interactive environment that lets you combine code execution, text, mathematics, plots, and rich media in a single document (similar to iPython Notebooks). +H2O Flow is a notebook-style open source UI for H2O-3. It's a web-based interactive environment that lets you combine code execution, text, mathematics, plots, and rich media in a single document (similar to iPython Notebooks). See more about `H2O Flow <../flow.html>`__. \ No newline at end of file diff --git a/h2o-docs/src/product/getting-started/getting-started.rst b/h2o-docs/src/product/getting-started/getting-started.rst index c4b7ff3b6c53..79347dcb271c 100644 --- a/h2o-docs/src/product/getting-started/getting-started.rst +++ b/h2o-docs/src/product/getting-started/getting-started.rst @@ -1,26 +1,26 @@ Getting started =============== -Here are some helpful links to help you get started learning H2O. +Here are some helpful links to help you get started learning H2O-3. Downloads page -------------- -To begin, download a copy of H2O from the `Downloads page `__. +To begin, download a copy of H2O-3 from the `Downloads page `__. -1. Click H2O Open Source Platform or scroll down to the H2O section. Here you have access to the different ways to download H2O: +1. Click H2O Open Source Platform or scroll down to the H2O section. Here you have access to the different ways to download H2O-3: -- Latest stable: this version is the most recentl alpha release version of H2O. -- Nightly bleeding edge: this version contains all the latest changes to H2O that haven't been released officially yet. -- Prior releases: this houses all previously released versions of H2O. +- Latest stable: this version is the most recentl alpha release version of H2O-3. +- Nightly bleeding edge: this version contains all the latest changes to H2O-3 that haven't been released officially yet. +- Prior releases: this houses all previously released versions of H2O-3. For first-time users, we recomment downloading the latest alpha release and the default standalone option (the Download and Run tab) as the installation method. Make sure to install Java if it is not already installed. .. note:: By default, this setup is open. Follow `security guidelines <../security.html>`__ if you want to secure your installation. -Using Flow - H2O's web UI -------------------------- +Using Flow - H2O-3's web UI +--------------------------- `This section describes our web interface, Flow `__. Flow is similar to IPython notebooks and allows you to create a visual workflow to share with others. @@ -39,7 +39,7 @@ The following examples use H2O Flow. To see a step-by-step example of one of our Launch from the command line ---------------------------- -You can configure H2O when you launch it from the command line. For example, you can specify a different directory for saved Flow data, you could allocate more memory, or you could use a flatfile for a quick configuration of your cluster. See more details about `configuring the additional options when you launch H2O `__. +You can configure H2O-3 when you launch it from the command line. For example, you can specify a different directory for saved Flow data, you could allocate more memory, or you could use a flatfile for a quick configuration of your cluster. See more details about `configuring the additional options when you launch H2O-3 `__. Algorithms @@ -50,7 +50,7 @@ Algorithms Use cases --------- -H2O can handle a wide variety of practical use cases due to its robust catalogue of supported algorithms, wrappers, and machine learning tools. The following are some example problems H2O can handle: +H2O-3 can handle a wide variety of practical use cases due to its robust catalogue of supported algorithms, wrappers, and machine learning tools. The following are some example problems H2O-3 can handle: - Determining outliers in housing prices based on number of bedrooms, number of bathrooms, access to waterfront, etc. through `anomaly detection `__. - Revealing natural customer `segments `__ in retail data to determine which groups are purchasing which products. @@ -58,14 +58,14 @@ H2O can handle a wide variety of practical use cases due to its robust catalogue - Unsampling the minority class for credit card fraud data to handle `imbalanced data `__. - `Detecting drift `__ on avocado sales pre-2018 and 2018+ to determine if a model is still relevant for new data. -See our `best practice tutorials `__ to further explore the capabilities of H2O. +See our `best practice tutorials `__ to further explore the capabilities of H2O-3. New user quickstart ------------------- -You can follow these steps to quickly get up and running with H2O directly from the `H2O-3 repository `__. These steps will guide you through cloning the repository, starting H2O, and importing a dataset. Once you're up and running, you'll be better able to follow examples included within this user guide. +You can follow these steps to quickly get up and running with H2O-3 directly from the `H2O-3 repository `__. These steps will guide you through cloning the repository, starting H2O-3, and importing a dataset. Once you're up and running, you'll be better able to follow examples included within this user guide. -1. In a terminal window, create a folder for the H2O repository: +1. In a terminal window, create a folder for the H2O-3 repository: .. code-block:: bash diff --git a/h2o-docs/src/product/getting-started/hadoop-users.rst b/h2o-docs/src/product/getting-started/hadoop-users.rst index 7f2914657402..8cc37ed5e431 100644 --- a/h2o-docs/src/product/getting-started/hadoop-users.rst +++ b/h2o-docs/src/product/getting-started/hadoop-users.rst @@ -1,7 +1,7 @@ Hadoop users ============ -This section describes how to use H2O on Hadoop. +This section describes how to use H2O-3 on Hadoop. Supported Versions ------------------ @@ -44,18 +44,18 @@ Supported Versions Important points to remember: - - The command used to launch H2O differs from previous versions (see the `Walkthrough `__ section). - - Launching H2O on Hadoop requires at least 6GB of memory. - - Each H2O node runs as a mapper (run only one mapper per host). + - The command used to launch H2O-3 differs from previous versions (see the `Walkthrough `__ section). + - Launching H2O-3 on Hadoop requires at least 6GB of memory. + - Each H2O-3 node runs as a mapper (run only one mapper per host). - There are no combiners or reducers. - - Each H2O cluster needs a unique job name. + - Each H2O-3 cluster needs a unique job name. - ``-mapperXmx``, ``-nodes``, and ``-output`` are required. - - Root permissions are not required (just unzip the H2O ZIP file on any single node). + - Root permissions are not required (just unzip the H2O-3 ZIP file on any single node). Prerequisite: Open communication paths -------------------------------------- -H2O communicates using two communication paths. Verify these paths are open and available for use by H2O. +H2O-3 communicates using two communication paths. Verify these paths are open and available for use by H2O-3. Path 1: Mapper to driver ~~~~~~~~~~~~~~~~~~~~~~~~ @@ -65,38 +65,38 @@ Optionally specify this port using the ``-driverport`` option in the ``hadoop ja Path 2: Mapper to mapper ~~~~~~~~~~~~~~~~~~~~~~~~ -Optionally specify this port using the ``-baseport`` option in the ``hadoop jar`` command see `Hadoop launch parameters `__). This port and the next subsequent port are opened on the mapper hosts (i.e. the Hadoop worker nodes) where the H2O mapper nodes are placed by the Resource Manager. By default, ports ``54321`` and ``54322`` are used. +Optionally specify this port using the ``-baseport`` option in the ``hadoop jar`` command see `Hadoop launch parameters `__). This port and the next subsequent port are opened on the mapper hosts (i.e. the Hadoop worker nodes) where the H2O-3 mapper nodes are placed by the Resource Manager. By default, ports ``54321`` and ``54322`` are used. -The mapper port is adaptive: if ``54321`` and ``54322`` are not available, h2O will try ``54323`` and ``54324`` and so on. The mapper port is designed to be adaptive because sometimes if the YARN cluster is low on resources, YARN will place two H2O mappers for the same H2O cluster request on the same physical host. For this reason, we recommend opening a range of more than two ports: 20 ports should be sufficient. +The mapper port is adaptive: if ``54321`` and ``54322`` are not available, H2O-3 will try ``54323`` and ``54324`` and so on. The mapper port is designed to be adaptive because sometimes if the YARN cluster is low on resources, YARN will place two H2O-3 mappers for the same H2O-3 cluster request on the same physical host. For this reason, we recommend opening a range of more than two ports: 20 ports should be sufficient. Walkthrough ----------- -The following steps show you how to download or build H2O with Hadoop and the parameters involved in launching H2O from the command line. +The following steps show you how to download or build H2O-3 with Hadoop and the parameters involved in launching H2O-3 from the command line. -1. Download the latest H2O release for your version of Hadoop from the `Downloads page `__. Refer to the H2O on Hadoop tab of the H2O download page for the latest stable release or the nightly bleeding edge release. -2. Prepare the job input on the Hadoop node by unzipping the build file and changing to the directory with the Hadoop and H2O's driver jar files: +1. Download the latest H2O-3 release for your version of Hadoop from the `Downloads page `__. Refer to the H2O-3 on Hadoop tab of the H2O-3 download page for the latest stable release or the nightly bleeding edge release. +2. Prepare the job input on the Hadoop node by unzipping the build file and changing to the directory with the Hadoop and H2O-3's driver jar files: :: unzip h2o-{{project_version}}-*.zip cd h2o-{{project_version}}-* -3. Launch H2O nodes and form a cluster on the Hadoop cluster by running: +3. Launch H2O-3 nodes and form a cluster on the Hadoop cluster by running: :: hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g - The above command launches a 6g node of H2O. We recommend you launch the cluster with at least four times the memory of your data file size. + The above command launches a 6g node of H2O-3. We recommend you launch the cluster with at least four times the memory of your data file size. - *mapperXmx* is the mapper size or the amount of memory allocated to each node. Specify at least 6 GB. - *nodes* is the number of nodes requested to form the cluster. - - *output* is the name of the directory created each time a H2O cluster is created so it is necessary for the name to be unique each time it is launched. + - *output* is the name of the directory created each time a H2O-3 cluster is created so it is necessary for the name to be unique each time it is launched. -4. Monitor your job by directing your web browser to your standard job tracker web UI. To access H2O's web UI, direct your web browser to one of the launched instances. If you are unsure where your JVM is launched, review the output from your command after the nodes have clouded and formed a cluster. Any nodes' IP addresses will work as there is no master node: +4. Monitor your job by directing your web browser to your standard job tracker web UI. To access H2O-3's web UI, direct your web browser to one of the launched instances. If you are unsure where your JVM is launched, review the output from your command after the nodes have clouded and formed a cluster. Any nodes' IP addresses will work as there is no master node: :: @@ -121,30 +121,36 @@ Hadoop launch parameters - ``-driverif driver callback interface>``: Specify the IP address for callback messages from the mapper to the driver. - ``-driverport callback interface>``: Specify the port number for callback messages from the mapper to the driver. - ``-driverportrange callback interface>``: Specify the allowed port range of the driver callback interface, eg. 50000-55000. -- ``-network [,]``: Specify the IPv4 network(s) to bind to the H2O nodes; multiple networks can be specified to force H2O to use the specified host in the Hadoop cluster. ``10.1.2.0/24`` allows 256 possibilities. +- ``-network [,]``: Specify the IPv4 network(s) to bind to the H2O-3 nodes; multiple networks can be specified to force H2O-3 to use the specified host in the Hadoop cluster. ``10.1.2.0/24`` allows 256 possibilities. - ``-timeout ``: Specify the timeout duration (in seconds) to wait for the cluster to form before failing. - **Note**: The default value is 120 seconds; if your cluster is very busy, this may not provide enough time for the nodes to launch. If H2O does not launch, try increasing this value (for example, ``-timeout 600``). + .. note:: + + The default value is 120 seconds; if your cluster is very busy, this may not provide enough time for the nodes to launch. If H2O does not launch, try increasing this value (for example, ``-timeout 600``). - ``-disown``: Exit the driver after the cluster forms. - **Note**: For Qubole users who include the ``-disown`` flag, if your cluster is dying right after launch, add ``-Dmapred.jobclient.killjob.onexit=false`` as a launch parameter. + .. note:: + + For Qubole users who include the ``-disown`` flag, if your cluster is dying right after launch, add ``-Dmapred.jobclient.killjob.onexit=false`` as a launch parameter. - ``-notify ``: Specify a file to write when the cluster is up. The file contains the IP and port of the embedded web server for one of the nodes in the cluster. All mappers must start before the H2O cluster is considered "up". - ``-mapperXmx ``: Specify the amount of memory to allocate to H2O (at least 6g). - ``-extramempercent``: Specify the extra memory for internal JVM use outside of the Java heap. This is a percentage of ``mapperXmx``. - **Recommendation**: Set this to a high value when running XGBoost (for example, 120). + .. admonition:: Recommendation + + Set this to a high value when running XGBoost (for example, 120). - ``-n | -nodes ``: Specify the number of nodes. - ``-nthreads ``: Specify the maximum number of parallel threads of execution. This is usually capped by the max number of vcores. -- ``-baseport ``: Specify the initialization port for the H2O nodes. The default is ``54321``. +- ``-baseport ``: Specify the initialization port for the H2O-3 nodes. The default is ``54321``. - ``-license ``: Specify the directory of local filesytem location and the license file name. - ``-o | -output ``: Specify the HDFS directory for the output. -- ``-flow_dir ``: Specify the directory for saved flows. By default, H2O will try to find the HDFS home directory to use as the directory for flows. If the HDFS home directory is not found, flows cannot be saved unless a directory is specified using ``-flow_dir``. +- ``-flow_dir ``: Specify the directory for saved flows. By default, H2O-3 will try to find the HDFS home directory to use as the directory for flows. If the HDFS home directory is not found, flows cannot be saved unless a directory is specified using ``-flow_dir``. - ``-port_offset ``: This parameter allows you to specify the relationship of the API port ("web port") and the internal communication port. The h2o port and API port are derived from each other, and we cannot fully decouple them. Instead, we allow you to specify an offset such that h2o port = api port + offset. This allows you to move the communication port to a specific range that can be firewalled. - ``-proxy``: Enables Proxy mode. -- ``-report_hostname``: This flag allows the user to specify the machine hostname instead of the IP address when launching H2O Flow. This option can only be used when H2O on Hadoop is started in Proxy mode (with ``-proxy``). +- ``-report_hostname``: This flag allows the user to specify the machine hostname instead of the IP address when launching H2O Flow. This option can only be used when H2O-3 on Hadoop is started in Proxy mode (with ``-proxy``). JVM arguments ~~~~~~~~~~~~~ @@ -156,14 +162,14 @@ JVM arguments Configure HDFS -------------- -When running H2O on Hadoop, you do not need to worry about configuring HDFS. The ``-hdfs_config`` flag is used to configure access to HDFS from a standalone cluster. However, it's also used for anything that requires Hadoop (such as Hive). +When running H2O-3 on Hadoop, you do not need to worry about configuring HDFS. The ``-hdfs_config`` flag is used to configure access to HDFS from a standalone cluster. However, it's also used for anything that requires Hadoop (such as Hive). If you are accessing HDFS/Hive without Kerberos, then you will need to pass ``-hdfs_config`` and path to the ``core-site.xml`` that you got from your Hadoop edge node. If you are accessing Kerberized Hadoop, you will also need to pass ``hdfs-site.xml``. Access S3 data from Hadoop -------------------------- -H2O launched on Hadoop can access S3 data in addition to HDFS. To enable access, follow these instructions: +H2O-3 launched on Hadoop can access S3 data in addition to HDFS. To enable access, follow these instructions: 1. Edit Hadoop's ``core-site.xml``. 2. Set the ``HADOOP_CONF_DIR`` environment property to the directory containing the ``core_site.xml``. See the `core-site.xml example `__ for more information. @@ -172,7 +178,7 @@ H2O launched on Hadoop can access S3 data in addition to HDFS. To enable access, Typically the configuration directory for most Hadoop distributions is ``/etc/hadoop/conf``. -You can also pass the S3 credentials when launching H2O with the Hadoop jar command. use the ``-D`` flag to pass the credentials: +You can also pass the S3 credentials when launching H2O-3 with the Hadoop jar command. Use the ``-D`` flag to pass the credentials: .. code-block:: bash @@ -186,27 +192,28 @@ where: 3. Import the data with the S3 URL path: .. tabs:: - .. code-tab:: r R + .. code-tab:: bash Flow - h2o.importFile(path = "s3://bucket/path/to/file.csv") + importFiles [ "s3:/path/to/bucket/file/file.tab.gz" ] .. code-tab:: python h2o.import_frame(path = "s3://bucket/path/to/file.csv") - .. code-tab:: bash Flow + .. code-tab:: r R + + h2o.importFile(path = "s3://bucket/path/to/file.csv") - importFiles [ "s3:/path/to/bucket/file/file.tab.gz" ] YARN best practices ------------------- -YARN (Yet another resource negotiator) is a resource management framework. H2O can be launched as an application on YARN. If you want to run H2O on Hadoop, essentailly, you are running H2O on YARN. We strongly recommend using YARN to manage your cluster resources. +YARN (Yet another resource negotiator) is a resource management framework. H2O-3 can be launched as an application on YARN. If you want to run H2O-3 on Hadoop, essentailly, you are running H2O-3 on YARN. We strongly recommend using YARN to manage your cluster resources. -H2O with YARN -~~~~~~~~~~~~~ +H2O-3 with YARN +~~~~~~~~~~~~~~~ -When you launch H2O on Hadoop using the ``hadoop jar`` command, YARN allocates the necessary resources to launch the requested number of nodes. H2O launches as a map-reduce (V2) task where each mapper is an H2O node of the specified size: +When you launch H2O-3 on Hadoop using the ``hadoop jar`` command, YARN allocates the necessary resources to launch the requested number of nodes. H2O-3 launches as a map-reduce (V2) task where each mapper is an H2O-3 node of the specified size: .. code-block:: bash @@ -220,12 +227,12 @@ Occassionally, YARN may reject a job request. This usually occurs because there Failure with too little memory ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -If YARN rejects the job request, try re-launching the job with less memory first to see if that is the cause of the failure. Specify smaller values for ``-mapperXmx`` (we recommend a minimum or ``2g``) and ``-nodes`` (start with ``1``) to confirm that H2O can launch successfully. +If YARN rejects the job request, try re-launching the job with less memory first to see if that is the cause of the failure. Specify smaller values for ``-mapperXmx`` (we recommend a minimum or ``2g``) and ``-nodes`` (start with ``1``) to confirm that H2O-3 can launch successfully. Failure due to configuration issues ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -To resolve configuration issues, adjust the maximum memory that YARN will allow when launching each mapper. If the cluster manager settings are configured for the default maximum memory size but the memory requried for the request exceeds that amount, YARN will not launch and H2O will time out. +To resolve configuration issues, adjust the maximum memory that YARN will allow when launching each mapper. If the cluster manager settings are configured for the default maximum memory size but the memory requried for the request exceeds that amount, YARN will not launch and H2O-3 will time out. If you are using the default configuration, change the configuration settings in your cluster manager to specify memory allocation when launching mapper tasks. To calculate the amount of memory required for a successful launch, the the following formula: @@ -291,7 +298,7 @@ To verify the values were changes, check the values for the following properties Limit CPU usage ~~~~~~~~~~~~~~~ -To limit the number of CPUs used by H2O, use the ``-nthreads`` option and specify the maximum number of CPUs for a single container to use. The following example limits the number of CPUs to four: +To limit the number of CPUs used by H2O-3, use the ``-nthreads`` option and specify the maximum number of CPUs for a single container to use. The following example limits the number of CPUs to four: .. code-block:: bash @@ -306,7 +313,7 @@ To limit the number of CPUs used by H2O, use the ``-nthreads`` option and specif Specify a queue ~~~~~~~~~~~~~~~ -If you do not specify a queue when launching H2O, H2O jobs are submitted to the default queue. Jobs submitted to the default queue have a lower priority than jobs submitted to a specific queue. +If you do not specify a queue when launching H2O-3, H2O-3 jobs are submitted to the default queue. Jobs submitted to the default queue have a lower priority than jobs submitted to a specific queue. To specify a queue with Hadoop, enter ``-Dmapreduce.job.queuename=`` (where ```` is the name of the queue) when launching Hadoop. @@ -324,7 +331,7 @@ Specify an output directory To prevent overwriting multiple users' files, each job must have a unique output directory name. Change the ``-output hdfsOutputDir`` argument (where ``hdfsOutputDir`` is the name of the directory). -Alternatively, you can delete the directory (manually or by using a script) instead of creating a unique directory each time you launch H2O. +Alternatively, you can delete the directory (manually or by using a script) instead of creating a unique directory each time you launch H2O-3. YARN Customization ~~~~~~~~~~~~~~~~~~ @@ -340,34 +347,34 @@ Access logs for a YARN job with the ``yarn logs -applicationId ` This command must be run by the same userID as the job owner and can only be run after the job has finished. -How H2O runs on YARN -~~~~~~~~~~~~~~~~~~~~ +How H2O-3 runs on YARN +~~~~~~~~~~~~~~~~~~~~~~ Let's say that you have a Hadoop cluster with six worker nodes and six HDFS nodes. For architectural diagramming purposes, the worker nodes and HDFS nodes are shown as separate blocks in the following diagrams, but they may be running on the same physical machines. -The ``hadoop jar`` command that you run on the edge node talks to the YARN Resource Manager to launch an H2O MRv2 (map-reduce V2) job. The Resource Manager then places the requested number of H2O nodes (i.e. MRv2 mappers and YARN mappers), three in this example, on worker nodes. +The ``hadoop jar`` command that you run on the edge node talks to the YARN Resource Manager to launch an H2O MRv2 (map-reduce V2) job. The Resource Manager then places the requested number of H2O-3 nodes (i.e. MRv2 mappers and YARN mappers), three in this example, on worker nodes. .. figure:: ../images/h2o-on-yarn-1.png - :alt: Hadoop cluster showing YARN resource manager placing requested number of H2O nodes on worker nodes. + :alt: Hadoop cluster showing YARN resource manager placing requested number of H2O-3 nodes on worker nodes. -Once the H2O job's nodes all start, they find each other and create an H2O cluster (as shown by the dark blue line encircling the three H2O nodes in the following figure). The three H2O nodes work together to perform distributed Machine Learning functions as a group. +Once the H2O-3 job's nodes all start, they find each other and create an H2O cluster (as shown by the dark blue line encircling the three H2O-3 nodes in the following figure). The three H2O-3 nodes work together to perform distributed Machine Learning functions as a group. .. note:: - The three worker nodes that are not part of the H2O job have been removed from the following picture for explanatory purposes. They aren't part of the compute or memory resources used by the H2O job, The full complement of HDFS is still available, though. + The three worker nodes that are not part of the H2O-3 job have been removed from the following picture for explanatory purposes. They aren't part of the compute or memory resources used by the H2O-3 job, The full complement of HDFS is still available, though. .. figure:: ../images/h2o-on-yarn-2.png - :alt: Hadoop cluster showing H2O nodes forming a cluster to perform distributed machine learning functions as a group. + :alt: Hadoop cluster showing H2O-3 nodes forming a cluster to perform distributed machine learning functions as a group. -Data is then read in from HDFS once (seen by the red lines in the following figure) and stored as distributed H2O frames in H2O's in-memory column-compressed distributed key-value (DKV) store. +Data is then read in from HDFS once (seen by the red lines in the following figure) and stored as distributed H2O-3 frames in H2O-3's in-memory column-compressed distributed key-value (DKV) store. .. figure:: ../images/h2o-on-yarn-3.png - :alt: Hadoop cluster showing data read from HDFS and stored as distributed H2O frames. + :alt: Hadoop cluster showing data read from HDFS and stored as distributed H2O-3 frames. -Machine Learning algorithms then run very fast in a parallel and distributed way (as shown by the light blue lines in the following image). They iteratively sweep the data over and over again to build models. This is why the in-memory storage makes H2O fast. +Machine Learning algorithms then run very fast in a parallel and distributed way (as shown by the light blue lines in the following image). They iteratively sweep the data over and over again to build models. This is why the in-memory storage makes H2O-3 fast. .. note:: @@ -379,7 +386,7 @@ Machine Learning algorithms then run very fast in a parallel and distributed way Hadoop and AWS -------------- -AWS access credential configuration is provided to H2O by the Hadoop environment itself. There are a number of Hadoop distributions, and each distribution supports different means/providers to configure access to AWS. It's considered best practice to follow your Hadoop provider's guide. +AWS access credential configuration is provided to H2O-3 by the Hadoop environment itself. There are a number of Hadoop distributions, and each distribution supports different means/providers to configure access to AWS. It's considered best practice to follow your Hadoop provider's guide. You can access multiple buckets with distinct credentials by means of the S3A protocol. See the `Hadoop documentation `__ for more information. If you use derived distributions, we advise you to follow the respective documentation of your distribution and the specific version you are using. diff --git a/h2o-docs/src/product/getting-started/java-users.rst b/h2o-docs/src/product/getting-started/java-users.rst index 75fab287a3cc..0ef506d1ab3b 100644 --- a/h2o-docs/src/product/getting-started/java-users.rst +++ b/h2o-docs/src/product/getting-started/java-users.rst @@ -1,7 +1,7 @@ Java users ========== -The following resources will help you create your own custom app that uses H2O. See `H2O's Java requirements `__ for more information. +The following resources will help you create your own custom app that uses H2O-3. See `H2O-3's Java requirements `__ for more information. Java developer documentation ---------------------------- @@ -9,12 +9,12 @@ Java developer documentation Core components ~~~~~~~~~~~~~~~ -The definitive `Java API guide for the core components of H2O `__. +The definitive `Java API guide for the core components of H2O-3 `__. Algorithms ~~~~~~~~~~ -The definitive `Java API guide for the algorithms used by H2O `__. +The definitive `Java API guide for the algorithms used by H2O-3 `__. Example ------- diff --git a/h2o-docs/src/product/getting-started/kubernetes-users.rst b/h2o-docs/src/product/getting-started/kubernetes-users.rst index 7e3dc7369437..e391f7e45a38 100644 --- a/h2o-docs/src/product/getting-started/kubernetes-users.rst +++ b/h2o-docs/src/product/getting-started/kubernetes-users.rst @@ -1,31 +1,31 @@ Kubernetes users ================ -H2O nodes must be treated as stateful by the Kubernetes environment because H2O is a stateful application. H2O nodes are, therefore, spawned together and deallocated together as a single unit. Subsequently, Kubernetes tooling for stateless applications is not applicable to H2O. In Kubernetes, a set of pods sharing a common state is named a `StatefulSet `__. +H2O-3 nodes must be treated as stateful by the Kubernetes environment because H2O-3 is a stateful application. H2O-3 nodes are, therefore, spawned together and deallocated together as a single unit. Subsequently, Kubernetes tooling for stateless applications is not applicable to H2O-3. In Kubernetes, a set of pods sharing a common state is named a `StatefulSet `__. -H2O pods deployed on a Kubernetes cluster require a `headless service `__ for H2O node discovery. The headless service returns a set of addresses to all the underlying pods instead of load-balancing incoming requests to the underlying H2O pods. +H2O-3 pods deployed on a Kubernetes cluster require a `headless service `__ for H2O-3 node discovery. The headless service returns a set of addresses to all the underlying pods instead of load-balancing incoming requests to the underlying H2O-3 pods. .. figure:: ../images/h2o-k8s-clustering.png - :alt: Kubernetes headless service enclosing an underlying H2O cluster made of a StatefulSet. + :alt: Kubernetes headless service enclosing an underlying H2O-3 cluster made of a StatefulSet. Kubernetes integration ---------------------- -This section outlines how to integrate H2O and Kubernetes. +This section outlines how to integrate H2O-3 and Kubernetes. Requirements ~~~~~~~~~~~~ -To spawn an H2O cluster inside of a Kubernetes cluster, you need the following: +To spawn an H2O-3 cluster inside of a Kubernetes cluster, you need the following: - A Kubernetes cluster: either local development (e.g. `ks3 `__) or easy start (e.g. `OpenShift `__ by RedHat) -- A Docker image with H2O inside. -- A Kubernetes deployment definition with a StatefulSet of H2O pods and a headless service. +- A Docker image with H2O-3 inside. +- A Kubernetes deployment definition with a StatefulSet of H2O-3 pods and a headless service. Create the Docker image ~~~~~~~~~~~~~~~~~~~~~~~ -A simple Docker container with H2O running on startup is enough: +A simple Docker container with H2O-3 running on startup is enough: .. code:: bash @@ -68,18 +68,18 @@ First, create a headless service on Kubernetes: Where: - ``clusterIP: None``: This setting defines the service as headless. -- ``port: 54321``: This setting is the default H2O port. Users and client libraries use this port to talk to the H2O cluster. -- ``app: h2o-k8s``: This setting is of great importance because it is the name of the application with the H2O pods inside. While the name is arbitrarily chosen in this example, it must correspond to the chosen H2O deployment name. +- ``port: 54321``: This setting is the default H2O-3 port. Users and client libraries use this port to talk to the H2O-3 cluster. +- ``app: h2o-k8s``: This setting is of great importance because it is the name of the application with the H2O-3 pods inside. While the name is arbitrarily chosen in this example, it must correspond to the chosen H2O-3 deployment name. -Create the H2O deployment -~~~~~~~~~~~~~~~~~~~~~~~~~ +Create the H2O-3 deployment +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -We strongly recomming you run H2O as a StatefulSet on your Kubernetes cluster. Treating H2O nodes as stateful ensures the following: +We strongly recomming you run H2O-3 as a StatefulSet on your Kubernetes cluster. Treating H2O-3 nodes as stateful ensures the following: -- H2O nodes will be treated as a single unit and will be brought up and down gracefully and together. -- No attempts will be made by a Kubernetes healthcheck to restart individual H2O nodes in case of an error. +- H2O-3 nodes will be treated as a single unit and will be brought up and down gracefully and together. +- No attempts will be made by a Kubernetes healthcheck to restart individual H2O-3 nodes in case of an error. - The cluster will be restarted as a whole, if required. -- Persistent storages and volumes associated with the StatefulSet of H2O nodes will not be deleted once the cluster is brought down. +- Persistent storages and volumes associated with the StatefulSet of H2O-3 nodes will not be deleted once the cluster is brought down. .. code:: bash @@ -120,26 +120,26 @@ We strongly recomming you run H2O as a StatefulSet on your Kubernetes cluster. T Where: -- ``H2O_KUBERNETES_SERVICE_DNS``: *Required* Crucial for clustering to work. This format usually follows the ``..svc.cluster.local`` pattern. This setting enables H2O node discovery through DNS. It must be modified to match the name of the headless service you created. Be sure you also pay attention to the rest of the address: it needs to match the specifics of your Kubernetes implementation. +- ``H2O_KUBERNETES_SERVICE_DNS``: *Required* Crucial for clustering to work. This format usually follows the ``..svc.cluster.local`` pattern. This setting enables H2O-3 node discovery through DNS. It must be modified to match the name of the headless service you created. Be sure you also pay attention to the rest of the address: it needs to match the specifics of your Kubernetes implementation. - ``H2O_NODE_LOOKUP_TIMEOUT``: Node lookup constraint. Specify the time before the node lookup times out. -- ``H2O_NODE_EXPECTED_COUNT``: Node lookup constraint. Specofu the expected number of H2O pods to be discovered. +- ``H2O_NODE_EXPECTED_COUNT``: Node lookup constraint. Specofu the expected number of H2O-3 pods to be discovered. - ``H2O_KUBERNETES_API_PORT``: Port for Kubernetes API checks to listen on (defaults to ``8080``). -If none of these optional lookup constraints are specified, a sensible default node lookup timeout will be set (defaults to three minutes). If any of the lookup constraints are defined, the H2O node lookup is terminated on whichever condition is met first. +If none of these optional lookup constraints are specified, a sensible default node lookup timeout will be set (defaults to three minutes). If any of the lookup constraints are defined, the H2O-3 node lookup is terminated on whichever condition is met first. -In the above example, ``'h2oai/h2o-open-source-k8s:latest'`` retrieves the latest build of the H2O Docker image. Replace ``latest`` with ``nightly`` to get the bleeding-edge Docker image with H2O inside. +In the above example, ``'h2oai/h2o-open-source-k8s:latest'`` retrieves the latest build of the H2O-3 Docker image. Replace ``latest`` with ``nightly`` to get the bleeding-edge Docker image with H2O-3 inside. Documentation ''''''''''''' -The documentation for the official H2O Docker images is available at the official `H2O Docker Hub page `__. +The documentation for the official H2O-3 Docker images is available at the official `H2O-3 Docker Hub page `__. -Expose the H2O cluster -~~~~~~~~~~~~~~~~~~~~~~ +Expose the H2O-3 cluster +~~~~~~~~~~~~~~~~~~~~~~~~ -Exposing the H2O cluster is the responsibility of the Kubernetes administrator. By default, an `Ingress `__ can be created. Different platforms offer different capabilities (e.g. OpenShift offers `Routes `__). +Exposing the H2O-3 cluster is the responsibility of the Kubernetes administrator. By default, an `Ingress `__ can be created. Different platforms offer different capabilities (e.g. OpenShift offers `Routes `__). -See more information on `running an H2O cluster on a Kubernetes cluster `__. +See more information on `running an H2O-3 cluster on a Kubernetes cluster `__. diff --git a/h2o-docs/src/product/getting-started/python-users.rst b/h2o-docs/src/product/getting-started/python-users.rst index b1f9d183d9e7..386161d5d161 100644 --- a/h2o-docs/src/product/getting-started/python-users.rst +++ b/h2o-docs/src/product/getting-started/python-users.rst @@ -1,17 +1,17 @@ Python users ============ -Pythonistas can rest easy knowing that H2O provides support for this popular programming language. You can also use H2O with IPython notebooks. +Pythonistas can rest easy knowing that H2O-3 provides support for this popular programming language. You can also use H2O-3 with IPython notebooks. Getting started with Python --------------------------- -The following sections will help you begin using Python for H2O. +The following sections will help you begin using Python for H2O-3. -Installing H2O with Python -~~~~~~~~~~~~~~~~~~~~~~~~~~ +Installing H2O-3 with Python +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -You can find instructions for using H2O with Python in the `Downloading and installing H2O <../downloading.html#install-in-python>`__ section and on the `Downloads page `__. +You can find instructions for using H2O-3 with Python in the `Downloading and installing H2O-3 <../downloading.html#install-in-python>`__ section and on the `Downloads page `__. From the Downloads page: @@ -31,4 +31,4 @@ See a notebook demonstration for how to use grid search in Python. Anaconda Cloud users -------------------- -You can run H2O in an Anaconda Cloud environment. Conda 2.7, 3.5, and 3.6 repositories are supported (as are a number of H2O versions). See Anaconda's `official H2O package `__ to view a list of all available H2O versions. You can refer to the `Install on Anaconda Cloud <../downloading.html#install-on-anaconda-cloud>`__ section for information about installing H2O in an Anaconda Cloud. \ No newline at end of file +You can run H2O-3 in an Anaconda Cloud environment. Conda 2.7, 3.5, and 3.6 repositories are supported (as are a number of H2O-3 versions). See Anaconda's `official H2O-3 package `__ to view a list of all available H2O-3 versions. You can refer to the `Install on Anaconda Cloud <../downloading.html#install-on-anaconda-cloud>`__ section for information about installing H2O-3 in an Anaconda Cloud. \ No newline at end of file diff --git a/h2o-docs/src/product/getting-started/r-users.rst b/h2o-docs/src/product/getting-started/r-users.rst index b0b9a38e4d40..a7c2c0de391f 100644 --- a/h2o-docs/src/product/getting-started/r-users.rst +++ b/h2o-docs/src/product/getting-started/r-users.rst @@ -1,19 +1,19 @@ R users ======= -R users rejoice: H2O supports your chosen programming language! +R users rejoice: H2O-3 supports your chosen programming language! Getting started with R ---------------------- -The following sections will help you begin using R for H2O. +The following sections will help you begin using R for H2O-3. -See `this cheatsheet on H2O in R `__ for a quick start. +See `this cheatsheet on H2O-3 in R `__ for a quick start. .. note:: - If you are running R on Linus, then you must install ``libcurl`` which allows H2O to communicate with R. We also recommend disabling SElinux and any firewalls (at least initially until you confirmed H2O can initialize). + If you are running R on Linus, then you must install ``libcurl`` which allows H2O-3 to communicate with R. We also recommend disabling SElinux and any firewalls (at least initially until you confirmed H2O-3 can initialize). - On Ubuntu, run: ``apt-get install libcurl4-openssl-dev`` - On CentOS, run: ``yum install libcurl-devel`` @@ -21,17 +21,17 @@ See `this cheatsheet on H2O in R `__ section and on the `Downloads page `__. +You can find instructions for using H2O-3 with Python in the `Downloading and installing H2O <../downloading.html#install-in-r>`__ section and on the `Downloads page `__. From the Downloads page: 1. Click the Install in R tab. 2. Follow the on-page instructions. -Checking your R version for H2O -''''''''''''''''''''''''''''''' +Checking your R version for H2O-3 +''''''''''''''''''''''''''''''''' -To check which version of H2O is installed in R, run the following: +To check which version of H2O-3 is installed in R, run the following: :: @@ -39,7 +39,7 @@ To check which version of H2O is installed in R, run the following: .. note:: - R version 3.1.0 ("Spring Dance") is incompatible with H2O. If you are using that version, we recommend upgrading your R version before using H2O. + R version 3.1.0 ("Spring Dance") is incompatible with H2O-3. If you are using that version, we recommend upgrading your R version before using H2O-3. R documentation diff --git a/h2o-docs/src/product/getting-started/sparkling-users.rst b/h2o-docs/src/product/getting-started/sparkling-users.rst index b837dafd353c..455a71d025bb 100644 --- a/h2o-docs/src/product/getting-started/sparkling-users.rst +++ b/h2o-docs/src/product/getting-started/sparkling-users.rst @@ -5,15 +5,15 @@ Sparkling Water is a gradle project with the following submodules: - **Core**: Implementation of H2OContext, H2ORDD, and all technical integration code. - **Examples**: Application, demos, and examples. -- **ML**: Implementation of `MLlib `__ pipelines for H2O algorithms. +- **ML**: Implementation of `MLlib `__ pipelines for H2O-3 algorithms. - **Assembly**: This creates "fatJar" (composed of all other modules). -- **py**: Implementation of (H2O) Python binding to Sparkling Water. +- **py**: Implementation of (H2O-3) Python binding to Sparkling Water. The best way to get started is to modify the core module or create a new module (which extends the project). .. note:: - Sparkling Water is only supported with the latest version of H2O. + Sparkling Water is only supported with the latest version of H2O-3. Sparkling Water is versioned according to the Spark versioning, so make sure to use the Sparkling Water version that corresponds to your installed version of spark. @@ -32,7 +32,7 @@ Download Sparkling Water Sparkling Water documentation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The documentation for Sparkling Water is separate from the H2O user guide. Read this documentation to get started with Sparkling Water. +The documentation for Sparkling Water is separate from the H2O-3 user guide. Read this documentation to get started with Sparkling Water. - `Sparkling Water for Spark 3.5 `__ - `Sparkling Water for Spark 3.4 `__ @@ -71,7 +71,7 @@ The frequently asked questions provide answers to many common questions about Sp Sparkling Water blog posts -------------------------- -- `How Sparkling Water Brings H2O to Spark `_ +- `How Sparkling Water Brings H2O-3 to Spark `_ - `H2O - The Killer App on Spark `_ - `In-memory Big Data: Spark + H2O `_ @@ -97,9 +97,9 @@ Documentation for PySparkling is available for the following versions: RSparkling ---------- -The RSparkling R package is an extension package for `sparklyr `__ that creates an R front-end for the Sparkling Water package from H2O. This provides an interface to H2O's high performance, distributed machine learning algorithms on Spark using R. +The RSparkling R package is an extension package for `sparklyr `__ that creates an R front-end for the Sparkling Water package from H2O-3. This provides an interface to H2O-3's high performance, distributed machine learning algorithms on Spark using R. -This package implements basic functionality by creating an H2OContext, showing the H2O Flow interface, and converting between Spark DataFrames. The main purpose of this package is to provide a connector between sparklyr and H2O's machine learning algorithms. +This package implements basic functionality by creating an H2OContext, showing the H2O Flow interface, and converting between Spark DataFrames. The main purpose of this package is to provide a connector between sparklyr and H2O-3's machine learning algorithms. The RSparkling package uses sparklyr for Spark job deployment and initialization of Sparkling Water. After that, you can use the regular H2O R package for modeling. diff --git a/h2o-docs/src/product/welcome.rst b/h2o-docs/src/product/welcome.rst index 83d34083f31d..32c0866f5c4c 100644 --- a/h2o-docs/src/product/welcome.rst +++ b/h2o-docs/src/product/welcome.rst @@ -1,26 +1,26 @@ Welcome to H2O-3 ================ -H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform. It lets you build machine learning models on big data and provides easy productionalization of those models in an enterprise environment. +H2O-3 is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform. It lets you build machine learning models on big data and provides easy productionalization of those models in an enterprise environment. Basic framework --------------- -H2O's core code is written in Java. A distributed key-value store is used to access and reference data, models, objects, etc. across all nodes and machines. The algorithms are implemented on top of H2O's distributed map-reduce framework and utilize the Java fork/join framework for multi-threading. The data is read in parallel and is distributed across the cluster. It is stored in-memory in a columnar format in a compressed way. H2O's data parser has built-in intelligence to guess the schema of the incoming dataset and supports data ingest from multiple sources in various formats. +H2O-3's core code is written in Java. A distributed key-value store is used to access and reference data, models, objects, etc. across all nodes and machines. The algorithms are implemented on top of H2O-3's distributed map-reduce framework and utilize the Java fork/join framework for multi-threading. The data is read in parallel and is distributed across the cluster. It is stored in-memory in a columnar format in a compressed way. H2O's data parser has built-in intelligence to guess the schema of the incoming dataset and supports data ingest from multiple sources in various formats. REST API ~~~~~~~~ -H2O's REST API allow access to all the capabilities of H2O frm an external program or script through JSON over HTTP. The REST API is used by H2O's web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python). +H2O-3's REST API allow access to all the capabilities of H2O-3 from an external program or script through JSON over HTTP. The REST API is used by H2O-3's web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python). -The speed, quality, ease-of-use, and model-deployment for our various supervised and unsupervised algorithms (such as Deep Learning, GLRM, or our tree ensembles) make H2O a highly sought after API for big data data science. +The speed, quality, ease-of-use, and model-deployment for our various supervised and unsupervised algorithms (such as Deep Learning, GLRM, or our tree ensembles) make H2O-3 a highly sought after API for big data data science. H2O is licensed under the `Apache License, Version 2.0 `__. Requirements ------------ -We recommend the following at minimum for compatibility with H2O: +We recommend the following at minimum for compatibility with H2O-3: - **Operating systems**: @@ -29,12 +29,12 @@ We recommend the following at minimum for compatibility with H2O: - Ubuntu 12.04 - RHEL/CentOS 6+ -- **Languages**: R and Python are not required to use H2O (unless you want to use H2O in those environments), but Java is always required (see `Java requirements `__). +- **Languages**: R and Python are not required to use H2O-3 (unless you want to use H2O-3 in those environments), but Java is always required (see `Java requirements `__). - R version 3+ - Python 3.6.x, 3.7.x, 3.8.x, 3.9.x, 3.10.x, 3.11.x -- **Browser**: An internet browser is required to use H2O's web UI, Flow. +- **Browser**: An internet browser is required to use H2O-3's web UI, Flow. - Google Chrome - Firefox @@ -44,12 +44,12 @@ We recommend the following at minimum for compatibility with H2O: Java requirements ~~~~~~~~~~~~~~~~~ -H2O runs on Java. The 64-bit JDK is required to build H2O or run H2O tests. Only the 64-bit JRE is required to run the H2O binary using either the command line, R, or Python packages. +H2O-3 runs on Java. The 64-bit JDK is required to build H2O-3 or run H2O-3 tests. Only the 64-bit JRE is required to run the H2O-3 binary using either the command line, R, or Python packages. Java support '''''''''''' -H2O supports the following versions of Java: +H2O-3 supports the following versions of Java: - Java SE 17 - Java SE 16 @@ -78,20 +78,20 @@ The following code forces an unsupported Java version: java -jar -Dsys.ai.h2o.debug.allowJavaVersions=19 h2o.jar -Java support with H2O and Hadoop -'''''''''''''''''''''''''''''''' +Java support with H2O-3 and Hadoop +'''''''''''''''''''''''''''''''''' -Java support is different between H2O and Hadoop. Hadoop only supports `Java 8 and Java 11 `__. Therefore, when running H2O on Hadoop, we recommend only running H2O on Java 8 or Java 11. +Java support is different between H2O-3 and Hadoop. Hadoop only supports `Java 8 and Java 11 `__. Therefore, when running H2O on Hadoop, we recommend only running H2O-3 on Java 8 or Java 11. Optional requirements ~~~~~~~~~~~~~~~~~~~~~ -This section outlines requirements for optional ways you can run H2O. +This section outlines requirements for optional ways you can run H2O-3. Optional Hadoop requirements '''''''''''''''''''''''''''' -Hadoop is only required if you want to deploy H2O on a Hadoop cluster. Supported versions are listed on the `Downloads `__ page (when you select the Install on Hadoop tab) and include: +Hadoop is only required if you want to deploy H2O-3 on a Hadoop cluster. Supported versions are listed on the `Downloads `__ page (when you select the Install on Hadoop tab) and include: - Cloudera CDH 5.4+ - Hortonworks HDP 2.2+ @@ -103,7 +103,7 @@ See the `Hadoop users Date: Wed, 5 Jun 2024 08:59:54 -0500 Subject: [PATCH 22/27] ht/flow algo section w/ links to algo pages --- .../getting-started/getting-started.rst | 28 +++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/h2o-docs/src/product/getting-started/getting-started.rst b/h2o-docs/src/product/getting-started/getting-started.rst index 79347dcb271c..a671bec9e5bb 100644 --- a/h2o-docs/src/product/getting-started/getting-started.rst +++ b/h2o-docs/src/product/getting-started/getting-started.rst @@ -24,6 +24,34 @@ Using Flow - H2O-3's web UI `This section describes our web interface, Flow `__. Flow is similar to IPython notebooks and allows you to create a visual workflow to share with others. +Algorithm support +~~~~~~~~~~~~~~~~~ + +H2O Flow supports the following H2O-3 algorithms: + +- `AdaBoost <../data-science/adaboost.html>`__ +- `Aggregator <../data-science/aggregator.html>`__ +- `ANOVA GLM <../data-science/anova_glm.html>`__ +- `Cox Proportional Hazards (CoxPH) <../data-science/coxph.html>`__ +- `Deep Learning <../data-science/deep-learning.html>`__ +- `Distributed Random Forest (DRF) <../data-science/drf.html>`__ +- `Distributed Uplift Random Forest (Uplift DRF) <../data-science/upliftdrf.html>`__ +- `Extended Isolation Forest <../data-science/eif.html>`__ +- `Generalized Additive Models (GAM) <../data-science/gam.html>`__ +- `Generalized Linear Model (GLM) <../data-science/glm.html>`__ +- `Generalized Low Rank Models (GLRM) <../data-science/glrm.html>`__ +- `Gradient Boosting Machine (GBM) <../data-science/gbm.html>`__ +- `Isolation Forest <../data-science/if.html>`__ +- `Isotonic Regression <../data-science/isotonic-regression.html>`__ +- `K-Means Clustering <../data-science/k-means.html>`__ +- `ModelSelection <../data-science/model_selection.html>`__ +- `Naïve Bayes Classifier <../data-science/naive-bayes.html>`__ +- `Principal Component Analysis (PCA) <../data-science/pca.html>`__ +- `RuleFit <../data-science/rulefit.html>`__ +- `Stacked Ensemble <../data-science/stacked-ensembles.html>`__ +- `Support Vector Machine (PSVM) <../data-science/svm.html>`__ +- `Word2Vec <../data-science/word2vec.html>`__ + Tutorials of Flow ~~~~~~~~~~~~~~~~~ From 93bb209adcf9fb21519326afe4b13456fd0d236b Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Wed, 5 Jun 2024 09:39:10 -0500 Subject: [PATCH 23/27] ht/added available algos + linux fix --- .../src/product/getting-started/r-users.rst | 2 +- h2o-docs/src/product/welcome.rst | 32 +++++++++++++++++++ 2 files changed, 33 insertions(+), 1 deletion(-) diff --git a/h2o-docs/src/product/getting-started/r-users.rst b/h2o-docs/src/product/getting-started/r-users.rst index a7c2c0de391f..fea83bac6e09 100644 --- a/h2o-docs/src/product/getting-started/r-users.rst +++ b/h2o-docs/src/product/getting-started/r-users.rst @@ -13,7 +13,7 @@ See `this cheatsheet on H2O-3 in R `__. +Available algorithms +'''''''''''''''''''' + +H2O-3 supports the following `algorithms `__: + +- `AdaBoost `__ +- `Aggregator `__ +- `ANOVA GLM `__ +- `AutoML `__ +- `Cox Proportional Hazards (CoxPH) `__ +- `Decision Tree `__ +- `Deep Learning `__ +- `Distributed Random Forest (DRF) `__ +- `Distributed Uplift Random Forest (Uplift DRF) `__ +- `Extended Isolation Forest `__ +- `Generalized Additive Models (GAM) `__ +- `Generalized Linear Model (GLM) `__ +- `Generalized Low Rank Models (GLRM) `__ +- `Gradient Boosting Machine (GBM) `__ +- `Isolation Forest `__ +- `Isotonic Regression `__ +- `K-Means Clustering `__ +- `ModelSelection `__ +- `Naïve Bayes Classifier `__ +- `Principal Component Analysis (PCA) `__ +- `RuleFit `__ +- `Stacked Ensemble `__ +- `Support Vector Machine (PSVM) `__ +- `Target Encoding `__ +- `Word2Vec `__ +- `XGBoost `__ + Requirements ------------ From 34b455d414a694bd435d8ccb26d6a78668c66744 Mon Sep 17 00:00:00 2001 From: Hannah <52463461+hannah-tillman@users.noreply.github.com> Date: Thu, 27 Jun 2024 12:05:10 -0500 Subject: [PATCH 24/27] Update h2o-docs/src/product/getting-started/docker-users.rst Co-authored-by: Adam Valenta --- h2o-docs/src/product/getting-started/docker-users.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/h2o-docs/src/product/getting-started/docker-users.rst b/h2o-docs/src/product/getting-started/docker-users.rst index 58a0e0e831b4..28bbef942ff4 100644 --- a/h2o-docs/src/product/getting-started/docker-users.rst +++ b/h2o-docs/src/product/getting-started/docker-users.rst @@ -21,7 +21,7 @@ Prerequisites .. note:: - - Older Linux kernal versions can cause kernal panics that break Docker. There are ways around it, but attempt these at your own risk. Check the version of your kernel by running ``uname -r``. + - Older Linux kernel versions can cause kernel panics that break Docker. There are ways around it, but attempt these at your own risk. Check the version of your kernel by running ``uname -r``. - The Dockerfile always pulls the latest H2O-3 release. - The Docker image only needs to be built once. From 323e6f5ad66991b52ed8da231fb73100c3e411a3 Mon Sep 17 00:00:00 2001 From: Hannah <52463461+hannah-tillman@users.noreply.github.com> Date: Thu, 27 Jun 2024 12:05:18 -0500 Subject: [PATCH 25/27] Update h2o-docs/src/product/getting-started/docker-users.rst Co-authored-by: Adam Valenta --- h2o-docs/src/product/getting-started/docker-users.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/h2o-docs/src/product/getting-started/docker-users.rst b/h2o-docs/src/product/getting-started/docker-users.rst index 28bbef942ff4..4590bab615b8 100644 --- a/h2o-docs/src/product/getting-started/docker-users.rst +++ b/h2o-docs/src/product/getting-started/docker-users.rst @@ -13,7 +13,7 @@ This section describes how to use H2O-3 on Docker. It walks you through the foll Prerequisites ------------- -- Linux kernal verison 3.8+ or Mac OS 10.6+ +- Linux kernel verison 3.8+ or Mac OS 10.6+ - VirtualBox - Latest version of Docker installed and configured - Docker daemon running (enter all following commands in the Docker daemon window) From cc4cfa08bbb9c2cbeeae0b4b07ad2514a7af477d Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Thu, 27 Jun 2024 12:27:40 -0500 Subject: [PATCH 26/27] ht/requested updates; updated flow list to reflect app --- .../src/product/getting-started/getting-started.rst | 13 ++++++------- .../src/product/getting-started/hadoop-users.rst | 4 ++-- .../product/getting-started/kubernetes-users.rst | 2 +- 3 files changed, 9 insertions(+), 10 deletions(-) diff --git a/h2o-docs/src/product/getting-started/getting-started.rst b/h2o-docs/src/product/getting-started/getting-started.rst index a671bec9e5bb..fa904419b698 100644 --- a/h2o-docs/src/product/getting-started/getting-started.rst +++ b/h2o-docs/src/product/getting-started/getting-started.rst @@ -29,29 +29,28 @@ Algorithm support H2O Flow supports the following H2O-3 algorithms: -- `AdaBoost <../data-science/adaboost.html>`__ - `Aggregator <../data-science/aggregator.html>`__ - `ANOVA GLM <../data-science/anova_glm.html>`__ +- `AutoML <../automl.html>`__ - `Cox Proportional Hazards (CoxPH) <../data-science/coxph.html>`__ - `Deep Learning <../data-science/deep-learning.html>`__ - `Distributed Random Forest (DRF) <../data-science/drf.html>`__ - `Distributed Uplift Random Forest (Uplift DRF) <../data-science/upliftdrf.html>`__ - `Extended Isolation Forest <../data-science/eif.html>`__ -- `Generalized Additive Models (GAM) <../data-science/gam.html>`__ - `Generalized Linear Model (GLM) <../data-science/glm.html>`__ - `Generalized Low Rank Models (GLRM) <../data-science/glrm.html>`__ - `Gradient Boosting Machine (GBM) <../data-science/gbm.html>`__ +- `Information Diagram (Infogram) <../admissible.html>`__ - `Isolation Forest <../data-science/if.html>`__ -- `Isotonic Regression <../data-science/isotonic-regression.html>`__ - `K-Means Clustering <../data-science/k-means.html>`__ - `ModelSelection <../data-science/model_selection.html>`__ - `Naïve Bayes Classifier <../data-science/naive-bayes.html>`__ - `Principal Component Analysis (PCA) <../data-science/pca.html>`__ - `RuleFit <../data-science/rulefit.html>`__ - `Stacked Ensemble <../data-science/stacked-ensembles.html>`__ -- `Support Vector Machine (PSVM) <../data-science/svm.html>`__ +- `Target Encoding <../data-science/target-encoding.html>`__ - `Word2Vec <../data-science/word2vec.html>`__ - +- `XGBoost <../data-science/xgboost.html>`__ Tutorials of Flow ~~~~~~~~~~~~~~~~~ @@ -148,7 +147,7 @@ At this point, choose whether you want to complete this quickstart in Python or >>> h2o.demo("deeplearning") # Import the Iris (with headers) dataset: - >>> path = "smalldata/iris/iris_wheader.csv" + >>> path = "https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_wheader.csv" >>> iris = h2o.import_file(path=path) # View a summary of the imported dataset: @@ -200,7 +199,7 @@ At this point, choose whether you want to complete this quickstart in Python or > h2o.init() # Import the Iris (with headers) dataset. - > path <- "smalldata/iris/iris_wheader.csv" + > path <- "https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_wheader.csv" > iris <- h2o.importFile(path) # View a summary of the imported dataset. diff --git a/h2o-docs/src/product/getting-started/hadoop-users.rst b/h2o-docs/src/product/getting-started/hadoop-users.rst index 8cc37ed5e431..0a173fa036d3 100644 --- a/h2o-docs/src/product/getting-started/hadoop-users.rst +++ b/h2o-docs/src/product/getting-started/hadoop-users.rst @@ -60,7 +60,7 @@ H2O-3 communicates using two communication paths. Verify these paths are open an Path 1: Mapper to driver ~~~~~~~~~~~~~~~~~~~~~~~~ -Optionally specify this port using the ``-driverport`` option in the ``hadoop jar`` command (see `Hadoop launch parameters `__). This port is opened on the driver host (the host where you entered the ``hadoop jar`` command). By default, this port is chosen randomly by the operating system. If you don't want to spcify an exact port but still want to restrict the port to a certain range of pors, you can use the option ``-driverportrange``. +Optionally specify this port using the ``-driverport`` option in the ``hadoop jar`` command (see `Hadoop launch parameters `__). This port is opened on the driver host (the host where you entered the ``hadoop jar`` command). By default, this port is chosen randomly by the operating system. If you don't want to specify an exact port but still want to restrict the port to a certain range of ports, you can use the option ``-driverportrange``. Path 2: Mapper to mapper ~~~~~~~~~~~~~~~~~~~~~~~~ @@ -118,7 +118,7 @@ Hadoop launch parameters - ``-h | -help``: Display help. - ``-jobname ``: Specify a job name for the Jobtracker to use; the default is ``H2O_nnnnn`` (where n is chosen randomly). - ``-principal -keytab | -run_as_user ``: Optionally specify a Kerberos principal and keytab or specify the ``run_as_user`` parameter to start clusters on behalf of the user/principal. Note that using ``run_as_user`` implies that the Hadoop cluster does not have Kerberos. -- ``-driverif driver callback interface>``: Specify the IP address for callback messages from the mapper to the driver. +- ``-driverip driver callback interface>``: Specify the IP address for callback messages from the mapper to the driver. - ``-driverport callback interface>``: Specify the port number for callback messages from the mapper to the driver. - ``-driverportrange callback interface>``: Specify the allowed port range of the driver callback interface, eg. 50000-55000. - ``-network [,]``: Specify the IPv4 network(s) to bind to the H2O-3 nodes; multiple networks can be specified to force H2O-3 to use the specified host in the Hadoop cluster. ``10.1.2.0/24`` allows 256 possibilities. diff --git a/h2o-docs/src/product/getting-started/kubernetes-users.rst b/h2o-docs/src/product/getting-started/kubernetes-users.rst index e391f7e45a38..c66f26dddefd 100644 --- a/h2o-docs/src/product/getting-started/kubernetes-users.rst +++ b/h2o-docs/src/product/getting-started/kubernetes-users.rst @@ -122,7 +122,7 @@ Where: - ``H2O_KUBERNETES_SERVICE_DNS``: *Required* Crucial for clustering to work. This format usually follows the ``..svc.cluster.local`` pattern. This setting enables H2O-3 node discovery through DNS. It must be modified to match the name of the headless service you created. Be sure you also pay attention to the rest of the address: it needs to match the specifics of your Kubernetes implementation. - ``H2O_NODE_LOOKUP_TIMEOUT``: Node lookup constraint. Specify the time before the node lookup times out. -- ``H2O_NODE_EXPECTED_COUNT``: Node lookup constraint. Specofu the expected number of H2O-3 pods to be discovered. +- ``H2O_NODE_EXPECTED_COUNT``: Node lookup constraint. Specify the expected number of H2O-3 pods to be discovered. - ``H2O_KUBERNETES_API_PORT``: Port for Kubernetes API checks to listen on (defaults to ``8080``). If none of these optional lookup constraints are specified, a sensible default node lookup timeout will be set (defaults to three minutes). If any of the lookup constraints are defined, the H2O-3 node lookup is terminated on whichever condition is met first. From bed77609cd1ade936763cc42fd09acdf89fafa8c Mon Sep 17 00:00:00 2001 From: Hannah Tillman Date: Thu, 27 Jun 2024 12:34:00 -0500 Subject: [PATCH 27/27] ht/added infogram to welcome page --- h2o-docs/src/product/welcome.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/h2o-docs/src/product/welcome.rst b/h2o-docs/src/product/welcome.rst index 11637c7a3803..8af8b1aef1af 100644 --- a/h2o-docs/src/product/welcome.rst +++ b/h2o-docs/src/product/welcome.rst @@ -36,6 +36,7 @@ H2O-3 supports the following `algorithms `__: - `Generalized Linear Model (GLM) `__ - `Generalized Low Rank Models (GLRM) `__ - `Gradient Boosting Machine (GBM) `__ +- `Information Diagram (Infogram) `__ - `Isolation Forest `__ - `Isotonic Regression `__ - `K-Means Clustering `__