From 048a26be595bb7fdb9d619c12422542c6bc5f2c4 Mon Sep 17 00:00:00 2001 From: Pierre Brunelle <70675979+pierrebrunelle@users.noreply.github.com> Date: Sat, 18 May 2024 09:08:29 -0700 Subject: [PATCH] Working with Document How To Simple and Easy focus on uploading and splitting docs, and redirect to other more advanced tutorials for UDF and Rag --- .../howto/Working_with_Documents.ipynb | 670 ++++++++++++++++++ 1 file changed, 670 insertions(+) create mode 100644 docs/release/howto/Working_with_Documents.ipynb diff --git a/docs/release/howto/Working_with_Documents.ipynb b/docs/release/howto/Working_with_Documents.ipynb new file mode 100644 index 00000000..931baa33 --- /dev/null +++ b/docs/release/howto/Working_with_Documents.ipynb @@ -0,0 +1,670 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Working with Documents in Pixeltable\n", + "\n", + "Pixeltable simplifies the processing and analysis of documents within your ML workloads. This guide demonstrates how to ingest, split, and interact with document data using the DocumentSplitter iterator." + ], + "metadata": { + "id": "aRG3MzczWgNo" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Setting Up Document Processing\n", + "\n", + "Import Necessary Modules: Begin by importing pixeltable, pathlib, and the DocumentSplitter class:" + ], + "metadata": { + "id": "RVTmACdQWqGY" + } + }, + { + "cell_type": "code", + "source": [ + "%pip install -q pixeltable" + ], + "metadata": { + "id": "2t-jwNQXWwg3" + }, + "execution_count": 1, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "loqc8sTcWd4S" + }, + "outputs": [], + "source": [ + "import pixeltable as pxt\n", + "import pathlib\n", + "from pixeltable.iterators.document import DocumentSplitter" + ] + }, + { + "cell_type": "code", + "source": [ + "# Create the Pixeltable workspace\n", + "pxt.create_dir('document_example', ignore_errors=True)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "gQUKXvDSY8xD", + "outputId": "23021a2a-39a9-45dd-9979-a83621091453" + }, + "execution_count": 3, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Connected to Pixeltable database at: postgresql://postgres:@/pixeltable?host=/root/.pixeltable/pgdata\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Prepare Document Paths\n", + "\n", + "Define a list of file paths to the documents you want to process. In this example, we're using sample PDF and HTML files from Pixeltable's test data directory:" + ], + "metadata": { + "id": "kbzmlXXcW0K3" + } + }, + { + "cell_type": "code", + "source": [ + "# Import your own sample data in your local path\n", + "doc_paths = [\n", + " pathlib.Path(pxt.__path__[0]) / '/content/sample_data/37-Million-Compilations.pdf',\n", + " pathlib.Path(pxt.__path__[0]) / '/content/sample_data/2018-CppCon-Unwinding.pdf',\n", + " pathlib.Path(pxt.__path__[0]) / '/content/sample_data/100G-Networking-Technology-Overview.pdf',\n", + "]" + ], + "metadata": { + "id": "MYgQ7vNnWzb_" + }, + "execution_count": 4, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Create Tables\n", + "\n", + "- doc_table: You can create a table (doc_table) to store metadata about your documents, such as file names or other relevant attributes.\n", + "- doc_paragraphs: Pixeltable will create this table to store the extracted paragraph-level data." + ], + "metadata": { + "id": "paj62VKJXBtP" + } + }, + { + "cell_type": "code", + "source": [ + "pxt.drop_table('doc_paragraphs', ignore_errors=True) # Ensure table doesn't exist\n", + "pxt.drop_table('doc_table', ignore_errors=True)\n", + "\n", + "doc_table = pxt.create_table('doc_table', {'document': pxt.DocumentType()})\n", + "doc_table.insert({'document': str(doc_path)} for doc_path in doc_paths)\n", + "\n", + "doc_table.show() # Display the table's contents" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 688 + }, + "id": "CAlFU3yrW8J-", + "outputId": "75f60cf9-d5d8-4b83-c1b2-4611166fe06a" + }, + "execution_count": 5, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Created table `doc_table`.\n", + "Inserting rows into `doc_table`: 3 rows [00:00, 576.22 rows/s]\n", + "Inserted 3 rows with 0 errors.\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " document\n", + "0 /content/sample_data/37-Million-Compilations.pdf\n", + "1 /content/sample_data/2018-CppCon-Unwinding.pdf\n", + "2 /content/sample_data/100G-Networking-Technolog..." + ], + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
document
" + ] + }, + "metadata": {}, + "execution_count": 5 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Using the DocumentSplitter\n", + "\n", + "Create a View with DocumentSplitter: The core of document processing involves creating a Pixeltable view that utilizes the DocumentSplitter iterator. This iterator breaks down the documents into paragraphs and extracts metadata:" + ], + "metadata": { + "id": "QlIISefQXE33" + } + }, + { + "cell_type": "code", + "source": [ + "paragraph_table = pxt.create_view(\n", + " 'doc_paragraphs',\n", + " doc_table,\n", + " iterator=DocumentSplitter.create(\n", + " document=doc_table.document,\n", + " separators='paragraph', # Split by paragraphs\n", + " metadata='page,bounding_box,heading' # Extract metadata\n", + " )\n", + ")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "LLsQAgJFXGhn", + "outputId": "9ccee162-9907-41f1-bd01-98f32267d4df" + }, + "execution_count": 6, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Inserting rows into `doc_paragraphs`: 2596 rows [00:02, 1026.41 rows/s]\n", + "Created view `doc_paragraphs` with 2596 rows, 0 exceptions.\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Explore and Analyze:\n", + "\n", + "Display Paragraphs: The paragraph_table now contains each paragraph as a separate row along with its metadata.\n", + "Filter and Query: Utilize Pixeltable's powerful filtering and querying capabilities to explore specific paragraphs, search within the text, or perform other analyses." + ], + "metadata": { + "id": "KbRO8ur8XIPA" + } + }, + { + "cell_type": "code", + "source": [ + "paragraph_table.show() # Display the extracted paragraph data" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "8BJisxmFXKHY", + "outputId": "4268efa4-2202-4daa-f331-2c34f2e353fc" + }, + "execution_count": 7, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " pos text page \\\n", + "0 0 37 Million Compilations:\\nInvestigating Novice... 0 \n", + "1 1 Amjad Altadmri\\nSchool of Computing\\nUniversit... 0 \n", + "2 2 Neil C. C. Brown\\nSchool of Computing\\nUnivers... 0 \n", + "3 3 ABSTRACT\\n 0 \n", + "4 4 Previous investigations of student errors have... 0 \n", + "5 5 Categories and Subject Descriptors\\n 0 \n", + "6 6 K.3.2 [Computers And Education]: Computer and ... 0 \n", + "7 7 General Terms\\n 0 \n", + "8 8 Experimentation\\n 0 \n", + "9 9 Keywords\\n 0 \n", + "10 10 Programming Mistakes; Blackbox\\n 0 \n", + "11 11 1.\\nINTRODUCTION\\nKnowledge about students' mi... 0 \n", + "12 12 Permission to make digital or hard copies of a... 0 \n", + "13 13 the recently launched Blackbox data collection... 0 \n", + "14 14 • What are the most frequent mistakes in a lar... 0 \n", + "15 15 • What are the most common errors, and common ... 0 \n", + "16 16 • Which errors take the shortest or longest ti... 0 \n", + "17 17 • How do these errors evolve during the academ... 0 \n", + "18 39 • B: Use of == instead of .equals to compare s... 1 \n", + "19 40 • M: Trying to invoke a non-static method as i... 1 \n", + "\n", + " bounding_box heading \\\n", + "0 {'x1': 53.798004150390625, 'x2': 555.916503906... None \n", + "1 {'x1': 142.07000732421875, 'x2': 241.692718505... None \n", + "2 {'x1': 371.1390380859375, 'x2': 464.5284423828... None \n", + "3 {'x1': 53.79803466796875, 'x2': 118.2365646362... None \n", + "4 {'x1': 53.79803466796875, 'x2': 292.9497985839... None \n", + "5 {'x1': 53.79803466796875, 'x2': 234.0943908691... None \n", + "6 {'x1': 53.79803466796875, 'x2': 292.9109191894... None \n", + "7 {'x1': 53.79803466796875, 'x2': 130.0602569580... None \n", + "8 {'x1': 53.79803466796875, 'x2': 121.2074508666... None \n", + "9 {'x1': 53.79803466796875, 'x2': 105.1814880371... None \n", + "10 {'x1': 53.79803466796875, 'x2': 189.4776000976... None \n", + "11 {'x1': 53.79803466796875, 'x2': 292.9588012695... None \n", + "12 {'x1': 53.79802703857422, 'x2': 292.9018249511... None \n", + "13 {'x1': 316.81201171875, 'x2': 555.963623046875... None \n", + "14 {'x1': 330.13702392578125, 'x2': 555.936828613... None \n", + "15 {'x1': 330.13702392578125, 'x2': 557.3984375, ... None \n", + "16 {'x1': 330.13702392578125, 'x2': 554.340698242... None \n", + "17 {'x1': 330.13702392578125, 'x2': 555.927917480... None \n", + "18 {'x1': 330.136962890625, 'x2': 549.27111816406... None \n", + "19 {'x1': 330.1369934082031, 'x2': 555.9454956054... None \n", + "\n", + " document \n", + "0 /content/sample_data/37-Million-Compilations.pdf \n", + "1 /content/sample_data/37-Million-Compilations.pdf \n", + "2 /content/sample_data/37-Million-Compilations.pdf \n", + "3 /content/sample_data/37-Million-Compilations.pdf \n", + "4 /content/sample_data/37-Million-Compilations.pdf \n", + "5 /content/sample_data/37-Million-Compilations.pdf \n", + "6 /content/sample_data/37-Million-Compilations.pdf \n", + "7 /content/sample_data/37-Million-Compilations.pdf \n", + "8 /content/sample_data/37-Million-Compilations.pdf \n", + "9 /content/sample_data/37-Million-Compilations.pdf \n", + "10 /content/sample_data/37-Million-Compilations.pdf \n", + "11 /content/sample_data/37-Million-Compilations.pdf \n", + "12 /content/sample_data/37-Million-Compilations.pdf \n", + "13 /content/sample_data/37-Million-Compilations.pdf \n", + "14 /content/sample_data/37-Million-Compilations.pdf \n", + "15 /content/sample_data/37-Million-Compilations.pdf \n", + "16 /content/sample_data/37-Million-Compilations.pdf \n", + "17 /content/sample_data/37-Million-Compilations.pdf \n", + "18 /content/sample_data/37-Million-Compilations.pdf \n", + "19 /content/sample_data/37-Million-Compilations.pdf " + ], + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
postextpagebounding_boxheadingdocument
037 Million Compilations:\\nInvestigating Novice Programming Mistakes in Large-Scale\\nStudent Data\\n0{'x1': 53.798004150390625, 'x2': 555.91650390625, 'y1': 68.27877044677734, 'y2': 130.65234375}None
1Amjad Altadmri\\nSchool of Computing\\nUniversity of Kent\\nCanterbury, Kent, UK\\naa803@kent.ac.uk\\n0{'x1': 142.07000732421875, 'x2': 241.69271850585938, 'y1': 155.01763916015625, 'y2': 213.086181640625}None
2Neil C. C. Brown\\nSchool of Computing\\nUniversity of Kent\\nCanterbury, Kent, UK\\nnccb@kent.ac.uk\\n0{'x1': 371.1390380859375, 'x2': 464.5284423828125, 'y1': 155.01763916015625, 'y2': 213.086181640625}None
3ABSTRACT\\n0{'x1': 53.79803466796875, 'x2': 118.23656463623047, 'y1': 225.115966796875, 'y2': 240.669677734375}None
4Previous investigations of student errors have typically fo-\\ncused on samples of hundreds of students at individual in-\\nstitutions.\\nThis work uses a year's worth of compilation\\nevents from over 250,000 students all over the world, taken\\nfrom the large Blackbox data set. We analyze the frequency,\\ntime-to-fix, and spread of errors among users, showing how\\nthese factors inter-relate, in addition to their development\\nover the course of the year. These results can inform the de-\\nsign of courses, textbooks and also tools to target the most\\nfrequent (or hardest to fix) errors.\\n0{'x1': 53.79803466796875, 'x2': 292.9497985839844, 'y1': 244.31414794921875, 'y2': 347.42755126953125}None
5Categories and Subject Descriptors\\n0{'x1': 53.79803466796875, 'x2': 234.09439086914062, 'y1': 358.51995849609375, 'y2': 374.07366943359375}None
6K.3.2 [Computers And Education]: Computer and In-\\nformation Science Education\\n0{'x1': 53.79803466796875, 'x2': 292.9109191894531, 'y1': 377.7181701660156, 'y2': 397.14556884765625}None
7General Terms\\n0{'x1': 53.79803466796875, 'x2': 130.0602569580078, 'y1': 408.23797607421875, 'y2': 423.79168701171875}None
8Experimentation\\n0{'x1': 53.79803466796875, 'x2': 121.20745086669922, 'y1': 427.4361877441406, 'y2': 436.402587890625}None
9Keywords\\n0{'x1': 53.79803466796875, 'x2': 105.18148803710938, 'y1': 447.4949951171875, 'y2': 463.0487060546875}None
10Programming Mistakes; Blackbox\\n0{'x1': 53.79803466796875, 'x2': 189.47760009765625, 'y1': 466.6932067871094, 'y2': 475.65960693359375}None
111.\\nINTRODUCTION\\nKnowledge about students' mistakes and the time taken\\nto fix errors is useful for many reasons. For example, Sadler\\net al [10] suggest that understanding student misconceptions\\nis important to educator efficacy. Knowing which mistakes\\nnovices are likely to make or finding challenging informs the\\nwriting of instructional materials, such as textbooks, and\\ncan help improve the design and impact of beginner's IDEs\\nor other educatoinal programming tools.\\nPrevious studies that have investigated student errors dur-\\ning [Java] programming have focused on cohorts of up to 600\\nstudents at a single institution [1, 4, 5, 7, 8, 13]. However,\\n0{'x1': 53.79803466796875, 'x2': 292.95880126953125, 'y1': 486.75201416015625, 'y2': 618.0305786132812}None
12Permission to make digital or hard copies of all or part of this work for personal or\\nclassroom use is granted without fee provided that copies are not made or distributed\\nfor profit or commercial advantage and that copies bear this notice and the full cita-\\ntion on the first page. Copyrights for components of this work owned by others than\\nACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-\\npublish, to post on servers or to redistribute to lists, requires prior specific permission\\nand/or a fee. Request permissions from permissions@acm.org.\\nSIGCSE'15, March 4–7, 2015, Kansas City, MO, USA.\\nCopyright c\\n⃝2015 ACM 978-1-4503-2966-8/15/03 ...$15.00.\\nhttp://dx.doi.org/10.1145/2676723.2677258.\\n0{'x1': 53.79802703857422, 'x2': 292.9018249511719, 'y1': 632.3612060546875, 'y2': 721.7460327148438}None
13the recently launched Blackbox data collection project [3]\\naffords an opportunity to observe the mistakes of a large\\nnumber of students across many institutions – for example,\\nin one year of data, the project collected error messages and\\nJava code from around 265,000 users worldwide. A previ-\\nous study by the authors utilized four months of data from\\nBlackbox to study educators opinions against the frequency\\nof mistakes [2]. The contribution in our proposed paper is\\nto go further, and provide a more detailed investigation into\\ncharacteristics of the mistakes, trying to answer the follow-\\ning research questions:\\n0{'x1': 316.81201171875, 'x2': 555.963623046875, 'y1': 229.86822509765625, 'y2': 343.442626953125}None
14• What are the most frequent mistakes in a large-scale\\nmulti-institution data set?\\n0{'x1': 330.13702392578125, 'x2': 555.9368286132812, 'y1': 354.6811218261719, 'y2': 374.3506164550781}None
15• What are the most common errors, and common classes\\nof errors?\\n0{'x1': 330.13702392578125, 'x2': 557.3984375, 'y1': 383.8711242675781, 'y2': 403.53961181640625}None
16• Which errors take the shortest or longest time to fix?\\n0{'x1': 330.13702392578125, 'x2': 554.3406982421875, 'y1': 413.06011962890625, 'y2': 428.6168212890625}None
17• How do these errors evolve during the academic terms\\nand academic year?\\n0{'x1': 330.13702392578125, 'x2': 555.9279174804688, 'y1': 431.78912353515625, 'y2': 451.4586181640625}None
39• B: Use of == instead of .equals to compare strings.\\nFor example: if (a == \"start\") ...\\n1{'x1': 330.136962890625, 'x2': 549.2711181640625, 'y1': 245.98818969726562, 'y2': 265.66009521484375}None
40• M: Trying to invoke a non-static method as if it was\\nstatic.\\nFor example: MyClass.toString();\\n1{'x1': 330.1369934082031, 'x2': 555.9454956054688, 'y1': 273.8652038574219, 'y2': 303.9971008300781}None
" + ] + }, + "metadata": {}, + "execution_count": 7 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Key Considerations\n", + "\n", + "- Supported File Types: Pixeltable currently supports PDF and HTML documents.\n", + "- Customizable Splitting: You can specify different separators in DocumentSplitter (e.g., \"sentence\", \"line\") based on your needs\n", + "\n", + "## Advanced Features\n", + "\n", + "Explore Pixeltable's documentation (e.g. [UDF](https://pixeltable.readme.io/docs/user-defined-functions-udfs) for incorporating your own more advanced document processing and chunking strategies.\n", + "\n", + "Learn about RAG with Pixeltable: https://pixeltable.readme.io/docs/rag-operations" + ], + "metadata": { + "id": "3DoRNE6JXLa4" + } + } + ] +} \ No newline at end of file