diff --git a/docs/release/howto/Working_with_Documents.ipynb b/docs/release/howto/Working_with_Documents.ipynb new file mode 100644 index 00000000..931baa33 --- /dev/null +++ b/docs/release/howto/Working_with_Documents.ipynb @@ -0,0 +1,670 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Working with Documents in Pixeltable\n", + "\n", + "Pixeltable simplifies the processing and analysis of documents within your ML workloads. This guide demonstrates how to ingest, split, and interact with document data using the DocumentSplitter iterator." + ], + "metadata": { + "id": "aRG3MzczWgNo" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Setting Up Document Processing\n", + "\n", + "Import Necessary Modules: Begin by importing pixeltable, pathlib, and the DocumentSplitter class:" + ], + "metadata": { + "id": "RVTmACdQWqGY" + } + }, + { + "cell_type": "code", + "source": [ + "%pip install -q pixeltable" + ], + "metadata": { + "id": "2t-jwNQXWwg3" + }, + "execution_count": 1, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "loqc8sTcWd4S" + }, + "outputs": [], + "source": [ + "import pixeltable as pxt\n", + "import pathlib\n", + "from pixeltable.iterators.document import DocumentSplitter" + ] + }, + { + "cell_type": "code", + "source": [ + "# Create the Pixeltable workspace\n", + "pxt.create_dir('document_example', ignore_errors=True)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "gQUKXvDSY8xD", + "outputId": "23021a2a-39a9-45dd-9979-a83621091453" + }, + "execution_count": 3, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Connected to Pixeltable database at: postgresql://postgres:@/pixeltable?host=/root/.pixeltable/pgdata\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Prepare Document Paths\n", + "\n", + "Define a list of file paths to the documents you want to process. In this example, we're using sample PDF and HTML files from Pixeltable's test data directory:" + ], + "metadata": { + "id": "kbzmlXXcW0K3" + } + }, + { + "cell_type": "code", + "source": [ + "# Import your own sample data in your local path\n", + "doc_paths = [\n", + " pathlib.Path(pxt.__path__[0]) / '/content/sample_data/37-Million-Compilations.pdf',\n", + " pathlib.Path(pxt.__path__[0]) / '/content/sample_data/2018-CppCon-Unwinding.pdf',\n", + " pathlib.Path(pxt.__path__[0]) / '/content/sample_data/100G-Networking-Technology-Overview.pdf',\n", + "]" + ], + "metadata": { + "id": "MYgQ7vNnWzb_" + }, + "execution_count": 4, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Create Tables\n", + "\n", + "- doc_table: You can create a table (doc_table) to store metadata about your documents, such as file names or other relevant attributes.\n", + "- doc_paragraphs: Pixeltable will create this table to store the extracted paragraph-level data." + ], + "metadata": { + "id": "paj62VKJXBtP" + } + }, + { + "cell_type": "code", + "source": [ + "pxt.drop_table('doc_paragraphs', ignore_errors=True) # Ensure table doesn't exist\n", + "pxt.drop_table('doc_table', ignore_errors=True)\n", + "\n", + "doc_table = pxt.create_table('doc_table', {'document': pxt.DocumentType()})\n", + "doc_table.insert({'document': str(doc_path)} for doc_path in doc_paths)\n", + "\n", + "doc_table.show() # Display the table's contents" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 688 + }, + "id": "CAlFU3yrW8J-", + "outputId": "75f60cf9-d5d8-4b83-c1b2-4611166fe06a" + }, + "execution_count": 5, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Created table `doc_table`.\n", + "Inserting rows into `doc_table`: 3 rows [00:00, 576.22 rows/s]\n", + "Inserted 3 rows with 0 errors.\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " document\n", + "0 /content/sample_data/37-Million-Compilations.pdf\n", + "1 /content/sample_data/2018-CppCon-Unwinding.pdf\n", + "2 /content/sample_data/100G-Networking-Technolog..." + ], + "text/html": [ + "
document | \n", + "
---|
\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
pos | \n", + "text | \n", + "page | \n", + "bounding_box | \n", + "heading | \n", + "document | \n", + "
---|---|---|---|---|---|
0 | \n", + "37 Million Compilations:\\nInvestigating Novice Programming Mistakes in Large-Scale\\nStudent Data\\n | \n", + "0 | \n", + "{'x1': 53.798004150390625, 'x2': 555.91650390625, 'y1': 68.27877044677734, 'y2': 130.65234375} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
1 | \n", + "Amjad Altadmri\\nSchool of Computing\\nUniversity of Kent\\nCanterbury, Kent, UK\\naa803@kent.ac.uk\\n | \n", + "0 | \n", + "{'x1': 142.07000732421875, 'x2': 241.69271850585938, 'y1': 155.01763916015625, 'y2': 213.086181640625} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
2 | \n", + "Neil C. C. Brown\\nSchool of Computing\\nUniversity of Kent\\nCanterbury, Kent, UK\\nnccb@kent.ac.uk\\n | \n", + "0 | \n", + "{'x1': 371.1390380859375, 'x2': 464.5284423828125, 'y1': 155.01763916015625, 'y2': 213.086181640625} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
3 | \n", + "ABSTRACT\\n | \n", + "0 | \n", + "{'x1': 53.79803466796875, 'x2': 118.23656463623047, 'y1': 225.115966796875, 'y2': 240.669677734375} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
4 | \n", + "Previous investigations of student errors have typically fo-\\ncused on samples of hundreds of students at individual in-\\nstitutions.\\nThis work uses a year's worth of compilation\\nevents from over 250,000 students all over the world, taken\\nfrom the large Blackbox data set. We analyze the frequency,\\ntime-to-fix, and spread of errors among users, showing how\\nthese factors inter-relate, in addition to their development\\nover the course of the year. These results can inform the de-\\nsign of courses, textbooks and also tools to target the most\\nfrequent (or hardest to fix) errors.\\n | \n", + "0 | \n", + "{'x1': 53.79803466796875, 'x2': 292.9497985839844, 'y1': 244.31414794921875, 'y2': 347.42755126953125} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
5 | \n", + "Categories and Subject Descriptors\\n | \n", + "0 | \n", + "{'x1': 53.79803466796875, 'x2': 234.09439086914062, 'y1': 358.51995849609375, 'y2': 374.07366943359375} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
6 | \n", + "K.3.2 [Computers And Education]: Computer and In-\\nformation Science Education\\n | \n", + "0 | \n", + "{'x1': 53.79803466796875, 'x2': 292.9109191894531, 'y1': 377.7181701660156, 'y2': 397.14556884765625} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
7 | \n", + "General Terms\\n | \n", + "0 | \n", + "{'x1': 53.79803466796875, 'x2': 130.0602569580078, 'y1': 408.23797607421875, 'y2': 423.79168701171875} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
8 | \n", + "Experimentation\\n | \n", + "0 | \n", + "{'x1': 53.79803466796875, 'x2': 121.20745086669922, 'y1': 427.4361877441406, 'y2': 436.402587890625} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
9 | \n", + "Keywords\\n | \n", + "0 | \n", + "{'x1': 53.79803466796875, 'x2': 105.18148803710938, 'y1': 447.4949951171875, 'y2': 463.0487060546875} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
10 | \n", + "Programming Mistakes; Blackbox\\n | \n", + "0 | \n", + "{'x1': 53.79803466796875, 'x2': 189.47760009765625, 'y1': 466.6932067871094, 'y2': 475.65960693359375} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
11 | \n", + "1.\\nINTRODUCTION\\nKnowledge about students' mistakes and the time taken\\nto fix errors is useful for many reasons. For example, Sadler\\net al [10] suggest that understanding student misconceptions\\nis important to educator efficacy. Knowing which mistakes\\nnovices are likely to make or finding challenging informs the\\nwriting of instructional materials, such as textbooks, and\\ncan help improve the design and impact of beginner's IDEs\\nor other educatoinal programming tools.\\nPrevious studies that have investigated student errors dur-\\ning [Java] programming have focused on cohorts of up to 600\\nstudents at a single institution [1, 4, 5, 7, 8, 13]. However,\\n | \n", + "0 | \n", + "{'x1': 53.79803466796875, 'x2': 292.95880126953125, 'y1': 486.75201416015625, 'y2': 618.0305786132812} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
12 | \n", + "Permission to make digital or hard copies of all or part of this work for personal or\\nclassroom use is granted without fee provided that copies are not made or distributed\\nfor profit or commercial advantage and that copies bear this notice and the full cita-\\ntion on the first page. Copyrights for components of this work owned by others than\\nACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-\\npublish, to post on servers or to redistribute to lists, requires prior specific permission\\nand/or a fee. Request permissions from permissions@acm.org.\\nSIGCSE'15, March 4–7, 2015, Kansas City, MO, USA.\\nCopyright c\\n⃝2015 ACM 978-1-4503-2966-8/15/03 ...$15.00.\\nhttp://dx.doi.org/10.1145/2676723.2677258.\\n | \n", + "0 | \n", + "{'x1': 53.79802703857422, 'x2': 292.9018249511719, 'y1': 632.3612060546875, 'y2': 721.7460327148438} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
13 | \n", + "the recently launched Blackbox data collection project [3]\\naffords an opportunity to observe the mistakes of a large\\nnumber of students across many institutions – for example,\\nin one year of data, the project collected error messages and\\nJava code from around 265,000 users worldwide. A previ-\\nous study by the authors utilized four months of data from\\nBlackbox to study educators opinions against the frequency\\nof mistakes [2]. The contribution in our proposed paper is\\nto go further, and provide a more detailed investigation into\\ncharacteristics of the mistakes, trying to answer the follow-\\ning research questions:\\n | \n", + "0 | \n", + "{'x1': 316.81201171875, 'x2': 555.963623046875, 'y1': 229.86822509765625, 'y2': 343.442626953125} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
14 | \n", + "• What are the most frequent mistakes in a large-scale\\nmulti-institution data set?\\n | \n", + "0 | \n", + "{'x1': 330.13702392578125, 'x2': 555.9368286132812, 'y1': 354.6811218261719, 'y2': 374.3506164550781} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
15 | \n", + "• What are the most common errors, and common classes\\nof errors?\\n | \n", + "0 | \n", + "{'x1': 330.13702392578125, 'x2': 557.3984375, 'y1': 383.8711242675781, 'y2': 403.53961181640625} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
16 | \n", + "• Which errors take the shortest or longest time to fix?\\n | \n", + "0 | \n", + "{'x1': 330.13702392578125, 'x2': 554.3406982421875, 'y1': 413.06011962890625, 'y2': 428.6168212890625} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
17 | \n", + "• How do these errors evolve during the academic terms\\nand academic year?\\n | \n", + "0 | \n", + "{'x1': 330.13702392578125, 'x2': 555.9279174804688, 'y1': 431.78912353515625, 'y2': 451.4586181640625} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
39 | \n", + "• B: Use of == instead of .equals to compare strings.\\nFor example: if (a == \"start\") ...\\n | \n", + "1 | \n", + "{'x1': 330.136962890625, 'x2': 549.2711181640625, 'y1': 245.98818969726562, 'y2': 265.66009521484375} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "
40 | \n", + "• M: Trying to invoke a non-static method as if it was\\nstatic.\\nFor example: MyClass.toString();\\n | \n", + "1 | \n", + "{'x1': 330.1369934082031, 'x2': 555.9454956054688, 'y1': 273.8652038574219, 'y2': 303.9971008300781} | \n", + "None | \n", + "\n",
+ " \n",
+ " \n",
+ " | \n",
+ "