Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add epub file reader support #236

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,6 @@ dist/

# vs code
.vscode/launch.json

# jetbrains ide
.idea/
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ export const runtime = "nodejs"; // default
/** @type {import('next').NextConfig} */
const nextConfig = {
experimental: {
serverComponentsExternalPackages: ["pdf-parse"], // Puts pdf-parse in actual NodeJS mode with NextJS App Router
serverComponentsExternalPackages: ["pdf-parse", "@gxl/epub-parser"], // Puts pdf-parse and @gxl/epub-parser in actual NodeJS mode with NextJS App Router
m1911star marked this conversation as resolved.
Show resolved Hide resolved
},
};

Expand Down
Binary file added examples/data/wells.epub
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this example taken from? what license?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we download from somewhere instead of leave it here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. will remove it

Binary file not shown.
18 changes: 18 additions & 0 deletions examples/epub.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import { EpubReader, VectorStoreIndex } from "llamaindex";

async function main() {
// Load PDF
const reader = new EpubReader();
const documents = await reader.loadData("data/wells.epub");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m1911star you can load the ebub from an URL here instead

// Split text and create embeddings. Store them in a VectorStoreIndex
const index = await VectorStoreIndex.fromDocuments(documents);

// Query the index
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query("What is this about?");

// Output response
console.log(response.toString());
}

main().catch(console.error);
1 change: 1 addition & 0 deletions packages/core/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
"license": "MIT",
"dependencies": {
"@anthropic-ai/sdk": "^0.9.1",
"@gxl/epub-parser": "2.0.4",
"@notionhq/client": "^2.2.13",
"@xenova/transformers": "^2.8.0",
"crypto-js": "^4.2.0",
Expand Down
1 change: 1 addition & 0 deletions packages/core/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ export * from "./embeddings";
export * from "./indices";
export * from "./llm/LLM";
export * from "./readers/CSVReader";
export * from "./readers/EpubReader";
export * from "./readers/HTMLReader";
export * from "./readers/MarkdownReader";
export * from "./readers/NotionReader";
Expand Down
63 changes: 63 additions & 0 deletions packages/core/src/readers/EpubReader.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
import { parseEpub } from "@gxl/epub-parser";
import { Document } from "../Node";
import { GenericFileSystem } from "../storage/FileSystem";
import { DEFAULT_FS } from "../storage/constants";
import { BaseReader } from "./base";
/**
* Read the text of a Epub file
*/
export class EpubReader implements BaseReader {
async loadData(
file: string,
fs: GenericFileSystem = DEFAULT_FS,
): Promise<Document[]> {
const dataBuffer = (await fs.readFile(file)) as any;
const book = await parseEpub(dataBuffer, {
type: "buffer",
expand: true,
});
const sections = book.sections ?? [];
const header = `${book.info?.author}\n${book.info?.publisher}\n${book.info?.title}`;
const options = this.getOptions();
const results = await Promise.all(
sections.map((section) => {
return new Promise(async (resolve) => {
const parsed = await this.parseContent(section.htmlString, options);
resolve(parsed);
});
}),
m1911star marked this conversation as resolved.
Show resolved Hide resolved
);
return [
new Document({ text: `${header}\n${results.join("\n")}`, id_: file }),
];
}

/**
* Wrapper for string-strip-html usage.
* @param html Raw HTML content to be parsed.
* @param options An object of options for the underlying library
* @see getOptions
* @returns The HTML content, stripped of unwanted tags and attributes
*/
async parseContent(html: string, options: any = {}): Promise<string> {
const { stripHtml } = await import("string-strip-html"); // ESM only
return stripHtml(html).result;
}

/**
* Wrapper for our configuration options passed to string-strip-html library
* @see https://codsen.com/os/string-strip-html/examples
* @returns An object of options for the underlying library
*/
getOptions() {
return {
skipHtmlDecoding: true,
stripTogetherWithTheirContents: [
"script", // default
"style", // default
"xml", // default
"head", // <-- custom-added
],
};
marcusschiesser marked this conversation as resolved.
Show resolved Hide resolved
}
}
2 changes: 2 additions & 0 deletions packages/core/src/readers/SimpleDirectoryReader.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import { CompleteFileSystem, walk } from "../storage/FileSystem";
import { DEFAULT_FS } from "../storage/constants";
import { PapaCSVReader } from "./CSVReader";
import { DocxReader } from "./DocxReader";
import { EpubReader } from "./EpubReader";
import { HTMLReader } from "./HTMLReader";
import { MarkdownReader } from "./MarkdownReader";
import { PDFReader } from "./PDFReader";
Expand Down Expand Up @@ -42,6 +43,7 @@ export const FILE_EXT_TO_READER: Record<string, BaseReader> = {
docx: new DocxReader(),
htm: new HTMLReader(),
html: new HTMLReader(),
epub: new EpubReader(),
};

export type SimpleDirectoryReaderLoadDataProps = {
Expand Down
791 changes: 721 additions & 70 deletions pnpm-lock.yaml

Large diffs are not rendered by default.

Loading