Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of the Storage Scheme for the Email Spider #43

Open
alexsunxl opened this issue Aug 31, 2023 · 4 comments
Open

Draft of the Storage Scheme for the Email Spider #43

alexsunxl opened this issue Aug 31, 2023 · 4 comments

Comments

@alexsunxl
Copy link
Contributor

Configuration File Path

The configuration file for the email scraping program is located at rootfs/email/config.toml.

The configuration file includes the following fields:

  • EMAIL_IMAP_SERVER: This field is for the IMAP server of your email service. For example, "imap.gmail.com".
  • EMAIL_ADDRESS: This field is for the email address that you want to scrape. Please replace with your own email address.
  • EMAIL_PASSWORD: This field is for the password of your email account. Please replace with your own password.
  • EMAIL_IMAP_PORT: This field is for the port number of your IMAP server. For Gmail, this is typically 993.
  • LOCAL_DIR: This field is for the local directory where you want to store the scraped emails. For example, 'rootfs/data'.

Please note that you should keep your email address and password confidential and ensure they are securely stored.

Sure, here's how you can incorporate this information:

Title: File Organization and Storage Scheme for Email Scraping Program

File Storage Path

The scraped email files will be stored in the directory rootfs/data/[email protected]/. And also could change it by LOCAL_DIR filed

Creation of Email Folders

Each email will be processed through its name and time to generate a unique MD5 hash. We then use this hash to create a unique folder to store the corresponding email content.

Email Content Storage

Within each email's folder, we create two files to store the main information of the email:

  • email.txt: This file stores the body content of the email.
  • meta.json: This file stores the header information of the email.

In addition, this folder can also be used to store attachments, images, and other files related to the email.

The above is the file organization and storage scheme for our email scraping program. We welcome your feedback and suggestions so that we can continuously optimize and improve this scheme.

@alexsunxl
Copy link
Contributor Author

Maybe look like this:

├── data
│   └── [email protected]
│       └── 5de3e52f3a6b90cabe6cbdd4ae3a5c5b
│           ├── email.txt
│           └── meta.json

alexsunxl added a commit to alexsunxl/OpenDAN-Personal-AI-OS that referenced this issue Aug 31, 2023
alexsunxl added a commit to alexsunxl/OpenDAN-Personal-AI-OS that referenced this issue Aug 31, 2023
waterflier added a commit that referenced this issue Sep 1, 2023
Add a service: email spider  #43
@lurenpluto
Copy link
Contributor

lurenpluto commented Sep 11, 2023

Individual emails are stored in a separate directory, the name of the content inside needs to be fixed, so that we can use a fixed builder for each email processing, in addition to the mail inside the image, video, voice and other content, you should to use a separate directory for storage, easy parsing

A complete directory structure might look like the one shown below:

├── email.txt
└── meta.json
   ├── image
   │   ├── image1.jpg
   │   ├── image2.jpg
   │   └── ...
   ├── video
   │   ├── video1.mp4
   │   ├── video2.mv
   │   └── ...
   └── audio
      ├── audio1.m4a
      ├── audio2.flac
      └── ...

@alexsunxl
Copy link
Contributor Author

It might be better to distinguish between images in email attachments and images in the body by placing them in different folders.

what do you think?
@waterflier @lurenpluto

@waterflier
Copy link
Collaborator

To align with mental models, I suggest that we adopt a structure where each directory corresponds to a single email. As for attachments, I believe there is no need to store them in separate directories. Typically, the number of attachments for a single email isn't excessive, so a separate directory may not be necessary.

From the perspective of Named Data Networking (NDN), we can store all videos and images by their respective hashes. We can then reference these existing files in the email directory using soft links. This approach should provide an efficient and intuitive way to manage our data.

photosssa pushed a commit to photosssa/OpenDAN-Personal-AI-OS that referenced this issue Sep 19, 2023
photosssa pushed a commit to photosssa/OpenDAN-Personal-AI-OS that referenced this issue Sep 19, 2023
photosssa pushed a commit to photosssa/OpenDAN-Personal-AI-OS that referenced this issue Sep 19, 2023
photosssa pushed a commit to photosssa/OpenDAN-Personal-AI-OS that referenced this issue Sep 19, 2023
photosssa pushed a commit to photosssa/OpenDAN-Personal-AI-OS that referenced this issue Sep 20, 2023
photosssa pushed a commit to photosssa/OpenDAN-Personal-AI-OS that referenced this issue Sep 20, 2023
photosssa pushed a commit to photosssa/OpenDAN-Personal-AI-OS that referenced this issue Sep 21, 2023
photosssa pushed a commit to photosssa/OpenDAN-Personal-AI-OS that referenced this issue Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants