Dependencies

Repository Name Finder

This script extracts repository names from a text using natural language processing (NLP) techniques. The script first loads a specific NLP model (specified by your_NLP_model) using the 'spacy' library. Then, it processes a given text (e.g., 'I like to save my data in Odum Institute Archive Dataverse') and searches for repository names in it.

The script has three different methods to search for repository names in the text:

1.Token-based search: This method looks for specific tokens (e.g., 'SURFsara', 'DRYAD') in the text. If a matching token is found, the method returns 'True'. Otherwise, it returns 'False'.

2.Pattern-based search: This method uses a pre-defined pattern (e.g., [{'ORTH': 'Odum Institute Archive Dataverse'}]) to search for repository names in the text. It returns a list of matches.

3.PDF-based search: This method reads a series of PDF files from a specified directory ('files_path') and extracts text from them using the 'PyMuPDF' library ('fitz'). Then, it processes the extracted text using the previously loaded NLP model and extracts entities (i.e., named entities) from it. The method stores the extracted entities and their labels (i.e., entity types) in a Pandas dataframe, and then creates a pivot table to summarize the data. Finally, it exports the pivot table as an Excel file ('DMP_info.xlsx').

Dependencies

'spacy'

'pandas'

'PyMuPDF' ('fitz')

Usage

Install the dependencies.
Replace 'your_NLP_model' with the desired NLP model.
Specify the path to the PDF files in files_path.
Run the script.
The pivot table will be saved as an Excel file in the current working directory.

Chat GPT's coorection on the code

It seems like this code is trying to extract repository names (such as 'Odum Institute Archive Dataverse') from a text using spaCy and a combination of tokenization, pattern matching, and PDF processing.

There are a few issues with this code:

The first line where it loads the spaCy model should specify the model name. For example: nlp = spacy.load('en_core_web_sm')

The function has_SURFsara_token() checks for the token 'SURFsara' or 'DRYAD', but it is not used in the code.

The variable files_path is not defined, you should provide the path of folder containig the pdf files.

The line doc=fitz.open(files) should be doc=fitz.open(files).pages

The variable repo_model is not defined, you should replace it with nlp

The line df.drop_duplicates(subset=['Entities','Labels'], keep="first", inplace=True) is removing all the duplicates, you may want to keep all the entities except the ones that are already in the dataframe.

The library glob is not imported, you should add import glob at the top.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
chatGPT-repository-name-finder.py		chatGPT-repository-name-finder.py
repository-name-finder.py		repository-name-finder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository Name Finder

Dependencies

Usage

Chat GPT's coorection on the code

About

Releases

Packages

Languages

License

RaThorat/repository-name-finder

Folders and files

Latest commit

History

Repository files navigation

Repository Name Finder

Dependencies

Usage

Chat GPT's coorection on the code

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages