Speech-To-GCode

This project was developed during our master studies at the Kempten University of Applied Sciences in cooperation with the Institute for Data-optimised Manufacturing (IDF).

Project members: Linus Göhl, Quirin Sandt, Benjamin Schober

Introduction

The goal of the project was to create a pipeline that converts language to GCode (e.g. for a CNC milling machine). For this the different components are necessary:

Short information on how this pipeline works:

Audio is transcribed to text returning the prompt
Prompt is used to generate images using Stable Diffusion
Image is rated by its quality and using object detection
Selected image is preprocessed and converted to GCode

Below is more detailed information about the specific pipeline parts, models and technologies used.

Pipelines

Most of the pipeline components are deployed within a Docker container running on a GPU cluster. The pipelines are accessed through a REST API.

Text processing

Models and technologies used:

Model/Technology	Description	Link
`openai/whisper-large-v2`	Speech recognition model (ASR)	OpenAI GitHub, HuggingFace Model, [`Paper`]
`Helsinki-NLP/opus-mt-de-en`	Translation model	Helsinki-NLP GitHub, HuggingFace Model
`NLTK`	Natural Language Toolkit. Used for keyword/noun extraction	NLTK GitHub, NLTK Website

Since the pipeline is accessed through a REST API, all the functional parts are implemented in the class TextPipeline. When the pipeline is deployed, one instance of the class is created and the models are loaded into VRAM. Since the pipeline consists of multiple models and parts, the following endpoints and functions are available:

Endpoint	Description
`/api/transcribe`	Transcribes the audio file to text (executes the `transcribe`, `translate` and `extraact_nouns` function).
`/api/translate`	Translates the text to English (executes the `translate` and `extract_nouns` function)

Image creation and rating

Model/Technology	Description	Link
`stabilityai/stable-diffusion-2-1-base`	Image generation model	HuggingFace Model, [`Paper`]
`LAION-Aesthetics_Predictor V1`	Image rating model	GitHub, [`Paper`]
`Grounding DINO`	Object detection model	GitHub

Image preprocessing and GCode generation

Note: This pipeline component is not deployed within a Docker container, it is running on the local machine.

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
.assets		.assets
local		local
server		server
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
getting-started.md		getting-started.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.assets

.assets

local

local

server

server

.gitignore

.gitignore

README.md

README.md

docker-compose.yml

docker-compose.yml

getting-started.md

getting-started.md

Repository files navigation

Speech-To-GCode

Introduction

Pipelines

Text processing

Image creation and rating

Image preprocessing and GCode generation

About

Releases

Packages

Contributors 3

Languages

bennischober/speech-to-gcode

Folders and files

Latest commit

History

Repository files navigation

Speech-To-GCode

Introduction

Pipelines

Text processing

Image creation and rating

Image preprocessing and GCode generation

About

Topics

Resources

Stars

Watchers

Forks

Languages