Extensible Markup Language (XML) is a markup language. It is quite common to find files as main source for an ETL pipeline, which stands for extract, transform, load and is a three-phase process where data is extracted, transformed (cleaned, sanitized, scrubbed) and loaded into a new datacontainer (ex. a database).
PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing.