Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48231][BUILD] Remove unused CodeHaus Jackson dependencies #46521

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented May 10, 2024

What changes were proposed in this pull request?

CodeHaus Jackson dependencies were pulled from Hive, while in apache/hive#4564 (Hive 2.3.10), it migrated to Jackson 2.x, so we can remove them from Spark now.

Why are the changes needed?

Remove unused and vulnerable dependencies.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

cc @dongjoon-hyun and @wangyum

<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-core-asl</artifactId>
<version>${codehaus.jackson.version}</version>
<scope>${hive.jackson.scope}</scope>
Copy link
Member

@wangyum wangyum May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also remove <hive.jackson.scope>compile</hive.jackson.scope>?

spark/pom.xml

Line 270 in 44f00cc

<hive.jackson.scope>compile</hive.jackson.scope>

<hive.jackson.scope>provided</hive.jackson.scope>

https://github.com/apache/spark/blob/master/assembly/pom.xml#L272-L277

Copy link
Member Author

@pan3793 pan3793 May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we identify some issues on hive 2.3.10 before 4.0.0 release, we may need to revert this patch and fallback to SPARK-47119 approach to mitigate CodeHaus Jackson dependencies vulnerabilities, see comemnts at
#45201 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, sorry for making things difficult, @pan3793 and @wangyum .

If we are sure, we can clean up later more easily definitely.

dongjoon-hyun
dongjoon-hyun previously approved these changes May 10, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 10, 2024

I have one separate question (or request), @wangyum and @pan3793 .

When we use Isolated Class Loader to use old Hive client and old version HMS server and Hive UDF (requires Jackson).

Will everything work without any issue? Do you think we can have a test coverage for that?

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

I think the case is already covered by CI.

When IsolatedClassLoader is enabled, the Hive.get should pull and load CodeHaus Jackson classes from IsolatedClassLoader.

  /**
   * Initialize Hive through Configuration.
   * First try to use getWithoutRegisterFns to initialize to avoid loading all functions,
   * if there is no such method, fallback to Hive.get.
   */
  def getHive(conf: Configuration): Hive = {
    val hiveConf = conf match {
      case hiveConf: HiveConf =>
        hiveConf
      case _ =>
        new HiveConf(conf, classOf[HiveConf])
    }
    try {
      Hive.getWithoutRegisterFns(hiveConf)
    } catch {
      // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but
      // 2.3.8 don't), therefore here we fallback when encountering the exception.
      case _: NoSuchMethodError =>
        Hive.get(hiveConf)
    }

@dongjoon-hyun
Copy link
Member

Ya, I know that part, but do we have an end-to-end Hive UDF registration and invocation test case?

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

@dongjoon-hyun AFAIK, the "Hive UDF execution" always uses built-in Hive jars without IsolatedClassLoader. While "Hive UDF registration" will happen during Hive.get(hiveConf) with IsolatedClassLoader on constructing HMS client.

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun AFAIK, the "Hive UDF execution" always uses built-in Hive jars without IsolatedClassLoader. While "Hive UDF registration" will happen during Hive.get(hiveConf) with IsolatedClassLoader on constructing HMS client.

It sounds like that we could have a corner case. That's the reason why we need an actual test case to cover it, isn't it?

@dongjoon-hyun dongjoon-hyun dismissed their stale review May 10, 2024 08:12

Stale review.

@dongjoon-hyun
Copy link
Member

For this one PR, I believe we need a verification for different HMS versions to make it sure.

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

Hmm, let me clear my view.

In short, I think the current CI is sufficient.

Spark uses Hive in two cases:

  1. As an HMS client. To support different HMS versions, it allows to use IsolatedClassLoder to load a different Hive class. It calls Hive.getWithoutRegisterFns(hiveConf) or Hive.get(hiveConf) to create the Hive instance, and there is a chance to trigger the Hive built-in UDF registration, for older Hive, e.g. 2.1.1, some built-in Hive UDF may trigger CodeHaus Jackson classes loading.

  2. As an execution library. Spark always used the built-in Hive jars to read/write Hive tables, execute Hive UDFs.

For case 1, the CI already covers that(any older HMS client initialization triggers built-in UDF registration). For case 2, there is no chance to invoke CodeHaus Jackson classes since Hive 2.3.10 totally removed it in the codebase.

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

also cc @wangyum @yaooqinn @AngersZhuuuu @cloud-fan

@pan3793
Copy link
Member Author

pan3793 commented May 10, 2024

For this one PR, I believe we need a verification for different HMS versions to make it sure.

that's a valid concern, since Spark CI only covers embedded HMS client case, let me test it with the real setup.

@dongjoon-hyun
Copy link
Member

Thank you. Please attach the test results to the PR description.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please hold on all Hive related dependency change until we recover Maven CIs.

#46468 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants