数据源data/geo.csv:
代码:
// 创建 SparkConf 对象
SparkConf conf = new SparkConf().setAppName("GeoSparkExample").setMaster("local[*]");
// 创建 SparkSession
SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
// 注册 SedonaSQL 函数
SedonaSQLRegistrator.registerAll(spark);
Dataset<Row> rawDf = spark.read().format("csv").option("header", "true") // 指定第一行作为列名
.option("inferSchema", "true") // 推断列的数据类型
.option("delimiter", ",") // 指定列分隔符,默认为逗号
// 最后一条数据多边形没有关闭
.load("data/geo.csv");// 文件位置
rawDf.createOrReplaceTempView("rawdf");
rawDf.show();
System.out.println("======================================");
spark.sql("DESCRIBE rawdf").show();
spark.sql("DESCRIBE rawdf polygon").show();
// 将字符串类型的字段转换为 Sedona 的 Geometry 类型
Dataset<Row> result = spark.sql("SELECT ST_GeomFromWKT(regexp_replace(polygon, '\"', '')) AS geo FROM rawdf");
result.createOrReplaceTempView("test");
// 显示查询结果
result.show();
// 将字符串类型的字段转换为 Sedona 的 Geometry 类型
Dataset<Row> result = spark.sql("SELECT ST_GeomFromWKT(regexp_replace(polygon, '\"', '')) AS geo FROM rawdf");
result.createOrReplaceTempView("test");
23/07/21 15:57:02 ERROR FormatUtils: [Sedona] Points of LinearRing do not form a closed linestring
23/07/21 15:57:02 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 4)
java.lang.NullPointerException
at org.apache.sedona.sql.utils.GeometrySerializer$.getDimension(GeometrySerializer.scala:53)
at org.apache.sedona.sql.utils.GeometrySerializer$.serialize(GeometrySerializer.scala:37)
at org.apache.spark.sql.sedona_sql.expressions.implicits$GeometryEnhancer.toGenericArrayData(implicits.scala:79)
at org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT.eval(Constructors.scala:182)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:256)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:858)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:858)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
根据我的经验分析:因为原始多边形范围收尾没有关闭导致了这个异常,那么怎么通过ST_IsValid方法来过滤掉这种数据了?让程序不要报错?
答案:ST_IsClosed(ST_GeomFromText(t.pg1))
https://sedona.apache.org/1.3.1-incubating/api/flink/Function/#st_isclosed