Apache Beam 技术文档

2024-12-23 02:49:19作者：沈韬淼Beryl

1. 安装指南

Java SDK 安装

使用 Maven 进行依赖管理：

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>{{latest_version}}</version>
</dependency>

使用 Gradle 进行依赖管理：

dependencies {
  implementation 'org.apache.beam:beam-sdks-java-core:{{latest_version}}'
}

Python SDK 安装

使用 pip 进行安装：

pip install apache-beam

Go SDK 安装

使用 go get 进行安装：

go get -u github.com/apache/beam/sdks/v2/go

2. 项目的使用说明

Apache Beam 是一个用于定义批处理和流处理数据并行处理管道的统一模型，同时提供了多种语言特定的 SDK 用于构建管道，以及运行在分布式处理后端的 Runner。

创建管道

在 Java 中，创建一个简单的管道：

public class MyPipeline {
  public static void main(String[] args) throws Exception {
    PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
    Pipeline p = Pipeline.create(options);

    PCollection<String> lines = p.apply(Read.fromText("input.txt"));
    PCollection<String> words = lines.apply(Split.byPattern("\\W+"));
    PCollection<Long> wordCounts = words.apply(Count.byElement());

    wordCounts.apply(Write.toText("output.txt"));

    p.run().waitUntilFinish();
  }
}

在 Python 中，创建一个简单的管道：

import apache_beam as beam

def split_words(text):
    return text.split()

def count_words(element):
    return (element, 1)

with beam.Pipeline() as p:
    lines = (p | 'ReadLines' >> beam.io.ReadFromText('input.txt'))
    words = (lines | 'SplitWords' >> beam.Map(split_words))
    word_counts = (words | 'CountWords' >> beam.CombineGlobally(count_words).without_keys())
    word_counts | 'WriteCounts' >> beam.io.WriteToText('output.txt')

运行管道

使用 DirectRunner 在本地机器上运行管道：

mvn compile
java -jar target/MyPipeline-1.0-SNAPSHOT.jar --runner=DirectRunner

使用 DataflowRunner 在 Google Cloud Dataflow 上运行管道：

mvn compile
java -jar target/MyPipeline-1.0-SNAPSHOT.jar --runner=DataflowRunner

3. 项目API使用文档

Apache Beam 提供了丰富的 API 用于构建和运行管道。以下是一些常用的 API：

PCollection

PCollection 代表一个数据集合，可以是有限的或无限的。

PCollection<String> lines = p.apply(Read.fromText("input.txt"));

PTransform

PTransform 代表一个计算，用于将输入的 PCollection 转换为输出的 PCollection。

PCollection<String> words = lines.apply(Split.byPattern("\\W+"));

Pipeline

Pipeline 管理一个有向无环图，包含 PTransform 和 PCollection，准备执行。

Pipeline p = Pipeline.create(options);

PipelineRunner

PipelineRunner 指定管道应该在哪里以及如何执行。

p.run().waitUntilFinish();

4. 项目安装方式

请参考上述的安装指南，选择适合您项目的 SDK 语言和版本进行安装。

beam

Apache Beam is a unified programming model for Batch and Streaming data processing.

项目地址：https://gitcode.com/gh_mirrors/beam4/beam

登录后查看全文

项目优选

收起

本项目是CANN提供的transformer类大模型算子库，实现网络在NPU上加速计算。

Ascend Extension for PyTorch

本项目是CANN提供的神经网络类计算算子库，实现网络在NPU上加速计算。

openEuler内核是openEuler操作系统的核心，既是系统性能与稳定性的基石，也是连接处理器、设备与服务的桥梁。

468

461

flutter_flutter

本仓库是 Flutter SDK 与 Flutter Engine 的 OpenHarmony 适配版本，由 CPF-Flutter 团队维护。开发者可使用熟悉的 Flutter 技术栈开发 OpenHarmony 应用，3.35.7 及以后的适配版本可基于本仓库源码构建支持 OpenHarmony 的 Flutter Engine。

本项目是CANN提供的数学类基础计算算子库，实现网络在NPU上加速计算。

CANN 学习中心仓，支持在线互动运行、边学边练，提供教程、示例与优化方案，一站式助力昇腾开发者快速上手。

Jupyter Notebook

363

132