Presto 0.100 Documentation

4.1. Cassandra连接器

4.1. Cassandra连接器

可以使用 Cassandra connector 查询存储在Cassandra中的数据。

配置

创建一个文件 etc/catalog/cassandra.properties ,该文件的内容如下

connector.name=cassandra
cassandra.contact-points=host1,host2

以上配置项将会在cassandra catalog 上挂载一个cassandra connector。根据实际环境修改host1,host2 。cassandra.contact-points 的内容必须是以逗号分隔的Cassandra 节点,这些节点用来发现集群的逻辑结构。如果cassandra不使用默认的9142端口,你还要设置cassandra.native-protocol-port的值。

Multiple Cassandra Clusters

You can have as many catalogs as you need, so if you have additional Cassandra clusters, simply add another properties file to etc/catalog with a different name (making sure it ends in .properties). For example, if you name the property file sales.properties, Presto will create a catalog named sales using the configured connector.

Configuration Properties

The following configuration properties are available:

Property Name Description
cassandra.contact-points Comma-separated list of hosts in a Cassandra cluster. The Cassandra driver will use these contact points to discover cluster topology. At least one Cassandra host is required.
cassandra.native-protocol-port The Cassandra server port running the native client protocol (defaults to 9042).
cassandra.thrift-port The Cassandra server port running the Thrift client protocol (defaults to 9160).
cassandra.limit-for-partition-key-select Limit of rows to read for finding all partition keys. If a Cassandra table has more rows than this value, splits based on token ranges are used instead. Note that for larger values you may need to adjust read timeout for Cassandra.
cassandra.max-schema-refresh-threads Maximum number of schema cache refresh threads. This property corresponds to the maximum number of parallel requests.
cassandra.schema-cache-ttl Maximum time that information about a schema will be cached (defaults to 1h).
cassandra.schema-refresh-interval The schema information cache will be refreshed in the background when accessed if the cached data is at least this old (defaults to 2m).
cassandra.consistency-level Consistency levels in Cassandra refer to the level of consistency to be used for both read and write operations. More information about consistency levels can be found in the Cassandra consistency documentation. This property defaults to a consistency level of ONE. Possible values include ALL, EACH_QUORUM, QUORUM, LOCAL_QUORUM, ONE, TWO, THREE, LOCAL_ONE, ANY, SERIAL, LOCAL_SERIAL.
cassandra.allow-drop-table Set to true to allow dropping Cassandra tables from Presto via 删表 (defaults to false).
cassandra.username Username used for authentication to the Cassandra cluster. This is a global setting used for all connections, regardless of the user who is connected to Presto.
cassandra.password Password used for authentication to the Cassandra cluster. This is a global setting used for all connections, regardless of the user who is connected to Presto.

The following advanced configuration properties are available:

Property Name Description
cassandra.fetch-size Number of rows fetched at a time in a Cassandra query.
cassandra.fetch-size-for-partition-key-select Number of rows fetched at a time in a Cassandra query that selects partition keys.
cassandra.partition-size-for-batch-select Number of partitions batched together into a single select for a single partion key column table.
cassandra.split-size Number of keys per split when querying Cassandra.
cassandra.partitioner Partitioner to use for hashing and data distribution. This property defaults to Murmur3Partitioner. The other supported values are RandomPartitioner and ByteOrderedPartitioner.
cassandra.thrift-connection-factory-class Allows for the specification of a custom implementation of org.apache.cassandra.thrift.ITransportFactory to be used to connect to Cassandra using the Thrift protocol.
cassandra.transport-factory-options Allows for the specification of arbitrary options to be passed to the Thrift connection factory.
cassandra.client.read-timeout Maximum time the Cassandra driver will wait for an answer to a query from one Cassandra node. Note that the underlying Cassandra driver may retry a query against more than one node in the event of a read timeout. Increasing this may help with queries that use an index.
cassandra.client.connect-timeout Maximum time the Cassandra driver will wait to establish a connection to a Cassandra node. Increasing this may help with heavily loaded Cassandra clusters.
cassandra.client.so-linger Number of seconds to linger on close if unsent data is queued. If set to zero, the socket will be closed immediately. When this option is non-zero, a socket will linger that many seconds for an acknowledgement that all data was written to a peer. This option can be used to avoid consuming sockets on a Cassandra server by immediately closing connections when they are no longer needed.
cassandra.retry-policy Policy used to retry failed requests to Cassandra. This property defaults to DEFAULT. Using BACKOFF may help when queries fail with “not enough replicas”. The other possible values are DOWNGRADING_CONSISTENCY and FALLTHROUGH.

Querying Cassandra Tables

The users table is an example Cassandra table from the Cassandra Getting Started guide. It can be created along with the mykeyspace keyspace using Cassandra’s cqlsh (CQL interactive terminal):

cqlsh> CREATE KEYSPACE mykeyspace
   ... WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
cqlsh> USE mykeyspace;
cqlsh:mykeyspace> CREATE TABLE users (
              ...   user_id int PRIMARY KEY,
              ...   fname text,
              ...   lname text
              ... );

This table can be described in Presto:

DESCRIBE cassandra.mykeyspace.users;
 Column  |  Type   | Null | Partition Key | Comment
---------+---------+------+---------------+---------
 user_id | bigint  | true | true          |
 fname   | varchar | true | false         |
 lname   | varchar | true | false         |
(3 rows)

This table can then be queried in Presto:

SELECT * FROM cassandra.mykeyspace.users;