Clickhouse MPPDB 数据库 - 新功能使用示例
clickhouse 新特性:
从clickhouse 22.3至最新的版本24.3.2.23,clickhouse在快速发展中,每个版本都增加了一些新的特性,在数据写入、查询方面都有性能加速。
本文根据clickhouse blog中的clickhouse release blog中,学习并梳理了一些在实际工作中可能用到的新特性。
以下是如何基于docker,如果试用这些新性
docker run -d --name=ch -p 8123:8123 -p 9000:9000 -p 9009:9009 --ulimit nofile=262144:262144 -v D:/ch/latest/external:/external:rw -v chlatest:/var/lib/clickhouse:rw -v D:/ch/latest/logs:/var/log/clickhouse-server:rw -v D:/ch/latest/etc/clickhouse-server:/etc/clickhouse-server:rw clickhouse/clickhouse-server:24.3.2.23
docker exec -it bash
clickhouse-client --format_csv_delimiter=','
transform函数
进行字典替换
transform(x, array_from, array_to, default)
transform(T, Array(T), Array(U), U) -> U
transform(x, array_from, array_to)
UK-house-price-dataset.csv
CREATE TABLE uk_price_paid
(
price UInt32,
date Date,
postcode1 LowCardinality(String),
postcode2 LowCardinality(String),
type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),
is_new UInt8,
duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),
addr1 String,
addr2 String,
street LowCardinality(String),
locality LowCardinality(String),
town LowCardinality(String),
district LowCardinality(String),
county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2);
INSERT INTO uk_price_paid
WITH
splitByChar(' ', postcode) AS p
SELECT
toUInt32(price_string) AS price,
parseDateTimeBestEffortUS(time) AS date,
p[1] AS postcode1,
p[2] AS postcode2,
transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,
b = 'Y' AS is_new,
transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
);
SELECT transform(number, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'], NULL) AS numbers
FROM system.numbers
LIMIT 10
读取文件
可以自动识别文件的类型,推荐字段类型
SELECT * FROM (
WITH
splitByChar(' ', postcode) AS p
SELECT
toUInt32(price_string) AS price,
parseDateTimeBestEffortUS(time) AS date,
p[1] AS postcode1,
p[2] AS postcode2,
transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,
b = 'Y' AS is_new,
transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
) SETTINGS format_csv_delimiter=','
) LIMIT 2;
自定义函数
根据需要,编写自定义函数
CREATE OR REPLACE TABLE line_changes
(
version UInt32,
line_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3),
line_number UInt32,
line_content String,
time datetime default now()
)
ENGINE = MergeTree
ORDER BY time;
INSERT INTO default.line_changes (version,line_change_type,line_number,line_content) VALUES
(1, 'Add' , 1, 'ClickHouse provides SQL'),
(2, 'Add' , 2, 'with improvements'),
(3, 'Add' , 3, 'that makes it more friendly for analytical tasks.'),
(4, 'Add' , 2, 'with many extensions'),
(5, 'Modify', 3, 'and powerful improvements'),
(6, 'Delete', 1, ''),
(7, 'Add' , 1, 'ClickHouse provides a superset of SQL');
-- add a string (str) into an array (arr) at a specific position (pos)
CREATE OR REPLACE FUNCTION add AS (arr, pos, str) ->
arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos));
-- delete the element at a specific position (pos) from an array (arr)
CREATE OR REPLACE FUNCTION delete AS (arr, pos) ->
arrayConcat(arraySlice(arr, 1, pos-1), arraySlice(arr, pos+1));
-- replace the element at a specific position (pos) in an array (arr)
CREATE OR REPLACE FUNCTION modify AS (arr, pos, str) ->
arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos+1));
arrayFold
SELECT arrayFold((acc, v) -> (acc + v), [10, 20, 30], 0::UInt64) AS sum;
CREATE OR REPLACE VIEW text_version AS
WITH T1 AS (
SELECT arrayZip(
groupArray(line_change_type),
groupArray(line_number),
groupArray(line_content)) as line_ops
FROM (SELECT * FROM line_changes
WHERE version <= {version:UInt32} ORDER BY version ASC)
)
SELECT arrayJoin(
arrayFold((acc, v) ->
if(v.'change_type' = 'Add', add(acc, v.'line_nr', v.'content'),
if(v.'change_type' = 'Delete', delete(acc, v.'line_nr'),
if(v.'change_type' = 'Modify', modify(acc, v.'line_nr', v.'content'), []))),
line_ops::Array(Tuple(change_type String, line_nr UInt32, content String)),
[]::Array(String))) as lines
FROM T1;
SELECT * FROM text_version(version = 3);
Parallel window functions
窗口函数采用并行计算,性能大幅提升
SELECT
country,
day,
max(tempAvg) AS temperature,
avg(temperature) OVER (PARTITION BY country ORDER BY day ASC ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS moving_avg_temp
FROM noaa
WHERE country != ''
GROUP BY
country,
date AS day
ORDER BY
country ASC,
day ASC
FINAL
基于FINAL及enable_vertical_final
,在如下引擎
ReplacingMergeTree、 AggregatingMergeTree引擎中,可以快速查询到最新的数据
SELECT
postcode1,
formatReadableQuantity(avg(price))
FROM uk_property_offers FINAL
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3;
SELECT
postcode1,
formatReadableQuantity(avg(price))
FROM uk_property_offers
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3
SETTINGS enable_vertical_final = 1;
Variant Type
SET allow_experimental_variant_type=1,
use_variant_as_common_type = 1;
SELECT
map('Hello', 1, 'World', 'Mark') AS x,
toTypeName(x) AS type
FORMAT Vertical;
SELECT
arrayJoin([1, true, 3.4, 'Mark']) AS value,
toTypeName(value)
Row 1:
──────
x: {'Hello':1,'World':'Mark'}
type: Map(String, Variant(String, UInt8))
┌─value─┬─toTypeName(value)─────────────────────┐
1. │ true │ Variant(Bool, Float64, String, UInt8) │
2. │ true │ Variant(Bool, Float64, String, UInt8) │
3. │ 3.4 │ Variant(Bool, Float64, String, UInt8) │
4. │ Mark │ Variant(Bool, Float64, String, UInt8) │
└───────┴───────────────────────────────────────┘
字符相似性函数
-
byteHammingDistance: the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.
- “karolin” and “kathrin” is 3.
- “karolin” and “kerstin” is 3.
- “kathrin” and “kerstin” is 4.
- 0000 and 1111 is 4.
- 2173896 and 2233796 is 3.
-
editDistance:a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.
-
damerauLevenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.
-
jaroWinklerSimilarity: a string metric measuring an edit distance between two sequences. It is a variant of the Jaro distance metric
-
levenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.
https://clickhouse.com/docs/en/sql-reference/functions/string-functions#dameraulevenshteindistance
CREATE TABLE domains
(
`domain` String,
`rank` Float64
)
ENGINE = MergeTree
ORDER BY domain;
INSERT INTO domains SELECT
c2 AS domain,
1 / c1 AS rank
FROM url('domains.csv', 'CSV');
SELECT
domain,
levenshteinDistance(domain, 'facebook.com') AS d1,
damerauLevenshteinDistance(domain, 'facebook.com') AS d2,
jaroSimilarity(domain, 'facebook.com') AS d3,
jaroWinklerSimilarity(domain, 'facebook.com') AS d4
FROM domains
ORDER BY d1 ASC
LIMIT 10
Query id: 6f499f27-8274-4787-819a-b510322bdce3
┌─domain────────┬─d1─┬─d2─┬─────────────────d3─┬─────────────────d4─┐
1. │ facebook.com │ 0 │ 0 │ 1 │ 1 │
2. │ facebonk.com │ 1 │ 1 │ 0.8838383838383838 │ 0.9303030303030303 │
3. │ fabebook.com │ 1 │ 1 │ 0.914141414141414 │ 0.9313131313131312 │
4. │ facabook.com │ 1 │ 1 │ 0.9444444444444443 │ 0.961111111111111 │
5. │ facobook.com │ 1 │ 1 │ 0.8535353535353535 │ 0.8974747474747474 │
6. │ facebook1.com │ 1 │ 1 │ 0.9743589743589745 │ 0.9846153846153847 │
7. │ faceook.com │ 1 │ 1 │ 0.9722222222222221 │ 0.9833333333333333 │
8. │ faacebook.com │ 1 │ 1 │ 0.9743589743589745 │ 0.9794871794871796 │
9. │ faceboock.com │ 1 │ 1 │ 0.9326923076923077 │ 0.9596153846153846 │
10. │ facebool.com │ 1 │ 1 │ 0.9444444444444443 │ 0.9666666666666666 │
└───────────────┴────┴────┴────────────────────┴────────────────────┘
Vectorized distance functions
可以作为向量数据库使用,支持L2,cosineDistance,IP三种向量相似度的度量方法
https://clickhouse.com/blog/clickhouse-release-24-02
WITH 'dog' AS search_term,
(
SELECT vector
FROM glove
WHERE word = search_term
LIMIT 1
) AS target_vector
SELECT word, cosineDistance(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;
WITH
'dog' AS search_term,
(
SELECT vector
FROM glove
WHERE word = search_term
LIMIT 1
) AS target_vector
SELECT
word,
1 - dotProduct(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;
Adaptive asynchronous inserts
Asynchronous inserts shift data batching from the client side to the server side: data from insert queries is inserted into a buffer first and then written to the database storage later or asynchronously respectively.
推荐阅读
-
Clickhouse MPPDB 数据库 - 新功能使用示例
-
使用Node.js后端、微信小程序前端配合MongoDB数据库的 CRUD 操作示例
-
Clickhouse和Mybatis-Plus在SpringBoot中的集成配置与使用教程:包括建表脚本、示例代码和测试指南
-
Grid++Report 锐浪报表开发常见问题解答集锦-报表设计 问:怎样在设计时打印预览报表? 答:为了及时查看报表的设计效果,Grid++Report 报表设计应用程序提供了四种查看视图:普通视图、页面视图、预览视图与查询视图。通过窗口下边的 Tab 按钮可以在四种视图中任意切换。在预览视图中查看报表的打印预览效果,在查询视图中查看报表的查询显示效果。如果在报表的记录集提供了数据源连接串与查询 SQL,在进入预览视图与查询视图时会利用数据源连接串与查询 SQL 从数据源中自动取数,否则 Grid++Report 将自动生成模拟数据进行模拟打印预览与查询显示。注意:在预览视图与查询视图中看到的报表运行结果有可能与在你程序中的最终运行结果有差异,因为在报表的生成过程中我们可以在程序中对报表的生成行为进行一定的控制。 问:怎样用 Grid++Report 设计交叉表? 答:Grid++Report 没有提供专门实现交叉表的功能,其它的报表构件提供的交叉表功能一般也比较死板和功能有限。利用 Grid++Report 的编程接口可以做出灵活多变,功能丰富的交叉表。示例程序 CrossTab 就是一个实现交叉表的例子程序,认真领会此例子程序,你就可以做出自己想要各种交叉表,并能提取一些共用代码,便于重复使用。 问:怎样设置整个报表的缺省字体? 答:设置报表主对象的字体属性,也就是设置了整个报表的缺省字体。如果改变报表主对象的字体属性,则没有专门的设置字体属性的子对象的字体属性也跟随改变。同样每个报表节与明细网格也有字体属性,他们的字体属性也就是其拥有的子对象的缺省字体。 问:怎样在打印时限制一页的输出行数? 答:设定明细网格的内容行的‘每页行数(RowsPerPage)’属性即可。另外要注意‘调节行高(AdjustRowHeight)’属性值:为真时根据页面的输出高度自动调整行的高度,使整个页面的输出区域充满。为假时按设计时的高度输出行。 问:怎样显示中文大写金额? 答:将对象的“格式(Format)”属性设为 “$$” 及可,可以设置格式的对象有:字段(IGRField)、参数(IGRParameter)、系统变量(IGRSystemVarBox)与综合文字框(IGRMemoBox),其中综合文字框是在报表式上设格式。 问:能否实现自定义纸张与票据打印? 答:Grid++Report 完全支持自定义纸张的打印,只要在报表设定时在页面设置中选定自定义纸张,并指定准确的纸张尺寸。当然要在最终输出时得道合适的打印结果,输出打印机必须支持自定义纸张打印。Windows2000/XP/2003 操作系统上可以在打印机上定义自定义纸张,也可以采用这种方式实现自定义纸张打印。 问:怎样实现 0 值不打印? 答:直接设置格式串就可以,在“数字格式”设置对话框中选定“0 不显示”,就会得到合适的格式串。也可以通过直接录入格式串来指定 0 不显示,但格式串必须符合 Grid++Report 的规定格式。另一种实现办法是在报表获取明细记录数据时,在 BeforePostRecord 事件中将值为零的字段设为空,调用字段的 Clear 方法将字段置为空。 问:怎样实现多栏报表? 答:在明细网格上设‘页栏数(PageColumnCount)’属性值大于 1 即可。通过 Grid++Report 的“页栏输出顺序”还可以指定多栏报表的输出顺序是“先从上到下”还是“先从左到右”。 问:如何实现票据套打? 答:Grid++Report 为实现票据套打做了很多专门的安排:报表设计器提供了页面设计模式,按照设定的纸张尺寸显示设计面板,如果将空白票据的扫描图设为设计背景图,在定位报表内容的输出位置会非常方便。报表部件可以设定打印类别,非套打输出的内容在套打打印模式下就不会输出。 问:Grid++Report 有没有横向分页功能? 答:回答是肯定的,在列的总宽度超过打印页面的输出宽度时,Grid++Report 可以另起新页输出剩余的列,如果左边存在锁定列,锁定列可以在后面的新页中重复输出,这样可以保证关键数据列在每一页都有输出。仔细体会 Grid++Report 提供的多种打印适应策略,选用最合适的方式。Grid++Report 的多种打印适应策略为开发动态报表提供了很好的支持。 问:怎样实现报表本页小计功能? 答:定义一个报表分组,将本分组定义为页分组,在本分组的分组头与分组尾上定义统计。页分组就是在每页产生一个分组项,在每页的上端与下端都会分别显示页分组的分组头与分组尾,页分组不用定义分组依据字段。 报表运行 问:怎样与数据库建立连接? 答:如果在设计报表时指定了数据集的数据源连接串与查询 SQL 语句,Grid++Report 采用拉模式直接从数据源取得报表数据,Grid++Report 利用 OLE DB 从数据源取数,OLE DB 提供了广泛的数据源操作能力。如果 Grid++Report 的数据来源采用推模式,即 Grid++Report 不直接与数据库建立连接,各种编程语言/平台都提供了很好的数据库连接方式,并且易于操作,应用程序在报表主对象(IGridppReport)的 FetchRecord 事件中将数据传入,例子程序提供了各种编程语言填入数据的通用方法,对C++Builder 和 Delphi 还进行了专门的包装,直接关联 TDataSet 对象也可以将 TDataSet 对象中的数据传给报表。 问:打印时能否对打印纸张进行自适应?支持表格的折行打印吗? 答:Grid++Report 在打印时采用多种适应策略,通过设置明细网格(IGRDetailGrid)的‘打印策略(PrintAdaptMethod)’属性指定打印策略。(1)丢弃:按设计时列的宽度输出,超出范围的内容不显示。(2)绕行:按设计时列的宽度输出,如果在当前行不能完整输出,则另起新行进行输出。(3)缩放适应:对所有列的输出宽度进行按比例地缩放,使总宽度等于页面的输出宽度。(4)缩小适应:如果列的总宽度小于页面的输出宽度,对所有列的输出宽度进行按比例地缩小,使总宽度等于页面的输出宽度。(5)横向分页:超范围的列在新页中输出。(6)横向分页并重复锁定列。 问:如何改变缺省打印预览窗口的窗口标题? 答:改变报表主对象的‘标题(Title)’属性即可。 问:利用集合对象的编程接口取子对象的接口引用,但不是自己期望的结果。 答:Grid++Report中所有集合对象的下标索引都是从 1 开始,另按对象的名称查找对象的接口引用时,名称字符是不区分大小写的。 问:怎样在运行时控制报表中各个对象的可见性?即怎样在运行时显示或隐藏对象? 答:在报表主对象(GridppReport)的 SectionFormat 事件中设定相应报表子对象的可见(Visible)属性即可。 问:报表主对象重新载入数据,设计器中为什么没有反映新载入的数据? 答:应调用 IGRDesigner 的 Reload 方法。 问:怎样实现不进入打印预览界面,直接将报表打印出来?