RISC-V 性能实测,以平头哥 C906 为例

 

Corrector: TinyCorrect v0.1-rc3 - [spaces tables pangu autocorrect epw]
Author: LucasXu lucas.xuyq.work@outlook.com
Revisor: Taotieren
Date: 2022/09/06
Project: RISC-V Linux 内核剖析
Sponsor: PLCT Lab, ISCAS

RISC-V 性能实测,以平头哥 C906 为例

前言

在上一篇文章中,我们简要概述了 Linux 下 benchmark 工具。本文将以平头哥 C906 为例,介绍如何在真实硬件上进行性能测试。

测试环境

硬件环境

硬件参数 note
RISC-V 64 CPU D1-H (C906), 1.0GHZ,接下来会用于测试
X86_64 CPU Intel Core i7-4770HQ,接下来用于 UnixBench 和 Matrix 的测试
基准频率 2.2 GHZ,测试时全核稳定在 3.2GHZ.
X86_64 CPU Intel Core i5-8500,用于 microbench 的测试
基准频率 3.0GHZ,测试时睿频至 4.10GHZ
RAM DDR3 512MB,接下来会用于测试
ROM Nor Flash 128MB,接下来会用于测试
Network 100Mb
Audio 3.5mm CTIA
Button fel 1 + LRADC OK1
DEBUG UART + ADB USB
POWER USB Type-C 5V-2A
PCB 板层 6 层板

软件环境

  • Debian 11,Linux kernel version: 5.4

交叉编译工具链

  • riscv64-linux-gnu-gcc (GCC) 11.2.0

测试方法以及工具介绍

测试工具

  • 泰晓科技制作的 microbench,基于指令集的性能评测
  • 我自己编写的大规模矩阵计算程序,涉及到大规模整数和浮点数计算,可以与 X86_64 平台机器做个对比进行参考,主要可用于测试单核心整数以及浮点数性能
  • UnixBench

工具介绍与移植

大规模矩阵计算程序

使用 C/C++ 重新改写之前一段 MATLAB 矩阵计算代码,矩阵库使用 Eigen-3.4.0 数学库进行矩阵运算。

代码如下:

#include <iostream>
#include <Eigen/Dense>
#include <random>
#include <ctime>
#include <cmath>
#include <vector>
#define EIGEN_USE_BLAS
#define PI 3.141592654

using namespace Eigen;
using namespace std;
int sign(double x)
{
    if (x > 0)
        return 1;
    else if (x == 0)
        return 0;
    else
        return -1;
}
double max(double a, double b)
{
    if (a > b)
        return a;
    else
        return b;
}
double max(MatrixXd x)
{
    double max = x(0, 0);
    for (int i = 0; i < x.rows(); i++)
    {
        for (int j = 0; j < x.cols(); j++)
        {
            if (x(i, j) > max)
            {
                max = x(i, j);
            }
        }
    }
    return max;
}

MatrixXd sign(MatrixXd x)
{
    MatrixXd y(x.rows(), x.cols());
    for (int i = 0; i < x.rows(); i++)
    {
        for (int j = 0; j < x.cols(); j++)
        {
            y(i, j) = sign(x(i, j));
        }
    }
    return y;
}
MatrixXd proxlasso(MatrixXd x, double mu, double tk, int max_step)
{
    // 创建 matrix Y 并且初始化
    MatrixXd Y = MatrixXd::Zero(x.rows(), x.cols());
    // matlab 语句:y = sign(x).*max(abs(x) - tk*mu, 0);
    for (int i = 0; i < x.rows(); i++)
    {
        for (int j = 0; j < x.cols(); j++)
        {
            Y(i, j) = sign(x(i, j)) * max(abs(x(i, j)) - tk * mu, 0);
        }
    }
    return Y;
}
MatrixXd eig(MatrixXd A)
{
    // 求矩阵 A 的特征值
    EigenSolver<MatrixXd> es(A);
    MatrixXd eigenvalues = es.eigenvalues().real();
    return eigenvalues;
}
MatrixXd randomMatrix(int rows, int cols)
{
    srand(time(NULL));
    cout << "开始生成随机数矩阵" << endl;
    // 采用二重 for 循环给矩阵内容赋上随机数值
    MatrixXd A;
    A.resize(rows, cols);

    for (int i = 0; i < rows; i++)
    {
        for (int j = 0; j < cols; j++)
        {
            double tmp = rand() % 1000000;
            double tmp2 = tmp / 1000000.0;
            int sign = rand() % 2;
            if (sign == 0)
            {
                A(i, j) = tmp2;
            }
            else
            {
                A(i, j) = -tmp2;
            }
        }
    }
    cout << "随机数矩阵生成完毕" << endl;
    return A;
}
double generateGaussianNoise(double mean, double sigma)
{
    // 生成非 0 正态分布,只返回非零值
    std::random_device rd;
    std::mt19937 gen(rd());
    std::normal_distribution<> d(mean, sigma);
    if (d(gen) == 0.0)
    {
        return generateGaussianNoise(mean, sigma);
    }
    else
    {
        return d(gen);
    }
}
MatrixXd sprandnMatrix(int m, int n, int density)
{
    // 创建一个随机的 m×n 稀疏矩阵,在区间 [0,1] 中 density 有大约 density*m*n 个正态分布的非零项。
    MatrixXd A;
    A.resize(m, n);
    for (int i = 0; i < m; i++)
    {
        for (int j = 0; j < n; j++)
        {
            A(i, j) = 0;
        }
    }
    int nonzero = 97;
    int count = 0;
    // 生成 nonezero 个正态分布的非零项
    cout << "开始生成稀疏矩阵" << endl;
    while (count < nonzero)
    {
        int i = rand() % m;
        int j = rand() % n;
        if (A(i, j) == 0)
        {

            int sign = rand() % 2;
            if (sign == 0)
            {
                A(i, j) = generateGaussianNoise(0, 1);
                count++;
            }
            else
            {
                A(i, j) = -generateGaussianNoise(0, 1);
                count++;
            }
        }
    }
    cout << "稀疏矩阵生成完毕" << endl;

    return A;
}
int main()
{
    clock_t start, end;
    start = clock();
    cout << "Start time is: " << start << endl;
    int max_step = 500;
    double r = 0.1;
    double mu = 0.001;
    int m = 512;
    int n = 1024;
    // 生成随机矩阵
    MatrixXd A = randomMatrix(m, n);
    // 生成稀疏正态分布随机矩阵(n,1),在区间 [0,1] 中 r 有大约 r*n*1 个正态分布的非零项
    MatrixXd u = sprandnMatrix(n, 1, r);
    // MatrixXd u = MatrixXd::Random(n, 1);
    MatrixXd b = A * u;
    double threshold = 0.0001;
    // 准备工作完毕
    // 求 A transpose * A 的特征值
    MatrixXd eigenvalues = eig(A.transpose() * A);
    // 求 A transpose * A 这一超大矩阵行列式
    double det = (A.transpose() * A).determinant();
    double tk = 1.0 / max(eig(A.transpose() * A));
    int k = 0;
    MatrixXd gradg = MatrixXd::Zero(1024, max_step);
    MatrixXd x = MatrixXd::Zero(1024, max_step);
    // 取 x 的第一列
    MatrixXd f = MatrixXd::Zero(1, max_step);
    f(0) = mu * x.col(0).lpNorm<1>() + 0.5 * ((A * x.col(0) - b).lpNorm<2>() * (A * x.col(0) - b).lpNorm<2>());
    MatrixXd err = MatrixXd::Zero(1, max_step);
    while (1)
    {
        gradg.col(k) = A.transpose() * (A * x.col(k) - b);
        x.col(k + 1) = proxlasso((x.col(k) - tk * gradg.col(k)), mu, tk, max_step);
        f(k + 1) = mu * x.col(k + 1).lpNorm<1>() + 0.5 * (A * x.col(k + 1) - b).lpNorm<2>() * (A * x.col(k + 1) - b).lpNorm<2>();
        err(k) = abs((f(k + 1) - f(k)) / f(k));
        if (err(k) <= threshold || k == max_step - 2)
        {
            break;
        }
        k++;
    }
    cout << "求得最小的函数值为:" << f(k) << endl;
    cout << "迭代次数为:" << k << endl;
    end = clock();
    cout << "End time is: " << end << endl;
    cout << "Time consumption is: " << (end - start) / CLOCKS_PER_SEC << endl;
    return 0;
}

microbench

microbench 详细介绍可以详见之前的文章,这里就不再赘述。引用仓库中对 RV64 平台的移植说明如下,可以看到,microbench 的移植过程并不复杂,只需要修改一些路径即可。

make clean
make ARCH=riscv64 clean
make ARCH=riscv64

UnixBench

UnixBench 是一个用于测试 Unix 系统性能的工具,它可以测试 CPU、内存、磁盘、文件系统、网络等方面的性能。UnixBench 的移植过程也比较简单,只需要修改一下 Makefile 即可。修改后的 Makefile 如下:

##############################################################################
# UnixBench v5.1.3
#  Based on The BYTE UNIX Benchmarks - Release 3
#          Module: Makefile   SID: 3.9 5/15/91 19:30:15
#
##############################################################################
# Bug reports, patches, comments, suggestions should be sent to:
# David C Niemi <niemi@tux.org>
#
# Original Contacts at Byte Magazine:
# Ben Smith or Tom Yager at BYTE Magazine
# bensmith@bytepb.byte.com    tyager@bytepb.byte.com
#
##############################################################################
#  Modification Log: 7/28/89 cleaned out workload files
#                    4/17/90 added routines for installing from shar mess
#                    7/23/90 added compile for dhrystone version 2.1
#                          (this is not part of Run file. still use old)
#                          removed HZ from everything but dhry.
#                          HZ is read from the environment, if not
#                          there, you must define it in this file
#                    10/30/90 moved new dhrystone into standard set
#                          new pgms (dhry included) run for a specified
#                          time rather than specified number of loops
#                    4/5/91 cleaned out files not needed for
#                          release 3 -- added release 3 files -ben
#                    10/22/97 added compiler options for strict ANSI C
#                          checking for gcc and DEC's cc on
#                          Digital Unix 4.x (kahn@zk3.dec.com)
#                    09/26/07 changes for UnixBench 5.0
#                    09/30/07 adding ubgears, GRAPHIC_TESTS switch
#                    10/14/07 adding large.txt
#                    01/13/11 added support for parallel compilation
#                    01/07/16 [refer to version control commit messages and
#                              cease using two-digit years in date formats]
##############################################################################

##############################################################################
# CONFIGURATION
##############################################################################

SHELL = /bin/sh

# GRAPHIC TESTS: Uncomment the definition of "GRAPHIC_TESTS" to enable
# the building of the graphics benchmarks.  This will require the
# X11 libraries on your system. (e.g. libX11-devel mesa-libGL-devel)
#
# Comment the line out to disable these tests.
# GRAPHIC_TESTS = defined

# Set "GL_LIBS" to the libraries needed to link a GL program.
GL_LIBS = -lGL -lXext -lX11

# COMPILER CONFIGURATION: Set "CC" to the name of the compiler to use
# to build the binary benchmarks.  You should also set "$cCompiler" in the
# Run script to the name of the compiler you want to test.
CC=riscv64-linux-gnu-gcc

# OPTIMISATION SETTINGS:
# Use gcc option if defined UB_GCC_OPTIONS via "Environment variable" or "Command-line arguments".
ifdef UB_GCC_OPTIONS
  OPTON = $(UB_GCC_OPTIONS)

else
  ## Very generic
  #OPTON = -O

  ## For Linux 486/Pentium, GCC 2.7.x and 2.8.x
  #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math \
  # -m486 -malign-loops=2 -malign-jumps=2 -malign-functions=2

  ## For Linux, GCC previous to 2.7.0
  #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math -m486

  #OPTON = -O2 -fomit-frame-pointer -fforce-addr -fforce-mem -ffast-math \
  # -m386 -malign-loops=1 -malign-jumps=1 -malign-functions=1

  ## For Solaris 2, or general-purpose GCC 2.7.x
  #OPTON = -O2 -fomit-frame-pointer -fforce-addr -ffast-math -Wall

  ## For Digital Unix v4.x, with DEC cc v5.x
  #OPTON = -O4
  #CFLAGS = -DTIME -std1 -verbose -w0

  ## gcc optimization flags
  ## (-ffast-math) disables strict IEEE or ISO rules/specifications for math funcs
  OPTON = -O3 -ffast-math

  ## OS detection.  Comment out if gmake syntax not supported by other 'make'.
  OSNAME:=$(shell uname -s)
  ARCH := $(shell uname -p)
  ifeq ($(OSNAME),Linux)
    # Not all CPU architectures support "-march" or "-march=native".
    #   - Supported    : x86, x86_64, ARM, AARCH64, etc..
    #   - Not Supported: RISC-V, IBM Power, etc...
         -march = rv64imaf
  endif

  ifeq ($(OSNAME),Darwin)
    # (adjust flags or comment out this section for older versions of XCode or OS X)
    # (-mmacosx-versin-min= requires at least that version of SDK be installed)
    ifneq ($(ARCH),$(filter $(ARCH),ppc64 ppc64le))
        OPTON += -march=native -mmacosx-version-min=10.10
    else
        OPTON += -mcpu=native
    endif
    #http://stackoverflow.com/questions/9840207/how-to-use-avx-pclmulqdq-on-mac-os-x-lion/19342603#19342603
    CFLAGS += -Wa,-q
  endif

endif

## generic gcc CFLAGS.  -DTIME must be included.
CFLAGS += -Wall -pedantic $(OPTON) -I $(SRCDIR) -DTIME

##############################################################################
# END CONFIGURATION
##############################################################################

# local directories
PROGDIR = ./pgms
SRCDIR = ./src
TESTDIR = ./testdir
RESULTDIR = ./results
TMPDIR = ./tmp
# other directories
INCLDIR = /usr/include
LIBDIR = /lib
SCRIPTS = unixbench.logo multi.sh tst.sh index.base
SOURCES = arith.c big.c context1.c \
 dummy.c execl.c \
 fstime.c hanoi.c \
 pipe.c spawn.c \
 syscall.c looper.c timeit.c time-polling.c \
 dhry_1.c dhry_2.c dhry.h whets.c ubgears.c
TESTS = sort.src cctest.c dc.dat large.txt

ifneq (,$(GRAPHIC_TESTS))
GRAPHIC_BINS = $(PROGDIR)/ubgears
else
GRAPHIC_BINS =
endif

# Program binaries.
BINS = $(PROGDIR)/arithoh $(PROGDIR)/register $(PROGDIR)/short \
 $(PROGDIR)/int $(PROGDIR)/long $(PROGDIR)/float $(PROGDIR)/double \
 $(PROGDIR)/hanoi $(PROGDIR)/syscall $(PROGDIR)/context1 \
 $(PROGDIR)/pipe $(PROGDIR)/spawn $(PROGDIR)/execl \
 $(PROGDIR)/dhry2 $(PROGDIR)/dhry2reg  $(PROGDIR)/looper \
 $(PROGDIR)/fstime $(PROGDIR)/whetstone-double $(GRAPHIC_BINS)
## These compile only on some platforms...
# $(PROGDIR)/poll $(PROGDIR)/poll2 $(PROGDIR)/select

# Required non-binary files.
REQD = $(BINS) $(PROGDIR)/unixbench.logo \
 $(PROGDIR)/multi.sh $(PROGDIR)/tst.sh $(PROGDIR)/index.base \
 $(PROGDIR)/gfx-x11 \
 $(TESTDIR)/sort.src $(TESTDIR)/cctest.c $(TESTDIR)/dc.dat \
 $(TESTDIR)/large.txt

# ######################### the big ALL ############################
all:
## Ick!!!  What is this about???  How about let's not chmod everything bogusly.
# @chmod 744 * $(SRCDIR)/* $(PROGDIR)/* $(TESTDIR)/* $(DOCDIR)/*
 $(MAKE) distr
 $(MAKE) programs

# ####################### a check for Run ######################
check: $(REQD)
 $(MAKE) all
# ##############################################################
# distribute the files out to subdirectories if they are in this one
distr:
 @echo "Checking distribution of files"
# scripts
 @if  test ! -d  $(PROGDIR) \
        ; then  \
           mkdir $(PROGDIR) \
           ; mv $(SCRIPTS) $(PROGDIR) \
        ; else \
           echo "$(PROGDIR)  exists" \
        ; fi
# C sources
 @if  test ! -d  $(SRCDIR) \
        ; then  \
           mkdir $(SRCDIR) \
           ; mv $(SOURCES) $(SRCDIR) \
        ; else \
           echo "$(SRCDIR)  exists" \
        ; fi
# test data
 @if  test ! -d  $(TESTDIR) \
        ; then  \
           mkdir $(TESTDIR) \
           ; mv $(TESTS) $(TESTDIR) \
        ; else \
           echo "$(TESTDIR)  exists" \
        ; fi
# temporary work directory
 @if  test ! -d  $(TMPDIR) \
        ; then  \
           mkdir $(TMPDIR) \
        ; else \
           echo "$(TMPDIR)  exists" \
        ; fi
# directory for results
 @if  test ! -d  $(RESULTDIR) \
        ; then  \
           mkdir $(RESULTDIR) \
        ; else \
           echo "$(RESULTDIR)  exists" \
        ; fi

.PHONY: all check distr programs run clean spotless

programs: $(BINS)

# (use $< to link only the first dependency, instead of $^,
#  since the programs matching this pattern have only
#  one input file, and others are #include "xxx.c"
#  within the first.  (not condoning, just documenting))
# (dependencies could be generated by modern compilers,
#  but let's not assume modern compilers are present)
$(PROGDIR)/%:
 $(CC) -o $@ $(CFLAGS) $< $(LDFLAGS)

# Individual programs
# Sometimes the same source file is compiled in different ways.
# This limits the 'make' patterns that can usefully be applied.

$(PROGDIR)/arithoh:  $(SRCDIR)/arith.c $(SRCDIR)/timeit.c
$(PROGDIR)/arithoh:  CFLAGS += -Darithoh
$(PROGDIR)/register: $(SRCDIR)/arith.c $(SRCDIR)/timeit.c
$(PROGDIR)/register: CFLAGS += -Ddatum='register int'
$(PROGDIR)/short:    $(SRCDIR)/arith.c $(SRCDIR)/timeit.c
$(PROGDIR)/short:    CFLAGS += -Ddatum=short
$(PROGDIR)/int:      $(SRCDIR)/arith.c $(SRCDIR)/timeit.c
$(PROGDIR)/int:      CFLAGS += -Ddatum=int
$(PROGDIR)/long:     $(SRCDIR)/arith.c $(SRCDIR)/timeit.c
$(PROGDIR)/long:     CFLAGS += -Ddatum=long
$(PROGDIR)/float:    $(SRCDIR)/arith.c $(SRCDIR)/timeit.c
$(PROGDIR)/float:    CFLAGS += -Ddatum=float
$(PROGDIR)/double:   $(SRCDIR)/arith.c $(SRCDIR)/timeit.c
$(PROGDIR)/double:   CFLAGS += -Ddatum=double

$(PROGDIR)/poll:     $(SRCDIR)/time-polling.c
$(PROGDIR)/poll:     CFLAGS += -DUNIXBENCH -DHAS_POLL
$(PROGDIR)/poll2:    $(SRCDIR)/time-polling.c
$(PROGDIR)/poll2:    CFLAGS += -DUNIXBENCH -DHAS_POLL2
$(PROGDIR)/select:   $(SRCDIR)/time-polling.c
$(PROGDIR)/select:   CFLAGS += -DUNIXBENCH -DHAS_SELECT

$(PROGDIR)/whetstone-double: $(SRCDIR)/whets.c
$(PROGDIR)/whetstone-double: CFLAGS += -DDP -DGTODay -DUNIXBENCH
$(PROGDIR)/whetstone-double: LDFLAGS += -lm

$(PROGDIR)/pipe: $(SRCDIR)/pipe.c $(SRCDIR)/timeit.c

$(PROGDIR)/execl: $(SRCDIR)/execl.c $(SRCDIR)/big.c

$(PROGDIR)/spawn: $(SRCDIR)/spawn.c $(SRCDIR)/timeit.c

$(PROGDIR)/hanoi: $(SRCDIR)/hanoi.c $(SRCDIR)/timeit.c

$(PROGDIR)/fstime: $(SRCDIR)/fstime.c

$(PROGDIR)/syscall: $(SRCDIR)/syscall.c $(SRCDIR)/timeit.c

$(PROGDIR)/context1: $(SRCDIR)/context1.c $(SRCDIR)/timeit.c

$(PROGDIR)/looper: $(SRCDIR)/looper.c $(SRCDIR)/timeit.c

$(PROGDIR)/ubgears: $(SRCDIR)/ubgears.c
$(PROGDIR)/ubgears: LDFLAGS += -lm $(GL_LIBS)

$(PROGDIR)/dhry2: CFLAGS += -DHZ=${HZ}
$(PROGDIR)/dhry2: $(SRCDIR)/dhry_1.c $(SRCDIR)/dhry_2.c \
                  $(SRCDIR)/dhry.h $(SRCDIR)/timeit.c
 $(CC) -o $@ ${CFLAGS} $(SRCDIR)/dhry_1.c $(SRCDIR)/dhry_2.c

$(PROGDIR)/dhry2reg: CFLAGS += -DHZ=${HZ} -DREG=register
$(PROGDIR)/dhry2reg: $(SRCDIR)/dhry_1.c $(SRCDIR)/dhry_2.c \
                     $(SRCDIR)/dhry.h $(SRCDIR)/timeit.c
 $(CC) -o $@ ${CFLAGS} $(SRCDIR)/dhry_1.c $(SRCDIR)/dhry_2.c

# Run the benchmarks and create the reports
run:
 sh ./Run

clean:
 $(RM) $(BINS) core *~ */*~

spotless: clean
 $(RM) $(RESULTDIR)/* $(TMPDIR)/*

## END ##

测试方法

  • microbenchUnixBench 在 X86_64 平台进行移植与交叉编译,并在 RV64 平台上面进行测试,查看测试所得的分数。
  • 大规模矩阵计算程序在 X86_64 平台上进行到 RV64 平台的交叉编译与移植,测试程序在 RV64 平台上运行,查看运行时间。

测试结果

microbench 测试结果

BM_nop                               3.08 ns         2.99 ns    233889145
BM_ub                                3.10 ns         2.99 ns    233893313
BM_bnez                              12.4 ns         12.0 ns     58507883
BM_beqz                              12.3 ns         12.0 ns     58468358
BM_load_bnez                         12.2 ns         12.0 ns     58395094
BM_load_beqz                         12.1 ns         12.0 ns     58380949
BM_cache_miss_load_bnez              12.2 ns         5.99 ns    116951654
BM_cache_miss_load_beqz              12.4 ns         5.99 ns    116876961
BM_branch_miss_load_bnez             8.56 ns         4.00 ns    175331742
BM_branch_miss_load_beqz             8.15 ns         3.99 ns    175418777
BM_cache_branch_miss_load_bnez       9.79 ns         4.82 ns    141039686
BM_cache_branch_miss_load_beqz       10.3 ns         4.96 ns    127321246
BM_inc                               10.2 ns         9.97 ns     70122305
BM_dec                               11.2 ns         11.0 ns     63613111
BM_mul                               12.2 ns         12.0 ns     58461644
BM_div                               11.1 ns         11.0 ns     63466231
BM_float_inc                         18.5 ns         18.0 ns     39003335
BM_float_dec                         18.5 ns         18.0 ns     39007247
BM_float_mul                         18.5 ns         18.0 ns     38973395
BM_float_div                         33.9 ns         32.9 ns     21276946
BM_and                               11.2 ns         11.0 ns     63815846
BM_or                                11.4 ns         11.0 ns     63767351
BM_not                               11.3 ns         11.0 ns     63709582
BM_bits_and                          11.3 ns         11.0 ns     63730270
BM_bits_or                           12.2 ns         12.0 ns     58254510
BM_bits_nor                          12.2 ns         12.0 ns     58446391
BM_bits_not                          12.7 ns         12.0 ns     58380200
BM_bits_rshift                       11.3 ns         11.0 ns     63750367
BM_bits_lshift                       11.4 ns         11.0 ns     63743425
BM_for_loop                          24.7 ns         24.0 ns     29241522
BM_while_loop                        24.6 ns         23.9 ns     29267222
BM_do_while_loop                     24.8 ns         23.9 ns     29262257
BM_bubble_sort                        339 ns          328 ns      2160002
BM_std_sort                           197 ns          192 ns      3683400
BM_calculate_pi                      5182 ns         5073 ns       137929
BM_factorial                          108 ns          106 ns      6612491
benchmark time_rv64/ns cpu_rv64/ns iterations_rv64
BM_nop 3.08 2.99 233889145
BM_ub 3.1 2.99 233893313
BM_bnez 12.4 12 58507883
BM_beqz 12.3 12 58468358
BM_load_bnez 12.2 12 58395094
BM_load_beqz 12.1 12 58380949
BM_cache_miss_load_bnez 12.2 5.99 116951654
BM_cache_miss_load_beqz 12.4 5.99 116876961
BM_branch_miss_load_bnez 8.56 4 175331742
BM_branch_miss_load_beqz 8.15 3.99 175418777
BM_cache_branch_miss_load_bnez 9.79 4.82 141039686
BM_cache_branch_miss_load_beqz 10.3 4.96 127321246
BM_inc 10.2 9.97 70122305
BM_dec 11.2 11 63613111
BM_mul 12.2 12 58461644
BM_div 11.1 11 63466231
BM_float_inc 18.5 18 39003335
BM_float_dec 18.5 18 39007247
BM_float_mul 18.5 18 38973395
BM_float_div 33.9 32.9 21276946
BM_and 11.2 11 63815846
BM_or 11.4 11 63767351
BM_not 11.3 11 63709582
BM_bits_and 11.3 11 63730270
BM_bits_or 12.2 12 58254510
BM_bits_nor 12.2 12 58446391
BM_bits_not 12.7 12 58380200
BM_bits_rshift 11.3 11 63750367
BM_bits_lshift 11.4 11 63743425
BM_for_loop 24.7 24 29241522
BM_while_loop 24.6 23.9 29267222
BM_do_while_loop 24.8 23.9 29262257
BM_bubble_sort 339 328 2160002
BM_std_sort 197 192 3683400
BM_calculate_pi 5182 5073 137929
BM_factorial 108 106 6612491

UnixBench 测试结果

------unixbench----
------------------------------------------------------------------------
Benchmark Run: Wed Sep 07 2022 14:56:22 - 15:24:31
1 CPU in system; running 1 parallel copy of tests

Dhrystone 2 using register variables        3001049.4 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     1047.8 MWIPS (10.0 s, 7 samples)
Execl Throughput                                334.6 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         42369.6 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           11763.5 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        120604.7 KBps  (30.0 s, 2 samples)
Pipe Throughput                              208931.1 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  30214.6 lps   (10.0 s, 7 samples)
Process Creation                                790.2 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                    719.0 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                     92.6 lpm   (60.2 s, 2 samples)
System Call Overhead                         380766.8 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0    3001049.4    257.2
Double-Precision Whetstone                       55.0       1047.8    190.5
Execl Throughput                                 43.0        334.6     77.8
File Copy 1024 bufsize 2000 maxblocks          3960.0      42369.6    107.0
File Copy 256 bufsize 500 maxblocks            1655.0      11763.5     71.1
File Copy 4096 bufsize 8000 maxblocks          5800.0     120604.7    207.9
Pipe Throughput                               12440.0     208931.1    168.0
Pipe-based Context Switching                   4000.0      30214.6     75.5
Process Creation                                126.0        790.2     62.7
Shell Scripts (1 concurrent)                     42.4        719.0    169.6
Shell Scripts (8 concurrent)                      6.0         92.6    154.4
System Call Overhead                          15000.0     380766.8    253.8
                                                                   ========
System Benchmarks Index Score                                         133.4
name baseline rv64_result unit comment
Dhrystone 2 using register variables 106700 3001049.4 lps (10.0 s, 7 samples)
Double-Precision Whetstone 55 1047.8 MWIPS (10.0 s, 7 samples)
Execl Throughput 43 334.6 lps (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 3960 42369.6 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 1655 11763.5 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 5800 120604.7 KBps (30.0 s, 2 samples)
Pipe Throughput 12440 208931.1 lps (10.0 s, 7 samples)
Pipe-based Context Switching 4000 30214.6 lps (10.0 s, 7 samples)
Process Creation 126 790.2 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 42.4 719 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 6 92.6 lpm (60.0 s, 2 samples)
System Call Overhead 15000 380766.8 lps (10.0 s, 7 samples)

大规模矩阵计算程序测试结果

Start time is: 6262
开始生成随机数矩阵
随机数矩阵生成完毕
开始生成稀疏矩阵
稀疏矩阵生成完毕
求得最小的函数值为:0.142769
迭代次数为:162
End time is: 785892387
Time consumption is: 785

与现阶段 X86_64 平台机器进行性能对比

与 Intel Core i7-4770HQ 和 Intel Core i5-8500 进行性能对比

与 i5-8500 测试 microbench 性能对比图表见下

benchmark time_X86_64/ns time_rv64/ns cpu_X86_64/ns cpu_rv64/ns iterations_X86_64 iterations_rv64 ratio_time ratio_cpu ratio_iterations
BM_nop 0.248 3.08 0.248 2.99 1000000000 233889145 12.4193548 12.0564516 4.27552976
BM_ub 0.879 3.1 0.879 2.99 794629566 233893313 3.52673493 3.40159272 3.397401815
BM_bnez 0.99 12.4 0.99 12 708983810 58507883 12.5252525 12.1212121 12.11774848
BM_beqz 0.991 12.3 0.991 12 706511735 58468358 12.4117053 12.1089808 12.08365959
BM_load_bnez 0.799 12.2 0.799 12 906538755 58395094 15.2690864 15.0187735 15.52422803
BM_load_beqz 1.1 12.1 1.1 12 632241417 58380949 11 10.9090909 10.82958444
BM_cache_miss_load_bnez 2.2 12.2 2.2 5.99 333106358 116951654 5.54545455 2.72272727 2.848239821
BM_cache_miss_load_beqz 2.34 12.4 2.34 5.99 294563456 116876961 5.2991453 2.55982906 2.520286748
BM_branch_miss_load_bnez 1.91 8.56 1.91 4 363788231 175331742 4.48167539 2.09424084 2.074856651
BM_branch_miss_load_beqz 1.8 8.15 1.8 3.99 409687271 175418777 4.52777778 2.21666667 2.335481287
BM_cache_branch_miss_load_bnez 3.75 9.79 3.75 4.82 190417310 141039686 2.61066667 1.28533333 1.350097376
BM_cache_branch_miss_load_beqz 3.57 10.3 3.57 4.96 199812939 127321246 2.88515406 1.38935574 1.569360537
BM_inc 0.495 10.2 0.495 9.97 1000000000 70122305 20.6060606 20.1414141 14.26079762
BM_dec 0.618 11.2 0.618 11 1000000000 63613111 18.1229773 17.7993528 15.72002979
BM_mul 0.558 12.2 0.558 12 1000000000 58461644 21.8637993 21.5053763 17.10523228
BM_div 0.805 11.1 0.805 11 871044955 63466231 13.7888199 13.6645963 13.72454203
BM_float_inc 0.997 18.5 0.997 18 710661226 39003335 18.555667 18.0541625 18.22052463
BM_float_dec 0.992 18.5 0.992 18 696412695 39007247 18.6491935 18.1451613 17.85341824
BM_float_mul 0.801 18.5 0.801 18 870941987 38973395 23.0961298 22.4719101 22.34709055
BM_float_div 0.909 33.9 0.909 32.9 764747492 21276946 37.2937294 36.1936194 35.94254044
BM_and 0.617 11.2 0.617 11 1000000000 63815846 18.1523501 17.828201 15.67008921
BM_or 0.618 11.4 0.618 11 1000000000 63767351 18.4466019 17.7993528 15.6820063
BM_not 0.619 11.3 0.619 11 1000000000 63709582 18.2552504 17.7705977 15.69622604
BM_bits_and 0.562 11.3 0.562 11 1000000000 63730270 20.1067616 19.5729537 15.69113076
BM_bits_or 0.559 12.2 0.559 12 1000000000 58254510 21.8246869 21.4669052 17.16605289
BM_bits_nor 0.556 12.2 0.556 12 1000000000 58446391 21.942446 21.5827338 17.1096963
BM_bits_not 0.555 12.7 0.555 12 1000000000 58380200 22.8828829 21.6216216 17.12909514
BM_bits_rshift 0.555 11.3 0.555 11 1000000000 63750367 20.3603604 19.8198198 15.68618421
BM_bits_lshift 0.558 11.4 0.558 11 1000000000 63743425 20.4301075 19.7132616 15.68789252
BM_for_loop 3.04 24.7 3.04 24 235779692 29241522 8.125 7.89473684 8.063181253
BM_while_loop 3.25 24.6 3.25 23.9 215554143 29267222 7.56923077 7.35384615 7.365035978
BM_do_while_loop 2.98 24.8 2.98 23.9 235437929 29262257 8.32214765 8.02013423 8.045788437
BM_bubble_sort 31 339 31 328 22556145 2160002 10.9354839 10.5806452 10.44265005
BM_std_sort 9.74 197 9.74 192 72264119 3683400 20.2258727 19.7125257 19.61886274
BM_calculate_pi 127 5182 127 5073 5446257 137929 40.8031496 39.9448819 39.48594567
BM_factorial 9.15 108 9.15 106 76995221 6612491 11.8032787 11.5846995 11.6439056
average             15.4073332 14.6701879 13.23012203

image

和 i7-4770HQ 对比 Unixbench

Benchmark Run: Wed Sep 07 2022 19:59:17 - 20:23:52
8 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       32343533.0 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     4533.0 MWIPS (10.5 s, 7 samples)
Execl Throughput                               3263.5 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        635715.1 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          168269.6 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       2003899.3 KBps  (30.0 s, 2 samples)
Pipe Throughput                              677305.6 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 110291.0 lps   (10.0 s, 7 samples)
Process Creation                               8915.2 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   9925.1 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   2371.0 lpm   (60.0 s, 2 samples)
System Call Overhead                         554765.5 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   32343533.0   2771.5
Double-Precision Whetstone                       55.0       4533.0    824.2
Execl Throughput                                 43.0       3263.5    759.0
File Copy 1024 bufsize 2000 maxblocks          3960.0     635715.1   1605.3
File Copy 256 bufsize 500 maxblocks            1655.0     168269.6   1016.7
File Copy 4096 bufsize 8000 maxblocks          5800.0    2003899.3   3455.0
Pipe Throughput                               12440.0     677305.6    544.5
Pipe-based Context Switching                   4000.0     110291.0    275.7
Process Creation                                126.0       8915.2    707.6
Shell Scripts (1 concurrent)                     42.4       9925.1   2340.8
Shell Scripts (8 concurrent)                      6.0       2371.0   3951.7
System Call Overhead                          15000.0     554765.5    369.8
                                                                   ========
System Benchmarks Index Score                                        1111.4

------------------------------------------------------------------------
Benchmark Run: Wed Sep 07 2022 20:23:52 - 20:48:38
8 CPUs in system; running 8 parallel copies of tests

Dhrystone 2 using register variables      107735436.2 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    25460.0 MWIPS (9.3 s, 7 samples)
Execl Throughput                              10155.8 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks       1254565.5 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          326444.0 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       3831671.6 KBps  (30.0 s, 2 samples)
Pipe Throughput                             2871966.9 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 280889.3 lps   (10.0 s, 7 samples)
Process Creation                              24198.2 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                  24486.0 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   3283.5 lpm   (60.0 s, 2 samples)
System Call Overhead                        2258288.6 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0  107735436.2   9231.8
Double-Precision Whetstone                       55.0      25460.0   4629.1
Execl Throughput                                 43.0      10155.8   2361.8
File Copy 1024 bufsize 2000 maxblocks          3960.0    1254565.5   3168.1
File Copy 256 bufsize 500 maxblocks            1655.0     326444.0   1972.5
File Copy 4096 bufsize 8000 maxblocks          5800.0    3831671.6   6606.3
Pipe Throughput                               12440.0    2871966.9   2308.7
Pipe-based Context Switching                   4000.0     280889.3    702.2
Process Creation                                126.0      24198.2   1920.5
Shell Scripts (1 concurrent)                     42.4      24486.0   5775.0
Shell Scripts (8 concurrent)                      6.0       3283.5   5472.5
System Call Overhead                          15000.0    2258288.6   1505.5
                                                                   ========
System Benchmarks Index Score                                        3037.7

 性能对比图表见下:

name baseline rv64_result X86_64_result ratio unit comment
Dhrystone 2 using register variables 106700 3001049.4 32343533 10.77740773 lps (10.0 s, 7 samples)
Double-Precision Whetstone 55 1047.8 4533 4.326207291 MWIPS (10.0 s, 7 samples)
Execl Throughput 43 334.6 3263.5 9.75343694 lps (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 3960 42369.6 635715.1 15.00403827 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 1655 11763.5 168269.6 14.3043822 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 5800 120604.7 2003899.3 16.6154329 KBps (30.0 s, 2 samples)
Pipe Throughput 12440 208931.1 677305.6 3.241765348 lps (10.0 s, 7 samples)
Pipe-based Context Switching 4000 30214.6 110291 3.650255175 lps (10.0 s, 7 samples)
Process Creation 126 790.2 8915.2 11.28220704 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 42.4 719 9925.1 13.80403338 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 6 92.6 2371 25.60475162 lpm (60.0 s, 2 samples)
System Call Overhead 15000 380766.8 554765.5 1.4569692 lps (10.0 s, 7 samples)
avg       10.81840726    

Image

经过计算可以得出,选用的 X86_64 架构 CPU 性能大概为所选用的 rv64 架构 CPU 的 10 倍。

和 i7-4770HQ 对比大规模矩阵运算程序

Start time is: 1097
开始生成随机数矩阵
随机数矩阵生成完毕
开始生成稀疏矩阵
稀疏矩阵生成完毕
求得最小的函数值为:0.151914
迭代次数为:170
End time is: 20455667
Time consumption is: 20

rv64 处理器相比,耗时差距大概为 39.25 倍

总结

  • 由于是第一次接触 RISC-V 开发板,之前上面搭载的的是未完全移植的 ArchLinux 系统,设备驱动还十分不完善,导致在进行移植时遇到了很多问题,比如无法使用 HDMI 接口等,这些问题都是由于设备驱动不完善导致的,后来通过移植相对完整的 Debian 系统解决了这些问题。
  • 移植的 Debian 系统中没有 glibc 库,导致无法使用 gcc 进行交叉编译,后来通过移植 glibc 库解决了这个问题。
  • 在移植大规模矩阵计算程序时,由于原始代码编译出的程序运算量过于大,在开发板上运行时间过长(跑了三天三夜都没结束),导致无法进行测试,后来通过 O2 优化解决了这个问题。
  • 由上面测试以及所得到的图表,结果显而易见,即便仅进行单核性能对比,RISC-V 平台处理器对于目前已经成熟的 X86_64 平台处理器也有着明显的性能劣势,这种性能差距不仅仅是频率所导致的,而是由于 RISC-V 平台处理器的设计和 X86_64 平台处理器的设计的差异以及两款处理器的实现工艺所导致的。该 RISC-V 芯片的基础指令性能跟对比的 X86_64 芯片性能差异较大(4-10 倍),即使换算到同等主频(2-5 倍),差异也较为明显。
  • 换算至同频率后,性能对比如下:
测试项目 同频性能下对比
microbench i7-4770HQ 4.51 倍于  XuanTie C906
UnixBench i5-8500 2.64 倍于 XuanTie C906
Matrix i7-4700HQ 12.26 倍于 XuanTie C906

参考资料

  1. Eigen
  2. microbench
  3. UnixBench