CUDA Programming – String Matching

CUDA is worked on by NVIDIA to help application programmers to use the untapped potential of GPUs. As GPU have so many cores that are not used most of the time,Developers can take advantage of this situation and make their programs faster in the process.
Introduction to CUDA

Now , String Mathching
The Kernel Code :

__global__ void searchKeywordKernel(int *result, char *data, char *keyword,int datasize)
int i = blockIdx.x*threadIdx.x;
// Detect the first matching character

if (data[i] == keyword[0]) {
// Loop through next keyword character
for (int j=1; j<3; i++) {
if (data[i+j] != keyword[j])
// Store the first matching character to the result list
result[i] = 1;

Looking at the kernel code. By using the blockID , ThreadID we get a unique indexing number to match string.

We also specify the number of BLOCKS and THREADS
We do that by ,

searchKeywordKernel<<>>(dev_result, dev_data, dev_keyword,datasize);

So we get the number of blocks by Dividing the size with 512(As that is the optimal no. of threads in a block)

Ran with CUDA
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

1)nvcc -o string su_2.c

input.text a file it is reading from
and patter to be matched being Sherlock

String Matched at
4 52 34 54 56 65 23 54 453 545
Time Taken 33:534232 seconds

GPU and its powerful thread capabilities is real power that is yet to be exploited, CPU has only 4 cores whereas GPU can have multiple number of codes.


Matrix Multiplication

It always frightens us when we hear the word “Matrix” and the complex computations involved in Multiplying two Prime Numbers.

Lets take a look at the serial code , Brute force approach

Brute force

Brute Force Serial

  • Input: matrices A and B
  • Let C be a new matrix of the appropriate size
  • Pick a tile size T = Θ(√M)
  • For I from 1 to n in steps of T:
    • For J from 1 to p in steps of T:
      • For K from 1 to m in steps of T:
        • Multiply AI:I+T, K:K+T and BK:K+T, J:J+T into CI:I+T, J:J+T, that is:
        • For i from I to min(I + T, n):
          • For j from J to min(J + T, p):
            • Let sum = 0
            • For k from K to min(K + T, m):
              • Set sum ← sum + Aik × Bkj
            • Set CijCij + sum
  • Return C

Is the Algorithm for Brute force is optimal for smaller sizes

Divide and Conquer 
Divide and Conquer Serial

  • Inputs: matrices A of size n × m, B of size m × p.
  • Base case: if max(n, m, p) is below some threshold, use an unrolled version of t
  • If max(n, m, p) = n, split A horizontally:

Now Serial Divide and Conquer- D & C (Serial)


Now Strassen Matrix Multiplication Strassen Serial


Source : WIKIPEDIA for algo


SIMD – Single Instruction Multiple Data

In our Parallel Computing course, We were so eager to dive into Parallelize code but took a step back and looked at how we could improve already present serial code.

SIMD is an approach to do the same instruction on Multiple data points simultaneously. So more work is done with a single process instruction at given moment.


As you can see in the above picture , four scalar operation are converted to one SIMD operation.

CUDA Programming

A simple way to understand the difference between a GPU and a CPU is to compare how they process tasks. A CPU consists of a few cores optimized for sequential serial processing while a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously.GPU does mathematically heavy tasks that usually would strain the CPU .


CUDA programming was invented to get great software to GPU , so we could run C,C++ or fotron code directly on GPU’s. NVIDIA , went out to build CUDA programming so developers can take advantage of this hardware for various tasks that strain the CPU. NVIDIA main was to , Bring Software and Hardware together.  Now , we use GPU to perform any tasks that burden the CPU for example , floating point number operations. NVIDIA made it possible to harness the GPU power for its true potential and by removing the learning curve required before , to write code for GPU.

So by using CUDA extensions , we can write Kernel code in C


source :

CUDA is not API or a language for that matter , It is just a platform to write seamless GPU code. In the above picture , you can see with simple keywords let the developer write CUDA code.

First dip into Parallel Programming

Parallel Computing , words itself seem very intriguing. The reason behind Parallel Computing was the increase in power in our present systems and chance to exploit that power.

So I was very eager to dive into some Parallel Programming.What is the first program every computer science engineer writes in a new Environment,Hello world Program.

#include "omp.h"
int main(){
#pragma omp parallel for
for(int i=0;i<5;i++){
    printf(" Hello World %d ....",omp_get_thread_num());


Output I got were pretty Interesting
Test 1:

Hello World 2
3rd iteration in Thread 2
Hello World 1
2nd iteration in Thread 1
Hello World 0
0th iteration in Thread 0
Hello World 0
1st iteration in Thread 0
Hello World 3
4th iteration in Thread 3

Test 2:

 Hello World 0 
0th iteration in Thread 0 
 Hello World 0 
1st iteration in Thread 0 
 Hello World 2 
3rd iteration in Thread 2 
 Hello World 1 
2nd iteration in Thread 1 
 Hello World 3 
4th iteration in Thread 3 

#Pragma OpenMP for ,this flag tells the compiler that this for loop can be executed in parallel.(YES , as simple as that)
It created 4 threads on my machine.Which is running an Intel i5 Processor,two physical cores and two virtual cores.Which add up to 4 threads.

The outputs show an Interesting Behaviour!The for loop iterations may occur in any order when executed.
With that I found my first learning in Parallel Computing If the output of a program must remain consistent throughout multiple tests we must make sure the tasks we flag parallel are well and truly independent of the other iterations.