A pattern growth-based sequential pattern mining algorithm called prefixSu ffi xSpan

Sequential pattern mining is an important data mining problem widely addressed by the data mining community, with a very large field of applications. The sequence pattern mining aims at extracting a set of attributes, shared across time among a large number of objects in a given database. The work presented in this paper is directed towards the general theoretical foundations of the pattern-growth approach. It helps indepth understanding of the pattern-growth approach, current status of provided solutions, and direction of research in this area. In this paper, this study is carried out on a particular class of pattern-growth algorithms for which patterns are grown by making grow either the current pattern prefix or the current pattern suffix from the same position at each growth-step. This study leads to a new algorithm called prefixSuffixSpan. Its correctness is proven and experimentations are performed. Received on 07 November 2016; accepted on 05 December 2016; published on 19 January 2017


Introduction
A sequence database consists of sequences of ordered elements or events, recorded with or without a concrete notion of time.Sequences are common, occurring in any metric space that facilitates either partial or total ordering.Customer transactions, codons or nucleotides in an amino acid, website traversal, computer networks, DNA sequences and characters in a text string are examples of where the existence of sequences may be significant and where the detection of frequent (totally or partially ordered) subsequences might be useful.Sequential pattern mining has arisen as a technology to discover such subsequences.A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a customer transaction database, is a (frequent) sequential pattern.
Sequential pattern mining [5,13,14,16] is an important data mining problem widely addressed by the data mining community, with a very large field of applications such as finding network alarm patterns, mining customer purchase patterns, identifying outer membraneproteins, automatically detecting erroneous sentences, discovering block correlations in storage systems, identifying plan failures, identifying copypaste and related bugs in large-scale software code, API specification mining and API usage mining from open source repositories, and Web log data mining.Sequential pattern mining aims at extracting a set of attributes, shared across time among a large number of objects in a given database.
The sequential pattern mining problem was first introduced by Agrawal and Srikant [3] in 1995 based on their study of customer purchase sequences, as follows: Given a set of sequences, where each sequence consists of a list of events (or elements) and each event consists of a set of items, and given a user-specified minimum support threshold min_sup, sequential pattern mining finds all frequent subsequences, that is, the subsequences whose occurrence frequency in the set of sequences is no less than min_sup.
Since the first proposal of this new data mining task and its associated efficient mining algorithms, there has been a growing number of researchers in the field and tremendous progress [16] has been made, evidenced by hundreds of follow-up research publications, on various kinds of extensions and applications, ranging from scalable data mining methodologies, to handling a wide diversity of data types, various extended mining tasks, and a variety of new applications.
Improvements in sequential pattern mining algorithms have followed similar trend in the related area of association rule mining and have been motivated by the need to process more data at a faster speed with lower cost.Previous studies have developed two major classes of sequential pattern mining methods : Apriori-based approaches [3, 4, 8-10, 17, 21, 23, 25, 26] and pattern growth algorithms [11,12,[18][19][20]22].
The Apriori-based approach form the vast majority of algorithms proposed in the literature for sequential pattern mining.Apriori-like algorithms depend mainly on the Apriori anti-monotony property, which states the fact that any super-pattern of an infrequent pattern cannot be frequent, and are based on a candidate generation-and-test paradigm proposed in association rule mining [1,2].This candidate generation-andtest paradigm is carried out by GSP [3], SPADE [26], and SPAM [4].Mining algorithms derived from this approach are based on either vertical or horizontal data formats.Algorithms based on the vertical data format involve AprioriAll, AprioriSome and DynamicSome [3], GSP [3], PSP [17] and SPIRIT [8], while those based on the horizontal data format involve SPADE [26], cSPADE [25], SPAM [4], LAPIN-SPAM [23], IBM [21] and PRISM [9,10].The generation-and-test paradigm has the disadvantage of repeatedly generating an explosive number of candidate sequences and scanning the database to maintain the support count information for these sequences during each iteration of the algorithm, which makes them computationally expensive.To increase the performance of these algorithms constraint driven discovery can be carried out.With constraint driven approaches systems should concentrate only on user specific or user interested patterns or user specified constraints such as minimum support, minimum gap or time interval etc.With regular expressions these constraints are studied in SPIRIT [8].
To alleviate these problems, the pattern-growth approach, represented by FreeSpan [11], PrefixSpan [18,19] and their further extensions, namely FS-Miner [6], LAPIN [12,24], SLPMiner [22] and WAP-mine [20], for efficient sequential pattern mining adopts a divide-and-conquer pattern growth paradigm as follows.Sequence databases are recursively projected into a set of smaller projected databases based on the current sequential patterns, and sequential patterns are grown in each projected database by exploring only locally frequent fragments [11,19].The frequent pattern growth paradigm removes the need for the candidate generation and prune steps that occur in the Apriori-based algorithms and repeatedly narrows the search space by dividing a sequence database into a set of smaller projected databases, which are mined separately.The major advantage of projectionbased sequential pattern-growth algorithms is that they avoid the candidate generation and prune steps that occur in the Apriori-based algorithms.Unlike Apriori-based algorithms, they grow longer sequential patterns from the shorter frequent ones.The major cost of these algorithms is the cost of forming projected databases recursively.To alleviate this problem, a pseudo-projection method is exploited to reduce this cost.Instead of performing physical projection, one can register the index (or identifier) of the corresponding sequence and the starting position of the projected suffix in the sequence.Then, a physical projection of a sequence is replaced by registering a sequence identifier and the projected position index point.Pseudoprojection reduces the cost of projection substantially when the projected database can fit in main memory.
PrefixSpan [18,19] and FreeSpan [11] differ at the criteria of partitionning projected databases and at the criteria of growing patterns.FreeSpan creates projected databases based on the current set of frequent patterns without a particular ordering (i.e., pattern-growth direction), whereas PrefixSpan projects databases by growing frequent prefixes.Thus, PrefixSpan follows the unidirectional growth whereas FreeSpan follows the bidirectional growth.Another difference between FreeSpan and PrefixSpan is that the pseudo-projection works efficiently for PrefixSpan but not so for FreeSpan.This is because for PrefixSpan, an offset position clearly identifies the suffix and thus the projected subsequence.However, for FreeSpan, since the next step pattern-growth can be in both forward and backward directions from any position, one needs to register more information on the possible extension positions in order to identify the remainder of the projected subsequences.
The work presented in this paper is directed towards the general theoretical foundations of the patterngrowth approach, and does not look into algorithms specific to closed, maximal or incremental sequences, neither does it investigate special cases of constrained, approximate or near-match sequential pattern mining.It aims at enhancing understanding of the patterngrowth approach, current status of provided solutions, and direction of research in this area.To this end, the important key concepts upon which that approach relies, namely pattern-growth direction, pattern-growth ordering, search space pruning and search space partitioning, are revisited.In this paper, this study is carried out on a particular class of patterngrowth algorithms for which patterns are grown by making grow either the current pattern prefix or the current pattern suffix from the same position at each growth-step.This class contains PrefixSpan and involves both unidirectional and bidirectional growth.Thus, it is a generalization of PrefixSpan.However, it does not contain FreeSpan as it makes grow patterns from any position.Stemming from this theoretical study, we design a new algorithm called prefixSuffixSpan.We prove its correctness and perform experimentations.
The rest of the paper is organized as follows.Section 2 presents the formal definition of the problem of sequential pattern mining.Section 3 presents the contribution of the paper.Concluding remarks are given in section 4.

Problem statement and Notation
The problem of mining sequential patterns, and its associated notation, can be given as follows: Let I = {i 1 , i 2 , ..., i n } be a set of literals, termed items, which comprise the alphabet.An itemset is a subset of items.A sequence is an ordered list of itemsets.A sequence s is denoted by ≺ s 1 , s 2 , ...s n , where s j is an itemset.s j is also called an element of the sequence, and denoted as (x 1 , x 2 , ..., x m ), where x k is an item.For brevity, the brackets are omitted if an element has only one item, i.e. element (x) is written as x.An item can occur at most once in an element of a sequence, but can occur multiple times in different elements of a sequence.The number of instances of items in a sequence is called the length of the sequence.A sequence with length l is called an l-sequence.The length of a sequence α is denoted |α|.A sequence α =≺ a 1 a 2 ...a n is called subsequence of another sequence β =≺ b 1 b 2 ...b m and β a supersequence of α, denoted as α ⊆ β, if there exist integers 1 ≤ j 1 < j 2 < ... < j n ≤ j m such that a 1 ⊆ b j1 , a 2 ⊆ b j2 , ... , a n ⊆ b jn .Symbol denotes the empty sequence.
We are given a database S of input-sequences.A sequence database is a set of tuples of the form ≺ sid, s where sid is a sequence_id and s a sequence.A tuple ≺ sid, s is said to contain a sequence α, if α is a subsequence of s.The support of a sequence α in a sequence database S is the number of tuples in the database containing α, i.e.

support(S, α)
It can be denoted as support(α) if the sequence database is clear from the context.Given a user-specified positive integer denoted min_support, termed the minimum support or the support threshold, a sequence α is called a sequential pattern in the sequence database S if support(S, α) ≥ min_support.A sequential pattern with length l is called an l-pattern.Given a sequence database and the min_support threshold, sequential pattern mining is to find the complete set of sequential patterns in the database.

Pattern-Growth Directions and Orderings
Definition 1 (Pattern-growth direction).A pattern-growth direction is a direction along which patterns could grow.There are two pattern-growth directions, namely leftto-right and right-to-left directions.Do grow a pattern along left-to-right (resp.right-to-left) direction is to add one ore more item to its right (resp.left) hand side.
Definition 2 (Pattern-growth ordering).A pattern-growth ordering is a specification of the order in which patterns should grow.A pattern-growth ordering is said to be unidirectional iff all the patterns should grow along a unique direction.Otherwise it is said to be bidirectional.A pattern-growth ordering is said to be static (resp.dynamic) iff it is fully specified before the beginning of the mining process (resp.iff it is constructed during the mining process).Definition 3 (Basic-static pattern-growth ordering).A basicstatic pattern-growth ordering, also called basic pattern-growth ordering for sake of simplicity, is an ordering which is based on a unique pattern-growth direction, and grow a pattern at the rate of one item per growth-step.
There are two basic-static pattern-growth orderings, namely left-to-right ordering (also called prefix-growth ordering), which consists in growing a prefix of a pattern at the rate of one item per growth-step at its right hand side, and right-to-left ordering (also called suffix-growth ordering), which consists in growing a suffix of a pattern at the rate of one item per growth-step at its left hand side.
Definition 4 (Basic-dynamic pattern-growth ordering).A basicdynamic pattern-growth ordering is an ordering which grow a pattern at the rate of one item per growth-step, and whose pattern-growth direction is determined at the beginning of each growth-step during the mining process.It is denoted -growth.

Definition 5 (Basic-bidirectional pattern-growth ordering).
A basic-bidirectional pattern-growth ordering is an ordering which is based on the two distinct patterngrowth directions, and grow a pattern in each direction at the rate of one item per couple of growth-steps.
There are two basic-bidirectional pattern-growth orderings, namely prefix-suffix-growth ordering (i.e.leftto-right direction followed by right-to-left direction), which consists in growing a pattern at the rate of one item per growth-step during a couple of steps by first growing a prefix (i.e adding of one item at the righthand side) of that pattern followed by the growing of the corresponding suffix (i.e.adding of one item at the left-hand side), and suffix-prefix-growth ordering (i.e right-to-left direction followed by left-to-right direction), which consists in growing a pattern at the rate of one item per growth-step during a couple of steps by first growing a suffix of that pattern followed by the growing of the corresponding prefix.
Definition 6 (Linear pattern-growth ordering).A linear pattern-growth ordering is a series of compositions ofgrowth, prefix-growth and suffix-growth orderings, and Otherwise, it is said to be dynamic.
The o 0 -o 1 -o 2 . . .o n−1 -growth linear ordering consists in growing a pattern at the rate of one item per growthstep during a series of n growth-steps by growing at step i (0 ≤ i ≤ n − 1) a prefix (resp.suffix) of that pattern if o i denotes prefix (resp.suffix).If o i ∈ { }, a pattern-growth direction is determined and an item is added to the pattern following that direction.For instance, stemming from the prefix-suffix-suffix-prefixgrowth static linear ordering, one should grow a pattern in the following order: • Growth-step 0: Add an item to the right hand side of a prefix of that pattern.
• Growth-step 1: Add one item to the left hand side of the corresponding suffix of the previous prefix.
The prefix-suffix--prefix-growth dynamic linear ordering grows patterns as prefix-suffix-suffix-prefix-growth ordering except for steps k that satisfy (k mod 4) = 3.
During such a particular step, a pattern-growth direction is determined and an item is added to the pattern following that direction.
FreeSpan and PrefixSpan differ at the criteria of growing patterns.FreeSpan creates projected databases based on the current set of frequent patterns without a particular ordering (i.e., pattern-growth direction).Since a length-k pattern may grow at any position, the search for length-(k+1) patterns will need to check every possible combination, which is costly.Because of this, FreeSpan do not follow the linear ordering.However PrefixSpan follows the prefix-growth static ordering as it projects databases by growing frequent prefixes.
Given a database of sequences, an open problem is to find a linear ordering that leads to the best mining performances over all possible linear orderings.

Search Space Pruning and Partitioning
Definition 7 (Prefix of an itemset).Suppose all the items within an itemset are listed alphabetically.Given an itemset Definition 8 (The corresponding suffix of a prefix of an itemset).
is called the suffix of x with regards to prefix x , denoted as x = x/x .We also denote x = x .x .Note, if x = x , the suffix of x with regards to x is empty.If 1 ≤ m < n, the suffix is also denoted as (_x m+1 x m+2 . . .x n ).
The following definition introduce the dot operator.It permits itemset concatenations and sequence concatenations.Definition 9 ("." operator).Let e and e be two itemsets that do not contain the underscore symbol (_).Assume that all the items in e are alphabetically sorted after those in e.Let γ =≺ e 1 . . .e n−1 a and µ =≺ be 2 . . .e m be two sequences, where e i and e i are itemsets that do not contain the underscore symbol, a ∈ {e, (_items in e), (items in e_), (_items in e_)} and b ∈ {e , (_items in e ), (items in e _), (_items in e _)}.The dot operator is defined as follows.
Stemming from the canonical decompositions of sequences following prefix α and suffix α , we define two sets of the sequence database S as follows.We denote S α,α the set of subsequences of S prefixed with α and suffixed with α which are obtained by replacing the left and right parts of canonical decompositions respectively with α and α .We have S α,α = {≺ sid, α.y α,α .α| ≺ sid, y ∈ S and y = spc(y, α).y α,α .ssc(y,α )}.We denote S α,α the set of subsequences which are obtained by removing the left and right parts of canonical decompositions.We have S α,α = {≺ sid, y α,α | ≺ sid, y ∈ S and y = spc(y, α).y α,α .ssc(y,α )}.We also have S = S , and S = S , as denotes the empty sequence.
Definition 13 (Extension of the "." operator ).Let S be a sequence database and let α be a sequence that may contain the underscore symbol (_).The dot operator is extended as follows.We have the following lemmas.
Lemma 1 (The support of z in S α,α is that of its counterpart in S).
[15] Given a sequence database S and two sequences α and α , for any sequence y prefixed with α and suffixed with α , i.e. y = α.z.α for some sequence z, we have support(S, y)=support(S α,α , z).
Function f is bijective because it is injective and surjective.Let consider a sequence y prefixed with α and suffixed with α , i.e. y = α.z.α for some sequence Lemma 2 (What does set α.patterns(S α,α ).α denote for patterns(S) ?).The complete set of sequential patterns of S which are prefixed with α and suffixed with α is equal to α.patterns(S α,α ).α , where function patterns denotes the complete set of sequential patterns of its unique argument.
Proof.A similar proof is provided in [15].Let x be a sequence.Assume that x ∈ α.patterns(S α,α ).α .This means that x = α.z.α for some z ∈ patterns(S α,α ).From lemma 1, we have support(S α,α , z) = support(S, α.z.α ).It comes that, x is also a sequential pattern in S as z is a sequential pattern in S α,α .Thus, α.patterns(S α,α ).α is included in the set of sequential patterns of S which are prefixed with α and suffixed with α .Now, assume that x is a sequential pattern of S which is prefixed with α and suffixed with α .We have x = α.z.α for some sequence z.From lemma 1, we have support(S α,α , z) = support(S, α.z.α ).It comes that, z is also a sequential pattern in S α,α as x is a sequential pattern in S.This means that z ∈ patterns(S α,α ).Thus, the complete set of sequential patterns of S which are prefixed with α and suffixed with α is included in α.patterns(S α,α ).α .Hence the lemma.A pattern growth-based sequential pattern mining called prefixSuffixSpan specified before the beginning of the mining process.It is said to be dynamic iff it is constructed during the mining process.
Lemma 4 (Search-space partitioning based on prefix and/or suffix).We have the following.
1. Let {x 1 , x 2 , . . ., x n } be the complete set of length-1 sequential patterns in a sequence database S.
The complete set of sequential patterns in S can be divided into n disjoint subsets in two different ways: (a) Prefix-item-based search-space partitioning [19]: The i-th subset (1 ≤ i ≤ n) is the set of sequential patterns with prefix x i .
(b) Suffix-item-based search-space partitioning [19]: The i-th subset (1 ≤ i ≤ n) is the set of sequential patterns with suffix x i .
2. Let α be a length-l sequential pattern and {β 1 , β 2 , . . ., β p } be the set of all length-(l+1) sequential patterns with prefix α.Let α be a length-l sequential pattern and {γ 1 , γ 2 , . . ., γ q } be the set of all length-(l + 1) sequential patterns with suffix α .We have: (a) Prefix-based search-space partitioning [19]: The complete set of sequential patterns with prefix α, except for α itself, can be divided into p disjoint subsets.The i-th subset (1 ≤ i ≤ p) is the set of sequential patterns prefixed with β i .
(b) Suffix-based search-space partitioning [19]: The complete set of sequential patterns with suffix α , except for α itself, can be divided into q disjoint subsets.The j-th subset (1 ≤ j ≤ q) is the set of sequential patterns suffixed with γ j .
(c) Prefix-suffix-based search-space partitioning [15]: The complete set of sequential patterns with prefix α and suffix α , and of length greater or equal to l + l + 1, can be divided into p or q disjoint subsets.In the first partition, the i-th subset (1 ≤ i ≤ p) is the set of sequential patterns prefixed with β i and suffixed with α .In the second partition, the j-th subset (1 ≤ j ≤ q) is the set of sequential patterns prefixed with α and suffixed with γ j .Let µ be a sequential pattern of length greater or equal to l + l + 1, with prefix α and with suffix α , where α is of length l and α is of length l .The length-(l+1) prefix of µ is a sequential pattern according to an Apriori principle which states that a subsequence of a sequential pattern is also a sequential pattern.Furthermore, α is a prefix of the length-(l+1) prefix of µ, according to the definition of prefix.This implies that there exists some i (1 ≤ i ≤ p) such that β i is the length-(l+1) prefix of µ.Thus µ is in the i-th subset of the first partition.On the other hand, since the length-k prefix of a sequence is unique, the subsets are disjoint and this implies that µ belongs to only one determined subset.Thus, we have (2.c) for the first partition.The proof of (2.c) for the second partition is similar.Therefore we have the lemma.
Corollary 2 (Partitioning S with sets x i .patterns(Sx i , ) and patterns(S ,x i ).x i ).[15] Let {x 1 , x 2 , . . ., x n } be the complete set of length-1 sequential patterns in a sequence database S. The complete set of sequential patterns in S can be divided into n disjoint subsets in two different ways: , where function patterns denotes the set of sequential patterns of its unique argument.

Suffix-item-based search-space partitioning
Proof.According to part 1.(a) of lemma 4, the ith subset is the set of sequential patterns which are prefixed with x i .From lemma 2, this subset is x i .patterns(Sx i , ).Similarly, according to part 1.(b) of lemma 4, the i-th subset is the set of sequential patterns suffixed with x i .From lemma 2, this subset is patterns(S ,x i ).x i .

Lemma 5 (A linear ordering induces a recursive pruning and partitioning). [15]
A linear ordering induces a recursive pruning and partitioning of the search space.The recursive partitioning is static if the linear ordering is static and dynamic otherwise.
Proof.Let us consider the initial sequence database S, two integer numbers l and l , a length-l sequential pattern α, a length-l sequential pattern α , and a linear ordering L 0 = o 0 -o 1 -o 2 . . .o n−1 -growth.Note that .S , .= S is the starting database of the recursive pruning and partitioning of the search space.In the following, we show how L 0 induces a recursive pruning and partitioning of α.S α,α .α .
• Case 1: o 0 ∈ {prefix}.Let {β 1 .α, β 2 .α, . . ., β p .α } be the set of all length-(l + l + 1) sequential patterns with respect to database α.S α,α .α, prefixed with α and suffixed with α .From lemma 3, either , where x i is an and 1 ≤ i ≤ p.This implies that X = {≺ x 1 , ≺ x 2 , . . ., ≺ x p } is the complete set of length-1 sequential patterns with respect to database S α,α .It comes that any item that does not belong to X is not frequent with respect to S α,α .Thus, any sequence that contains an item that does not belong to X is not frequent with respect to S α,α according to an Apriori principle which states that any supersequence of an infrequent sequence is also infrequent.Because of this, all the infrequent items with respect to S α,α are removed from the z part (also called the middle part) of all sequence α.z.α ∈ α.S α,α .α .This pruning step leads to a new sequence database α.S α,α .αwhose middle parts of sequences do not contain infrequent items with respect to S α,α .Then, α.S α,α .α is partitioned according to part (2.c) of lemma 4. The i-th sub-database (1 ≤ i ≤ p) of α.S α,α .α, denoted α.x i .S α.x i ,α .α, is the set of subsequences of α.S α,α .αwith prefix β i = α.xi and with suffix α .Each sub-database is in turn recursively pruned and partitioned according to • Case 2: o 0 ∈ {suffix}.Let {α.γ 1 , α.γ 2 , . . ., α.γ p } be the set of all length-(l + l + 1) sequential patterns with respect to database α.S α,α .α, prefixed with α and suffixed with α .From lemma 3, either γ i =≺ (x i ) .α or γ i =≺ (x i _) .α(1 ≤ i ≤ p).As in case 1, α.S α,α .α is partitioned according to part (2.c) of lemma 4. The i-th sub-database (1 ≤ i ≤ p) of α.S α,α .α, denoted α.S α,x i .α.xi .α, is the set of subsequences of α.S α,α .αwith prefix α and with suffix γ i = x i .α .As in case 1, each sub-database is in turn recursively pruned and partitioned according to • Case 3: o 0 ∈ { }.A pattern-growth direction is determined during the mining process.Then, α.S α,α .α is recursively pruned and partitioned as in case 1 if the determined direction is left-to-right and as in case 2 otherwise.
From definitions 6 and 14 it is easy to see that the recursive partitioning is static if the linear ordering is static and dynamic otherwise.

A Pattern-growth algorithm based on linear orderings
In this section, we translate the study made in sections 3.1 and 3.2 into a function called prefixSuffixSpan.
It is presented in algorithm 1.The initial call of prefixSuffixSpan (1) takes as arguments the initial database S, the empty sequence as the current prefix and suffix values, a linear ordering o = o 0 -o 1 -o 2 ... o n−1 -growth, the index of the pattern-growth direction o 0 in o, i.e. 0, and the support threshold, (2) searches for the complete list X = {x 1 , x 2 , . . ., x p } of all the length-1 sequential patterns of S, (4) saves α.x i .αas a new sequential pattern for each pattern x i found, assuming that the current prefix and suffix values are respectively α and α .( 5) constructs, following corollary 2, a new database S x i , (resp.S ,x i ) for each length-1 pattern x i found if o 0 = pref ix (resp.o 0 = suf f ix), and ( 6) makes a recursive call per new constructed database with arguments (6.1) α.x i as the new current prefix value if o 0 = pref ix and α otherwise, (6.2) x i .αas the new current suffix value if o 0 = suf f ix and α otherwise, (6.3) o as the pattern-growth ordering, (6.4) the index of the pattern-growth direction o 1 in o, i.e. 1, and (6.5) the support threshold.Function prefixSuffixSpan recursively generates subdatabases from a partition of the current database following corollary 2. We consider that database S is of depth 0. A generated database is of depth d if it has been constructed using d length-1 patterns.Such a database is denoted S(x 1 , x 2 , ..., x d ), where x 1 , x 2 , ... , x d are the length-1 patterns used to construct that database step by step in this order.In the behaviour of prefixSuffixSpan, S(x 1 ) is generated from the initial database S, S(x 1 , x 2 ) is generated from S(x 1 ), more generally S(x 1 , x 2 , ..., x i ) is generated from S(x 1 , x 2 , ..., x i−1 ) where i < d and S(x 1 , x 2 , ..., x d ) is generated from S(x 1 , x 2 , ..., x d−1 ).Thus S(x 1 , x 2 , ..., x d ) is consructed in d steps, where step 1 corresponds to the construction of S(x 1 ) from S and step i corresponds to the construction of S(x 1 , x 2 , ..., x i ) from S(x 1 , x 2 , ..., x i−1 ).In terms of prefixSuffixSpan calls, step 1 corresponds to the initial function call prefixSuffixSpan(S, , , o, 0, min_support) and step i corresponds to the function call prefixSuffixSpan(S(x 1 , x 2 , ..., x i−1 ), α, α , o, i − 1, min_support).We consider that this last function call is of depth i − 1.Similarly, we consider that the initial call is of depth 0. For sake of simplicity, we assume that if d = 0, S(x 1 , x 2 , ..., x d ) denotes the initial sequence database S, i.e.S(x 1 , x 2 , ..., x d ) = S.We have the following lemmas and corollaries.direction ← getTheGrowthDirection() Comment: The following loop Append successively x i and α to α to form a sequential pattern.

9:
for all x i ∈ X do    the right-to-left orderings.on each data set while decreasing the support threshold until algorithms became too long to execute or ran out of memory.The performances are presented in figures 1, 2, 3 and 4.These figures show that the order in which patterns grow has a significant influence on the performances.

Conclusion
Sequential pattern mining is an important data mining problem with broad applications.However, it is also a challenging problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences.It has been a focused theme in data mining research for over a decade.Abundant literature has been dedicated to this research and tremendous progress has been made, ranging from efficient and scalable algorithms for frequent itemset mining to numerous research frontiers, such as sequential pattern mining, structured pattern mining, correlation mining, associative classification, and frequent pattern-based clustering, as well as their broad applications.In this article, an overview is provided on the current status of pattern growth-based sequential pattern mining algorithms.The important key concepts of the pattern-growth approach are revisited, formally defined and extended.A new class of pattern-growth algorithms inspired from a new class of patterngrowth orderings, called linear ordering, is introduced.Issues of this new class of pattern-growth algorithms related to search space pruning and partitioning are investigated.Stemming from this theoretical study, a new algorithm called prefixSuffixSpan is designed.Its correctness is proven and related experimental results are presented.

Definition 10 (
Prefix of a sequence).[19] Suppose all the items within an element are listed alphabetically.Given a sequence α =≺ e 1 e 2 . . .e n , a sequence β =≺ e 1 e 2 . . .e m (m ≤ n) is called a prefix of α if and only if 1) e i = e i for all i ≤ m − 1; 2) e m ⊆ e m ; and 3) all the frequent items in e m − e m are alphabetically sorted after those in e m .If e m ∅ and e m ⊂ e m the prefix is also denoted as ≺ e 1 e 2 . . .e m−1 (items in e m _) .Definition 11 (The corresponding suffix of a prefix of a sequence).[19] Given a sequence α =≺ e 1 e 2 . . .e n .Let β =≺ e 1 e 2 . . .e m−1 e m (m ≤ n) be a prefix of α.Sequence γ =≺ e m e m+1 . . .e n is called the suffix of α with regards to prefix β, denoted as γ = α/β, where e m = e m − e m .We also denote α = β.γ.Note, if β = α, the suffix of α with regards to β is empty.If e m is not empty, the suffix is also denoted as ≺ (_items in e m )e m+1 . . .e n .For example, for the sequence s =≺ a(abc)(ac)(ef gh) , ≺ (ac)(ef gh)

Lemma 8 (Figure 1 .
Figure 1.Performances of left-to-right and right-to-left patterngrowth orderings on the real-life data set LEVIATHAN.The leftto-right pattern-growth ordering is 1.27 − 1.4 times faster, and requires less memory if the support threshold is less than 0.05 and a little more memory otherwise.

Figure 2 .
Figure 2. Performances of left-to-right and right-to-left patterngrowth orderings on the real-life data set kosarak_converted.The right-to-left pattern-growth ordering is 2.6 − 5.6 times faster and requires almost 1.2 times less memory than the other direction.

Figure 3 .
Figure 3. Performances of left-to-right and right-to-left patterngrowth orderings on the real-life data set BIBLE.The right-to-left pattern-growth ordering is 1.21 − 1.25 times faster and requires almost 1.04 − 1.10 times less memory than the other ordering.
Computation of the current prefix and suffix values of a prefixSuffixSpan call of depth d with S(x 1 , x 2 , ..., x d ) as the database (of depth d).