Hacking Articles: GSM HACKING 2

                              [Extracts From]

                      TECHNICAL INFORMATION

              GSM System Security Study


                      10-1617-01        10^th June 1988


                          RACAL RESEARCH LTD.

               WORTON DRIVE, WORTON GRANGE INDUSTRIAL ESTATE,
                    READING, BERKS. RG2 0SB, ENGLAND.

            Telephone:  Reading  (0734)  868601     Telex:  847152



3  COMMERCIAL IN CONFIDENCE [Top of each page; omitted hereafter]


Part I


Overview of GSM Security Features

The object of this part of the report is to provide an overview of the security
features in the GSM system. The description is brief, and focuses on the
algorithms which are needed and how they are to be used: for a more detailed
description the reader is referred to the GSM recommendations GSM 02.09 [2]
and GSM 03.20 [3].

   Three distinct security services are provided. These are subscriber identity
authentication, user and signalling data confidentiality, and subscriber identity
confidentiality. Each of these is considered in turn, and the mechanisms used to
provide them outlined. Actually the second of the services is a grouping of three
GSM features: user data confidentiality on physical connections, connectionless
user data confidentiality and signalling information element confidentiality. [3].
The reason for combining them into one service is that they are all provided
by one and the same mechanism.


1 Subscriber Identity Authentication


This subscriber identity authentication service is the heart of the GSM security
system. It is used to enable the fixed network to authenticate the identity of
mobile subscribers, and to establish and manage the encryption keys needed
to provide the confidentiality services. The service must be supported by all
networks and mobiles, although the frequency of application is at the discretion
of the network.

   Authentication is initiated by the fixed network, and is based upon a simple
challenge-response protocol. When a mobile subscriber ( MS ) attempts to
access the system, the network issues it a random challenge RAND. The MS
computes a response SRES to RAND using a one-way function A3 under control
of a subscriber authentication key K_i. The key K_i is unique to the subscriber,
and is shared only by the subscriber and an authentication centre which serves
the subscriber's home network. The value SRES computed by the MS is sig-


4


nalled to the network, where it is compared with a pre-computed value. If the
two values of SRES agree, the mobile subscriber has been authenticated, and
the call is allowed to proceed. If the values are different, then access is denied.

   The same mechanism is also used to establish a cipher key K_c for encrypt-
ing user and signalling data on the radio path. This procedure is called cipher
key setting in [3]. The key is computed by the MS using a one-way func-
tion A8, again under control of the subscriber authentication key K_i, and is
pre-computed for the network by the authentication centre which serves the
subscriber's home network. Thus at the end of a successful authentication
exchange, both parties possess a fresh cipher key K_c.

   The pre-computed triples ( RAND, SRES, K_c ), held by the fixed networks
for a particular subscriber, are passed from the home network's authentication
centre to visited networks upon demand. The challenges are used just once.
Thus the authentication centre never sends the same triple to two distinct
networks, and a network never re-uses a challenge.

   In practice the two functions A3 and A8 are combined into a single algo-
rithm, called A38 in [2], which is used to simultaneously compute SRES and
K_c from RAND and K_i. In this report this combined algorithm is referred to
as the authentication algorithm. The protocol described above makes it quite
clear that this algorithm need only be available to an authentication centre and
the mobile subscribers which that authentication centre serves. In particular,
there is no need for a common GSM authentication algorithm. and different
networks may use different algorithms. ( The algorithms do, however, need
to have the same input and output parameters; in particular, the length of K_c
is determined by the GSM cipher algorithm ). Never-the-less it is desirable
that there is a GSM standard authentication algorithm which may be used
by all networks which do not wish to develop a proprietary algorithm. There
is just one candidate for such an algorithm; it was proposed by the German
administration, and is analysed in Part VI of this report.


2 User and Signalling Data Confidentiality


As mentioned earlier, this service consists of three elements; user data confiden-
tiality and signalling information on physical connections, connectionless user
data confidentiality and signalling information element confidentiality. The
first element provides for privacy of all user generated data, both voice and


5


non-voice, transferred over the radio path on traffic channels [2, Section 3.3].
The second element provides for privacy of user data transferred in packet
mode over the radio path on a dedicated signalling channel [2. Section 3.4],
whilst the third element provides for privacy of certain user related signalling
elements transferred over the radio path on dedicated signalling channels [2,
Section 3.5].

   All of these elements of service are provided using the same layer 1 encryp-
tion mechanism, and must be supported and used by all networks and mobiles.
The mechanism is now briefly described. For a more comprehensive description
the reader is referred to [3. Section 4]. Encryption is achieved by means of a
ciphering algorithm A5 which produces a key stream under control of a cipher
key K_c. This key stream is then bit-for-bit exclusive-or'd with the data trans-
ferred over the radio path between the MS and the base station ( BS ). The
cipher key is established at the MS as part of the authentication procedure, as
described in the last section, and is transferred through the fixed network to
the BS after the MS has been identified.

   It is essential that the MS and BS synchronise the starting of their cipher
algorithms. A technique for achieving this is described in [4, Section 4], but
this only directly addresses the situation when the network initiates an authen-
tication check. The procedures still need to be specified in detail to cover the
situation when the network does not authenticate the MS. When the network
intends to issue an authentication challenge, the BS starts deciphering all data
immediately after the MS has been identified using the cipher key K_c which the
MS will derive upon receipt of the challenge RAND. The MS starts ciphering
and deciphering the moment it has computed K_c ( and SRES ) from RAND, as
described in the last section, and before SRES is transmitted. On the BS side,
enciphering starts as soon as SRES has been received, deciphered and found to
be correct. To cope with possible transmission loss or errors, the authentication
request and response message are repeated under the control of timers.

   Synchronisation of the ciphering key stream is maintained by using the
TDMA frame structure of the radio sub-system. The TDMA frame number is
used as a message key for the cipher algorithm AS, and the algorithm produces
a synchronised key stream for enciphering and deciphering the data bits in the
frame. For each frame, a total of 114 bits are produced for enciphering / de-
ciphering data transferred from the MS to the BS, and an additional 114 bits
are produced for deciphering / enciphering data received at the MS from the
BS. A frame lasts for 4.6 ms, so that the cipher has to produce the 228 bits in
this time.


6


   The cipher algorithm A5 must be common to all GSM networks, and three
algorithms have been proposed as candidates for the GSM standard: a French
algorithm, a Swedish algorithm and a UK algorithm. These algorithms are
discussed in detail in subsequent parts of this report.


3 Subscriber Identity Confidentiality


This service allows mobile subscribers to originate calls, update their location,
etc, without revealing their International Mobile Subscriber Identity ( IMSI )
to an eavesdropper on the radio path. It thus prevents location tracing of
individual mobile subscribers by listening to the signalling exchanges on the
radio path. All mobiles and networks must be capable of supporting the service,
but its use is not mandatory.

   In order to provide the subscriber identity confidentiality service it is neces-
sary to ensure that the IMSI, or any information which allows an eavesdropper
to derive the IMSI, it not ( normally ) transmitted in clear in any signalling
message on the radio path. The mechanism used to provide this service is
based on the use of a temporary mobile subscriber identity ( TMSI ), which is
securely updated after each successful access to the system. Thus, in principle,
the IMSI need only be transmitted in clear over the radio path at registration.
In addition, the signalling elements which convey information about the IMSI
are enciphered as described in the last section.

   The TMSI updating mechanism functions in the following manner. For
simplicity, assume the MS has been allocated a TMSI, denoted by TMSI_o,
and the network knows the association between TMSI_o and the subscriber's
IMSI. The MS identifies itself to the network by sending TMSI_o. Immediately
after authentication ( if this takes place ), the network generates a new TMSI,
denoted TMSI_n, and sends this to the MS encrypted under the cipher key K_c
as described in the last section. Upon receipt of the message, the MS deciphers
and replaces TMSI_o by TMSI_n.

[End first extract]


10


Part III


The French Proposal for the Cipher



The cipher proposed by France has always been considered as a hardware rather
than a software algorithm. The study of this cipher is based on the descrip-
tion reproduced in Appendix A and described in PDL ( program definition
language ) in Section 4. Software and hardware implementations of the cipher
are considered in Sections 5 and 6. The statistical tests applied are discussed
in Section 7¹ and there is a mathematical analysis in Section 8.


4 PDL Description of the Cipher


4.1 Main Algorithm


( Load Base Key )
FOR each base key bit from 1 to 64
   Load bit into corresponding LFSR cell
END FOR


( Load Message Key )
FOR each message key bit from 1 to 22
   shift_bits = f()         ( Call to shift function f )
   FOR each register i, from 1 to 3
       Exclusive-or message key bit into lsb
       IF bit i of shift_bits is set
         THEN Shift
       END IF
   END FOR
END FOR

( Produce both enciphering and deciphering streams )
FOR i from 1 to 2

_________________________

  ¹ The actual results from the statistical tests are in [1].


11


   ( Perform additional shifts )
   FOR j from 1 to 100
       shift_bits = f()
       FOR each register k from 1 to 3
           IF bit k of shift_bits is set
             THEN Shift
           END IF
       END FOR
   END FOR

   ( Output stream of 114 bits )
   FOR J from 1 to 114
       shift_bits = f()
       FOR each register k from 1 to 3
           IF bit k of shift_bits is set
             THEN Shift
           END IF
       END FOR
       Output = Exclusive-or msb of all three registers
   END FOR
END FOR


4.2 The Shift Function f


BEGIN FUNCTION f

   FOR each register i from 1 to 3
       Let middle[i] = the 'middle' bit of register i
   END FOR

   IF less than two of the 'middle bits' are '1'
     THEN bit-complement code
   END IF

   RETURN code

END FUNCTION


12


5 Software Estimates


In this section the cipher is described in a readable form similar to micropro-
cessor instruction code and an estimate of the speed is made from this. It was
suspected that it would not be practical to implement this code in software, so
the code was based on a very specialized microprocessor, which may not even
exist. If the cipher can not meet the time requirement of 4.6ms on the special-
ized microprocessor then it will not be able to meet it on a more general one.
This special microprocessor has a word size⁵ at least as long as the longest reg-
ister, i.e., 23 bits, it also has the function PARITY⁶ which exclusive-ors all the
bits in the accumulator and places the result in the least significant bit while
setting all the other bits to zero. Additionally it is assumed that the CARRY can
be loaded from the least significant bit of the accumulator. The problem of
directing the feedback bit to the appropriate part of the accumulator is ignored.

   The external memory, considered to be arranged so that REG contains the
registers and MASK contains bit masks for both the extraction of the central
bits of each register and for the calculation of the feedback values, with the
feedback masks last and in reverse order.

   The symbol & is used to mean 'address of'


5.1 Evaluating the Shift Function f


The following code extracts the central bits of each register and calculates the
corresponding output of the shift function f.


Load registers and masks

       LOAD    index register 1 with &REG
       LOAD    index register 2 with &MASK




__________________________

    ⁵Word size is considered here to be the natural size or accumulator and memory 'portions'.
    ⁶ Which will be considered as an ALU operation



13


Extract middle bits of each register

       LOAD    acc with MEM(index 1), POST INC index reg 1
       AND     acc with MEM(index 2), POST INC index reg 2
       PARITY  acc
       STORE   acc in M1
       LOAD    acc with MEM(index 1), POST INC index reg 1
       AND     acc with MEM(index 2), POST INC index reg 2
       PARITY  acc
       STORE   acc in M2
       LOAD    acc with MEM(index 1), POST INC index reg 1
       AND     acc with MEM(index 2)
       PARITY  acc
       STORE   acc in M3

Calculate shift function f

       XOR     acc with M2
       AND     acc with M1
       STORE   acc in I

       LOAD    acc with M2
       AND     acc with M3
       OR      acc with I

       XOR     acc with M1
       STORE   acc in M1
       LOAD    acc with I
       XOR     acc with M2
       STORE   acc in M2
       LOAD    acc with I
       XOR     acc with M3
       STORE   acc in M3




14


   Loading the data itself requires:

    2 index register operations  k cycles each

   Extracting the middle bits of each register requires:

    6 index register operations  @ k cycles each
    3 ALU operations             @ m cycles each
    3 load / store operations    @ n cycles each

   The function f requires:

    7 ALU operations             @ m cycles each
    8 load / store operations    @ n cycles each

   This part of the code requires 8k + 10m + 11n cycles per iteration



5.2 Performing the Shifts


The following code clocks the appropriate registers using the results of the
shift function. Note that the values of M1, M2 and M3 determine whether or
not each register is clocked. Note also that the registers are treated in reverse
order since index register 1 'points' to the contents of shift register 3 at this
stage.



       LOAD    acc with M3
       BRANCH  if zero to A
       LOAD    acc with MEM(index 1)
       AND     MEM(index 2) to acc
       PARITY  acc
       STORE   acc in CARRY
       ROTATE  acc right through CARRY
       STORE   CARRY in 03
       STORE   acc in MEM(index 1)




15


A:      DEC     index register 1
       DEC     index register 2

       LOAD    acc with M2
       BRANCH  if zero to B
       LOAD    acc with MEM(index 1)
       AND     MEM(index 2) to acc
       PARITY  acc
       STORE   acc in CARRY
       ROTATE  acc right through CARRY
       STORE   CARRY in 02
       STORE   acc in MEM(index 1)

B:      DEC     index register 1
       DEC     index register 2

       LOAD    acc with M1
       BRANCH  if zero to C
       LOAD    acc with MEM(index 1)
       AND     MEM(index 2) to acc
       PARITY  acc
       STORE   acc in CARRY
       ROTATE  acc right through CARRY
       STORE   acc in MEM(index 1)

C       LOAD    acc with CARRY
       XOR     acc with 03
       XOR     acc with 02
       OUTPUT  acc



   To estimate the speed it is assumed that 9/4 of the registers are clocked on
each iteration, i e., that 3/4 of the operations to shift the registers are performed
for each iteration.

   Shifting the registers requires:


    3 branch operations           @ j cycles each
   13 index register operations   @ k cycles each
    6 ALU operations              @ m cycles each
    8 load / store operations     @ n cycles each



16


   The calculation of the output bit requires a further:

    2 ALU operations              @ m cycles each
    1 load / store operation      @ n cycles

   Therefore, the clocking requires a total of.
   
    3/4 x (3j + 13k + 6m + 8n) + 2m + n = 9/4j 39/4k + 13/2m + 7n

cycles per iteration.

   The data loading and shift function calculation requires a further 8k +
10m + 11n cycles per iteration.

   Therefore the total number of cycles required is given by:

                        9/4j + 71/4k + 33/2m + 18n

   For typical values of j = k = 5 and m = n = 4 this gives 11.25 + 88.75 +
66 + 72 = 238 cycles per iteration.

   The algorithm must be iterated 450 times to produce 228 bits of output.
This corresponds to ~~ 450 x 238 = 107100 clock cycles to produce 228 bits.
On a 1ms microprocessor this would take approximately 107 ms.


5.3 Summary


These estimates show that even on a specialized microprocessor, and ignoring
some of the detail, this cipher can not operate at the required speed. It is
therefore reasonable to assume that it would not be viable to implement this
cipher in software on a more general microprocessor.

   In light of the unsuitability of this algorithm the memory requirement was
not estimated.



17


6  Hardware Estimates


The following estimates are based on the two Figures 2. and 3. Only the
hardware necessary for the shift registers and the shift function f is considered,
i.e., none of the control, interfacing or test circuitry is studied here. The overall
architecture is shown in Figure 1.

   The transistor counts for various components are based on the Racal Re-
search Ltd., 2.5 mm CMOS microcell library.




    Figure 1: Overall Architecture



6.1 Notation


The elements of the circuits in Figures 2 and 3 are shown as boxes marked
with various symbols as described in the following table.

 ______________________________________
 Symbol    Function
 ______________________________________

 MUX       Multiplexor
 .
 _         Exclusive-or gate
 .

 -         Exclusive-nor gate
 .

 .         And gate

 +         Or gate

 Unmarked  D-type, i e., register stage
 ______________________________________
 [Symbols partly illegible]


18


  All the signals shown are single bits. However. the various "Load Control"
signals in Figure 2 are different signals which control different parts of the
loading mechanism.



6.2 Shift Registers


Figure 2 shows the register R1. The number of exclusive-or gates necessary for
each register depends upon the feedback function for that particular register;
in total seven such gates are needed for the three registers.



  

     Figure 2: Shift Register Architecture for R1
     [Poor original]


   To load the base key, the registers are concatenated together and the key
is shifted through, suppressing the output so that the key does not reappear
again. To load the message key the key bits are exclusive-or'd into the feedback
path. In ordinary operation the feedback path is fed back to the left hand cell
without obstruction. In order to implement this a multiplexor is used to chose
between the feedback and input, while an and gate is used to suppress the
external input to the exclusive-or on the feedback path when it is not required.
The overall components together with their respective transistor counts are:

   64  D-types            @ 22 transistors each  =  1408 transistors
   6   2-input AND gates  @ 6 transistors each   =    36 transistors
   10  XOR gates          @ 10 transistors each  =   100 transistors
   3   2-input MUXs       @ 12 transistors each  =    36 transistors
                                                    1580 transistors

   An additional two exclusive-or gates are required to combine the output of
the three registers, each requiring 10 transistors. This gives a total of 1600
transistors to implement the shift register.



19


^{6.3 Shift function f}


The shift function is implemented by producing a signal comp which is true if
the three bits M1, H2 and M3 need to be inverted. Th;s signal is the exclusive-
ored with each of the three original bits to effect the inversion. If the three bits
are regarded as numbers, then comp is true if and only if their sum is greater
than or equal to 2. Thus, if the three bits are fed into a full adder, then comp
is the negation of the carry out signal. The equation for this carry out signal
is
                                       .
                               M1 .(M2 - M3) + (M2 . M3)
                                       .

which was also employed in Section 5.1. This is shown in Figure 3. Rather than
inverting this value, and then using exclusive-or gates, exclusive-nor ( XNR )
gates are used.


    

 Figure 3: Shift Function f Architecture
  [Poor original]



20


   This requires:

    2  AND gates  @ 6 transistors each   =  12 transistors
    1  OR gate    @ 6 transistors each   =   6 transistors
    1  XOR gate   @ 10 transistors each  =  10 transistors
    3  XNR gates  @ 10 transistors each  =  30 transistors
                                            48

   In addition three further And gates are required to combine the output of
the shift function with the clock signal, see Figure 1.

   The total number of transistors required is 1600 + 48 + 18 = 1666.



6.4 Speed Estimates


In order to produce the two 114-bit key streams the shift registers have to be
shifted the following number of times:

    64   : to load the base key
    22   : to load the message key
    100  : intermediate shifts
    114  : to produce the encrypt stream
    100  : intermediate shifts
    114  : to produce decrypt stream
    514

   If these shifts take two clock cycles each then 1028 clock cycles would be
required. At a clock speed of 50ns per cycle then it would take 51.4ms to
produce the key streams from the keys, which is well within the requirement.



6.5 Summary


A hardware implementation of this cipher requires a relatively small number of
transistors. approximately 1666. If the base key was loaded in parallel then
the circuitry would be more complex, however, given the speed estimate above
it is unlikely that this would be necessary.



21


   These estimates suggest that it should be possible to produce a hardware
implementation of this cipher which meets the speed requirement using a
relatively small number of transistors.


[End second extract]


65


Part I


The German Proposal for the Authentication Algorithm


The authentication algorithm need not be universal and different networks are
free to use algorithms of their own choice ( provided that the parameters are
of the correct length ). However, there will be a GSM standard which can be
used by any administrations who do not wish to develop their own proprietary
algorithms. The AEG have already decided to recommend the German pro-
posal also referred to as COMP128 in some literature for this purpose. This
algorithm was included in the study in order to assess its suitability for the
task. The algorithm is specified in [8]. which is reproduced in Appendix D
( with the exception of the details of the tables ).

   The functionality of the algorithm is described in PDL ( program definition
language ) in Section 20. Using this description an estimate for the complexity
of a software implementation is made in Section 21. This algorithm will even-
tually be implemented in the Subscriber Interface Module ( SIM )¹ which will
be either a smart card or plug-in module. Both options for the SIM contain
a microprocessor thus the authentication algorithm will be implemented in
software rather than hardware. Therefore no hardware estimates were made
for this cipher. A brief description of the statistical analysis applied to the
algorithm is given in Section 22 and a summary of the mathematical analysis
performed is given in Section 23.

________________________________

   ¹ See Part VII



66


20 PDL Description of the Algorithm


( Load RAND into last 16 bytes of input )
FOR i from 16 to 31
   x[i] = rand[i]
END FOR

( Loop eight times )
FOR i from 1 to 8

   ( Load key into first 16 bytes of input )
   FOR j from 0 to 15
       x[j] = key[j]
   END FOR

   ( Perform substitutions )
   FOR j from 0 to 4
       FOR k from 0 to 2^j - 1
           FOR l from 0 to 2^4-j - 1

               m = 1 + k x 2^5-j
               n = m +2^4-j

               y = (x[m] + 2 x x[n]) mod 2^9-j
               z = (2 x x[n] + x[n]) mod 2^9-j

               x[m] = table [j,y]
               x[n] = table [j,z]

           END FOR
       END FOR
   END FOR

   ( Form bits from bytes )
   FOR j from 0 to 31
       FOR k from 0 to 7
           bit [4*j+k] = the (8-k)th bit of byte j
       END FOR
   END FOR



67


   ( Permutation but not on the last loop )
   IF (i < 8) THEN
       FOR j from 0 to 15
           FOR k from 0 to 7
               next bit = (8 x j + k) x 17 mod 128
               Bit k of x[j + 16] = bit[next_bit]
           END FOR
       END FOR
   END IF

END FOR



   At this stage the vector x[ ] consists of 32 nibbles. The last 8 of these
are taken as the output SRES.



21 Software Estimates


In order to estimate the complexity of a software implementation of the German
authentication algorithm, it has been described in a readable form similar to
microprocessor instruction code. This code is then used as a basis for the
estimates.

   Assume that the external memory is arranged as follows:

    TAB   contains the compression tables.
    KEY   contains the 128-bit key.
    SRES  is 256 bits of external memory used to store the
          intermediate and final values - assuming that the
          last 16 bytes have been initialised with RAND.
    TEMP  is 16 bytes of external memory available as working space.

   The symbol & is used to mean 'address of'.



68


21.1 Substitutions


The following code performs the substitutions using the tables. Note that the
indices of the j, k and l loops run 'downwards' for reasons of simplicity of
coding. The corresponding PDL segment is



  FOR j from 4 to 0 step -1
      FOR k from 2^4-j - 1 to 0 step -1
          FOR l from 2^j-1 to 0 step -1
              m = 2^j -1 - 1 + (2^4-j - k - 1)x (2^j+l)


Code to Perform the Substitutions


      LOAD   index register 2 with &SRES
      LOAD   acc with 8
      STORE  acc with I

( Top of I loop )
I:     LOAD   acc with 16
      STORE  acc in J

( Load key into first 16 bytes )
A:     LOAD   index register 1 with &KEY
      LOAD   acc with MEM(index 1), POST INC index reg 1
      STORE  acc in MEM(index 2), POST INC index reg 2
      LOAD   acc with J
      DEC    acc
      STORE  acc in J
      BRANCH if not zero to A

( Perform substitutions )
      LOAD   acc with 5
      STORE  acc in J

( Top of J loop )
J:     DEC    acc
      STORE  acc in J



69


      STORE  acc in X
      LOAD   acc with 1
      STORE  acc in T1
      LOAD   acc with 16
      STORE  acc in T2
      LOAD   acc with X
      BRANCH if zero to C

B:     LOAD   acc with T1
      SHIFT  acc LEFT
      STORE  acc in T1
      LOAD   acc with T2
      SHIFT  acc RIGHT
      STORE  acc in T2
      LOAD   acc with X
      DEC    acc
      STORE  acc in X                 ( T1 = 2**j )
      BRANCH if not zero to B         ( T2 = 2**(4-j) )

C:     LOAD   acc with T2
      STORE  acc in K

( Top of K loop )
K:     DEC    acc
      STORE  acc in K
      LOAD   acc with T1
      STORE  acc in L                 ( L = 2**j )

( Top of L loop )
L:     DEC    acc
      STORE  acc in L

      LOAD   acc with T2
      SUB    K from acc
      DEC    acc
      STORE  acc in M
      LOAD   acc With J
      INC    acc
      STORE  acc in X



70


D:     LOAD   acc with M
      SHIFT  acc LEFT
      STORE  acc in M
      LOAD   acc with X
      DEC    acc
      STORE  acc in X                 ( M = (2**(4-j)-K-1)
      BRANCH if not zero to D              * 2**(J+1) )

      LOAD   acc with M
      ADD    T1 to acc
      SUB    L from acc
      DEC    acc
      STORE  acc in M                 ( M = M+(T1-L-1) )

      ADD    T1 to acc
      STORE  acc in N                 ( N = M + T1 )

      LOAD   index register 1 with &SRES
      LOAD   acc with M
      ADD    acc to index register 1
      LOAD   acc with MEM(index 1)
      STORE  acc in XM                ( XM = X[M] )

      LOAD   index register 1 with &SRES
      LOAD   acc with N
      ADD    acc to index register 1
      LOAD   acc with MEM(index 1)
      STORE  acc in XN                ( XN = X[N] )

      SHIFT  acc LEFT
      ADD    XM to acc
      SHIFT  acc LEFT                 ( Lose last bit of Y )
      SHIFT  acc RIGHT
      STORE  acc in Y                 ( Y = XM + 2*XW )
      LOAD   acc with XM
      SHIFT  acc LEFT
      ADD    XN to acc
      SHIFT  acc LEFT
      SHIFT  acc RIGHT
      STORE  acc in Z                 ( Z = 2*XM + XN )



71


      LOAD   index register 2 with &INDX
      LOAD   acc with I
      ADD    acc to index register 2
      LOAD   acc with MEM(index 2)
      STORE  acc in P
      LOAD   index register 2 with &TAB
      ADD    acc to index register 2  ( Point to correct
      LOAD   acc with Z                 table using INDX )
      ADD    acc to index register 2
      LOAD   acc with MEM(index 2)
      STORE  acc in MEM(index 1)      ( table[Z] -> X[N] )

      LOAD   index register 1 with &SRES
      LOAD   acc with M
      ADD    acc to index register 1
      LOAD   index register 2 with &TAB
      LOAD   acc with Y
      ADD    acc to index register 2
      LOAD   acc with P               ( Point to correct
      ADD    acc to index register 2    table using INDX )
      LOAD   acc with MEM(index 2)
      STORE  acc in MEM(index 1)      ( table[Y] -> X[M] )

( Bottom of L )
      LOAD   acc with L
      BRANCH if not zero to L

( Bottom of K )
      LOAD   acc with K
      BRANCH if not zero to K

( Bottom of J )
      LOAD   acc with J
      BRANCH if not zero to J



72


      LOAD   acc with I
      DEC    acc
      STORE  acc in I
      DEC    acc                       ( No permutation
      BRANCH if zero to Z                if I = 1 )

   Counting the number of operations in each segment of the code gives:

Loop D 1  branch operation           @ j cycles each
      2  ALU operations             @ m cycles each
      4  load / store operations    @ n cycles each

Loop L 1  branch operation           @ j cycles each
      21 index register operations  @ k cycles each
      16 ALU operations             @ m cycles each
      22 load / store operations    @ n cycles each

Loop K 1  branch operation           @ j cycles each
      1  ALU operation              @ m cycles each
      4  load / store operations    @ n cycles each

Seg C  2  load / store operations    @ n cycles each

Loop B 1  branch operation           @ j cycles each
      3  ALU operations             @ m cycles each
      6  load / store operations    @ n cycles each

Loop J 2  branch operation           @ j cycles each
      1  ALU operations             @ m cycles each
      8  load / store operations    @ n cycles each

Loop A 1  branch operation           @ j cycles each
      3  index register operations  @ k cycles each
      1  ALU operations             @ m cycles each
      2  load / store operations    @ n cycles each



73


   In order to assess the speed of the algorithm the number of operations in
each segment of the code need to be multiplied by the number of times that
segment is performed. The code to perform the substitution contains three
nested loops J, K and L, the number of iterations of the K and L loops being
dependent on the value of the J counter, a. There are 2^4-a iterations of K and
2^a iterations of L. Thus there are always 2^4-a x 2^a = 2⁴ = 16 iterations of L.

   The number of iterations of each individual segment is shown in the fol-
lowing table.

   __________________________________________
   Nesting of Loops  N^o of Loops        Total
   __________________________________________
   I A               8 x 16               128
   I J               8 x 5                 40
   I J B             8 x 5 x a            400
   I J C             8 x 5                 40
   I J K             8 x 5 x 2^4-a        1240
   I J K L           8 x 5 x 16           640
   I J K L D         8 x 5 x 16 x (a +1) 9600
   __________________________________________

   This giVes the total number of cycles required to be


   128(j+3k+m+2n) + 40(2j+m+8n) + 400(j+3m+6n) + 40 x 2n
   + 1240(j+m+4n) + 640(j+21k+16m+22n) + 9600(j+2m+4n)


   i.e., 12080j + 13824k + 32048m + 60498n instruction cycles to
perform the substitutions.



21.2 Permutations


The following code performs the permutation. Note that although the PDL
description contains code for extracting and storing individual bits of the bytes
before permutation the low-level code here does not. What is done here is that
for each bit its corresponding byte number and bit-position within that byte
are calculated, the identified bit is masked off and the result is accumulated in
the appropriate byte.



74


_{Code to Perform the Permutations}


      LOAD   index register 2 with &TEMP
      LOAD   acc with 16
      STORE  acc in J

( Top of J loop )
J1     DEC    acc
      STORE  acc in J
      LOAD   acc with 0
      STORE  acc in M
      LOAD   acc with 7
      STORE  acc in K

( Top of K loop )
K1     DEC    acc
      STORE  acc in K

      LOAD   acc with 1
      STORE  acc in R                ( Initialize mask )

      LOAD   acc with J
      SHIFT  acc LEFT
      SHIFT  acc LEFT
      SHIFT  acc LEFT                ( acc = 8*j )
      ADD    K to acc
      STORE  acc in N                ( N = 8*j + K )
      SHIFT  acc LEFT
      SHIFT  acc LEFT
      SHIFT  acc LEFT
      SHIFT  acc LEFT                ( acc = 16*N )
      ADD    N to acc
      STORE  acc in N                ( N = 17 * (8*j + K) )
      SHIFT  acc LEFT
      SHIFT  acc RIGHT               ( N = N mod 128 )

      LOAD   index register 1 with &SRES
      LOAD   acc with N
      SHIFT  acc RIGHT               ( The required type
      SHIFT  acc RIGHT                 is INT(N/4) )



75


      ADD    acc to index register 1

      LOAD   acc with 3              ( The required bit
      XOR    acc with N                is (N mod 4) so
      STORE  acc in A                  mask is created by
      BRANCH if zero to F              3-(N mod 4) shifts )

E      LOAD   acc with R
      SHIFT  acc LEFT
      STORE  acc in R
      LOAD   acc with A
      DEC    acc
      STORE  acc in A
      BRANCH if not zero to E        ( Mask is now in R )

F      LOAD   acc with R
      AND    acc with MEM(index 1)
      ADD    acc to M                ( accumulate result )

( Bottom of K loop )
      LOAD   acc with K
      BRANCH if acc not zero to K1

      LOAD   acc with M
      STORE  acc in MEM(index 2), POST INC index reg 2

( Bottom of J loop )
      LOAD   acc with J
      BRANCH if not zero to J1

(Copy permutation bytes into last 16 bytes in reverse order)
      LOAD   index register 1 with &SRES
      LOAD   acc with 32
      ADD    acc to index register 1

      LOAD   acc with 16
      STORE  acc in J

J2     LOAD   acc with MEM(index 2), POST DEC index reg 2
      STORE  acc in MEM(index 1), POST INC index reg 1



76


      LOAD   acc with J
      DEC    acc
      STORE  acc in J
      BRANCH if not zero to J2

( Bottom of I loop )
Z      LOAD   acc with I
      DEC acc
      BRANCH if not zero to I


   Counting the number of instructions for the individual segments of code
gives:


Loop E  1  branch operation           @ j cycles each
       2  ALU operations             @ m cycles each
       4  load / store operations    @ n cycles each

Loop K1 2  branch operations          @ j cycles each
       3  index register operations  @ k cycles each
       16 ALU operations             @ m cycles each
       11 load / store operations    @ n cycles each

Loop J1 1  branch operation           @ j cycles each
       1  index register operation   @ k cycles each
       1  ALU operation              @ m cycles each
       7  load / store operations    @ n cycles each

Loop J2 1  branch operation           @ j cycles each
       2  index register operations  @ k cycles each
       1  ALU operations             @ m cycles each
       2  load / store operations    @ n cycles each

Other   3  index register operations  @ k cycles each
       5  load / store operations    @ n cycles each



77


Outer Loop


Loop I  1  branch operation           @ j cycles each
       1  ALU operations             @ m cycles each
       3  load / store operations    @ n cycles each

Other   1  index register operation   @ k cycles each
       2  load / store operations    @ n cycles each


   Counting the number of iterations of each segment of the code gives the
following table. Note that the permutation is not performed in the final round
of the algorithm thus the latter part of the I loop has been labelled I'.

   _________________________________________
   Nesting of Loops    No of Loops     Total
   _________________________________________

   I' J1               7 x 16            112
   I' J1 K1            7 x 16 x 7        784
   I' J1 K1 E          7 x 16 x 7 x 2   1568
   I'                  7                   7
   I' J2               7 x 16            112
   I                   8                   8
   _________________________________________


   The number of iterations of the loops together with the numbers and types
of operations required by each of the loops gives the total number of instruction
cycles required to be:


   112(j+k+m+7n) + 784(2j+3k+16m+11n) + 1568(j+2m+4n) +
   7(3k+5n) + 112(j+2k+m+2n) + 8(j+m+3n) + k + 2n


   i.e. 3368j + 2710k + 15912m + 15965n instruction cycles to perform
the permutation.



78


21.3 Speed Estimate


Adding the number of instruction cycles required for the substitutions and
permutations together gives a total of 15.456j+16.534k+47.960m+76.463n
instruction cycles to perform the algorithm. For typical values of j = k = 5
and m = n = 4, leads to a total of 657,634 instruction cycles per run.
On a 1ms processor this would require approximately 0.65s.


21.4 Memory Requirement Estimate


The code to perform the substitutions contains

   72 load / store instructions @ 3 bytes each = 216 bytes
    8 branch instructions       @ 3 bytes each =  24 bytes
   35 ALU instructions          @ 1 byte each  =  35 bytes
                                                 275

   The code to perform the permutations contains

   36 load / store instructions @ 3 bytes each = 108 bytes
    6 branch instructions       @ 3 bytes each =  18 bytes
   23 ALU instructions          @ 1 byte each  =  23 bytes
                                                 149


   Therefore a total of 424 bytes of memory are require to store the code.

   In addition the compression tables TAB occupy 512 + 256 + 128 + 52 + 32 =
992 bytes of external memory. The variables KEY, SRES and TEMP require a
further 16 + 32 + 16 = 64 bytes of memory. In the code the following 16
single byte variables are used: A, I, J, K, L, M, N, P, R, T1, T2, X,
XM, XN, Y and Z.

   Thus the total memory requirement for the data and program storage is
424 + 992 + 64 + 16 = 1072 bytes.

[End third extraction]