Hacking Articles

GSM HACKING 2

Filed under ( HACKING CRACKING MAC OS WINDOWS LINUX NETWORK PASSWORD )

                              [Extracts From]

                      TECHNICAL INFORMATION

              GSM System Security Study


                      10-1617-01        10^th June 1988


                          RACAL RESEARCH LTD.

               WORTON DRIVE, WORTON GRANGE INDUSTRIAL ESTATE,
                    READING, BERKS. RG2 0SB, ENGLAND.

            Telephone:  Reading  (0734)  868601     Telex:  847152



3  COMMERCIAL IN CONFIDENCE [Top of each page; omitted hereafter]


Part I


Overview of GSM Security Features

The object of this part of the report is to provide an overview of the security
features in the GSM system. The description is brief, and focuses on the
algorithms which are needed and how they are to be used: for a more detailed
description the reader is referred to the GSM recommendations GSM 02.09 [2]
and GSM 03.20 [3].

   Three distinct security services are provided. These are subscriber identity
authentication, user and signalling data confidentiality, and subscriber identity
confidentiality. Each of these is considered in turn, and the mechanisms used to
provide them outlined. Actually the second of the services is a grouping of three
GSM features: user data confidentiality on physical connections, connectionless
user data confidentiality and signalling information element confidentiality. [3].
The reason for combining them into one service is that they are all provided
by one and the same mechanism.


1 Subscriber Identity Authentication


This subscriber identity authentication service is the heart of the GSM security
system. It is used to enable the fixed network to authenticate the identity of
mobile subscribers, and to establish and manage the encryption keys needed
to provide the confidentiality services. The service must be supported by all
networks and mobiles, although the frequency of application is at the discretion
of the network.

   Authentication is initiated by the fixed network, and is based upon a simple
challenge-response protocol. When a mobile subscriber ( MS ) attempts to
access the system, the network issues it a random challenge RAND. The MS
computes a response SRES to RAND using a one-way function A3 under control
of a subscriber authentication key K_i. The key K_i is unique to the subscriber,
and is shared only by the subscriber and an authentication centre which serves
the subscriber's home network. The value SRES computed by the MS is sig-


4


nalled to the network, where it is compared with a pre-computed value. If the
two values of SRES agree, the mobile subscriber has been authenticated, and
the call is allowed to proceed. If the values are different, then access is denied.

   The same mechanism is also used to establish a cipher key K_c for encrypt-
ing user and signalling data on the radio path. This procedure is called cipher
key setting in [3]. The key is computed by the MS using a one-way func-
tion A8, again under control of the subscriber authentication key K_i, and is
pre-computed for the network by the authentication centre which serves the
subscriber's home network. Thus at the end of a successful authentication
exchange, both parties possess a fresh cipher key K_c.

   The pre-computed triples ( RAND, SRES, K_c ), held by the fixed networks
for a particular subscriber, are passed from the home network's authentication
centre to visited networks upon demand. The challenges are used just once.
Thus the authentication centre never sends the same triple to two distinct
networks, and a network never re-uses a challenge.

   In practice the two functions A3 and A8 are combined into a single algo-
rithm, called A38 in [2], which is used to simultaneously compute SRES and
K_c from RAND and K_i. In this report this combined algorithm is referred to
as the authentication algorithm. The protocol described above makes it quite
clear that this algorithm need only be available to an authentication centre and
the mobile subscribers which that authentication centre serves. In particular,
there is no need for a common GSM authentication algorithm. and different
networks may use different algorithms. ( The algorithms do, however, need
to have the same input and output parameters; in particular, the length of K_c
is determined by the GSM cipher algorithm ). Never-the-less it is desirable
that there is a GSM standard authentication algorithm which may be used
by all networks which do not wish to develop a proprietary algorithm. There
is just one candidate for such an algorithm; it was proposed by the German
administration, and is analysed in Part VI of this report.


2 User and Signalling Data Confidentiality


As mentioned earlier, this service consists of three elements; user data confiden-
tiality and signalling information on physical connections, connectionless user
data confidentiality and signalling information element confidentiality. The
first element provides for privacy of all user generated data, both voice and


5


non-voice, transferred over the radio path on traffic channels [2, Section 3.3].
The second element provides for privacy of user data transferred in packet
mode over the radio path on a dedicated signalling channel [2. Section 3.4],
whilst the third element provides for privacy of certain user related signalling
elements transferred over the radio path on dedicated signalling channels [2,
Section 3.5].

   All of these elements of service are provided using the same layer 1 encryp-
tion mechanism, and must be supported and used by all networks and mobiles.
The mechanism is now briefly described. For a more comprehensive description
the reader is referred to [3. Section 4]. Encryption is achieved by means of a
ciphering algorithm A5 which produces a key stream under control of a cipher
key K_c. This key stream is then bit-for-bit exclusive-or'd with the data trans-
ferred over the radio path between the MS and the base station ( BS ). The
cipher key is established at the MS as part of the authentication procedure, as
described in the last section, and is transferred through the fixed network to
the BS after the MS has been identified.

   It is essential that the MS and BS synchronise the starting of their cipher
algorithms. A technique for achieving this is described in [4, Section 4], but
this only directly addresses the situation when the network initiates an authen-
tication check. The procedures still need to be specified in detail to cover the
situation when the network does not authenticate the MS. When the network
intends to issue an authentication challenge, the BS starts deciphering all data
immediately after the MS has been identified using the cipher key K_c which the
MS will derive upon receipt of the challenge RAND. The MS starts ciphering
and deciphering the moment it has computed K_c ( and SRES ) from RAND, as
described in the last section, and before SRES is transmitted. On the BS side,
enciphering starts as soon as SRES has been received, deciphered and found to
be correct. To cope with possible transmission loss or errors, the authentication
request and response message are repeated under the control of timers.

   Synchronisation of the ciphering key stream is maintained by using the
TDMA frame structure of the radio sub-system. The TDMA frame number is
used as a message key for the cipher algorithm AS, and the algorithm produces
a synchronised key stream for enciphering and deciphering the data bits in the
frame. For each frame, a total of 114 bits are produced for enciphering / de-
ciphering data transferred from the MS to the BS, and an additional 114 bits
are produced for deciphering / enciphering data received at the MS from the
BS. A frame lasts for 4.6 ms, so that the cipher has to produce the 228 bits in
this time.


6


   The cipher algorithm A5 must be common to all GSM networks, and three
algorithms have been proposed as candidates for the GSM standard: a French
algorithm, a Swedish algorithm and a UK algorithm. These algorithms are
discussed in detail in subsequent parts of this report.


3 Subscriber Identity Confidentiality


This service allows mobile subscribers to originate calls, update their location,
etc, without revealing their International Mobile Subscriber Identity ( IMSI )
to an eavesdropper on the radio path. It thus prevents location tracing of
individual mobile subscribers by listening to the signalling exchanges on the
radio path. All mobiles and networks must be capable of supporting the service,
but its use is not mandatory.

   In order to provide the subscriber identity confidentiality service it is neces-
sary to ensure that the IMSI, or any information which allows an eavesdropper
to derive the IMSI, it not ( normally ) transmitted in clear in any signalling
message on the radio path. The mechanism used to provide this service is
based on the use of a temporary mobile subscriber identity ( TMSI ), which is
securely updated after each successful access to the system. Thus, in principle,
the IMSI need only be transmitted in clear over the radio path at registration.
In addition, the signalling elements which convey information about the IMSI
are enciphered as described in the last section.

   The TMSI updating mechanism functions in the following manner. For
simplicity, assume the MS has been allocated a TMSI, denoted by TMSI_o,
and the network knows the association between TMSI_o and the subscriber's
IMSI. The MS identifies itself to the network by sending TMSI_o. Immediately
after authentication ( if this takes place ), the network generates a new TMSI,
denoted TMSI_n, and sends this to the MS encrypted under the cipher key K_c
as described in the last section. Upon receipt of the message, the MS deciphers
and replaces TMSI_o by TMSI_n.

[End first extract]


10


Part III


The French Proposal for the Cipher



The cipher proposed by France has always been considered as a hardware rather
than a software algorithm. The study of this cipher is based on the descrip-
tion reproduced in Appendix A and described in PDL ( program definition
language ) in Section 4. Software and hardware implementations of the cipher
are considered in Sections 5 and 6. The statistical tests applied are discussed
in Section 7¹ and there is a mathematical analysis in Section 8.


4 PDL Description of the Cipher


4.1 Main Algorithm


( Load Base Key )
FOR each base key bit from 1 to 64
   Load bit into corresponding LFSR cell
END FOR


( Load Message Key )
FOR each message key bit from 1 to 22
   shift_bits = f()         ( Call to shift function f )
   FOR each register i, from 1 to 3
       Exclusive-or message key bit into lsb
       IF bit i of shift_bits is set
         THEN Shift
       END IF
   END FOR
END FOR

( Produce both enciphering and deciphering streams )
FOR i from 1 to 2

_________________________

  ¹ The actual results from the statistical tests are in [1].


11


   ( Perform additional shifts )
   FOR j from 1 to 100
       shift_bits = f()
       FOR each register k from 1 to 3
           IF bit k of shift_bits is set
             THEN Shift
           END IF
       END FOR
   END FOR

   ( Output stream of 114 bits )
   FOR J from 1 to 114
       shift_bits = f()
       FOR each register k from 1 to 3
           IF bit k of shift_bits is set
             THEN Shift
           END IF
       END FOR
       Output = Exclusive-or msb of all three registers
   END FOR
END FOR


4.2 The Shift Function f


BEGIN FUNCTION f

   FOR each register i from 1 to 3
       Let middle[i] = the 'middle' bit of register i
   END FOR

   IF less than two of the 'middle bits' are '1'
     THEN bit-complement code
   END IF

   RETURN code

END FUNCTION


12


5 Software Estimates


In this section the cipher is described in a readable form similar to micropro-
cessor instruction code and an estimate of the speed is made from this. It was
suspected that it would not be practical to implement this code in software, so
the code was based on a very specialized microprocessor, which may not even
exist. If the cipher can not meet the time requirement of 4.6ms on the special-
ized microprocessor then it will not be able to meet it on a more general one.
This special microprocessor has a word size⁵ at least as long as the longest reg-
ister, i.e., 23 bits, it also has the function PARITY⁶ which exclusive-ors all the
bits in the accumulator and places the result in the least significant bit while
setting all the other bits to zero. Additionally it is assumed that the CARRY can
be loaded from the least significant bit of the accumulator. The problem of
directing the feedback bit to the appropriate part of the accumulator is ignored.

   The external memory, considered to be arranged so that REG contains the
registers and MASK contains bit masks for both the extraction of the central
bits of each register and for the calculation of the feedback values, with the
feedback masks last and in reverse order.

   The symbol & is used to mean 'address of'


5.1 Evaluating the Shift Function f


The following code extracts the central bits of each register and calculates the
corresponding output of the shift function f.


Load registers and masks

       LOAD    index register 1 with &REG
       LOAD    index register 2 with &MASK




__________________________

    ⁵Word size is considered here to be the natural size or accumulator and memory 'portions'.
    ⁶ Which will be considered as an ALU operation



13


Extract middle bits of each register

       LOAD    acc with MEM(index 1), POST INC index reg 1
       AND     acc with MEM(index 2), POST INC index reg 2
       PARITY  acc
       STORE   acc in M1
       LOAD    acc with MEM(index 1), POST INC index reg 1
       AND     acc with MEM(index 2), POST INC index reg 2
       PARITY  acc
       STORE   acc in M2
       LOAD    acc with MEM(index 1), POST INC index reg 1
       AND     acc with MEM(index 2)
       PARITY  acc
       STORE   acc in M3

Calculate shift function f

       XOR     acc with M2
       AND     acc with M1
       STORE   acc in I

       LOAD    acc with M2
       AND     acc with M3
       OR      acc with I

       XOR     acc with M1
       STORE   acc in M1
       LOAD    acc with I
       XOR     acc with M2
       STORE   acc in M2
       LOAD    acc with I
       XOR     acc with M3
       STORE   acc in M3




14


   Loading the data itself requires:

    2 index register operations  k cycles each

   Extracting the middle bits of each register requires:

    6 index register operations  @ k cycles each
    3 ALU operations             @ m cycles each
    3 load / store operations    @ n cycles each

   The function f requires:

    7 ALU operations             @ m cycles each
    8 load / store operations    @ n cycles each

   This part of the code requires 8k + 10m + 11n cycles per iteration



5.2 Performing the Shifts


The following code clocks the appropriate registers using the results of the
shift function. Note that the values of M1, M2 and M3 determine whether or
not each register is clocked. Note also that the registers are treated in reverse
order since index register 1 'points' to the contents of shift register 3 at this
stage.



       LOAD    acc with M3
       BRANCH  if zero to A
       LOAD    acc with MEM(index 1)
       AND     MEM(index 2) to acc
       PARITY  acc
       STORE   acc in CARRY
       ROTATE  acc right through CARRY
       STORE   CARRY in 03
       STORE   acc in MEM(index 1)




15


A:      DEC     index register 1
       DEC     index register 2

       LOAD    acc with M2
       BRANCH  if zero to B
       LOAD    acc with MEM(index 1)
       AND     MEM(index 2) to acc
       PARITY  acc
       STORE   acc in CARRY
       ROTATE  acc right through CARRY
       STORE   CARRY in 02
       STORE   acc in MEM(index 1)

B:      DEC     index register 1
       DEC     index register 2

       LOAD    acc with M1
       BRANCH  if zero to C
       LOAD    acc with MEM(index 1)
       AND     MEM(index 2) to acc
       PARITY  acc
       STORE   acc in CARRY
       ROTATE  acc right through CARRY
       STORE   acc in MEM(index 1)

C       LOAD    acc with CARRY
       XOR     acc with 03
       XOR     acc with 02
       OUTPUT  acc



   To estimate the speed it is assumed that 9/4 of the registers are clocked on
each iteration, i e., that 3/4 of the operations to shift the registers are performed
for each iteration.

   Shifting the registers requires:


    3 branch operations           @ j cycles each
   13 index register operations   @ k cycles each
    6 ALU operations              @ m cycles each
    8 load / store operations     @ n cycles each



16


   The calculation of the output bit requires a further:

    2 ALU operations              @ m cycles each
    1 load / store operation      @ n cycles

   Therefore, the clocking requires a total of.
   
    3/4 x (3j + 13k + 6m + 8n) + 2m + n = 9/4j 39/4k + 13/2m + 7n

cycles per iteration.

   The data loading and shift function calculation requires a further 8k +
10m + 11n cycles per iteration.

   Therefore the total number of cycles required is given by:

                        9/4j + 71/4k + 33/2m + 18n

   For typical values of j = k = 5 and m = n = 4 this gives 11.25 + 88.75 +
66 + 72 = 238 cycles per iteration.

   The algorithm must be iterated 450 times to produce 228 bits of output.
This corresponds to ~~ 450 x 238 = 107100 clock cycles to produce 228 bits.
On a 1ms microprocessor this would take approximately 107 ms.


5.3 Summary


These estimates show that even on a specialized microprocessor, and ignoring
some of the detail, this cipher can not operate at the required speed. It is
therefore reasonable to assume that it would not be viable to implement this
cipher in software on a more general microprocessor.

   In light of the unsuitability of this algorithm the memory requirement was
not estimated.



17


6  Hardware Estimates


The following estimates are based on the two Figures 2. and 3. Only the
hardware necessary for the shift registers and the shift function f is considered,
i.e., none of the control, interfacing or test circuitry is studied here. The overall
architecture is shown in Figure 1.

   The transistor counts for various components are based on the Racal Re-
search Ltd., 2.5 mm CMOS microcell library.




    Figure 1: Overall Architecture



6.1 Notation


The elements of the circuits in Figures 2 and 3 are shown as boxes marked
with various symbols as described in the following table.

 ______________________________________
 Symbol    Function
 ______________________________________

 MUX       Multiplexor
 .
 _         Exclusive-or gate
 .

 -         Exclusive-nor gate
 .

 .         And gate

 +         Or gate

 Unmarked  D-type, i e., register stage
 ______________________________________
 [Symbols partly illegible]


18


  All the signals shown are single bits. However. the various "Load Control"
signals in Figure 2 are different signals which control different parts of the
loading mechanism.



6.2 Shift Registers


Figure 2 shows the register R1. The number of exclusive-or gates necessary for
each register depends upon the feedback function for that particular register;
in total seven such gates are needed for the three registers.



  

     Figure 2: Shift Register Architecture for R1
     [Poor original]


   To load the base key, the registers are concatenated together and the key
is shifted through, suppressing the output so that the key does not reappear
again. To load the message key the key bits are exclusive-or'd into the feedback
path. In ordinary operation the feedback path is fed back to the left hand cell
without obstruction. In order to implement this a multiplexor is used to chose
between the feedback and input, while an and gate is used to suppress the
external input to the exclusive-or on the feedback path when it is not required.
The overall components together with their respective transistor counts are:

   64  D-types            @ 22 transistors each  =  1408 transistors
   6   2-input AND gates  @ 6 transistors each   =    36 transistors
   10  XOR gates          @ 10 transistors each  =   100 transistors
   3   2-input MUXs       @ 12 transistors each  =    36 transistors
                                                    1580 transistors

   An additional two exclusive-or gates are required to combine the output of
the three registers, each requiring 10 transistors. This gives a total of 1600
transistors to implement the shift register.



19


^{6.3 Shift function f}


The shift function is implemented by producing a signal comp which is true if
the three bits M1, H2 and M3 need to be inverted. Th;s signal is the exclusive-
ored with each of the three original bits to effect the inversion. If the three bits
are regarded as numbers, then comp is true if and only if their sum is greater
than or equal to 2. Thus, if the three bits are fed into a full adder, then comp
is the negation of the carry out signal. The equation for this carry out signal
is
                                       .
                               M1 .(M2 - M3) + (M2 . M3)
                                       .

which was also employed in Section 5.1. This is shown in Figure 3. Rather than
inverting this value, and then using exclusive-or gates, exclusive-nor ( XNR )
gates are used.


    

 Figure 3: Shift Function f Architecture
  [Poor original]



20


   This requires:

    2  AND gates  @ 6 transistors each   =  12 transistors
    1  OR gate    @ 6 transistors each   =   6 transistors
    1  XOR gate   @ 10 transistors each  =  10 transistors
    3  XNR gates  @ 10 transistors each  =  30 transistors
                                            48

   In addition three further And gates are required to combine the output of
the shift function with the clock signal, see Figure 1.

   The total number of transistors required is 1600 + 48 + 18 = 1666.



6.4 Speed Estimates


In order to produce the two 114-bit key streams the shift registers have to be
shifted the following number of times:

    64   : to load the base key
    22   : to load the message key
    100  : intermediate shifts
    114  : to produce the encrypt stream
    100  : intermediate shifts
    114  : to produce decrypt stream
    514

   If these shifts take two clock cycles each then 1028 clock cycles would be
required. At a clock speed of 50ns per cycle then it would take 51.4ms to
produce the key streams from the keys, which is well within the requirement.



6.5 Summary


A hardware implementation of this cipher requires a relatively small number of
transistors. approximately 1666. If the base key was loaded in parallel then
the circuitry would be more complex, however, given the speed estimate above
it is unlikely that this would be necessary.



21


   These estimates suggest that it should be possible to produce a hardware
implementation of this cipher which meets the speed requirement using a
relatively small number of transistors.


[End second extract]


65


Part I


The German Proposal for the Authentication Algorithm


The authentication algorithm need not be universal and different networks are
free to use algorithms of their own choice ( provided that the parameters are
of the correct length ). However, there will be a GSM standard which can be
used by any administrations who do not wish to develop their own proprietary
algorithms. The AEG have already decided to recommend the German pro-
posal also referred to as COMP128 in some literature for this purpose. This
algorithm was included in the study in order to assess its suitability for the
task. The algorithm is specified in [8]. which is reproduced in Appendix D
( with the exception of the details of the tables ).

   The functionality of the algorithm is described in PDL ( program definition
language ) in Section 20. Using this description an estimate for the complexity
of a software implementation is made in Section 21. This algorithm will even-
tually be implemented in the Subscriber Interface Module ( SIM )¹ which will
be either a smart card or plug-in module. Both options for the SIM contain
a microprocessor thus the authentication algorithm will be implemented in
software rather than hardware. Therefore no hardware estimates were made
for this cipher. A brief description of the statistical analysis applied to the
algorithm is given in Section 22 and a summary of the mathematical analysis
performed is given in Section 23.

________________________________

   ¹ See Part VII



66


20 PDL Description of the Algorithm


( Load RAND into last 16 bytes of input )
FOR i from 16 to 31
   x[i] = rand[i]
END FOR

( Loop eight times )
FOR i from 1 to 8

   ( Load key into first 16 bytes of input )
   FOR j from 0 to 15
       x[j] = key[j]
   END FOR

   ( Perform substitutions )
   FOR j from 0 to 4
       FOR k from 0 to 2^j - 1
           FOR l from 0 to 2^4-j - 1

               m = 1 + k x 2^5-j
               n = m +2^4-j

               y = (x[m] + 2 x x[n]) mod 2^9-j
               z = (2 x x[n] + x[n]) mod 2^9-j

               x[m] = table [j,y]
               x[n] = table [j,z]

           END FOR
       END FOR
   END FOR

   ( Form bits from bytes )
   FOR j from 0 to 31
       FOR k from 0 to 7
           bit [4*j+k] = the (8-k)th bit of byte j
       END FOR
   END FOR



67


   ( Permutation but not on the last loop )
   IF (i < 8) THEN
       FOR j from 0 to 15
           FOR k from 0 to 7
               next bit = (8 x j + k) x 17 mod 128
               Bit k of x[j + 16] = bit[next_bit]
           END FOR
       END FOR
   END IF

END FOR



   At this stage the vector x[ ] consists of 32 nibbles. The last 8 of these
are taken as the output SRES.



21 Software Estimates


In order to estimate the complexity of a software implementation of the German
authentication algorithm, it has been described in a readable form similar to
microprocessor instruction code. This code is then used as a basis for the
estimates.

   Assume that the external memory is arranged as follows:

    TAB   contains the compression tables.
    KEY   contains the 128-bit key.
    SRES  is 256 bits of external memory used to store the
          intermediate and final values - assuming that the
          last 16 bytes have been initialised with RAND.
    TEMP  is 16 bytes of external memory available as working space.

   The symbol & is used to mean 'address of'.



68


21.1 Substitutions


The following code performs the substitutions using the tables. Note that the
indices of the j, k and l loops run 'downwards' for reasons of simplicity of
coding. The corresponding PDL segment is



  FOR j from 4 to 0 step -1
      FOR k from 2^4-j - 1 to 0 step -1
          FOR l from 2^j-1 to 0 step -1
              m = 2^j -1 - 1 + (2^4-j - k - 1)x (2^j+l)


Code to Perform the Substitutions


      LOAD   index register 2 with &SRES
      LOAD   acc with 8
      STORE  acc with I

( Top of I loop )
I:     LOAD   acc with 16
      STORE  acc in J

( Load key into first 16 bytes )
A:     LOAD   index register 1 with &KEY
      LOAD   acc with MEM(index 1), POST INC index reg 1
      STORE  acc in MEM(index 2), POST INC index reg 2
      LOAD   acc with J
      DEC    acc
      STORE  acc in J
      BRANCH if not zero to A

( Perform substitutions )
      LOAD   acc with 5
      STORE  acc in J

( Top of J loop )
J:     DEC    acc
      STORE  acc in J



69


      STORE  acc in X
      LOAD   acc with 1
      STORE  acc in T1
      LOAD   acc with 16
      STORE  acc in T2
      LOAD   acc with X
      BRANCH if zero to C

B:     LOAD   acc with T1
      SHIFT  acc LEFT
      STORE  acc in T1
      LOAD   acc with T2
      SHIFT  acc RIGHT
      STORE  acc in T2
      LOAD   acc with X
      DEC    acc
      STORE  acc in X                 ( T1 = 2**j )
      BRANCH if not zero to B         ( T2 = 2**(4-j) )

C:     LOAD   acc with T2
      STORE  acc in K

( Top of K loop )
K:     DEC    acc
      STORE  acc in K
      LOAD   acc with T1
      STORE  acc in L                 ( L = 2**j )

( Top of L loop )
L:     DEC    acc
      STORE  acc in L

      LOAD   acc with T2
      SUB    K from acc
      DEC    acc
      STORE  acc in M
      LOAD   acc With J
      INC    acc
      STORE  acc in X



70


D:     LOAD   acc with M
      SHIFT  acc LEFT
      STORE  acc in M
      LOAD   acc with X
      DEC    acc
      STORE  acc in X                 ( M = (2**(4-j)-K-1)
      BRANCH if not zero to D              * 2**(J+1) )

      LOAD   acc with M
      ADD    T1 to acc
      SUB    L from acc
      DEC    acc
      STORE  acc in M                 ( M = M+(T1-L-1) )

      ADD    T1 to acc
      STORE  acc in N                 ( N = M + T1 )

      LOAD   index register 1 with &SRES
      LOAD   acc with M
      ADD    acc to index register 1
      LOAD   acc with MEM(index 1)
      STORE  acc in XM                ( XM = X[M] )

      LOAD   index register 1 with &SRES
      LOAD   acc with N
      ADD    acc to index register 1
      LOAD   acc with MEM(index 1)
      STORE  acc in XN                ( XN = X[N] )

      SHIFT  acc LEFT
      ADD    XM to acc
      SHIFT  acc LEFT                 ( Lose last bit of Y )
      SHIFT  acc RIGHT
      STORE  acc in Y                 ( Y = XM + 2*XW )
      LOAD   acc with XM
      SHIFT  acc LEFT
      ADD    XN to acc
      SHIFT  acc LEFT
      SHIFT  acc RIGHT
      STORE  acc in Z                 ( Z = 2*XM + XN )



71


      LOAD   index register 2 with &INDX
      LOAD   acc with I
      ADD    acc to index register 2
      LOAD   acc with MEM(index 2)
      STORE  acc in P
      LOAD   index register 2 with &TAB
      ADD    acc to index register 2  ( Point to correct
      LOAD   acc with Z                 table using INDX )
      ADD    acc to index register 2
      LOAD   acc with MEM(index 2)
      STORE  acc in MEM(index 1)      ( table[Z] -> X[N] )

      LOAD   index register 1 with &SRES
      LOAD   acc with M
      ADD    acc to index register 1
      LOAD   index register 2 with &TAB
      LOAD   acc with Y
      ADD    acc to index register 2
      LOAD   acc with P               ( Point to correct
      ADD    acc to index register 2    table using INDX )
      LOAD   acc with MEM(index 2)
      STORE  acc in MEM(index 1)      ( table[Y] -> X[M] )

( Bottom of L )
      LOAD   acc with L
      BRANCH if not zero to L

( Bottom of K )
      LOAD   acc with K
      BRANCH if not zero to K

( Bottom of J )
      LOAD   acc with J
      BRANCH if not zero to J



72


      LOAD   acc with I
      DEC    acc
      STORE  acc in I
      DEC    acc                       ( No permutation
      BRANCH if zero to Z                if I = 1 )

   Counting the number of operations in each segment of the code gives:

Loop D 1  branch operation           @ j cycles each
      2  ALU operations             @ m cycles each
      4  load / store operations    @ n cycles each

Loop L 1  branch operation           @ j cycles each
      21 index register operations  @ k cycles each
      16 ALU operations             @ m cycles each
      22 load / store operations    @ n cycles each

Loop K 1  branch operation           @ j cycles each
      1  ALU operation              @ m cycles each
      4  load / store operations    @ n cycles each

Seg C  2  load / store operations    @ n cycles each

Loop B 1  branch operation           @ j cycles each
      3  ALU operations             @ m cycles each
      6  load / store operations    @ n cycles each

Loop J 2  branch operation           @ j cycles each
      1  ALU operations             @ m cycles each
      8  load / store operations    @ n cycles each

Loop A 1  branch operation           @ j cycles each
      3  index register operations  @ k cycles each
      1  ALU operations             @ m cycles each
      2  load / store operations    @ n cycles each



73


   In order to assess the speed of the algorithm the number of operations in
each segment of the code need to be multiplied by the number of times that
segment is performed. The code to perform the substitution contains three
nested loops J, K and L, the number of iterations of the K and L loops being
dependent on the value of the J counter, a. There are 2^4-a iterations of K and
2^a iterations of L. Thus there are always 2^4-a x 2^a = 2⁴ = 16 iterations of L.

   The number of iterations of each individual segment is shown in the fol-
lowing table.

   __________________________________________
   Nesting of Loops  N^o of Loops        Total
   __________________________________________
   I A               8 x 16               128
   I J               8 x 5                 40
   I J B             8 x 5 x a            400
   I J C             8 x 5                 40
   I J K             8 x 5 x 2^4-a        1240
   I J K L           8 x 5 x 16           640
   I J K L D         8 x 5 x 16 x (a +1) 9600
   __________________________________________

   This giVes the total number of cycles required to be


   128(j+3k+m+2n) + 40(2j+m+8n) + 400(j+3m+6n) + 40 x 2n
   + 1240(j+m+4n) + 640(j+21k+16m+22n) + 9600(j+2m+4n)


   i.e., 12080j + 13824k + 32048m + 60498n instruction cycles to
perform the substitutions.



21.2 Permutations


The following code performs the permutation. Note that although the PDL
description contains code for extracting and storing individual bits of the bytes
before permutation the low-level code here does not. What is done here is that
for each bit its corresponding byte number and bit-position within that byte
are calculated, the identified bit is masked off and the result is accumulated in
the appropriate byte.



74


_{Code to Perform the Permutations}


      LOAD   index register 2 with &TEMP
      LOAD   acc with 16
      STORE  acc in J

( Top of J loop )
J1     DEC    acc
      STORE  acc in J
      LOAD   acc with 0
      STORE  acc in M
      LOAD   acc with 7
      STORE  acc in K

( Top of K loop )
K1     DEC    acc
      STORE  acc in K

      LOAD   acc with 1
      STORE  acc in R                ( Initialize mask )

      LOAD   acc with J
      SHIFT  acc LEFT
      SHIFT  acc LEFT
      SHIFT  acc LEFT                ( acc = 8*j )
      ADD    K to acc
      STORE  acc in N                ( N = 8*j + K )
      SHIFT  acc LEFT
      SHIFT  acc LEFT
      SHIFT  acc LEFT
      SHIFT  acc LEFT                ( acc = 16*N )
      ADD    N to acc
      STORE  acc in N                ( N = 17 * (8*j + K) )
      SHIFT  acc LEFT
      SHIFT  acc RIGHT               ( N = N mod 128 )

      LOAD   index register 1 with &SRES
      LOAD   acc with N
      SHIFT  acc RIGHT               ( The required type
      SHIFT  acc RIGHT                 is INT(N/4) )



75


      ADD    acc to index register 1

      LOAD   acc with 3              ( The required bit
      XOR    acc with N                is (N mod 4) so
      STORE  acc in A                  mask is created by
      BRANCH if zero to F              3-(N mod 4) shifts )

E      LOAD   acc with R
      SHIFT  acc LEFT
      STORE  acc in R
      LOAD   acc with A
      DEC    acc
      STORE  acc in A
      BRANCH if not zero to E        ( Mask is now in R )

F      LOAD   acc with R
      AND    acc with MEM(index 1)
      ADD    acc to M                ( accumulate result )

( Bottom of K loop )
      LOAD   acc with K
      BRANCH if acc not zero to K1

      LOAD   acc with M
      STORE  acc in MEM(index 2), POST INC index reg 2

( Bottom of J loop )
      LOAD   acc with J
      BRANCH if not zero to J1

(Copy permutation bytes into last 16 bytes in reverse order)
      LOAD   index register 1 with &SRES
      LOAD   acc with 32
      ADD    acc to index register 1

      LOAD   acc with 16
      STORE  acc in J

J2     LOAD   acc with MEM(index 2), POST DEC index reg 2
      STORE  acc in MEM(index 1), POST INC index reg 1



76


      LOAD   acc with J
      DEC    acc
      STORE  acc in J
      BRANCH if not zero to J2

( Bottom of I loop )
Z      LOAD   acc with I
      DEC acc
      BRANCH if not zero to I


   Counting the number of instructions for the individual segments of code
gives:


Loop E  1  branch operation           @ j cycles each
       2  ALU operations             @ m cycles each
       4  load / store operations    @ n cycles each

Loop K1 2  branch operations          @ j cycles each
       3  index register operations  @ k cycles each
       16 ALU operations             @ m cycles each
       11 load / store operations    @ n cycles each

Loop J1 1  branch operation           @ j cycles each
       1  index register operation   @ k cycles each
       1  ALU operation              @ m cycles each
       7  load / store operations    @ n cycles each

Loop J2 1  branch operation           @ j cycles each
       2  index register operations  @ k cycles each
       1  ALU operations             @ m cycles each
       2  load / store operations    @ n cycles each

Other   3  index register operations  @ k cycles each
       5  load / store operations    @ n cycles each



77


Outer Loop


Loop I  1  branch operation           @ j cycles each
       1  ALU operations             @ m cycles each
       3  load / store operations    @ n cycles each

Other   1  index register operation   @ k cycles each
       2  load / store operations    @ n cycles each


   Counting the number of iterations of each segment of the code gives the
following table. Note that the permutation is not performed in the final round
of the algorithm thus the latter part of the I loop has been labelled I'.

   _________________________________________
   Nesting of Loops    No of Loops     Total
   _________________________________________

   I' J1               7 x 16            112
   I' J1 K1            7 x 16 x 7        784
   I' J1 K1 E          7 x 16 x 7 x 2   1568
   I'                  7                   7
   I' J2               7 x 16            112
   I                   8                   8
   _________________________________________


   The number of iterations of the loops together with the numbers and types
of operations required by each of the loops gives the total number of instruction
cycles required to be:


   112(j+k+m+7n) + 784(2j+3k+16m+11n) + 1568(j+2m+4n) +
   7(3k+5n) + 112(j+2k+m+2n) + 8(j+m+3n) + k + 2n


   i.e. 3368j + 2710k + 15912m + 15965n instruction cycles to perform
the permutation.



78


21.3 Speed Estimate


Adding the number of instruction cycles required for the substitutions and
permutations together gives a total of 15.456j+16.534k+47.960m+76.463n
instruction cycles to perform the algorithm. For typical values of j = k = 5
and m = n = 4, leads to a total of 657,634 instruction cycles per run.
On a 1ms processor this would require approximately 0.65s.


21.4 Memory Requirement Estimate


The code to perform the substitutions contains

   72 load / store instructions @ 3 bytes each = 216 bytes
    8 branch instructions       @ 3 bytes each =  24 bytes
   35 ALU instructions          @ 1 byte each  =  35 bytes
                                                 275

   The code to perform the permutations contains

   36 load / store instructions @ 3 bytes each = 108 bytes
    6 branch instructions       @ 3 bytes each =  18 bytes
   23 ALU instructions          @ 1 byte each  =  23 bytes
                                                 149


   Therefore a total of 424 bytes of memory are require to store the code.

   In addition the compression tables TAB occupy 512 + 256 + 128 + 52 + 32 =
992 bytes of external memory. The variables KEY, SRES and TEMP require a
further 16 + 32 + 16 = 64 bytes of memory. In the code the following 16
single byte variables are used: A, I, J, K, L, M, N, P, R, T1, T2, X,
XM, XN, Y and Z.

   Thus the total memory requirement for the data and program storage is
424 + 992 + 64 + 16 = 1072 bytes.

[End third extraction]

[End]

Note: For expanded interpretation of COMP128 see:

http://www.scard.org/gsm/a3a8.txt

See related information and discussion of GSM encryption algorithms:

http://jya.com/crack-a5.htm

GSM HACKING 1

Filed under ( HACKING CRACKING MAC OS WINDOWS LINUX NETWORK PASSWORD )

Real Time Cryptanalysis of A5/1 on a PC

Alex Biryukov * Adi Shamir ** David Wagner ***

Abstract. A5/1 is the strong version of the encryption algorithm used by about 130 million GSM customers in Europe to protect the over-the-air privacy of their cellular voice and data communication. The best published attacks against it require between 2⁴⁰ and 2⁴⁵ steps. This level of security makes it vulnerable to hardware-based attacks by large organizations, but not to software-based attacks on multiple targets by hackers.
In this paper we describe new attacks on A5/1, which are based on subtle flaws in the tap structure of the registers, their noninvertible clocking mechanism, and their frequent resets. After a 2⁴⁸ parallelizable data preparation stage (which has to be carried out only once), the actual attacks can be carried out in real time on a single PC.
The first attack requires the output of the A5/1 algorithm during the first two minutes of the conversation, and computes the key in about one second. The second attack requires the output of the A5/1 algorithm during about two seconds of the conversation, and computes the key in several minutes. The two attacks are related, but use diffrent types of time-memory tradeoff. The attacks were verified with actual implementations, except for the preprocessing stage which was extensively sampled rather than completely executed.
REMARK: We based our attack on the version of the algorithm which was derived by reverse engineering an actual GSM telephone and published at http://www.scard.org. We would like to thank the GSM organization for graciously confiming to us the correctness of this unofficial description. In addition, we would like to stress that this paper considers the narrow issue of the cryptographic strength of A5/1, and not the broader issue of the practical security of fielded GSM systems, about which we make no claims.
* Computer Science department, The Weizmann Institute, Rehovot 76100, Israel.
** Computer Science department, The Weizmann Institute, Rehovot 76100, Israel.
*** Computer Science department, University of California, Berkeley CA 94720, USA.

1 Introduction

The over-the-air privacy of GSM telephone conversations is protected by the A5 stream cipher. This algorithm has two main variants: The stronger A5/1 version is used by about 130 million customers in Europe, while the weaker A5/2 version is used by another 100 million customers in other markets. The approximate design of A5/1 was leaked in 1994, and the exact design of both A5/1 and A5/2 was reverse engineered by Briceno from an actual GSM telephone in 1999 (see [3]).

In this paper we develop two new cryptanalytic attacks on A5/1, in which a single PC can extract the conversation key in real time from a small amount of generated output. The attacks are related, but each one of them optimizes a different parameter: The first attack (called the biased birthday attack) requires two minutes of data and one second of processing time, whereas the second attack (called the the random subgraph attack) requires two seconds of data and several minutes of processing time. There are many possible choices of tradeo parameters in these attacks, and three of them are summarized in Table 1.

Attack Type	Preprocessing steps	Available data	Number of 73GB disks	Attack time
Biased Birthday attack (1)	2⁴²	2 minutes	4	1 second
Biased Birthday attack (2)	2⁴⁸	2 minutes	2	1 second
Random Subgraph attack	2⁴⁸	2 seconds	4	minutes

Table 1. Three possible tradeoff points in the attacks on A5/1.

Many of the ideas in these two new attacks are applicable to other stream ciphers as well, and define new quantifiable measures of security.

The paper is organized in the following way: Section 2 contains a full description of the A5/1 algorithm. Previous attacks on A5/1 are surveyed in Section 3, and an informal description of the new attacks is contained in Section 4. Finally, Section 5 contains various implementation details and an analysis of the expected success rate of the attacks, based on large scale sampling with actual implementations.

2 Description of the A5/1 stream cipher

A GSM conversation is sent as a sequence of frames every 4.6 millisecond. Each frame contains 114 bits representing the digitized A to B communication, and 114 bits representing the digitized B to A communication. Each conversation can be encrypted by a new session key K. For each frame, K is mixed with a publicly known frame counter Fn , and the result serves as the initial state of a generator which produces 228 pseudo random bits. These bits are XOR'ed by the two parties with the 114+114 bits of the plaintext to produce the 114+114 bits of the ciphertext.

A5/1 is built from three short linear feedback shift registers (LFSR) of lengths 19, 22, and 23 bits, which are denoted by R1; R2 and R3 respectively. The rightmost bit in each register is labelled as bit zero. The taps of R1 are at bit positions 13,16,17,18; the taps of R2 are at bit positions 20,21; and the taps of R3 are at bit positions 7, 20,21,22 (see Figure 1). When a register is clocked, its taps are XORed together, and the result is stored in the rightmost bit of the left-shifted register. The three registers are maximal length LFSR's with periods 2¹⁹ -1, 2²² - 1, and 2²³ -1, respectively. They are clocked in a stop/go fashion using the following majority rule: Each register has a single "clocking" tap (bit 8 for R1, bit 10 for R2, and bit 10 for for R3); each clock cycle, the majority function of the clocking taps is calculated and only those registers whose clocking taps agree with the majority bit are actually clocked. Note that at each step either two or three registers are clocked, and that each register moves with probability 3/4 and stops with probability 1/4.

The process of generating pseudo random bits from the session key K and the frame counter F_n is carried out in four steps:

-- The three registers are zeroed, and then clocked for 64 cycles (ignoring the stop/go clock control). During this period each bit of K (from lsb to msb) is XOR'ed in parallel into the lsb's of the three registers.
-- The three registers are clocked for 22 additional cycles (ignoring the stop/go clock control). During this period the successive bits of F_n (from lsb to msb) are again XOR'ed in parallel into the lsb's of the three registers. The contents of the three registers at the end of this step is called the initial state of the frame.
-- The three registers are clocked for 100 additional clock cycles with the stop/go clock control but without producing any outputs.
-- The three registers are clocked for 228 additional clock cycles with the stop/go clock control in order to produce the 228 output bits. At each clock cycle, one output bit is produced as the XOR of the msb's of the three registers.

3 Previous attacks

The attacker is assumed to know some pseudo random bits generated by A5/1 in some of the frames. This is the standard assumption in the cryptanalysis of stream ciphers, and we do not consider in this paper the crucial issue of how one can obtain these bits in fielded GSM systems. For the sake of simplicity, we assume that the attacker has complete knowledge of the outputs of the A5/1 algorithm during some initial period of the conversation, and his goal is to find the key in order to decrypt the remaining part of the conversation. Since GSM telephones send a new frame every 4.6 milliseconds, each second of the conversation contains about 2⁸ frames.

At the rump session of Crypto 99, Ian Goldberg and David Wagner announced an attack on A5/2 which requires very few pseudo random bits and just O(2¹⁶) steps. This demonstrated that the \export version" A5/2 is totally insecure.

The security of the A5/1 encryption algorithm was analyzed in several papers. Some of them are based on the early imprecise description of this algorithm, and thus their details have to be slightly modified. The known attacks can be summarized in the following way:

-- Briceno[3] found out that in all the deployed versions of the A5/1 algorithm, the 10 least signicant of the 64 key bits were always set to zero. The complexity of exhaustive search is thus reduced to O(2 54 ).⁴
-- Anderson and Roe[1] proposed an attack based on guessing the 41 bits in the shorter R₁ and R₂ registers, and deriving the 23 bits of the longer R₃ register from the output. However, they occasionally have to guess additional bits to determine the majority-based clocking sequence, and thus the total complexity of the attack is about O(2⁴⁵). Assuming that a standard PC can test ten million guesses per second, this attack needs more than a month to find one key.
-- Golic[4] described an improved attack which requires O(2⁴⁰) steps. However, each operation in this attack is much more complicated, since it is based on the solution of a system of linear equations. In practice, this algorithm is not likely to be faster than the previous attack on a PC.
-- Golic[4] describes a general time-memory tradeo attack on stream ciphers (which was independently discovered by Babbage [2] two years earlier), and concludes that it is possible to find the A5/1 key in 2 22 probes into random locations in a precomputed table with 2⁴² 128 bit entries. Since such a table requires a 64 terabyte hard disk, the space requirement is unrealistic. Alternatively, it is possible to reduce the space requirement to 862 gigabytes, but then the number of probes increases to O(2²⁸). Since random access to the fastest commercially available PC disks requires about 6 milliseconds, the total probing time is almost three weeks. In addition, this tradeoff point can only be used to attack GSM phone conversations which last more than 3 hours, which again makes it unrealistic.
___________________
4 Our new attack is not based on this assumption, and is thus applicable to A5/1 implementations with full 64 bit keys. It is an interesting open problem whether we can speed it up by assuming that 10 key bits are zero.

4 Informal Description of the New Attacks

We start with an executive summary of the key ideas of the two attacks. More technical descriptions of the various steps will be provided in the next section.

Key idea 1: Use the Golic time-memory tradeoff. The starting point for the new attacks is the time-memory tradeoff described in Golic[3], which is applicable to any cryptosystem with a relatively small number of internal states. A5/1 has this weakness, since it has n = 2⁶⁴ states defined by the 19+22+23 = 64 bits in its three shift registers. The basic idea of the Golic time-memory tradeoff is to keep a large set A of precomputed states on a hard disk, and to consider the large set B of states through which the algorithm progresses during the actual generation of output bits. Any intersection between A and B will enable us to identify an actual state of the algorithm from stored information.

Key idea 2: Identify states by prefixes of their output sequences. Each state defines an infinite sequence of output bits produced when we start clocking the algorithm from that state. In the other direction, states are usually uniquely defined by the first log(n) bits in their output sequences, and thus we can look for equality between unknown states by comparing such prefixes of their output sequences. During precomputation, pick a subset A of states, compute their output prefixes, and store the (prefix, state) pairs sorted into increasing prefix values. Given actual outputs of the A5/1 algorithm, extract all their (partially overlapping) prefixes, and define B as the set of their corresponding (unknown) states. Searching for common states in A and B can be efficiently done by probing the sorted data A on the hard disk with prefix queries from B.

Key idea 3: A5/1 can be efficiently inverted. As observed by Golic, the state transition function of A5/1 is not uniquely invertible: The majority clock control rule implies that up to 4 states can converge to a common state in one clock cycle, and some states have no predecessors. We can run A5/1 backwards by exploring the tree of possible predecessor states, and backtracking from dead ends. The average number of predecessors of each node is 1, and thus the expected number of vertices in the first k levels of each tree grows only linearly in k (see[3]). As a result, if we find a common state in the disk and data, we can obtain a small number of candidates for the initial state of the frame. The weakness we exploit here is that due to the frequent reinitializations there is a very short distance from intermediate states to initial states.

Key idea 4: The key can be extracted from the initial state of any frame. Here we exploit the weakness of the A5/1 key setup routine. Assume that we know the state of A5/1 immediately after the key and frame counter were used, and before the 100 mixing steps. By running backwards, we can eliminate the effect of the known frame counter in a unique way, and obtain 64 linear combinations of the 64 key bits. Since the tree exploration may suggest several keys, we can choose the correct one by mixing it with the next frame counter, running A5/1 forward for more than 100 steps, and comparing the results with the actual data in the next frame.

Key idea 5: The Golic attack on A5/1 is marginally impractical. By the well known birthday paradox, A and B are likely to have a common state when their sizes a and b satisfy a * b approx = n. We would like a to be bounded by the size of commercially available PC hard disks, and b to be bounded by the number of overlapping prefixes in a typical GSM telephone conversation. Reasonable bounds on these values (justified later in this paper) are a approx = 2³⁵ and b approx = 2²². Their product is 2⁵⁷, which is about 100 times smaller than n = 2⁶⁴. To make the intersection likely, we either have to increase the storage requirement from 150 gigabytes to 15 terabytes, or to increase the length of the conversation from two minutes to three hours. Neither approach seems to be practical, but the gap is not huge and a relatively modest improvement by two orders of magnitude is all we need to make it practical.

Key idea 6: Use special states. An important consideration in implementing time-memory tradeoff attacks is that access to disk is about a million times slower than a computational step, and thus it is crucial to minimize the number of times we look for data on the hard disk. An old idea due to Ron Rivest is to keep on the disk only special states which are guaranteed to produce output bits starting with a particular pattern alpha of length k, and to access the disk only when we encounter such a prefix in the data. This reduces the number b of disk probes by a factor of about 2k . The number of points a we have to memorize remains unchanged, since in the formula a * b approx = n both b and n are reduced by the same factor 2k . The downside is that we have to work 2k times harder during the preprocessing stage, since only 2^-k the preprocessing time by a factor of about 64,000, which makes it impractically long.

Key idea 7: Special states can be efficiently sampled in A5/1. A major weakness of A5/1 which we exploit in both attacks is that it is easy to generate all the states which produce output sequences that start with a particular k-bit pattern alpha with k = 16 without trying and discarding other states. This is due to a poor choice of the clocking taps, which makes the register bits that affect the clock control and the register bits that affect the output unrelated for about 16 clock cycles, so we can choose them independently. This easy access to special states does not happen in good block ciphers, but can happen in stream ciphers due to their simpler transition functions. In fact, the maximal value of k for which special states can be sampled without trial and error can serve as a new security measure for stream ciphers, which we call its sampling resistance. As demonstrated in this paper, high values of k can have a big impact on the efficiency of time-memory tradeoff attacks on such cryptosystems.

Key idea 8: Use biased birthday attacks. The main idea of the first attack is to consider sets A and B which are not chosen with uniform probability distribution among all the possible states. Assume that each state s is chosen for A with probability P_A (s), and is chosen for B with probability P_B (s). If the means of these probability distributions are a/n and b/n, respectively, then the expected size of A is a, and the expected size of B is b.

The birthday threshold happens when Sigma_s P_A (s) P_B (s) approx = 1. For independent uniform distributions, this evaluates to the standard condition a * b approx = n. However, in the new attack we choose states for the disk and states in the data with two non-uniform probability distributions which have strong positive correlation. This makes our time memory tradeoff much more efficient than the one used by Golic. This is made possible by the fact that in A5/1, the initial state of each new frame is rerandomized very frequently with different frame counters.

Key idea 9: Use Hellman's time-memory tradeoff on a subgraph of special states. The main idea of the second attack (called the random subgraph attack) is to make most of the special states accessible by simple computations from the subset of special states which are actually stored in the hard disk. The first occurrence of a special state in the data is likely to happen in the first two seconds of the conversation, and this single occurrence sufacces in order to locate a related special state in the disk even though we are well below the threshold of either the normal or the biased birthday attack. The attack is based on a new function f which maps one special state into another special state in an easily computable way. This f can be viewed as a random function over the subspace of 2⁴⁸ special states, and thus we can use Hellman's time-memory tradeoff [4] in order to invert it efficiently. The inverse function enables us to compute special states from output prefixes even when they are not actually stored on the hard disk, with various combinations of time T and memory M satisfying M square root T = 2⁴⁸. If we choose M = 2³⁶ , we get T = 2²⁴ , and thus we can carry out the attack in a few minutes, after a 2⁴⁸ preprocessing stage which explores the structure of this function f.

Key idea 10: A5/1 is very efficient on a PC. The A5/1 algorithm was designed to be efficient in hardware, and its straightforward software implementation is quite slow. To execute the preprocessing stage, we have to run it on a distributed network of PC's up to 2⁴⁸ times, and thus we need an extremely efficient way to compute the effect of one clock cycle on the three registers.

We exploit the following weakness in the design of A5/1: Each one of the three shift registers is so small that we can precompute all its possible states, and keep them in RAM as three cyclic arrays, where successive locations in each array represent successive states of the corresponding shift register. In fact, we don't have to keep the full states in the arrays, since the only information we have to know about a state is its clocking tap and its output tap. A state can thus be viewed as a triplet of indices (i; j; k) into three large single bit arrays (see Figure 2). A₁ (i); A₂ (j); A₃ (k) are the clocking taps of the current state, and A₁ (i - 11), A₂ (j - 12), A₃ (k - 13) are the output taps of the current state (since these are the corresponding delays in the movement of clocking taps to output taps when each one of the three registers is clocked). Since there is no mixing of the values of the three registers, their only interaction is in determining which of the three indices should be incremented by 1. This can be determined by a precomputed table with three input bits (the clocking taps) and three output bits (the increments of the three registers). When we clock A5/1 in our software implementation, we don't shift registers or compute feedbacks - we just add a 0/1 vector to the current triplet of indices. A typical two dimensional variant of such movement vectors in triplet space is described in Figure 3. Note the local tree structure determined by the deterministic forward evaluation and the nondeterministic backward exploration in this triplet representation.

Since the increment table is so small, we can expand the A tables from bits to bytes, and use a larger precomputed table with 2²⁴ entries, whose inputs are the three bytes to the right of the clocking taps in the three registers, and outputs are the three increments to the indices which allow us to jump directly to the state which is 8 clock cycles away. The total amount of RAM needed for the state arrays and precomputed movement tables is less than 128 MB, and the total cost of advancing the three registers for 8 clock cycles is one table lookup and three integer additions! A similar table lookup technique can be used to compute in a single step output bytes instead of output bits, and to speed up the process of running A5/1 backwards.

5 Detailed Description of the Attacks

In this section we fill in the missing details, and analyse the success rate of the new attacks.

5.1 Efficient Sampling of Special States

Let alpha be any 16 bit pattern of bits. To simplify the analysis, we prefer to use an alpha which does not coincide with shifted versions of itself (such as alpha = 1000...0) since this makes it very unlikely that a single 228-bit frame contains more than one occurrence of alpha.

The total number of states which generate an output prefix of alpha is about 2⁶⁴ * 2^-16 = 2⁴⁸. We would like to generate all of them in a (barely doable) 2⁴⁸ preprocessing stage, without trying all the 2⁶⁴ possible states and discarding the vast majority which fail the test. The low sampling resistance of A5/1 is made possible by several flaws in its design, which are exploited in the following algorithm:

-- Pick an arbitrary 19-bit value for the shortest register R1. Pick arbitrary values for the rightmost 11 bits in R2 and R3 which will enter the clock control taps in the next few cycles. We can thus define 2^19+11+11 = 2⁴¹ partial states.
-- For each partial state we can uniquely determine the clock control of the three registers for the next few cycles, and thus determine the identity of the bits that enter their msb's and affect the output.
-- Due to the majority clock control, at least one of R2 and R3 shifts a new (still unspecified) bit into its msb at each clock cycle, and thus we can make sure that the computed output bit has the desired value. Note that about half the time only one new bit is shifted (and then its choice is forced), and about half the time two new bits are shifted (and then we can choose them in two possible ways). We can keep this process alive without time consuming trial and error as long as the clock control taps contain only known bits whereas the output taps contain at least one unknown bit. A5/1 makes this very easy, by using a single clocking tap and placing it in the middle of each register: We can place in R2 and R3 11 specified bits to the right of the clock control tap, and 11-12 unspecified bits to the right of the output tap. Since each register moves only 3/4 of the time, we can keep this process alive for about 16 clock cycles, as desired.
-- This process generates only special states, and cannot miss any special state (if we start the process with its partial specification, we cannot get into an early contradiction). We can similarly generate any number c <>48 of randomly chosen special states in time proportional to c. As explained later in the paper, this can make the preprocessing faster, at the expense of other parameters in our attack.

5.2 Efficient Disk Probing

To leave room for a sufficiently long identifying prefix of 35 bits after the 16-bit alpha, we allow it to start only at bit positions 1 to 177 in each one of the given frames (i.e., at a distance of 101 to 277 from the initial state). The expected number of occurrences of alpha in the data produced by A5/1 during a two minute conversation is thus 2^-16 * 177 * 120 * 1000/4.6 approx = 71. This is the expected number of times b we access the hard disk. Since each random access takes about 6 milliseconds, the total disk access time becomes negligible (about 0.4 seconds).

5.3 Efficient Disk Storage

The data items we store on the disk are (prefix, state) pairs. The state of A5/1 contains 64 bits, but we keep only special states and thus we can encode them efficiently with shorter 48 bit names, by specifying the 41 bits of the partial state and the approx = 7 choice bits in the sampling procedure. We can further reduce the state to less than 40 bits (5 bytes) by leaving some of the 48 bits unspecified. This saves a considerable fraction of the disk space prepared during preprocessing, and the only penalty is that we have to try a small number of candidate states instead of one candidate state for each one of the 71 relevant frames. Since this part is so fast, even in its slowed down version it takes less than a second.

The output prefix produced from each special state is nominally of length 16+35=51 bits. However, the first 16 bits are always the constant alpha, and the next 35 bits are stored in sorted order on the disk. We can thus store the full value of these 35 bits only once per sector, and encode on the disk only their small increments (with a default value of 1). Other possible implementations are to use the top parts of the prefixes as direct sector addresses or as file names. With these optimizations, we can store each one of the sorted (prefix, state) pairs in just 5 bytes. The largest commercially available PC hard disks (such as IBM Ultrastar 72 ZX or Seagate Cheetah 73) have 73 gigabytes. By using two such disks, we can store 146 * 2³⁰ =5 approx = 2³⁵ pairs during the preprocessing stage, and characterize each one of them by the (usually unique) 35-bit output prefix which follows alpha.

5.4 Efficient Tree Exploration

The forward state-transition function of A5/1 is deterministic, but in the reverse direction we have to consider four possible predecessors. About 3/8 of the states have no predecessors, 13/32 of the states have one predecessor, 3/32 of the states have two predecessors, 3/32 of the states have three predecessors, and 1/32 of the states have four predecessors.

Since the average number of predecessors is 1, Golic assumed that a good statistical model for the generated trees of predecessors is the critical branching process (see [3]). We were surprised to discover that in the case of A5/1, there was a very significant difference between the predictions of this model and our experimental data. For example, the theory predicted that only 2% of the states would have some predecessor at depth 100, whereas in a large sample of 100,000,000 trees we generated from random A5/1 states the percentage was close to 15%. Another major difference was found in the tail distributions of the number of sons at depth 100: Theory predicted that in our sample we should see some cases with close to 1000 sons, whereas in our sample we never saw trees with more than 120 sons at depth 100.

5.5 The Biased Birthday Attack.

To analyse the performance of our biased birthday attack, we introduce the following notation:

Definition 1 A state s is coloured red, if the sequence of output bits produced from state s starts with alpha (i.e., it is a special state). The subspace of all the red states is denoted by R.

Definition 2 A state is coloured green, if the sequence of output bits produced from state s contains an occurrence of alpha which starts somewhere between bit positions 101 and 277. The subspace of all the green states is denoted by G.

The red states are the states that we keep in the disk, look for in the data, and try to collide by comparing their prexes. The green states are all the states that could serve as initial states in frames that contain alpha. Non-green initial states are of no interest to us, since we discard the frames they generate from the actual data.

The size of R is approximately 2⁴⁸, since there are 2⁶⁴ possible states, and the probability that alpha occurs right at the beginning of the output sequence is 2^-16. Since the redness of a state is not directly related to its separate coordinates i, j, k in the triplet space, the red states can be viewed as randomly and sparsely located in this representation. The size of G is approximately 177 * 2⁴⁸ (which is still a small fraction of the state space) since alpha has 177 opportunities to occur along the output sequence.

Since a short path of length 277 in the output sequence is very unlikely to contain two occurrences of alpha, the relationship between green and red states is essentially many to one: The set of all the relevant states we consider can be viewed as a collection of disjoint trees of various sizes, where each tree has a red state as its root and a "belt" of green states at levels 101 to 277 below it (see Figure 4). The weight W (s) of a tree whose root is the red state s is defined as the number of green states in its belt, and s is called k-heavy if W (s) > k.

The crucial observation which makes our biased birthday attack efficient is that in A5/1 there is a huge variance in the weights of the various red states. We ran the tree exploration algorithm on 100,000,000 random states and computed their weights. We found out that the weight of about 85% of the states was zero, because their trees died out before reaching depth 100. Other weights ranged all the way from 1 to more than 26,000.

The leftmost graph of Figure 5 describes for each x which is a multiple of 100 the value y which is the total weight of all the trees whose weights were between x and x + 100. The total area under the graph to the right of x = k represents the total number of green states in all the k-heavy trees in our sample.

The initial mixing of the key and frame number, which ignores the usual clock control and ips the least signicant bits of the registers about half the time before shifting them, can be viewed as random jumps with uniform probability distribution into new initial states: even a pair of frame counters with Hamming distance 1 can lead to far away initial states in the triplet space. When we restrict our attention to the frames that contain alpha, we get a uniform probability distribution over the green states, since only green states can serve as initial states in such frames.

The red states, on the other hand, are not encountered with uniform probability distribution in the actual data. For example, a red state whose tree has no green belt will never be seen in the data. On the other hand, a red state with a huge green belt has a huge number of chances to be reached when the green initial state is chosen with uniform probability distribution. In fact the probability of encountering a particular red state s in a particular frame which is known to contain alpha is the ratio of its weight W (s) and the total number of green states 177 * 2⁴⁸ , and the probability of encountering it in one of the 71 relevant frames is P_B (s) = 71 * W (s)=(177 * 2⁴⁸).

Since P_B (s) has a huge variance, we can maximize the expected number of collisions Sigma_s P_A (s) * P_B (s) by choosing red points for the hard disk not with uniform probability distribution, but with a biased probability P_A (s) which maximizes the correlation between these distributions, while minimizing the expected size of A. The best way to do this is to keep on the disk only the heaviest trees. In other words, we choose a threshold number k, and define P_A (s) = 0 if W (s) < k, and P_A (s) = 1 if W (s) > k. We can now easily compute the expected number of collisions by the formula:

which is just the number of red states we keep on the disk, times the average weight of their trees, times 71/(177 * 2⁴⁸).

In our actual attack, we keep 2³⁵ red states on the disk. This is a 2^-13 fraction of the 2⁴⁸ red states. With such a tiny fraction, we can choose particularly heavy trees with an average weight of 12,500. The expected number of colliding red states in the disk and the actual data is 2³⁵ *12,500 * 71/(177 * 2⁴⁸) approx = 0.61. This expected value makes it quite likely that a collision will actually exist.⁵

____________________

5 Note that in time memory tradeoff attacks, it becomes increasingly expensive to push this probability towards 1, since the only way to guarantee success is to memorize the whole state space.

The intuition behind the biased time memory tradeoff attack is very simple. We store red states, but what we really want to collide are the green states in their belts (which are accessible from the red roots by an easy computation). The 71 green states in the actual data are uniformly distributed, and thus we want to cover about 1% of the green area under the curve in the right side of Figure 5. Standard time memory tradeoff attacks store random red states, but each stored state increases the coverage by just 177 green states on average. With our optimized choice in the preprocessing stage, each stored state increases the coverage by 12,500 green states on average, which improves the efficiency of the attack by almost two orders of magnitude.

5.6 Efficient Determination of Initial States

One possible disadvantage of storing heavy trees is that once we find a collision, we have to try a large number of candidate states in the green belt of the colliding red state. Since each green state is only partially specified in our compact 5-byte representation, the total number of candidate green states can be hundreds of thousands, and the real time part of the attack can be relatively slow.

However, this simple estimate is misleading. The parasitic red states obtained from the partial specification can be quickly discarded by evaluating their outputs beyond the guaranteed occurrence of alpha and comparing it to the bits in the given frame. In addition, we know the exact location of alpha in this frame, and thus we know the exact depth of the initial state we are interested in within the green belt. As a result, we have to try only about 70 states in a cut through the green belt, and not the 12,500 states in the full belt.

5.7 Reducing the Preprocessing Time of the Biased Birthday Attack

The 2⁴⁸ complexity of the preprocessing stage of this attack can make it too time consuming for a small network of PC's. In this section we show how to reduce this complexity by any factor of up to 1000, by slightly increasing either the space complexity or the length of the attacked conversation.

The efficient sampling procedure makes it possible to generate any number c <>48 of random red states in time proportional to c. To store the same number of states in the disk, we have to choose a larger fraction of the tested trees, which have a lower average weight, and thus a less efficient coverage of the green states. Table 2 describes the average weight of the heaviest trees for various fractions of the red states, which was experimentally derived from our sample of 100,000,000 A5/1 trees. This table can be used to choose the appropriate value of k in the definition the k-heavy trees for various choices of c. The implied tradeoff is very favorable: If we increase the fraction from 2^-13 to 2^-7, we can reduce the preprocessing time by a factor of 64 (from 2⁴⁸ to 2⁴²), and compensate by either doubling the length of attacked conversation from 2 minutes to 4 minutes or doubling the number of hard disks from 2 to 4. The extreme point in this tradeoff is to store in the disk all the sampled red states with nonzero weights (the other sampled red states are just a waste of space, since they will never be seen in the actual data). In A5/1 about 15% of the red states have nonzero weights, and thus we have to sample about 2³⁸ red states in the preprocessing stage in order to find the 15% among them (about 2³⁵ states) which we want to store, with an average tree weight of 1180. To keep the same probability of success, we have to attack conversations which last about half an hour.

Average Weights
2^-4	2432	2^-5	3624	2^-6	4719	2^-7	5813
2^-8	6910	2^-9	7991	2-¹⁰	9181	2^-11	10277
2-¹²	11369	2^-13	12456	2^-14	13471	2^-15	14581

2^-16	15686	2^-17	16839	2^-18	17925	2^-19	19012
2^-20	20152	2^-21	21227	2^-22	22209	2^-23	23515
2^-24	24597	2^-25	25690	2^-26	26234

Table 2. The average weight of the heaviest trees for various functions of R.

A further reduction in the complexity of the preprocessing stage can be obtained by the early abort strategy: Explore each red state to a shallow depth, and continue to explore only the most promising candidates which have a large number of sons at that depth. This heuristic does not guarantee the existence of a large belt, but there is a clear correlation between these events.

To check whether the efficiency of our biased birthday attack depends on the details of the stream cipher, we ran several experiments with modified variants of A5/1. In particular, we concentrated on the effect of the clock control rule, which determines the noninvertibility of the model. For example, we hashed the full state of the three registers and used the result to choose among the four possible majority-like movements (+1,+1,+1), (+1,+1,0), (+1,0,+1), (0,+1,+1) in the triplet space. The results were very different from the real majority rule. We then replaced the majority rule by a minority rule (if all the clocking taps agree, all the registers move, otherwise only the minority register moves). The results of this minority rule were very similar to the majority-like hashing case, and very different from the real majority case (see Figure 5). It turns out that in this sense A5/1 is actually stronger than its modified versions, but we do not currently understand the reason for this strikingly different behavior. We believe that the type of data in Table 2, which we call the tail coverage of the cryptosystem, can serve as a new security measure for stream ciphers with noninvertible state transition functions.

Fig. 5. Weight distributions. The graph on the left shows weight distributions for the majority function; the graph on the right compares the weight distributions of several clock-control functions.

5.8 Extracting the Key From a Single Red State

The biased birthday attack was based on a direct collision between a state in the disk and a state in the data, and required approx = 71 red states from a relatively long (approx = 2 minute) prefix of the conversation. In the random subgraph attack we use indirect collisions, which make it possible to find the key with reasonable probability from the very first red state we encounter in the data, even though it is unlikely to be stored in the disk. This makes it possible to attack A5/1 with less than two seconds of available data. The actual attack requires several minutes instead of one second, but this is still a real time attack on normal telephone conversations.

The attack is based on Hellman's original time-memory tradeoff for block ciphers, described in [4]. Let E be an arbitrary block cipher, and let P be some fixed plaintext. Define the function f from keys K to ciphertexts C by f(K) = E_K (P ). Assuming that all the plaintexts, ciphertexts and keys have the same binary size, we can consider f as a random function (which is not necessarily one-to-one) over a common space U . This function is easy to evaluate and to iterate but difficult to invert, since computing the key K from the ciphertext f(K) = E_K (P ) is essentially the problem of chosen message cryptanalysis.

Hellman's idea was to perform a precomputation in which we choose a large number m of random start points in U , and iterate f on each one of them t times. We store the m (start point, end point) pairs on a large disk, sorted into increasing endpoint order. If we are given f(K) for some unknown K which is located somewhere along one of the covered paths, we can recover K by repeatedly applying f in the easy forward direction until we hit a stored end point, jump to its corresponding start point, and continue to apply f from there. The last point before we hit f(K) again is likely to be the key K which corresponds to the given ciphertext f(K).

Since it is difficult to cover a random graph with random paths in an efficient way, Hellman proposed a rerandomization technique which creates multiple variants of f (e.g., by permuting the order of the output bits of f ). We use t variants f_i , and iterate each one of them t times on m random start points to get m corresponding end points. If the parameters m and t satisfy mt² = |U|, then each state is likely to be covered by one of the variants of f. Since we have to handle each variant separately (both in the preprocessing and in the actual attack), the total memory becomes M = mt and the total running time becomes T = t² , where M and T can be anywhere along the tradeoff curve M square root T = |U|. In particular, Hellman suggests using M = T = |U|^2/3.

A straightforward application of this M square root T = |U| tradeoff to the |U| = 2⁶⁴ states of A5/1 with the maximal memory M = 2³⁶ requires time T = 2⁵⁶, which is much worse than previously known attacks. The basic idea of the new random subgraph attack is to apply the time-memory tradeoff to the subspace R of 2⁴⁸ red states, which is made possible by the fact that it can be efficiently sampled. Since T occurs in the tradeoff formula M square root T = |U| with a square root, reducing the size of the graph by a modest 2¹⁶ (from |U| = 2⁶⁴ to |R| = 2⁴⁸ ) and using the same memory (M = 2³⁶), reduces the time by a huge factor of 2³² (from T = 2⁵⁶ to just T = 2²⁴ ). This number of steps can be carried out in several minutes on a fast PC.

What is left is to design a random function f over R whose output-permuted variants are easy to evaluate, and for which the inversion of any variant yields the desired key. Each state has a "full name" of 64 bits which describes the contents of its three registers. However, our efficient sampling technique enables us to give each red state a "short name" of 48 bits (which consists of the partial contents of the registers and the random choices made during the sampling process), and to quickly translate short names to full names. In addition, red states are characterized (almost uniquely) by their "output names" defined as the 48 bits which occur after alpha in their output sequences. We can now define the desired function f over 48-bit strings as the mapping from short names to output names of red states: Given a 48-bit short name x, we expand it to the full name of a red state, clock this state 64 times, delete the initial 16-bit alpha, and define f(x) as the remaining 48 output bits. The computation of f(x) from x can be efficiently done by using the previously described precomputed tables, but the computation of x from f(x) is exactly the problem of computing the (short) name of an unknown red state from the 48 output bits it produces after alpha. When we consider some output-permuted variant f_i of f, we obviously have to apply the same permutation to the given output sequence before we try to invert f_i over it.

The recommended preprocessing stage stores 2¹² tables on the hard disk. Each table is defined by iterating one of the variants f_i 2¹² times on 2²⁴ randomly chosen 48-bit strings. Each table contains 2²⁴ (start point, end point) pairs, but implicitly covers about 2³⁶ intermediate states. The collection of all the 2¹² tables requires 2³⁶ disk space, but implicitly covers about 2⁴⁸ red states.

The simplest implementation of the actual attack iterates each one of the 2¹² variants of f separately 2¹² times on appropriately permuted versions of the single red state we expect to find in the 2 seconds of data. After each step we have to check whether the result is recorded as an end point in the corresponding table, and thus we need T = 2²⁴ probes to random disk locations. At 6 ms per probe, this requires more than a day. However, we can again use Rivest's idea of special points: We say that a red state is bright if the first 28 bits of its output sequence contain the 16-bit alpha extended by 12 additional zero bits. During preprocessing, we pick a random red start point, and use f_i to quickly jump from one red state to another. After approximately 2¹² jumps, we expect to encounter another bright red state, at which we stop and store the pair of (start point, end point) in the hard disk. In fact, each end point consists of a 28 bit fixed prefix followed by 36 additional bits. As explained in the previous attack, we do not have to store either the prefix (which is predictable) or the suffix (which is used as an index) on the hard disk, and thus we need only half the expected storage. We can further reduce the required storage by using the fact that the bright red states have even shorter short names than red states (36 instead of 48 bits), and thus we can save 25% of the space by using bright red instead of red start points in the table.⁶ During the actual attack, we find the first red state in the data, iterate each one of the 2¹² variants of f over it until we encounter a bright red state, and only then search this state among the pairs stored in the disk. We thus have to probe the disk only once in each one of the t = 2¹² tables, and the total probing time is reduced to 24 seconds.

____________________

6 Note that we do not know how to jump in a direct way from one bright red state to another, since we do not know how to sample them in an eÆcient way. We have to try about 2¹² red states in order to find one bright red start point, but the total time needed to find the 2³⁶ bright red start points in all the tables is less than the 2⁴⁸ complexity of the path evaluations during the preprocessing stage.

There are many additional improvement ideas and implementation details which will be described in the final version of this paper.

Acknowledgements: We would like to thank Ross Anderson, Mike Roe, Jovan Golic, Marc Briceno, and Ian Goldberg for their pioneering contributions to the analysis of A5/1, which made this paper possible. We would like to thank Marc Briceno, Ian Goldberg, Ron Rivest and Nicko van Someren for sending very useful comments on the early version of this paper.

References

1. R. Anderson, M. Roe, A5, http://jya.com/crack-a5.htm, 1994.

2. S. Babbage, A Space/Time Tradeoff in Exhaustive Search Attacks on Stream Ciphers, European Convention on Security and Detection, IEE Conference publication, No. 408, May 1995.

3. M. Briceno, I. Goldberg, D. Wagner, A pedagogical implementation of A5/1, http://www.scard.org, May 1999.

4. J. Golic, Cryptanalysis of Alleged A5 Stream Cipher, proceedings of EUROCRYPT'97, LNCS 1233,pp.239{255, Springer-Verlag 1997.

5. M. E. Hellman, A Cryptanalytic Time-Memory Trade-Off, IEEE Transactions on Information Theory, Vol. IT-26, N 4, pp.401{406, July 1980.