Deep learning

[Reinforcement Learning / review article / not use tensorflow] Policy Gradient (CartPole)


JaykayChoi  2017. 4. 8. 11:38  





[Reinforcement Learning] Policy Gradient (CartPole)   
https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724
 tensorflow    python  numpy   .


python 3.6

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

import numpy as np

import gym

 

 

class NN:

    def __init__(self):

 

        self.numHiddenLayerNeurons = 10

        self.learningRate = 1e-2

        self.discountFactorForReward = 0.99

        self.inputDimension = 4

 

        self.W1 = self.HeInitialization(self.inputDimension, self.numHiddenLayerNeurons)

        self.W2 = self.HeInitialization(self.numHiddenLayerNeurons, 1)

 

        self.W1GradientBuffer = np.zeros_like(self.W1)

        self.W2GradientBuffer = np.zeros_like(self.W2)

 

 

    def sigmoid(self,x):

        return 1.0 / (1.0 + np.exp(-x))

 

 

    def dsigmoid(self,x):

        return x * (1. - x)

 

 

    def tanh(self,x):

        return np.tanh(x)

 

 

    def dtanh(self,x):

        return 1.0 - x * x

 

 

    def ReLU(self, x):

        return x * (x > 0)

 

 

    def dReLU(self,x):

        return 1.0 * (x > 0)

 

 

    def softmax(self, x):

        if x.ndim == 1:

            x = x.reshape([1, x.size])

        modifiedX = x - np.max(x, 1).reshape([x.shape[0], 1]);

        sigmoid = np.exp(modifiedX)

        return sigmoid / np.sum(sigmoid, axis=1).reshape([sigmoid.shape[0], 1]);

 

 

    def XavierInitialization(self, NumIn, NumOut):

        return np.random.randn(NumIn, NumOut) / np.sqrt(NumIn)

 

 

    def HeInitialization(self, NumIn, NumOut):

        return np.random.randn(NumIn, NumOut) / np.sqrt(NumIn / 2)

 

 

    def feedForward(self, x):

        y1 = self.ReLU(np.matmul(x, self.W1))

        score = np.matmul(y1, self.W2)

        probability = self.sigmoid(score)

 

        return y1, probability

 

 

    def backpropagation(self, x, error, y1, reward):

        discountedReward = self.discountReward(reward)

        discountedReward -= np.mean(discountedReward)

        discountedReward /= np.std(discountedReward)

        error *= discountedReward

 

        # dY2 = np.matmul(error, self.weights['W2'].T)

        dY2 = np.outer(error, self.W2)

        dY1 = self.dReLU(y1)

        dW1 = np.matmul(x.T, (dY2 * dY1))

 

        dW2 = np.matmul(y1.T, error)

 

        self.W1GradientBuffer += dW1;

        self.W2GradientBuffer += dW2;

 

 

    def update(self):

        self.W1 += self.learningRate * self.W1GradientBuffer

        self.W2 += self.learningRate * self.W2GradientBuffer

        self.W1GradientBuffer = np.zeros_like(self.W1)

        self.W2GradientBuffer = np.zeros_like(self.W2)

 

 

    def discountReward(self, r):

        discounted_r = np.zeros_like(r)

        running_add = 0

        for t in reversed(range(0, r.size)):

            running_add = running_add * self.discountFactorForReward + r[t]

            discounted_r[t= running_add

        return discounted_r

 

 

if __name__ == '__main__':

 

    batchSize = 5

 

    env = gym.make('CartPole-v0')

    observation = env.reset()

 

    NN = NN()

 

    arrX, arrReward, arrY1, arrError = [], [], [], []

    rewardSum = 0

    episodeIndex = 1

 

    env.reset()

 

    while episodeIndex <= 10000:

        x = np.reshape(observation, [1, NN.inputDimension])

        y1, probability = NN.feedForward(x)

        action = 1 if np.random.uniform() < probability else 0  #e-greedy 필요할듯

 

        arrX.append(x)

        arrY1.append(y1)

        arrError.append(action - probability)

 

        observation, reward, done, info = env.step(action)

 

        rewardSum += reward

 

        arrReward.append(reward)

 

        if done:

            episodeIndex += 1

 

            episodeX = np.vstack(arrX)

            episodeReward = np.vstack(arrReward)

            episodeY1 = np.vstack(arrY1)

            episodeError = np.vstack(arrError)

            arrX, arrReward, arrY1, arrError = [], [], [], []

 

            NN.backpropagation(episodeX, episodeError, episodeY1, episodeReward)

 

            if episodeIndex % batchSize == 0:

                NN.update()

 

 

                print('Average reward for episode %f.  Total average reward %f.' % (

                    rewardSum / batchSize, rewardSum / batchSize))

 

                if rewardSum / batchSize >= 200:

                    print("Task solved in", episodeIndex, 'episodes!')

                    break

 

                rewardSum = 0

 

            observation = env.reset()

Colored by Color Scripter

cs

 



 

 


     



'Deep learning'   

[Reinforcement Learning / learn article] Deep Q-Networks and Beyond  (0) 2017.04.17
[Reinforcement Learning / learn article] Model-Based RL (CartPole)  (1) 2017.04.10
[Supervised Learning / python / not use tensorflow] MNIST - Softmax regression  (0) 2017.03.28
[Reinforcement Learning / review article / c++] Policy Gradient (Two-armed Bandit)  (0) 2017.03.26
[Reinforcement Learning / learn article] Policy Gradient (CartPole)  (0) 2017.03.15







[Reinforcement Learning / review article / not use tensorflow] Policy Gradient (CartPole)








[Reinforcement Learning / learn article] Deep Q-Networks and Beyond  2017.04.17  


[Reinforcement Learning / learn article] Model-Based RL (CartPole)  2017.04.10  


[Supervised Learning / python / not use tensorflow] MNIST - Softmax regression  2017.03.28  


[Reinforcement Learning / review article / c++] Policy Gradient (Two-armed Bandit)  2017.03.26  












프로필사진
Program Programming Programmer


 

 
   






  (142) 
Deep learning (17) 

Python (0) 

Algorithm, Data stru.. (125) 
Solved Algorithmic P.. (120) 

Popular Algorithms (3) 

Basic concepts (2) 






Tag


GREEDY,   Shoelace Formula,   Math,   dfs,   Deterministic finite automaton,   binary search,   Divide And Conquer,   convex hull,   sort,   bipartite matching,   Josephus,   binomial coefficient,   Simulation,   Base Conversion,   Complete Search,   bit mask,   string,   memoization,   dynamic programming,   Erathosthenes,  


 






 












  



Facebook

Twitter
 







Archives





Calendar


«   2024/07   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31


Total




Today : 

Yesterday : 




 






Copyright © Kakao Corp. All rights reserved.